key: cord-024346-shauvo3j authors: kruglov, vasiliy n. title: using open source libraries in the development of control systems based on machine vision date: 2020-05-05 journal: open source systems doi: 10.1007/978-3-030-47240-5_7 sha: doc_id: 24346 cord_uid: shauvo3j the possibility of the boundaries detection in the images of crushed ore particles using a convolutional neural network is analyzed. the structure of the neural network is given. the construction of training and test datasets of ore particle images is described. various modifications of the underlying neural network have been investigated. experimental results are presented. when processing crushed ore mass at ore mining and processing enterprises, one of the main indicators of the quality of work of both equipment and personnel is the assessment of the size of the crushed material at each stage of the technological process. this is due to the need to reduce material and energy costs for the production of a product unit manufactured by the plant: concentrate, sinter or pellets. the traditional approach to the problem of evaluating the size of crushed material is manual sampling with subsequent sieving with sieves of various sizes. the determination of the grain-size distribution of the crushed material in this way entails a number of negative factors: -the complexity of the measurement process; -the inability to conduct objective measurements with sufficient frequency; -the human error factor at the stages of both data collection and processing. these shortcomings do not allow you to quickly adjust the performance of crushing equipment. the need for obtaining data on the coarseness of crushed material in real time necessitated the creation of devices for in situ assessment of parameters such as the grain-size distribution of ore particles, weight-average ore particle and the percentage of the targeted class. the machine vision systems are able to provide such functionality. they have high reliability, performance and accuracy in determining the geometric dimensions of ore particles. at the moment, several vision systems have been developed and implemented for the operational control of the particle size distribution of crushed or granular material. in [9] , a brief description and comparative analysis of such systems as: split, wipfrag, fragscan, cias, ipacs, tucips is given. common to the algorithmic part of these systems is the stage of dividing the entire image of the crushed ore mass into fragments corresponding to individual particles with the subsequent determination of their geometric sizes. such a segmentation procedure can be solved by different approaches, one of which is to highlight the boundaries between fragments of images of ore particles. classical methods for borders highlighting based on the assessment of changes in brightness of neighboring pixels, which implies the use of mathematical algorithms based on differentiation [4, 8] . figure 1 shows typical images of crushed ore moving on a conveyor belt. algorithms from the opencv library, the sobel and canny filters in particular, used to detect borders on the presented images, have identified many false boundaries and cannot be used in practice. this paper presents the results of recognizing the boundaries of images of stones based on a neural network. this approach has been less studied and described in the literature, however, it has recently acquired great significance in connection with its versatility and continues to actively develop with the increasing of a hardware performance [3, 5] . to build a neural network and apply machine learning methods, a sample of images of crushed ore stones in gray scale was formed. the recognition of the boundaries of the ore particles must be performed for stones of arbitrary size and configuration on a video frame with ratio 768 × 576 pixels. to solve this problem with the help of neural networks, it is necessary to determine what type of neural network to use, what will be the input information and what result we want to get as the output of the neural network processing. analysis of literary sources showed that convolutional neural networks are the most promising when processing images [3, [5] [6] [7] . convolutional neural network is a special architecture of artificial neural networks aimed at efficient pattern recognition. this architecture manages to recognize objects in images much more accurately, since, unlike the multilayer perceptron, two-dimensional image topology is considered. at the same time, convolutional networks are resistant to small displacements, zooming, and rotation of objects in the input images. it is this type of neural network that will be used in constructing a model for recognizing boundary points of fragments of stone images. algorithms for extracting the boundaries of regions as source data use image regions having sizes of 3 × 3 or 5 × 5. if the algorithm provides for integration operations, then the window size increases. an analysis of the subject area for which this neural network is designed (a cascade of secondary and fine ore crushing) showed: for images of 768 × 576 pixels and visible images of ore pieces, it is preferable to analyze fragments with dimensions of 50 × 50 pixels. thus, the input data for constructing the boundaries of stones areas will be an array of images consisting of (768 − 50)*(576 − 50) = 377668 halftone fragments measuring 50 × 50 pixels. in each of these fragments, the central point either belongs to the boundary of the regions or not. based on this assumption, all images can be divided into two classes. to mark the images into classes on the source images, the borders of the stones were drawn using a red line with a width of 5 pixels. this procedure was performed manually with the microsoft paint program. an example of the original and marked image is shown in fig. 2 . then python script was a projected, which processed the original image to highlight 50 × 50 pixels fragments and based on the markup image sorted fragments into classes preserving them in different directories to write the scripts, we used the python 3 programming language and the jupyter notebook ide. thus, two data samples were obtained: training dataset and test dataset for the assessment of the network accuracy. as noted above, the architecture of the neural network was built on a convolutional principle. the structure of the basic network architecture is shown in fig. 3 [7] . the network includes an input layer in the format of the tensor 50 × 50 × 1. the following are several convolutional and pooling layers. after that, the network unfolds in one fully connected layer, the outputs of which converge into one neuron, to which the activation function, the sigmoid, will be applied. at the output, we obtain the probability that the center point of the input fragment belongs to the "boundary point" class. the keras open source library was used to develop and train a convolutional neural network [1, 2, 6, 10] . the basic convolutional neural network was trained with the following parameters: -10 epoch; -error -binary cross-entropy; -quality metric -accuracy (percentage of correct answers); -optimization algorithm -rmsprop. the accuracy on the reference data set provided by the base model is 90.8%. in order to improve the accuracy of predictions, a script was written that trains models on several configurations, and also checks the quality of the model on a test dataset. to improve the accuracy of the predictions of the convolutional neural network, the following parameters were varied with respect to the base model: -increasing the number of layers: +1 convolutional +1 pooling; -increasing of the number of filters: +32 in each layer; -increasing the size of the filter up to 5*5; -increasing the number of epochs up to 30; -decreasing in the number of layers. these modifications of the base convolutional neural network did not lead to an improvement in its performance -all models had the worst quality on the test sample (in the region of 88-90% accuracy). the model of the convolutional neural network, which showed the best quality, was the base model. its quality in the training sample is estimated at 90.8%, and in the test sample -at 83%. none of the other models were able to surpass this figure. data on accuracy and epoch error are shown in fig. 4 and 5 . if you continue to study for more than 10 epochs, then the effect of retraining occurs: the error drops, and accuracy increases only on training samples, but not on test ones. figure 6 shows examples of images with neural network boundaries. as you can see from the images, not all the borders are closed. the boundary discontinuities are too large to be closed using morphological operations on binary masks; however, the use of the "watershed" algorithm [8] will reduce the identification error of the boundary points. in this work, a convolutional neural network was developed and tested to recognize boundaries on images of crushed ore stones. for the task of constructing a convolutional neural network model, two data samples were generated: training and test dataset. when building the model, the basic version of the convolutional neural network structure was implemented. in order to improve the quality of model recognition, a configuration of various models was devised with deviations from the basic architecture. an algorithm for training and searching for the best model by enumerating configurations was implemented. in the course of the research, it was found that the basic model has the best quality for recognizing boundary points. it shows the accuracy of the predictions for the targeted class at 83%. based on the drawn borders on the test images, it can be concluded that the convolutional neural network is able to correctly identify the boundary points with a high probability. it rarely makes mistakes for cases when there is no boundary (false positive), but often makes mistakes when recognizing real boundary points (false negative). the boundary breaks are too large to be closed using morphological operations on binary masks, however, the use of the "watershed" algorithm will reduce the identification error for boundary points. funding. the work was performed under state contract 3170γc1/48564, grant from the fasie. keras: the python deep learning library deep learning with python, 1st edn machine learning: the art and science of algorithms that make sense of data digital image processing hands-on machine learning with scikit-learn and tensorflow: concepts, tools, and techniques to build intelligent systems deep learning with keras: implement neural networks with keras on theano and tensorflow comprehensive guide to convolutional neural networks -the eli5 way image processing, analysis and machine vision identifying, visualizing, and comparing regions in irregularly spaced 3d surface data python data science handbook: essential tools for working with data, 1st edn key: cord-016196-ub4mgqxb authors: wang, cheng; zhang, qing; gan, jianping title: study on efficient complex network model date: 2012-11-20 journal: proceedings of the 2nd international conference on green communications and networks 2012 (gcn 2012): volume 5 doi: 10.1007/978-3-642-35398-7_20 sha: doc_id: 16196 cord_uid: ub4mgqxb this paper summarizes the relevant research of the complex network systematically based on statistical property, structural model, and dynamical behavior. moreover, it emphatically introduces the application of the complex network in the economic system. transportation network, and so on are of the same kind [2] . emphasis on the structure of the system and the system analysis from structure are the research thinking of the complex network. the difference is that the property of the topological structure of the abstracted real networks is different from the network discussed before, and has numerous nodes, as a result we call it complex network [3] . in recent years, a large number of articles are published in world leading publication such as science, nature, prl, and pnas, which reflects indirectly that complex network has been a new research hot spot. the research in complex network can be simply summarized as contents of three aspects each of which has close and further relationships: rely on the statistical property of the positivist network measurement; understanding the reason why the statistical property has the property it has through building the corresponding network model; forecasting the behavior of the network system based on the structure and the formation rule of the network. the description of the world in the view of the network started in 1736 when german mathematician eular solved the problem of johannesburg's seven bridges. the difference of complex network researching is that you should view the massive nodes and the properties they have in the network from the point of the statistics firstly. the difference of the properties means the different internal structures of the network; moreover the different internal structures of the network bring about the difference of the systemic function. therefore, the first step of our research on complex network is the description and understanding of the statistical properties, sketched as follows: in the research of the network, generally speaking we define the distance between two nodes as the number of the shortest path edge of the two connectors; the diameter of the net as the maximum range between any two points; the average length of the net is the average value of the distance among all the nodes, it represents the degree of separation of the nodes in the net, namely the size of the net. an important discover in the complex network researching is that the average path length of the most of the large-scale real networks is much less than our imagine, which we call ''small-world effect''. this viewpoint comes from the famous experiment of ''milgram small-world'', the experiment required the participators to send a letter to one of their acquaintances making sure the letter reach the recipient of the letter, in order to figure out the distribution of the path length in the network, the result shows that the number of the average passing person is just six, in addition the experiment is also the origin of the popular theory ''6°of separation''. the aggregation extent of the nodes in the network is represented by convergence factor c, that is how close of the network. for example in the social networks, your friend's friend may be your friend or both of your two friends are friends. the computational method is that: assuming node i connect other k i nodes through k i , if the k i connected each other, there should be k i ðk i à 1þ=2 sides among them, however if the k i nodes have e i sides, then the ratio of e i to k i ðk i à 1þ=2 is the convergence factor of node i. the convergence factor of the network is the average value of all the nodes' convergence factor in the network. obviously only is in fully connected network the convergence factor equals 1, in most other networks convergence factor less than 1. however, it proves to be that nodes in most large-scale realworlds network tent to be flock together, although the convergence factor c is far less than 1, it is far more than n à1 . the degree k i of the node i in the graph theory is the total amount of the sides connected by node i, the average of the degree k i of the node i is called average degree of the network, defined as \ k [. the degree of the node in the network is represented by distribution function p(k), the meaning of which is that the probability that any nodes with k sides, it also equals the number of nodes with k degree divide the number of all the nodes in the network. the statistical property described above is the foundation of the complex networks researching; with the further researching we generally discover the realworld network has other important statistical property, such as the relativity among network resilience, betweenness, and degree and convergence factor. the most simple network model is the regular net region; the same number around every node is its characteristic, such as 1 d chain-like, 2 d lattice, complete graph and so on. paul erdös and alfred rényi discovered a complete random network model in the late 50s twentieth century, it is made of any two nodes which connected with probability p in the graph made of n nodes, its average degree is \k [ ¼ pðn à 1þ % pn; the average path length l : ln n= lnð\k [ þ; the convergence factor c ¼ p; when the value of n is very large, the distribution of the node degree approximately equals poisson distribution. the foundation of the random network model is a significant achievement in the network researching, but it can hardly describe the actual property of the realworld, lots of new models are raised by other people. as the experiment puts, most of the realworld networks has small-world (lesser shortest path) and aggregation (larger convergence factor). however, the regular network has aggregation, but its average shortest path length is larger, random graph has the opposite property, having small-world and less convergence factor. so the regular networks and random networks can not reflect the property of the realworld, it shows that the realworld is not well-defined neither is complete random. watts and strogatz found a network which contains both small-world and high-aggregation in 1988, which is a great break in the complex network researching. they connected every side to a new node with probability p,through which they build a network between regular network and random network (calling ws net for short), it has less average path length and larger convergence factor, while the regular network and random network are special case when p is 0 and 1 in the ws net. after the ws model being put forward, many scholars made a further change based on ws model, the nw small-world model raised by newman and watts has the most extensive use. the difference between nw model and ws model is that nw model connects a coupe of nodes, instead of cutting off the original edge in the regular network. the advantage of nw model is that the model simplifies the theory analysis, since the ws model may have orphan nodes which nw would not do. in fact, when p is few while n is large, the results of the theory analysis of the two models will be the same; we call them small-world model now. although the scale-free network can describe the small-world and highaggregation of the realworld well, the theory analysis of the small-world model reals that the distribution of the node is still the index distribution form. as the empirical results put it is more accurate to describe the most of the large-scale realworld model in the form of the power-law namely pðkþ : k àc . compared with index distribution power-law has no peak, most nodes has few connection, while few nodes have lots of connection, there is no characteristic scale as the random network do, so barabási and some other people call this network distribution having power rate characteristics scale-free network. in order to explain the foundation of the scale-free network, barabási and albert found the famous ba model, they thought the networks raised before did not consider the two important property of the realworld-growth property and connection optimization, the former means the new nodes are constantly coming into the network, the latter means after their arriving the new nodes prefer to connect the nodes with large degree. not only do they make the simulation analysis of the generating algorithm of the ba model, but also it has given the analytic solution to the model using the way of the mean field in statistical physics, as the result put: after enough time of evolution, the distribution of ba network don't change with time, degree distribution is power-law with its index number 3 steadily. foundation of the ba model is another great breakout in the complex network research, demonstrating our further understanding of the objective network world. after that, many scholars made many improvements in the model, such as nonlinearity priority connection, faster growth, and local events of rewind side, being aging, and adaptability competition and so on. note that: most instead of all of the realworld is scale-free network, for some realworld network's degree distribution is the truncation form of the power-law. scholars also found some other network model such as local area world evolution model, weight evolution network model and certainty network model to describe the network structure of the realworld besides small-world model and scale-free network. study of the network structure is important, but the ultimate purpose is that we can understand and explain the system's modus operand based on these networks, and then we can forecast and control the behavior of network system. this systemic dynamical property based on network is generally called dynamical behavior, it involves so many things such as systemic transfusion, synchronization, phase change, web search and network navigator. the researched above has strong theoretical, a kind of research of network behavior which has strong applied has increasingly aroused our interests, for example the spread of computer virus on computer net, the spread of the communicable disease among multitude and the spread of rumours in society and so on, all of them are actually some propagation behavior obeying certain rules and spreading on certain net. the traditional network propagation models are always found based on regular networks, we have to review the issue with the further research of the complex networks. we emphatically introduce the research of the application. one of the uppermost and foremost purposes of network propagation behavior research is that we can know the mechanism transmission of the disease well. substitute node for the unit infected, if one unit can associate with another in infection or the other way round through some way, then we regard that the two units have connection, in this way can we get the topological structure of network propagation, the relevant propagation model can be found to study the propagation behavior in turn. obviously, the key to network propagation model studying is the formulation of the propagation rule and the choice of the network topological structure. however, it does not conform to the actual fact simply regarding the disease contact network as regular uniform connect network. moore studied the disease propagation behavior in small-world, discovering that the propagation threshold value of disease in small-world is much less than it does in regular network, in the same propagation degree, experience the same time, the propagation scope of disease in the small-world is significantly greater than the propagation scope in the regular network, that is to say: compared to regular network, disease in the smallworld inflects easily; paster satornas and others studied the propagation behavior in the scale-free world, the result turns out to be amazing: there is always positive propagation degree threshold value in both of regular world and small-world, while the propagation degree threshold value approves to be 0. we can get the similar results when analyzing the scale-free world. as lots of experiments put realworld network has both small-world and scale-free, the conclusion described above is quite frustrated. fortunately, no matter virus or computer virus they all has little infectious (k ¼ 1), doing little harm. however, once the intensity of disease or virus reaches some degree, we have to pay enough attention to it, the measurement to control it can not totally rely on the improvement of medical conditions, we have to take measures to quarantine the nodes and turn off the relevant connections in order to cut off avenue of infection in which we can we change the topological structure of the propagation network. in fact, just in this way can we defeat the war of fighting sars in 2003 summer in our country. the study of the disease's mechanism transmission is not all of the questions our ultimate goal is that we can master how to control disease propagation efficiently. while in practical applications, it is hard to stat the number of nodes namely the number of units which have possibilities connect with other nodes in infection period. for example in the research of std spread, researchers get the information about psychopath and high risk group only through questionnaire survey and oral questioning, while their reply has little reliability, for that reason, quite a lot of immunization strategy have been put forward by some scholars based on above-mentioned opinion, such as ''who is familiar with the immune'', ''natural exposure'', ''vaccination''. analyzing disease spread phenomenon is not just the purpose of researching network propagation behavior; what is more a large amount of things can be analyzed through it. for example we can apply it to propagation behavior's research in social network, the basic ideas showed as follows: first we should abstract the topological structure of the social network out from complex network theory, then analyze the mechanism transmission according to some propagation rules, analyze how to affect the propagation through some ways at last. actually, this kind of work has already started, such as the spread of knowledge, the spread of new product network and bank financial risk; they have both relation and difference, the purpose of the research of the former is to contribute to its spread; the latter is to avoid its spread. systems science. shanghai scientific and technological educational publishing house pearson education statistical mechanics of complex network the structure and function of complex networks key: cord-015861-lg547ha9 authors: kang, nan; zhang, xuesong; cheng, xinzhou; fang, bingyi; jiang, hong title: the realization path of network security technology under big data and cloud computing date: 2019-03-12 journal: signal and information processing, networking and computers doi: 10.1007/978-981-13-7123-3_66 sha: doc_id: 15861 cord_uid: lg547ha9 this paper studies the cloud and big data technology based on the characters of network security, including virus invasion, data storage, system vulnerabilities, network management etc. it analyzes some key network security problems in the current cloud and big data network. above all, this paper puts forward technical ways of achieving network security. cloud computing is a service that based on the increased usage and delivery of the internet related services, it promotes the rapidly development of the big data information processing technology, improves the processing and management abilities of big data information. with tie rapid development of computer technology, big data technology brings not only huge economic benefits, but the evolution of social productivity. however, serials of safety problems appeared. how to increase network security has been become the key point. this paper analyzes and discusses the technical ways of achieving network security. cloud computing is a kind of widely-used distributed computing technology [1] [2] [3] . its basic concept is to automatically divide the huge computing processing program into numerous smaller subroutines through the network, and then hand the processing results back to the user after searching, calculating and analyzing by a large system of multiple servers [4] [5] [6] . with this technology, web service providers can process tens of millions, if not billions, of information in a matter of seconds, reaching a network service as powerful as a supercomputer [7, 8] . cloud computing is a resource delivery and usage model, it means get resource (hardware, software) via network. the network of providing resource is called 'cloud'. the hardware resource in the 'cloud' seems scalable infinitely and can be used whenever [9] [10] [11] . cloud computing is the product of the rapid development of computer science and technology. however, the problem of computer network security in the background of cloud computing brings a lot of trouble to people's life, work and study [12] [13] [14] . therefore, scientific and effective management measures should be taken in combination with the characteristics of cloud computing technology to minimize the risk of computer network security and improve the stability and security of computer network. this paper briefly introduces cloud computing, analyzes the network security problem of computer under cloud computing, and expounds the network security protection measures under cloud computing. processing data by cloud computing can save the energy expenditure and reduce the dealing cost of big data, so that it can improve the healthy development of cloud computing technology. analyzing big data by cloud computing technology can be shown by a directed acyclic data flow graph g ¼ ðv; eþ, and the cloud service module in the parallel selection mechanism is made up by a serial group v ¼ fiji ¼ 1; 2; . . .; vg and a serial of remote data transfer hidden channels e ¼ fði; jþji; j 2 vg. assuming the date transmission distance of the data flow model in c=s framework is the directed graph model gp ¼ ðvp; ep; scapþ explanation, ep represent lkset, the vp cross channel bearing the physical node set, the scap explains the quantity of data unit of physical node. besides, assuming undirected graph gs ¼ ðvs; es; sarsþ expresses data packet markers input by application. the process of link mapping between cloud computing components and overall architecture can be explained by: for the different customer demands, building an optimized resource-allocated model to build the application model that processed by big data. the built-in network link structure for big data information processing as follows: in fig. 1 , the i th transmission package in the cloud computer is i th . let ti represent the transmission time of i th . the interval of component is mapped to thread or process is showed by j i ¼ t i à t d , when j i ¼ t i à t d in the range of (−∞, ∞), the weight of node i is w i which computing time, the detail application model of big data information processing is shown in fig. 2 in the mobile cloud system model, the grid architecture that relies on local computing resources and the wireless network to build cloud computing, which will select the components of data flow graph to migrate to the cloud, computer data processing cloud computing formula modeling, fgðv; eþ; si; di; jg is the given data flow applications, assuming that the channel capacity is infinite, the problem of using cloud computing technology to optimize big data information processing is described as follows maxmax xi;yi;jxi;yi;j among them: the energy overhead of data flow migrating between groups in mobile cloud computing is described as: 4 main characteristics of network security technology in the context of big data, cloud computing, users can save the data in the cloud and then process and manage the data. compared with the original network technology, it has certain data network risks, but its security coefficient is higher. cloud security technology can utilize modern network security technology to realize centralizing upgrade and guarantee the overall security of big data. since the data is stored in the cloud, enhancing the cloud management is the only way to ensure the security of the data. big data stored in the cloud usually affects network data. most enterprises will connect multiple servers so as to build computing terminals with strong performance. cloud computing itself has the convenience. customers of its hardware facilities do not need to purchase additional services. they only need to purchase storage and computing services. due to its particularity, cloud computing can effectively reduce resource consumption and is also a new form of energy conservation and environmental protection. when local computers encounter risks, data stored in the cloud will not be affected, nor will it be lost, and at the same time these data will be shared. the sharing and transfer of raw data is generally based on physical connections, and then data transfer is implemented. compared with the original data research, data sharing in big data cloud computing can be realized by using the cloud. users can collect data with the help of various terminals, so as to have a strong data sharing function. most computer networks have risks from system vulnerabilities. criminals use illegal means to make use of system vulnerabilities to invade other systems. system vulnerabilities not only include the vulnerabilities of the computer network system itself, but also can easily affect the computer system due to the user's downloading of unknown plug-ins, thus causing system vulnerability problems. with the continuous development of the network, its virus forms are also diverse, but mainly refers to a destructive program created by human factors. due to the diversity of the virus, the degree of impact is also different. customer information and files of enterprises can be stolen by viruses, resulting in huge economic losses, and some of the viruses are highly destructive, which will not only damage the relevant customer data, but also cause network system paralysis. in the context of big data cloud computing, external storage of the cloud computing platform can be realized through various distributed facilities. the service characteristic index of the system is mainly evaluated through high efficiency, security and stability. storage security plays a very important role in the computer network system. computer network system has different kinds, large storage, the data has diversified characteristics. the traditional storage methods have been unable to meet the needs of social development. optimizing the data encryption methods cannot meet the demand of the network. the deployment of cloud computing data and finishing need data storage has certain stability and security, to avoid economic losses to the user. in order to ensure data security, it is necessary to strengthen computer network management. all computer managers and application personnel are the main body of computer network security management. if the network management personnel do not have a comprehensive understanding of their responsibilities and adopt an unreasonable management method, data leakage will occur. especially for enterprise, government and other information management, network security management is very important. in the process of application, many computers do not pay enough attention to network security management, leading to the crisis of computer intrusion, thus causing data exposure problems. 6 ways to achieve network security one of the main factors influencing the big data cloud save system is data layout. exploring it at the present stage is usually combined with the characteristics of the data to implement the unified layout. management and preservation function are carried out through data type distribution, and the data is encrypted. the original data stored in more than one cloud, different data management level has different abilities to resist attacks. for cloud computing, data storage, transmission and sharing can apply encryption technology. during data transmission, the party receiving the data can decrypt the encrypted data, so as to prevent the data from being damaged or stolen during the transmission. the intelligent firewall can identify the data through statistics, decision-making, memory and other ways, and achieve the effect of access control. by using the mathematical concept, it can eliminate the large-scale computing methods applied in the matching verification process and realize the mining of the network's own characteristics, so as to achieve the effect of direct access and control. the intelligent firewall technology includes risk identification, data intrusion prevention and outlaw personnel supply warning. compared with the original firewall technology, the intelligent firewall technology can further prevent the network system from being damaged by human factors and improve the security of network data. the system encryption technology is generally divided into public key and private key with the help of encryption algorithm to prevent the system from being attacked. meanwhile, service operators are given full attention to monitor the network operation and improve the overall security of the network. in addition, users should improve their operation management of data. in the process of being attacked by viruses, static and dynamic technologies are used. dynamic technologies are efficient in operation and can support multiple types of resources. safety isolation system is usually called virtualizes distributed firewalls (vdfw). it made up of security isolation system centralized management center and security service virtual machine (svm). the main role of this system is to achieve network security. the key functions of the system are as follows. access control functions analyze source/destination ip addresses, mac address, port and protocol, time, application characteristics, virtual machine object, user and other dimensions based on state detection access control. meanwhile, it supports many functions, including the access control policy grouping, search, conflict detection. intrusion prevention module judge the intrusion behavior by using protocol analysis and pattern recognition, statistical threshold and comprehensive technical means such as abnormal traffic monitoring. it can accurately block eleven categories of more than 4000 kinds of network attacks, including overflow attacks, rpc attack, webcgi attack, denial of service, trojans, worms, system vulnerabilities. moreover, it supports custom rules to detect and alert network attack traffic, abnormal messages in traffic, abnormal traffic, flood and other attacks. it can check and kill the trojan, worm, macro, script and other malicious codes contained in the email body/attachments, web pages and download files based on streaming and transparent proxy technology. it supports ftp, http, pop3, smtp and other protocols. it identifies the traffic of various application layers, identify over 2000 protocols; its built-in thousands of application recognition feature library. this paper studies the cloud and big data technology. in the context of large data cloud computing, the computer network security problem is gradually a highlight, and in this case, the computer network operation condition should be combined with the modern network frame safety technology, so as to ensure the security of the network information, thus creating a safe network operation environment for users. application and operation of computer network security prevention under the background of big data era research on enterprise network information security technology system in the context of big data self-optimised coordinated traffic shifting scheme for lte cellular systems network security technology in big data environment data mining for base station evaluation in lte cellular systems user-vote assisted self-organizing load balancing for ofdma cellular systems discussion on network information security in the context of big data telecom big data based user offloading self-optimisation in heterogeneous relay cellular systems application of cloud computing technology in computer secure storage user perception aware telecom data mining and network management for lte/lte-advanced networks selfoptimised joint traffic offloading in heterogeneous cellular networks network information security control mechanism and evaluation system in the context of big data mobility load balancing aware radio resource allocation scheme for lte-advanced cellular networks wcdma data based lte site selection scheme in lte deployment key: cord-011400-zyjd9rmp authors: peixoto, tiago p. title: network reconstruction and community detection from dynamics date: 2019-09-18 journal: nan doi: 10.1103/physrevlett.123.128301 sha: doc_id: 11400 cord_uid: zyjd9rmp we present a scalable nonparametric bayesian method to perform network reconstruction from observed functional behavior that at the same time infers the communities present in the network. we show that the joint reconstruction with community detection has a synergistic effect, where the edge correlations used to inform the existence of communities are also inherently used to improve the accuracy of the reconstruction which, in turn, can better inform the uncovering of communities. we illustrate the use of our method with observations arising from epidemic models and the ising model, both on synthetic and empirical networks, as well as on data containing only functional information. the observed functional behavior of a wide variety largescale system is often the result of a network of pairwise interactions. however, in many cases, these interactions are hidden from us, either because they are impossible to measure directly, or because their measurement can be done only at significant experimental cost. examples include the mechanisms of gene and metabolic regulation [1] , brain connectivity [2] , the spread of epidemics [3] , systemic risk in financial institutions [4] , and influence in social media [5] . in such situations, we are required to infer the network of interactions from the observed functional behavior. researchers have approached this reconstruction task from a variety of angles, resulting in many different methods, including thresholding the correlation between time series [6] , inversion of deterministic dynamics [7] [8] [9] , statistical inference of graphical models [10] [11] [12] [13] [14] and of models of epidemic spreading [15] [16] [17] [18] [19] [20] , as well as approaches that avoid explicit modeling, such as those based on transfer entropy [21] , granger causality [22] , compressed sensing [23] [24] [25] , generalized linearization [26] , and matching of pairwise correlations [27, 28] . in this letter, we approach the problem of network reconstruction in a manner that is different from the aforementioned methods in two important ways. first, we employ a nonparametric bayesian formulation of the problem, which yields a full posterior distribution of possible networks that are compatible with the observed dynamical behavior. second, we perform network reconstruction jointly with community detection [29] , where, at the same time as we infer the edges of the underlying network, we also infer its modular structure [30] . as we will show, while network reconstruction and community detection are desirable goals on their own, joining these two tasks has a synergistic effect, whereby the detection of communities significantly increases the accuracy of the reconstruction, which in turn improves the discovery of the communities, when compared to performing these tasks in isolation. some other approaches combine community detection with functional observation. berthet et al. [31] derived necessary conditions for the exact recovery of group assignments for dense weighted networks generated with community structure given observed microstates of an ising model. hoffmann et al. [32] proposed a method to infer community structure from time-series data that bypasses network reconstruction by employing a direct modeling of the dynamics given the group assignments, instead. however, neither of these approaches attempt to perform network reconstruction together with community detection. furthermore, they are tied down to one particular inverse problem, and as we will show, our general approach can be easily extended to an open-ended variety of functional models. bayesian network reconstruction.-we approach the network reconstruction task similarly to the situation where the network edges are measured directly, but via an uncertain process [33, 34] : if d is the measurement of some process that takes place on a network, we can define a posterior distribution for the underlying adjacency matrix a via bayes' rule where pðdjaþ is an arbitrary forward model for the dynamics given the network, pðaþ is the prior information on the network structure, and pðdþ ¼ p a pðdjaþpðaþ is a normalization constant comprising the total evidence for the data d. we can unite reconstruction with community detection via an, at first, seemingly minor, but ultimately consequential modification of the above equation where we introduce a structured prior pðajbþ where b represents the partition of the network in communities, i.e., b ¼ fb i g, where b i ∈ f1; …; bg is group membership of node i. this partition is unknown, and is inferred together with the network itself, via the joint posterior distribution the prior pðajbþ is an assumed generative model for the network structure. in our work, we will use the degreecorrected stochastic block model (dc-sbm) [35] , which assumes that, besides differences in degree, nodes belonging to the same group have statistically equivalent connection patterns, according to the joint probability with λ rs determining the average number of edges between groups r and s and κ i the average degree of node i. the marginal prior is obtained by integrating over all remaining parameters weighted by their respective prior distributions, which can be computed exactly for standard prior choices, although it can be modified to include hierarchical priors that have an improved explanatory power [36] (see supplemental material [37] for a concise summary.). the use of the dc-sbm as a prior probability in eq. (2) is motivated by its ability to inform link prediction in networks where some fraction of edges have not been observed or have been observed erroneously [34, 39] . the latent conditional probabilities of edges existing between groups of nodes is learned by the collective observation of many similar edges, and these correlations are leveraged to extrapolate the existence of missing or spurious ones. the same mechanism is expected to aid the reconstruction task, where edges are not observed directly, but the observed functional behavior yields a posterior distribution on them, allowing the same kind of correlations to be used as an additional source of evidence for the reconstruction, going beyond what the dynamics alone says. our reconstruction approach is finalized by defining an appropriate model for the functional behavior, determining pðdjaþ. here, we will consider two kinds of indirect data. the first comes from a susceptible-infected-susceptible (sis) epidemic spreading model [40] , where σ i ðtþ ¼ 1 means node i is infected at time t, 0, otherwise. the likelihood for this model is where is the transition probability for node i at time t, with fðp; σþ ¼ ð1 − pþ σ p 1−σ , and where m i ðtþ ¼ p j a ij lnð1 − τ ij þσ j ðtþ is the contribution from all neighbors of node i to its infection probability at time t. in the equations above, the value τ ij is the probability of an infection via an existing edge ði; jþ, and γ is the 1 → 0 recovery probability. with these additional parameters, the full posterior distribution for the reconstruction becomes since τ ij ∈ ½0; 1, we use the uniform prior pðτþ ¼ 1. note, also, that the recovery probability γ plays no role on the reconstruction algorithm, since its term in the likelihood does not involve a [and, hence, gets cancelled out in the denominator pðσjγþ ¼ pðγjσþpðσþ=pðγþ]. this means that the above posterior only depends on the infection events 0 → 1 and, thus, is also valid without any modifications to all epidemic variants susceptible-infected (si), susceptibleinfected-recovered (sir), susceptible-exposed-infectedrecovered (seir), etc., [40] , since the infection events occur with the same probability for all these models. the second functional model we consider is the ising model, where spin variables on the nodes s ∈ f−1; 1g n are sampled according to the joint distribution where β is the inverse temperature, j ij is the coupling on edge ði; jþ, h i is a local field on node i, and zða; β; j; hþ ¼ p s expðβ p i c ã ; 0; otherwiseg. the value of c ã was chosen to maximize the posterior similarity, which represents the best possible reconstruction achievable with this method. nevertheless, the network thus obtained is severely distorted. the inverse correlation method comes much closer to the true network, but is superseded by the joint inference with community detection. empirical dynamics.-we turn to the reconstruction from observed empirical dynamics with unknown underlying interactions. the first example is the sequence of m ¼ 619 votes of n ¼ 575 deputies in the 2007 to 2011 session of the lower chamber of the brazilian congress. each deputy voted yes, no, or abstained for each legislation, which we represent as f1; −1; 0g, respectively. since the temporal ordering of the voting sessions is likely to be of secondary importance to the voting outcomes, we assume the votes are sampled from an ising model [the addition of zero-valued spins changes eq. (9) only slightly by replacing 2 coshðxþ → 1 þ 2 coshðxþ]. figure 4 shows the result of the reconstruction, where the division of the nodes uncovers a cohesive government and a split opposition, as well as a marginal center group, which correlates very well with the known party memberships and can be used to predict unseen voting behavior (see supplemental material [37] for more details). in fig. 5 , we show the result of the reconstruction of the directed network of influence between n ¼ 1833 twitter users from 58224 retweets [50] using a si epidemic model (the act of "retweeting" is modeled as an infection event, using eqs. (5) and (6) with γ ¼ 0) and the nested dc-sbm. the reconstruction uncovers isolated groups with varying propensities to retweet, as well as groups that tend to influence a large fraction of users. by inspecting the geolocation metadata on the users, we see that the inferred groups amount, to a large extent, to different countries, although clear subdivisions indicate that this is not the only factor governing the influence among users (see supplemental material [37] for more details). conclusion.-we have presented a scalable bayesian method to reconstruct networks from functional observations that uses the sbm as a structured prior and, hence, performs community detection together with reconstruction. the method is nonparametric and, hence, requires no prior stipulation of aspects of the network and size of the model, such as number of groups. by leveraging inferred correlations between edges, the sbm includes an additional source of evidence and, thereby, improves the reconstruction accuracy, which in turn also increases the accuracy of the inferred communities. the overall approach is general, requiring only appropriate functional model specifications, and can be coupled with an open ended variety of such models other than those considered here. [51, 52] for details on the layout algorithm), and the edge colors indicate the infection probabilities τ ij as shown in the legend. the text labels show the dominating country membership for the users in each group. inferring gene regulatory networks from multiple microarray datasets dynamic models of large-scale brain activity estimating spatial coupling in epidemiological systems: a mechanistic approach bootstrapping topological properties and systemic risk of complex networks using the fitness model the role of social networks in information diffusion network inference with confidence from multivariate time series revealing network connectivity from response dynamics inferring network topology from complex dynamics revealing physical interaction networks from statistics of collective dynamics learning factor graphs in polynomial time and sample complexity reconstruction of markov random fields from samples: some observations and algorithms, in approximation, randomization and combinatorial optimization. algorithms and techniques which graphical models are difficult to learn estimation of sparse binary pairwise markov networks using pseudo-likelihoods inverse statistical problems: from the inverse ising problem to data science inferring networks of diffusion and influence on the convexity of latent social network inference learning the graph of epidemic cascades statistical inference approach to structural reconstruction of complex networks from binary time series maximum-likelihood network reconstruction for sis processes is np-hard network reconstruction from infection cascades escaping the curse of dimensionality in estimating multivariate transfer entropy causal network inference by optimal causation entropy reconstructing propagation networks with natural diversity and identifying hidden sources efficient reconstruction of heterogeneous networks from time series via compressed sensing robust reconstruction of complex networks from sparse data universal data-based method for reconstructing complex networks with binary-state dynamics reconstructing weighted networks from dynamics reconstructing network topology and coupling strengths in directed networks of discrete-time dynamics community detection in networks: a user guide bayesian stochastic blockmodeling exact recovery in the ising blockmodel community detection in networks with unobserved edges network structure from rich but noisy data reconstructing networks with unknown and heterogeneous errors stochastic blockmodels and community structure in networks nonparametric bayesian inference of the microcanonical stochastic block model for summary of the full generative model used, details of the inference algorithm and more information on the analysis of empirical data efficient monte carlo and greedy heuristic for the inference of stochastic block models missing and spurious interactions and the reconstruction of complex networks epidemic processes in complex networks spatial interaction and the statistical analysis of lattice systems equation of state calculations by fast computing machines monte carlo sampling methods using markov chains and their applications asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications artifacts or attributes? effects of resolution on the little rock lake food web note that, in this case, our method also exploits the heterogeneous degrees in the network via the dc-sbm, which can refinements of this approach including thouless-anderson-palmer (tap) and bethe-peierls (bp) corrections [14] yield the same performance for this example pseudolikelihood decimation algorithm improving the inference of the interaction network in a general class of ising models the simple rules of social contagion hierarchical block structures and high-resolution model selection in large networks hierarchical edge bundles: visualization of adjacency relations in hierarchical data key: cord-017423-cxua1o5t authors: wang, rui; jin, yongsheng; li, feng title: a review of microblogging marketing based on the complex network theory date: 2011-11-12 journal: 2011 international conference in electrics, communication and automatic control proceedings doi: 10.1007/978-1-4419-8849-2_134 sha: doc_id: 17423 cord_uid: cxua1o5t microblogging marketing which is based on the online social network with both small-world and scale-free properties can be explained by the complex network theory. through systematically looking back at the complex network theory in different development stages, this chapter reviews literature from the microblogging marketing angle, then, extracts the analytical method and operational guide of microblogging marketing, finds the differences between microblog and other social network, and points out what the complex network theory cannot explain. in short, it provides a theoretical basis to effectively analyze microblogging marketing by the complex network theory. as a newly emerging marketing model, microblogging marketing has drawn the domestic academic interests in the recent years, but the relevant papers are scattered and inconvenient for a deep research. on the microblog, every id can be seen as a node, and the connection between the different nodes can be seen as an edge. these nodes, edges, and relationships inside form the social network on microblog which belongs to a typical complex network category. therefore, reviewing the literature from the microblogging marketing angle by the complex network theory can provide a systematic idea to the microblogging marketing research. in short, it provides a theoretical basis to effectively analyze microblogging marketing by the complex network theory. the start of the complex network theory dates from the birth of small-world and scale-free network model. these two models provide the network analysis tools and information dissemination interpretation to the microblogging marketing. "six degrees of separation" found by stanley milgram and other empirical studies show that the real network has a network structure of high clustering coefficient and short average path length [1] . watts and strogatz creatively built the smallworld network model with this network structure (short for ws model), reflecting human interpersonal circle focus on acquaintances to form the high clustering coefficient, but little exchange with strangers to form the short average path length [2] . every id in microblog has strong ties with acquaintance and weak ties with strangers, which matches the ws model, but individuals can have a large numbers of weak ties in the internet so that the online microblog has diversity with the real network. barabàsi and albert built a model by growth mechanism and preferential connection mechanism to reflect that the real network has degree distribution following the exponential distribution and power-law. because power-law has no degree distribution of the characteristic scale, this model is called the scale-free network model (short for ba model) [3] . exponential distribution exposes that most nodes have low degree and weak impact while a few nodes have high degree and strong impact, confirming "matthew effect" in sociology and satisfying the microblog structure that celebrities have much greater influence than grassroots, which the small-world model cannot describe. in brief, the complex network theory pioneered by the small-world and scalefree network model overcomes the constraints of the network size and structure of regular network and random network, describes the basic structural features of high clustering coefficient, short average path length, power-law degree distribution, and scale-free characteristics. the existing literature analyzing microblogging marketing by the complex network theory is less, which is worth further study. the complex network theory had been evoluted from the small-world scale-free model to some major models such as the epidemic model and game model. the diffusion behavior study on these evolutionary complex network models is valuable and can reveal the spread of microblogging marketing concept in depth. epidemic model divides the crowd into three basic types: susceptible (s), infected (i), and removed (r), and build models according to the relationship among different types during the disease spread in order to analyze the disease transmission rate, infection level, and infection threshold to control the disease. typical epidemic models are the sir model and the sis model. differences lie in that the infected (i) in the sir model becomes the removed (r) after recovery, so the sir model is used for immunizable diseases while the infected (i) in the sis model has no immunity and only becomes the susceptible (s) after recovery. therefore, the sis model is used for unimmunizable diseases. these two models developed other epidemic model: sir model changes to sirs model when the removed (r) has been the susceptible (s); sis model changes to si model presenting the disease outbreaks in a short time when the infected (i) is incurable. epidemic model can be widely seen in the complex network, such as the dissemination of computer virus [4] , information [5] , knowledge [6] . guimerà et al. finds the hierarchical and community structure in the social network [7] . due to the hierarchical structure, barthélemy et al. indicate that the disease outbreak followed hierarchical dissemination from the large-node degree group to the small-node degree group [8] . due to the community structure, liu et al. indicate the community structure has a lower threshold and greater steady-state density of infection, and is in favor of the infection [9] ; fu finds that the real interpersonal social network has a positive correlation of the node degree distribution, but the real interpersonal social network has negative [10] . the former expresses circles can be formed in celebrities except grassroots, but the latter expresses contacts can be formed in celebrities and grassroots on the microblog. the game theory combined with the complex network theory can explain the interpersonal microlevel interaction such as tweet release, reply, and retweet because it can analyze the complex dynamic process between individuals such as the game learning model, dynamic evolutionary game model, local interaction model, etc.(1) game learning model: individuals make the best decision by learning from others in the network. learning is a critical point to decision-making and game behavior, and equilibrium is the long-term process of seeking the optimal results by irrational individuals [11] . bala and goyal draw the "neighbor effect" showing the optimal decision-making process based on the historical information from individuals and neighbors [12] . (2) dynamic evolutionary game model: the formation of the social network seems to be a dynamic outcome due to the strategic choice behavior between edge-breaking and edge-connecting based on the individual evolutionary game [13] . fu et al. add reputation to the dynamic evolutionary game model and find individuals are more inclined to cooperate with reputable individuals in order to form a stable reputation-based network [14] . (3) local interaction model: local network information dissemination model based on the strong interactivity in local community is more practical to community microblogging marketing. li et al. restrain preferential connection mechanism in a local world and propose the local world evolutionary network model [15] . burke et al. construct a local interaction model and find individual behavior presents the coexistence of local consistency and global decentrality [16] . generally speaking, microblog has characteristics of the small-world, scale-free, high clustering coefficient, short average path length, hierarchical structure, community structure, and node degree distribution of positive and negative correlation. on one hand, the epidemic model offers the viral marketing principles to microblogging marketing, such as the sirs model can be used for the long-term brand strategy and the si model can be used for the short-term promotional activity; on the other hand, the game model tells microblogging marketing how to find opinion leaders in different social circles to develop strategies for the specific community to realize neighbor effect and local learning to form global microblog coordination interaction. rationally making use of these characteristics can preset effective strategies and solutions for microblogging marketing. the complex network theory is applied to biological, technological, economic, management, social, and many other fields by domestic scholars. zhou hui proves the spread of sars rumors has a typical small-world network features [17] . duan wenqi studies new products synergy diffusion in the internet economy by the complex network theory to promote innovation diffusion [18] . wanyangsong (2007) analyzes the dynamic network of banking crisis spread and proposes the interbank network immunization and optimization strategy [19] . although papers explaining microblogging marketing by the complex network theory have not been found, these studies have provided the heuristic method, such as the study about the online community. based on fu's study on xiao nei sns network [10] , hu haibo et al. carry out a case study on ruo lin sns network and conclude that the online interpersonal social network not only has almost the same network characteristics as the real interpersonal social network, but also has a negative correlation of the node degree distribution while the real interpersonal social network has positive. this is because the online interpersonal social network is more easier for strangers to establish relationships so that that small influence people can reach the big influence people and make weak ties in plenty through breaking the limited range of real world [20] . these studies can be used to effectively develop marketing strategies and control the scope and effectiveness of microblogging marketing. there will be a great potential to research on the emerging microblog network platform by the complex network theory. the complex network theory describes micro and macro models analyzing the marketing process to microblogging marketing. the complex network characteristics of the small-world, scale-free, high clustering coefficient, short average path length, hierarchical structure, community structure, node degree distribution of positive and negative correlation and its application in various industries provide theoretical and practical methods to conduct and implement microblogging marketing. the basic research idea is: extract the network topology of microblog by the complex network theory; then, analyze the marketing processes and dissemination mechanism by the epidemic model, game model, or other models while taking into account the impact of macro and micro factors; finally, find out measures for improving or limiting the marketing effect in order to promote the beneficial activities and control the impedimental activities for enterprizes' microblogging marketing. because the macro and micro complexity and uncertainty of online interpersonal social network, the previous static and dynamic marketing theory cannot give a reasonable explanation. based on the strong ties and weak ties that lie in individuals of the complex network, goldenberg et al. find: (1) after the external short-term promotion activity, strong ties and weak ties turn into the main force driving product diffusion; (2) strong ties have strong local impact and weak transmission ability, while weak ties have strong transmission ability and weak local impact [21] . therefore, the strong local impact of strong ties and strong transmission ability of weak ties are required to be rationally used for microblogging marketing. through system simulation and data mining, the complex network theory can provide explanation framework and mathematical tools to microblogging marketing as an operational guide. microblogging marketing is based on online interpersonal social network, having difference with the nonpersonal social network and real interpersonal social network. therefore, the corresponding study results cannot be simply mixed if involved with human factors. pastor-satorras et al. propose the target immunization solution to give protection priority to larger degree node according to sis scale-free network model [22] . this suggests the importance of cooperation with the large influential ids as opinion leaders in microblogging marketing. remarkably, the large influential ids are usually considered as large followers' ids on the microblog platform that can be seen from the microblog database. the trouble is, as scarce resources, the large influential ids have a higher cooperative cost, but the large followers' ids are not all large influential ids due to the online public relations behaviors such as follower purchasing and watering. this problem is more complicated than simply the epidemic model. the complex network theory can be applied in behavior dynamics, risk control, organizational behavior, financial markets, information management, etc.. microblogging marketing can learn the analytical method and operational guide from these applications, but the complex network theory cannot solve all the problems of microblogging marketing, mainly: 1. the complexity and diversity of microblogging marketing process cannot completely be explained by the complex network theory. unlike the natural life-like virus, individuals on microblog are bounded rational, therefore, the decisionmaking processes are impacted by not only the neighbor effect and external environment but also by individuals' own values, social experience, and other subjective factors. this creates a unique automatic filtering mechanism of microblogging information dissemination: information recipients reply and retweet the tweet or establish and cancel contact only dependent on their interests, leading to the complexity and diversity. therefore, interaction-worthy topics are needed in microblogging marketing, and the effective followers' number and not the total followers' number of id is valuable. this cannot be seen in disease infection. 2. there are differences in network characteristics between microblog network and the real interpersonal social network. on one hand, the interpersonal social network is different from the natural social network in six points: (1) social network has smaller network diameter and average path length; (2) social network has higher clustering coefficient than the same-scale er random network; (3) the degree distribution of social network has scale-free feature and follows power-law; (4) interpersonal social network has positive correlation of node degree distribution but natural social network has negative; (5) local clustering coefficient of the given node has negative correlation of the node degree in social network; (6) social network often has clear community structure [23] . therefore, the results of the natural social network are not all fit for the interpersonal social network. on the other hand, as the online interpersonal social network, microblog has negative correlation of the node degree distribution which is opposite to the real interpersonal social network. this means the results of the real interpersonal social network are not all fit for microblogging marketing. 3. there is still a conversion process from information dissemination to sales achievement in microblogging marketing. information dissemination on microblog can be explained by the complex network models such as the epidemic model, but the conversion process from information dissemination to sales achievement cannot be simply explained by the complex network theory, due to not only individual's external environment and neighborhood effect, but also consumer's psychology and willingness, payment capacity and convenience, etc.. according to the operational experience, conversion rate, retention rates, residence time, marketing topic design, target group selection, staged operation program, and other factors are needed to be analyzed by other theories. above all, microblogging marketing which attracts the booming social attention cannot be analyzed by regular research theories. however, the complex network theory can provide the analytical method and operational guide to microblogging marketing. it is believed that microblogging marketing on the complex network theory has a good study potential and prospect from both theoretical and practical point of view. the small world problem collective dynamics of 'small-world' networks emergence of scaling in random networks how viruses spread among computers and people information exchange and the robustness of organizational networks network structure and the diffusion of knowledge team assembly mechanisms determine collaboration network structure and team performance romualdo pastor-satorras, alessandro vespignani: velocity and hierarchical spread of epidemic outbreaks in scale-free networks epidemic spreading in community networks social dilemmas in an online social network: the structure and evolution of cooperation the theory of learning in games learning from neighbors a strategic model of social and economic networks reputation-based partner choice promotes cooperation in social networks a local-world evolving network model the emergence of local norms in networks research of the small-world character during rumor's propagation study on coordinated diffusion of new products in internet market doctoral dissertation of shanghai jiaotong university structural analysis of large online social network talk of the network: a complex systems look at the underlying process of word-of-mouth immunization of complex networks meeting strangers and friends of friends: how random are socially generated networks key: cord-241057-cq20z1jt authors: han, jungmin; cresswell-clay, evan c; periwal, vipul title: statistical physics of epidemic on network predictions for sars-cov-2 parameters date: 2020-07-06 journal: nan doi: nan sha: doc_id: 241057 cord_uid: cq20z1jt the sars-cov-2 pandemic has necessitated mitigation efforts around the world. we use only reported deaths in the two weeks after the first death to determine infection parameters, in order to make predictions of hidden variables such as the time dependence of the number of infections. early deaths are sporadic and discrete so the use of network models of epidemic spread is imperative, with the network itself a crucial random variable. location-specific population age distributions and population densities must be taken into account when attempting to fit these events with parametrized models. these characteristics render naive bayesian model comparison impractical as the networks have to be large enough to avoid finite-size effects. we reformulated this problem as the statistical physics of independent location-specific `balls' attached to every model in a six-dimensional lattice of 56448 parametrized models by elastic springs, with model-specific `spring constants' determined by the stochasticity of network epidemic simulations for that model. the distribution of balls then determines all bayes posterior expectations. important characteristics of the contagion are determinable: the fraction of infected patients that die ($0.017pm 0.009$), the expected period an infected person is contagious ($22 pm 6$ days) and the expected time between the first infection and the first death ($25 pm 8$ days) in the us. the rate of exponential increase in the number of infected individuals is $0.18pm 0.03$ per day, corresponding to 65 million infected individuals in one hundred days from a single initial infection, which fell to 166000 with even imperfect social distancing effectuated two weeks after the first recorded death. the fraction of compliant socially-distancing individuals matters less than their fraction of social contact reduction for altering the cumulative number of infections. the pandemic caused by the sars-cov-2 virus has swept across the globe with remarkable rapidity. the parameters of the infection produced by the virus, such as the infection rate from person-to-person contact, the mortality rate upon infection and the duration of the infectivity period are still controversial . parameters such as the duration of infectivity and predictions such as the number of undiagnosed infections could be useful for shaping public health responses as the predictive aspects of model simulations are possible guides to pandemic mitigation [7, 10, 20] . in particular, the possible importance of superspreaders should be understood [24] [25] [26] [27] . [5] had the insight that the early deaths in this pandemic could be used to find some characteristics of the contagion that are not directly observable such as the number of infected individuals. this number is, of course, crucial for public health measures. the problem is that standard epidemic models with differential equations are unable to determine such hidden variables as explained clearly in [6] . the early deaths are sporadic and discrete events. these characteristics imply that simulating the epidemic must be done in the context of network models with discrete dynamics for infection spread and death. the first problem that one must contend with is that even rough estimates of the high infection transmission rate and a death rate with strong age dependence imply that one must use large networks for simulations, on the order of 10 5 nodes, because one must avoid finite-size effects in order to accurately fit the early stochastic events. the second problem that arises is that the contact networks are obviously unknown so one must treat the network itself as a stochastic random variable, multiplying the computational time by the number of distinct networks that must be simulated for every parameter combination considered. the third problem is that there are several characteristics of sars-cov-2 infections that must be incorporated in any credible analysis, and the credibility of the analysis requires an unbiased sample of parameter sets. these characteristics are the strong age dependence of mortality of sars-cov-2 infections and a possible dependence on population density which should determine network connectivity in an unknown manner. thus the network nodes have to have location-specific population age distributions incorporated as node characteristics and the network connectivity itself must be a free parameter. 3 an important point in interpreting epidemics on networks is that the simplistic notion that there is a single rate at which an infection is propagated by contact is indefensible. in particular, for the sars-cov-2 virus, there are reports of infection propagation through a variety of mucosal interfaces, including the eyes. thus, while an infection rate must be included as a parameter in such simulations, there is a range of infection rates that we should consider. indeed, one cannot make sense of network connectivity without taking into account the modes of contact, for instance if an individual is infected during the course of travel on a public transit system or if an individual is infected while working in the emergency room of a hospital. one expects that network connectivity should be inversely correlated with infectivity in models that fit mortality data equally well but this needs to be demonstrated with data to be credible, not imposed by fiat. the effective network infectivity, which we define as the product of network connectivity and infection rate, is the parameter that needs to be reduced by either social distancing measures such as stay-at-home orders or by lowering the infection rate with mask wearing and hand washing. a standard bayesian analysis with these features is computationally intransigent. we therefore adopted a statistical physics approach to the bayesian analysis. we imagined a six-dimensional lattice of models with balls attached to each model with springs. each ball represents a location for which data is available and each parameter set determines a lattice point. the balls are, obviously, all independent but they face competing attractions to each lattice point. the spring constants for each model are determined by the variation we find in stochastic simulations of that specific model. one of the dimensions in the lattice of models corresponds to a median age parameter in the model. each location ball is attracted to the point in the median age parameter dimension that best matches that location's median age, and we only have to check that the posterior expectation of the median age parameter for that location's ball is close to the location's actual median age. thus we can decouple the models and the data simulations without having to simulate each model with the characteristics of each location, making the bayesian model comparison amenable to computation. finally, the distribution of location balls over the lattice determines the posterior expectation values of each parameter. we matched the outcomes of our simulations with data on the two week cumulative death counts after the first death using bayes' theorem to obtain parameter estimates for the infection dynamics. we used the bayesian model comparison to determine posterior expectation values for parameters for three distinct datasets. finally, we simulated the effects of various partially effective social-distancing measures on random networks and parameter sets given by the posterior expectation values of our bayes model comparison. we used data for the sars-cov-2 pandemic as compiled by [28] from the original data we generated random g(n, p = 2l/(n − 1)) networks of n = 90000 or 100000 nodes with an average of l links per node using the python package networkx [36] . scalepopdens ≡ l is one of the parameters that we varied. we compared the posterior expectation for this parameter for a location with the actual population density in an attempt to predict the appropriate way to incorporate measurable population densities in epidemic on network models [37, 38] . we used the python epidemics on networks package [39, 40] to simulate networks with specific parameter sets. we defined nodes to have status susceptible, infected, recovered or dead. we started each simulation with exactly one infected node, chosen at random. the simulation has two sorts of events: 1. an infected node connected to a susceptible node can change the status of the susceptible node to infected with an infection rate, infrate. this event is network connectivity dependent. therefore we expect to see a negative or inverse correlation between infrate and scalepopdens. 2. an infected node can transition to recovered status with a recovery rate, recrate, or transition to a dead status with a death rate, deathrate. both these rates are entirely node-autonomous. the reciprocal of the recrate parameter (recdays in the following) is the number of days an individual is contagious. we assigned an age to each node according to a probability distribution parametrized by the median age of each data set (country or state). as is well-known, there is a wide disparity in median ages in different countries. the probability distribution approximately models the triangular shape of the population pyramids that is observed in demographic studies. we parametrized it as a function of age a as follows: here medianage is the median age of a specific country, maxage = 100 y is a global maxiit is computationally impossible to perform model simulations for the exact age distribution for each location. we circumvented this problem, as detailed in the next subsection (bayes setup), by incorporating a scalemedage parameter in the model, scaled so that scalemedage = 1.0 corresponds to a median age of 40 years. the node age is used to make the deathrate of any node age-specific in the form of an age-dependent weight: where a[n] is the age of node n and ageincrease = 5.5 is an age-dependence exponent. w(a) is normalized so that a w(a|ageincrease)p (a|medianage = 38.7y) = 1, using the median age of new york state's population as the value of ageincrease given above was approximately determined by fitting to the observed age-specific mortality statistics of new york state [35] . however, we included ageincrease as a model parameter since the strong age dependence of sars-cov-2 mortality is not well understood, with the normalization adjusted appropriately as a function of ageincrease. note that a decrease in the median age with all rates and the age-dependence exponent held constant will lead to a lower number of deaths. we use simulations to find the number of dead nodes as a function of time. the first time at which a death occurs following the initial infection in the network is labeled time-firstdeath. figure close to its actual median age. we implemented bayes' theorem as usual. the probability of a model, m, given a set of after the first death did not affect our results. as alluded to in the previous subsection, the posterior expectation of the scalemedage parameter (×40 y) for each location should turn out to be close to the actual median age for each location in our setup, and this was achieved (right column, figure 5 ). we simulated our grid of models on the nih biowulf cluster. our grid comprised of 56448 ×2 parametrized models simulated with 40 random networks each and parameters in all possible combinations from the following lists: parameters. in particular, note that the network infectivity (infcontacts) has a factor of two smaller uncertainty than either of its factors as these two parameters (infrate and scalepopdens) cooperate in the propagation of the contagion and therefore turn out to have a negative posterior weighted correlation coefficient (table i ). the concordance of posterior expectation values (table i) this goes along with the approximately 80 day period between the first infection and the first death for a few outlier trajectories. however, it is also clear from the histograms in figure 9 and the mean timefirstdeath given in table i that the likely value of this duration is considerably shorter. finally, we evaluated a possible correlation between the actual population density and the scalepopdens parameter governing network connectivity. we found a significant correlation when we added additional countries to the european union countries in this regression, we obtained (p < 0.0019, r = 0.33) scalepopdens(us&eu+) = 0.11 ln(population per km 2 ) + 2.9. while epidemiology is not the standard stomping ground of statistical physics, bayesian model comparison is naturally interpreted in a statistical physics context. we showed that taking this interpretation seriously leads to enormous reductions in computational effort. given the complexity of translating the observed manifestations of the pandemic into an understanding of the virus's spread and the course of the infection, we opted for a simple data-driven approach, taking into account population age distributions and the age dependence of the death rate. while the conceptual basis of our approach is simple, there were computational difficulties we had to overcome to make the implementation amenable to computability with finite computational resources. our results were checked to not depend on the size of the networks we simulated, on the number of stochastic runs we used for each model, nor on the number of days that we used for the linear regression. all the values we report in table i are well within most estimated ranges in the literature but with the benefit of uncertainty estimates performed with a uniform model prior. while each location ball may range over a broad distribution of models, the consensus posterior distribution (table i) shows remarkable concordance across datasets. we can predict the posterior distribution of time of initial infection, timefirstdeath, as shown in table i . the dynamic model can predict the number of people infected after 21 the first infection (right panel, figure 10 ) and relative to the time of first death (left panel, figure 10 ) because we made no use of infection or recovery statistics in our analysis [9] . note the enormous variation in the number of infections for the same parameter set, only partly due to stochasticity of the networks themselves, as can be seen by comparing the upper and lower rows of figure 4 . with parameters intrinsic to the infection held fixed, we can predict the effect of various degrees of social distancing by varying network connectivity. we assumed that a certain fraction of nodes in the network would comply with social distancing and only these compliant nodes would reduce their connections at random by a certain fraction. figure 12 shows the effects of four such combinations of compliant node fraction and fraction of con(table ii) with the posterior expectations of parameters (table i) shows that the bayes entropy of the model posterior distribution is an important factor to consider, validating our initial intuition that optimization of model parameters would be inappropriate in this analysis. the regression we found (eq.'s 6, 7, 8) with respect to population density must be considered in light of the fact that many outbreaks are occurring in urban areas so they are not necessarily reflective of the true population density dependence. furthermore, we did not find a significant regression for the countries of the european union by themselves, perhaps because they have a smaller range of population densities, though the addition of these countries into the us states data further reduced the regression p-value of the null hypothesis without materially altering regression parameters. detailed epidemiological data could be used to clarify its significance. [ [24] [25] [26] [27] have suggested the importance of super-spreader events but we did not encounter any difficulty in modeling the available data with garden variety g(n, p) networks. certainly if the network has clusters of older nodes, there will be abrupt jumps in the cumulative death count as the infection spreads through the network. furthermore, it would be interesting to consider how to make the basic model useful for more heterogenous datasets such as all countries of the world with vastly different reporting of death statistics. using the posterior distribution we derived as a starting point for more complicated models may be an approach worth investigating. infectious disease modeling is a deep field with many sophisticated approaches in use [39, [41] [42] [43] and, clearly, our analysis is only scratching the surface of the problem at hand. network structure, in particular, is a topic that has received much attention in social network research [37, 38, [44] [45] [46] . bayesian approaches have been used in epidemics on networks modeling [47] and have also been used in the present pandemic context in [2, 27, 48] . to our knowledge, there is no work in the published literature that has taken the approach adopted in this paper. there are many caveats to any modeling attempt with data this heterogenous and complex. first of all, any model is only as good as the data incorporated and unreported sars-cov-2 deaths would impact the validity of our results. secondly, if the initial deaths occur in specific locations such as old-age care centers, our modeling will over-estimate the death rate. a safeguard against this is that the diversity of locations we used may compensate to a limited extent. detailed analysis of network structure from contact tracing can be used to correct for this if such data is available, and our posterior model probabilities could guide such refinement. thirdly, while we ensured that our results did not depend on our model ranges as far as practicable, we cannot guarantee that a model with parameters outside our ranges could not be a more accurate model. the transparency of our analysis and the simplicity of our assumptions may be helpful in this regard. all code is available 23 an seir infectious disease model with testing and conditional quarantine the lancet infectious diseases the lancet infectious diseases the lancet infectious diseases proceedings of the 7th python in science conference 2015 winter simulation conference (wsc) agent-based modeling and network dynamics infectious disease modeling charting the next pandemic: modeling infectious disease spreading in the data science age we are grateful to arthur sherman for helpful comments and questions and to carson chow for prepublication access to his group's work [6] . this work was supported by the key: cord-010758-ggoyd531 authors: valdano, eugenio; fiorentin, michele re; poletto, chiara; colizza, vittoria title: epidemic threshold in continuous-time evolving networks date: 2018-02-06 journal: nan doi: 10.1103/physrevlett.120.068302 sha: doc_id: 10758 cord_uid: ggoyd531 current understanding of the critical outbreak condition on temporal networks relies on approximations (time scale separation, discretization) that may bias the results. we propose a theoretical framework to compute the epidemic threshold in continuous time through the infection propagator approach. we introduce the weak commutation condition allowing the interpretation of annealed networks, activity-driven networks, and time scale separation into one formalism. our work provides a coherent connection between discrete and continuous time representations applicable to realistic scenarios. contagion processes, such as the spread of diseases, information, or innovations [1] [2] [3] [4] [5] , share a common theoretical framework coupling the underlying population contact structure with contagion features to provide an understanding of the resulting spectrum of emerging collective behaviors [6] . a common keystone property is the presence of a threshold behavior defining the transition between a macroscopic-level spreading regime and one characterized by a null or negligibly small contagion of individuals. known as the epidemic threshold in the realm of infectious disease dynamics [1] , the concept is analogous to the phase transition in nonequilibrium physical systems [7, 8] , and is also central in social contagion processes [5, [9] [10] [11] [12] [13] . a vast array of theoretical results characterize the epidemic threshold [14] , mainly under the limiting assumptions of quenched and annealed networks [4, [15] [16] [17] [18] , i.e., when the time scale of the network evolution is much slower or much faster, respectively, than the dynamical process. the recent availability of data on time-resolved contacts of epidemic relevance [19] has, however, challenged the time scale separation, showing it may introduce important biases in the description of the epidemic spread [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] and in the characterization of the transition behavior [31, [34] [35] [36] [37] . departing from traditional approximations, few novel approaches are now available that derive the epidemic threshold constrained to specific contexts of generative models of temporal networks [22, 32, 35, [38] [39] [40] [41] or considering generic discrete-time evolving contact patterns [42] [43] [44] . in particular, the recently introduced infection propagator approach [43, 44] is based on a matrix encoding the probabilities of transmission of the infective agent along time-respecting paths in the network. its spectrum allows the computation of the epidemic threshold at any given time scale and for an arbitrary discrete-time temporal network. leveraging an original mapping of the temporal network and epidemic spread in terms of a multilayer structure, the approach is valid in the discrete representation only, similarly to previous methods [17, 18, 35] . meanwhile, a large interest in the study of continuously evolving temporal networks has developed, introducing novel representations [19, 20, 27, 45] and proposing optimal discretization schemes [44, 46, 47] that may, however, be inaccurate close to the critical conditions [48] . most importantly, the two representations-continuous and discrete-of a temporal network remain disjointed in current network epidemiology. a discrete-time evolving network is indeed a multilayer object interpretable as a tensor in a linear algebraic representation [49] . this is clearly no longer applicable when time is continuous, as it cannot be expressed in the form of successive layers. hence, a coherent theoretical framework to bridge the gap between the two representations is still missing. in this letter, we address this issue by analytically deriving the infection propagator in continuous time. formally, we show that the dichotomy discrete timecontinuous time translates into the separation between a linear algebraic approach and a differential one, and that the latter can be derived as the structural limit of the former. our approach yields a solution for the threshold of epidemics spreading on generic continuously evolving networks, and a closed form under a specific condition that is then validated through numerical simulations. in addition, the proposed novel perspective allows us to cast an important set of network classes into one single rigorous and comprehensive mathematical definition, including annealed [4, 50, 51] and activity-driven [35, 52] networks, widely used in both methodological and applied research. let us consider a susceptible-infected-susceptible (sis) epidemic model unfolding on a continuously evolving temporal network of n nodes. the sis model constitutes a basic paradigm for the description of epidemics with reinfection [1] . infectious individuals (i) can propagate the contagion to susceptible neighbors (s) with rate λ, and recover to the s state with rate μ. the temporal network is described by the adjacency matrix aðtþ, with t ∈ ½0; t. we consider a discretized version of the system by sampling aðtþ at discrete time steps of length δt (fig. 1 ). this yields a finite sequence of adjacency matrices fa 1 ; a 2 ; …; a t step g, where t step ¼ ⌊t=δt⌋, and a h ¼ aðhδtþ. the sequence approximates the original continuous-time network with increasing accuracy as δt decreases. we describe the sis dynamics on this discrete sequence of static networks as a discrete-time markov chain [17, 18] : where p h;i is the probability that a node i is in the infectious state at time step h, and μδt (λδt) is the probability that a node recovers (transmits the infection) during a time step δt, for sufficiently small δt. by mapping the system into a multilayer structure encoding both network evolution and diffusion dynamics, the infection propagator approach derives the epidemic threshold as the solution of the equation ρ½pðt step þ ¼ 1 [43, 44] , where ρ is the spectral radius of the following matrix: the generic element p ij ðt step þ represents the probability that the infection can propagate from node i at time step 1 to node j at time step t step , when λ is close to λ c and within the quenched mean-field approximation (locally treelike network [53] ). for this reason, p is denoted as the infection propagator. to compute the continuous-time limit of the infection propagator, we observe that p obeys the recursive relation pðh þ 1þ ¼ pðhþ½1 − μδt þ λδta hþ1 . expressed in continuous time and dividing both sides by δt, the relation becomes that in the limit δt → 0 yields a system of n 2 coupled differential equations whose components are the lhs of eq. (4) is the derivative of p that is well behaved if all entries are continuous functions of time. a ij ðtþ are, however, often binary, so that their evolution is a sequence of discontinuous steps. to overcome this, it is possible to approximate these steps with one-parameter families of continuous functions, compute the threshold, and then perform the limit of the parameter that recovers the discontinuity. more formally, this is equivalent to interpreting derivatives in the sense of tempered distributions [54] . in order to check that our limit process correctly connects the discrete-time framework to the continuous time one, let us now consider the standard markov chain formulation of the continuous dynamics: performing a linear stability analysis of the disease-free state [i.e., around p i ðtþ ¼ 0] in the quenched mean-field approximation [17, 18] , we obtain we note that this expression is formally equivalent to eq. (5). in particular, each row of p ij of eq. (5) satisfies eq. (7). furthermore, the initial condition p ij ð0þ ¼ δ ij guarantees that in varying the row i, we consider all vectors of the space basis as initial condition. every solution pðtþ of eq. (7) can therefore be expressed as a linear combination of the rows of pðtþ. any fundamental matrix solution of eq. (7) obeys eq. (5) within the framework of the floquet theory of nonautonomous linear systems [55] . the equivalence of the two equations shows that our limit of the discrete-time propagator encodes the dynamics of the continuous process. it is important to note that the limit process leading to eq. (4) entails a fundamental change of paradigm on the representation of the network structure and contagion process, where the linear algebraic representation suitable in discrete time turns into a differential geometrical description of the continuous-time flow. while network and spreading dynamics in discrete time are encoded in a multilayer adjacency tensor, the continuous time description proposed in eq. (5) rests on a representation of the dynamical process in terms of a manifold whose points are adjacency matrices (or a rank-2 tensor in the sense of ref. [49] ) corresponding to possible network and contagion states. the dynamics of eq. (5) is then a curve on such a manifold, indicating which adjacency matrices to visit and in which order. in practice, we recover that the contagion process on a discrete temporal network corresponding to an ordered subset of the full multilayer structure of ref. [49] becomes in the limit δt → 0 a spreading on a continuous temporal network represented through a one-dimensional ordered subset of a tensor field (formally the pullback on the evolution curve). the two frameworks, so far considered independently and mutually exclusive, thus merge coherently through a smooth transition in this novel perspective. we now turn to solving eq. (4) to derive an analytic expression of the infection propagator. by defining the rescaled transmissibility γ ¼ λ=μ, we can solve eq. (4) in terms of a series in μ [56] , with p ð0þ ¼ 1 and under the assumption that γ remains finite around the epidemic threshold for varying recovery rates. the recursion relation from which we derived eq. (4) provides the full propagator for t ¼ t. equation (8) computed in t therefore yields the infection propagator for the continuous-time adjacency matrix aðtþ, and is defined by the sum of the following terms: equations (8) and (9) can be put in a compact form by using dyson's time-ordering operator t [57] . it is defined as t aðt 1 þaðt 2 þ ¼ aðt 1 þaðt 2 þθðt 1 − t 2 þ þ aðt 2 þaðt 1 þθðt 2 − t 1 þ, with θ being heaviside's step function. the expression of the propagator is thus equation (10) represents an explicit general solution for eq. (4) that can be computed numerically to arbitrary precision [56] . the epidemic threshold in the continuoustime limit is then given by ρ½pðtþ ¼ 1. we now discuss a special case where we can recover a closed-form solution of eq. (10), and thus of the epidemic threshold. we consider continuously evolving temporal networks satisfying the following condition (weak commutation): aðtþ; i.e., the adjacency matrix at a certain time aðtþ commutes with the aggregated matrix up to that time. in the introduced tensor field formalism, the weak commutation condition represents a constraint on the temporal trajectory, or equivalently, an equation of motion for aðtþ. equation (11) implies that the order of factors in eq. (9) no longer matters. hence, we can simply remove the timeordering operator t in eq. (10), yielding where hai ¼ r t 0 dtaðtþ=t is the adjacency matrix averaged over time. the resulting expression for the epidemic threshold for weakly commuting networks is then this closed-form solution proves to be extremely useful as a wide range of network classes satisfies the weak commutation condition of eq. (11) . an important class is constituted by annealed networks [4, 50, 51] . in the absence of dynamical correlations, the annealed regime leads to h½aðxþ; aðyþi ¼ 0, as the time ordering of contacts becomes irrelevant. equation (11) can thus be reinterpreted as h½aðtþ; aðxþi x ¼ 0, where the average is carried out over x ∈ ½0; tþ. for long enough t, r t 0 dxaðxþ=t approximates well the expected adjacency matrix hai of the annealed model, leading the annealed regime to satisfy eq. (13) . this result thus provides an alternative mathematical framework for the conceptual interpretation of annealed networks in terms of weak commutation. originally introduced to describe disorder on quenched networks [58, 59] , annealed networks were mathematically described in probabilistic terms, with the probability of establishing a contact depending on the degree distribution pðkþ and the twonode degree correlations pðk 0 jkþ [50] . here we show that temporal networks whose adjacency matrix aðtþ asymptotically commutes with the expected adjacency matrix are found to be in the annealed regime. equation (13) can also be used to test the limits of the time scale separation approach, by considering a generic temporal network not satisfying the weak commutation condition. if μ is small, we can truncate the series of the infection propagator [eq. (8) ] at the first order, p ¼ 1 þ μp ð1þ þ oðμ 2 þ, where p ð1þ ðtþ ¼ t½γhai − 1, to recover indeed eq. (13) . the truncation thus provides a mathematical expression of the range of validity of the physical review letters 120, 068302 (2018) time-separation scheme for spreading processes on temporal networks, since temporal correlations can be disregarded when the network evolves much faster than the spreading process. extending the result of the annealed networks, we show that the weak commutation condition also holds for networks whose expected adjacency matrix depends on time as a scalar function (instead of being constant as in the annealed case), haðtþi ¼ cðtþhað0þi. also in this case we have h½aðxþ; aðyþi ¼ 0, so that the same treatment performed for annealed networks applies. examples are provided by global trends in activation patterns, as often considered in infectious disease epidemiology to model seasonal variations of human contact patterns (e.g., due to the school calendar) [60] . when the time scale separation approach is not applicable, we find another class of weakly commuting temporal networks that are used as a paradigmatic network example for the study of contagion processes occurring on the same time scale of contacts evolution-the activity-driven model [35] . it considers heterogeneous populations where each node i activates according to an activity rate a i , drawn from a distribution fðaþ. when active, the node establishes m connections with randomly chosen nodes lasting a short time δ (δ ≪ 1=a i ). since the dynamics lacks time correlations, the weak commutation condition holds, and the epidemic threshold can be computed from eq. (13). in the limit of large network size, it is possible to write the average adjacency matrix as hai ij ¼ ðmδ=nþða i þ a j þ þ oð1=n 2 þ. through row operations we find that the matrix has rankðhaiþ ¼ 2, and thus only two nonzero eigenvalues, α, σ, with α > σ. we compute them through the traces of hai (tr½hai ¼ α þ σ and tr½hai 2 ¼ α 2 þ σ 2 ) to obtain the expression of ρ½hai for eq. (13): the epidemic threshold becomes yielding the same result of ref. [35] , provided here that the transmission rate λ is multiplied by δ to make it a probability, as in ref. [35] . finally, we verify that for the trivial example of static networks, with an adjacency matrix constant in time, eq. (13) reduces immediately to the result of refs. [17, 18] . we now validate our analytical prediction against numerical simulations on two synthetic models. the first is the activity-driven model with activation rate a i ¼ a, m ¼ 1, and average interactivation time τ ¼ 1=a ¼ 1, fixed as the time unit of the simulations. the transmission parameter is the probability upon contact λδ and the model is implemented in continuous time. the second model is based on a bursty interactivation time distribution pðδtþ ∼ ðϵ þ δtþ −β [31] , with β ¼ 2.5 and ϵ tuned to obtain the same average interactivation time as before, τ ¼ 1. we simulate a sis spreading process on the two networks with four different recovery rates, μ ∈ f10 −3 ; 10 −2 ; 10 −1 ; 1g, i.e., ranging from a value that is 3 orders of magnitude larger than the time scale τ of the networks (slow disease), to a value equal to τ (fast disease). we compute the average simulated endemic prevalence for specific values of λ, μ using the quasistationary method [61] and compare the threshold computed with eq. (13) with the simulated critical transition from extinction to endemic state. as expected, we find eq. (13) to hold for the activity-driven model at all time scales of the epidemic process (fig. 2) , as the network lacks temporal correlations. the agreement with the transition observed in the bursty model, however, is recovered only for slow diseases, as at those time scales the network is found in the annealed regime. when network and disease time scales become comparable, the weakly commuting approximation of eq. (13) no longer holds, as burstiness results in dynamical correlations in the network evolution [31] . our theory offers a novel mathematical framework that rigorously connects discrete-time and continuous-time critical behaviors of spreading processes on temporal networks. it uncovers a coherent transition from an adjacency tensor to a tensor field resulting from a limit performed on the structural representation of the network and contagion process. we derive an analytic expression of the infection propagator in the general case that assumes a closed-form solution in the introduced class of weakly commuting networks. this allows us to provide a rigorous mathematical interpretation of annealed networks, encompassing the different definitions historically introduced in the literature. this work also provides the basis for important theoretical extensions, assessing, for example, the impact of bursty activation patterns or of the adaptive dynamics in response to the circulating epidemic. finally, our approach offers a tool for applicative studies on the estimation of the vulnerability of temporal networks to contagion processes in many real-world scenarios, for which the discrete-time assumption would be inadequate. we thank luca ferreri and mason porter for fruitful discussions. this work is partially sponsored by the ec-health contract no. 278433 (predemics) and the anr contract no. anr-12-monu-0018 (harmsflu) to v. c., and the ec-anihwa contract no. anr-13-anwa-0007-03 (liveepi) to e. v., c. p., and v. c. * present address: department d'enginyeria informàtica i matemàtiques modeling infectious diseases in humans and animals generalization of epidemic theory: an application to the transmission of ideas epidemics and rumours epidemic spreading in scale-free networks a simple model of global cascades on random networks modelling dynamical processes in complex socio-technical systems contact interactions on a lattice on the critical behavior of the general epidemic process and dynamical percolation cascade dynamics of complex propagation propagation and immunization of infection on general networks with both homogeneous and heterogeneous components dynamics of rumor spreading in complex networks kinetics of social contagion critical behaviors in contagion dynamics epidemic processes in complex networks resilience of the internet to random breakdowns spread of epidemic disease on networks epidemic spreading in real networks: an eigenvalue viewpoint discrete time markov chain approach to contact-based disease spreading in complex networks modern temporal network theory: a colloquium impact of non-poissonian activity patterns on spreading processes disease dynamics over very different time-scales: foot-and-mouth disease and scrapie on the network of livestock movements in the uk epidemic thresholds in dynamic contact networks how disease models in static networks can fail to approximate disease in dynamic networks representing the uk's cattle herd as static and dynamic networks impact of human activity patterns on the dynamics of information diffusion small but slow world: how network topology and burstiness slow down spreading dynamical strength of social ties in information spreading high-resolution measurements of face-to-face contact patterns in a primary school dynamical patterns of cattle trade movements multiscale analysis of spreading in a large communication network bursts of vertex activation and epidemics in evolving networks interplay of network dynamics and heterogeneity of ties on spreading dynamics predicting and controlling infectious disease epidemics using temporal networks, f1000prime rep the dynamic nature of contact networks in infectious disease epidemiology activity driven modeling of time varying networks temporal percolation in activity-driven networks contrasting effects of strong ties on sir and sis processes in temporal networks monogamous networks and the spread of sexually transmitted diseases epidemic dynamics on an adaptive network effect of social group dynamics on contagion epidemic threshold and control in a dynamic network virus propagation on time-varying networks: theory and immunization algorithms analytical computation of the epidemic threshold on temporal networks infection propagator approach to compute epidemic thresholds on temporal networks: impact of immunity and of limited temporal resolution machine learning: ecml effects of time window size and placement on the structure of an aggregated communication network epidemiologically optimal static networks from temporal network data limitations of discrete-time approaches to continuous-time contagion dynamics mathematical formulation of multilayer networks langevin approach for the dynamics of the contact process on annealed scale-free networks thresholds for epidemic spreading in networks controlling contagion processes in activity driven networks beyond the locally treelike approximation for percolation on real networks a course of modern analysis some results in floquet theory, with application to periodic epidemic models the magnus expansion and some of its applications the radiation theories of tomonaga, schwinger, and feynman optimal disorder for segregation in annealed small worlds diffusion in scale-free networks with annealed disorder recurrent outbreaks of measles, chickenpox and mumps: i. seasonal variation in contact rates epidemic thresholds of the susceptible-infected-susceptible model on networks: a comparison of numerical and theoretical results key: cord-007415-d57zqixs authors: da fontoura costa, luciano; sporns, olaf; antiqueira, lucas; das graças volpe nunes, maria; oliveira, osvaldo n. title: correlations between structure and random walk dynamics in directed complex networks date: 2007-07-30 journal: appl phys lett doi: 10.1063/1.2766683 sha: doc_id: 7415 cord_uid: d57zqixs in this letter the authors discuss the relationship between structure and random walk dynamics in directed complex networks, with an emphasis on identifying whether a topological hub is also a dynamical hub. they establish the necessary conditions for networks to be topologically and dynamically fully correlated (e.g., word adjacency and airport networks), and show that in this case zipf’s law is a consequence of the match between structure and dynamics. they also show that real-world neuronal networks and the world wide web are not fully correlated, implying that their more intensely connected nodes are not necessarily highly active. in this letter the authors discuss the relationship between structure and random walk dynamics in directed complex networks, with an emphasis on identifying whether a topological hub is also a dynamical hub. they establish the necessary conditions for networks to be topologically and dynamically fully correlated ͑e.g., word adjacency and airport networks͒, and show that in this case zipf's law is a consequence of the match between structure and dynamics. they also show that real-world neuronal networks and the world wide web are not fully correlated, implying that their more intensely connected nodes are not necessarily highly active. © 2007 american institute of physics. ͓doi: 10.1063/1. 2766683͔ we address the relationship between structure and dynamics in complex networks by taking the steady-state distribution of the frequency of visits to nodes-a dynamical feature-obtained by performing random walks 1 along the networks. a complex network 2-5 is taken as a graph with directed edges and associated weights, which are represented in terms of the weight matrix w. the n nodes in the network are numbered as i =1,2, ... ,n, and a directed edge with weight m, extending from node j to node i, is represented as w͑i , j͒ = m. no self-connections ͑loops͒ are considered. the in and out strengths of a node i, abbreviated as is͑i͒ and os͑i͒, correspond to the sum of the weights of its in-and outbound connections, respectively. the stochastic matrix s for such a network is the matrix s is assumed to be irreducible; i.e., any of its nodes can be accessible from any other node, which allows the definition of a unique and stable steady state. an agent, placed at any initial node j, chooses among the adjacent outbound edges of node j with probability equal to s͑i , j͒. this step is repeated a large number of times t, and the frequency of visits to each node i is calculated as v͑i͒ = ͑number of visits during the walk͒ / t. in the steady state ͑i.e., after a long time period t͒, v = sv and the frequency of visits to each node along the random walk may be calculated in terms of the eigenvector associated with the unit eigenvalue ͑e.g., ref. 6͒. for proper statistical normalization we set ͚ p v͑p͒ = 1. the dominant eigenvector of the stochastic matrix has theoretically and experimentally been verified to be remarkably similar to the corresponding eigenvector of the weight matrix, implying that the adopted random walk model shares several features with other types of dynamics, including linear and nonlinear summations of activations and flow in networks. in addition to providing a modeling approach intrinsically compatible with dynamics involving successive visits to nodes by a single or multiple agents, such as is the case with world wide web ͑www͒ navigation, text writing, and transportation systems, random walks are directly related to diffusion. more specifically, as time progresses, the frequency of visits to each network node approaches the activity values which would be obtained by the traditional diffusion equation. a full congruence between such frequencies and activity diffusion is obtained at the equilibrium state of the random walk process. therefore, random walks are also directly related to the important phenomenon of diffusion, which plays an important role in a large number of linear and nonlinear dynamic systems including disease spreading and pattern formation. random walks are also intrinsically connected to markov chains, electrical circuits, and flows in networks, and even dynamical models such as ising. for such reasons, random walks have become one of the most important and general models of dynamics in physics and other areas, constituting a primary choice for investigating dynamics in complex networks. the correlations between activity ͑the frequency of visits to nodes v͒ and topology ͑out strength os or in strength is͒ can be quantified in terms of the pearson correlation coefficient r. for full activity-topology correlation in directed networks, i.e., ͉r͉ = 1 between v and os or between v and is, it is enough that ͑i͒ the network must be strongly connected, i.e., s is irreducible, and ͑ii͒ for any node, the in strength must be equal to the out strength. the proof of the statement above is as follows. because the network is strongly connected, its stochastic matrix s has a unit eigenvector in the steady state, i.e., v = sv. since s͑i , j͒ = w͑i , j͒ /os͑j͒, the ith element of the vector sos is given as by hypothesis, is͑i͒ =os͑i͒ for any i and, therefore, both os and is are eigenvectors of s associated with the unit eigenvalue. then os= is= v, implying full correlation between frequency of visits and both in and out strengths. an implication of this derivation is that for perfectly correlated networks, the frequency of symbols produced by random walks will be equal to the out strength or in strength distributions. therefore, an out strength scale-free 3 network must produce sequences obeying zipf's law 7 and vice versa. if, on the other hand, the node distribution is gaussian, the frequency of visits to nodes will also be a gaussian function; that is to say, the distribution of nodes is replicated in the node activation. although the correlation between node strength and random walk dynamics in undirected networks has been established before 8 ͑including full correlation 9,10 ͒, the findings reported here are more general since they are related to any directed weighted network, such as the www and the airport network. indeed, the correlation conditions for undirected networks can be understood as a particular case of the conditions above. a fully correlated network will have ͉r͉ = 1. we obtained r = 1 for texts by darwin 11 and wodehouse 12 and for the network of airports in the usa. 13 the word association network was obtained by representing each distinct word as a node, while the edges were established by the sequence of immediately adjacent words in the text after the removal of stopwords 14 and lemmatization. 15 more specifically, the fact that word u has been followed by word v, m times during the text, is represented as w͑v , u͒ = m. zipf's law is known to apply to this type of network. 16 the airport network presents a link between two airports if there exists at least one flight between them. the number of flights performed in one month was used as the strength of the edges. we obtained r for various real networks ͑table i͒, including the fully correlated networks mentioned above. to interpret these data, we recall that a small r means that a hub ͑large in or out strength͒ in topology is not necessarily a center of activity. notably, in all cases considered r is greater for the in strength than for the out strength. this may be understood with a trivial example of a node from which a high number of links emerge ͑implying large out strength͒ but which has only very few inbound links. this node, in a random walk model, will be rarely occupied and thus cannot be a center of activity, though it will strongly affect the rest of the network by sending activation to many other targets. understanding why a hub in terms of in strength may fail to be very active is more subtle. consider a central node receiving links from many other nodes arranged in a circle, i.e., the central node has a large in strength but with the surrounding nodes possessing small in strength. in other words, if a node i receives several links from nodes with low activity, this node i will likewise be fairly inactive. in order to further analyze the latter case, we may examine the correlations between the frequency of visits to each node i and the cumulative hierarchical in and out strengths of that node. the hierarchical degree 17-19 of a network node provides a natural extension of the traditional concept of node degree. the im-table i. number of nodes ͑no. nodes͒, number of edges ͑no. edges͒, means and standard deviations of the clustering coefficient ͑cc͒, cumulative hierarchical in strengths for levels 1-4 ͑is1-is4͒, cumulative hierarchical out strengths for levels 1-4 ͑os1-os4͒, and the pearson correlation coefficients between the activation and all cumulative hierarchical in strengths and out strengths ͑r is1r os4 ͒ for the complex networks considered in the present work. for the least correlated network analyzed, viz., that of the largest strongly connected cluster in the network of www links in the domain of ref. 21 ͑massey university, new zealand͒ ͑refs. 22 and 23͒ activity could not be related to in strength at any hierarchical level. because the pearson coefficient corresponds to a single real value, it cannot adequately express the coexistence of the many relationships between activity and degrees present in this specific network as well as possibly heterogeneous topologies. very similar results were obtained for other www networks, which indicate that the reasons why topological hubs have not been highly active cannot be identified at the present moment ͑see, however, discussion for higher correlated networks below͒. however, for the two neuronal structures of table i that are not fully correlated ͑network defined by the interconnectivity between cortical regions of the cat 24 and network of synaptic connections in c. elegans 25 ͒, activity was shown to increase with the cumulative first and second hierarchical in strengths. in the cat cortical network, each cortical region is represented as a node, and the interconnections are reflected by the network edges. significantly, in a previous paper, 26 it was shown that when connections between cortex and thalamus were included, the correlation between activity and outdegree increased significantly. this could be interpreted as a result of increased efficiency with the topological hubs becoming highly active. furthermore, for the fully correlated networks, such as word associations obtained for texts by darwin and wodehouse, activity increased basically with the square of the cumulative second hierarchical in strength ͑see supplementary fig. 2 . in ref. 20͒ . in addition, the correlations obtained for these two authors are markedly distinct, as the work of wodehouse is characterized by substantially steeper increase of frequency of visits for large in strength values ͑see supplementary fig. 3 in ref. 20͒. therefore, the results considering higher cumulative hierarchical degrees may serve as a feature for authorship identification. in conclusion, we have established ͑i͒ a set of conditions for full correlation between topological and dynamical features of directed complex networks and demonstrated that ͑ii͒ zipf's law can be naturally derived for fully correlated networks. result ͑i͒ is of fundamental importance for studies relating the dynamics and connectivity in networks, with critical practical implications. for instance, it not only demonstrates that hubs of connectivity may not correspond to hubs of activity but also provides a sufficient condition for achieving full correlation. result ͑ii͒ is also of fundamental importance as it relates two of the most important concepts in complex systems, namely, zipf's law and scale-free networks. even though sharing the feature of power law, these two key concepts had been extensively studied on their own. the result reported in this work paves the way for important additional investigations, especially by showing that zipf's law may be a consequence of dynamics taking place in scalefree systems. in the cases where the network is not fully correlated, the pearson coefficient may be used as a characterizing parameter. for a network with very small correlation, such as the www links between the pages in a new zealand domain analyzed here, the reasons for hubs failing to be active could not be identified, probably because of the substantially higher complexity and heterogeneity of this network, including varying levels of clustering coefficients, as compared to the neuronal networks. this work was financially supported by fapesp and cnpq ͑brazil͒. luciano da f. costa thanks grants 05/ 00587-5 ͑fapesp͒ and 308231/03-1 ͑cnpq͒. 1 markov chains: gibbs fields, monte carlo simulation, and queues ͑springer the formation of vegetable mould through the action of worms, with observations on their habits ͑murray the pothunters ͑a & c black bureau of transportation statistics: airline on-time performance data modern information retrieval ͑addison-wesley the oxford handbook of computational linguistics ͑oxford human behaviour and the principle of least effort ͑addison-wesley key: cord-024830-cql4t0r5 authors: mcmillin, stephen edward title: quality improvement innovation in a maternal and child health network: negotiating course corrections in mid-implementation date: 2020-05-08 journal: j of pol practice & research doi: 10.1007/s42972-020-00004-z sha: doc_id: 24830 cord_uid: cql4t0r5 this article analyzes mid-implementation course corrections in a quality improvement innovation for a maternal and child health network working in a large midwestern metropolitan area. participating organizations received restrictive funding from this network to screen pregnant women and new mothers for depression, make appropriate referrals, and log screening and referral data into a project-wide data system over a one-year pilot program. this paper asked three research questions: (1) what problems emerged by mid-implementation of this program that required course correction? (2) how were advocacy targets developed to influence network and agency responses to these mid-course problems? (3) what specific course corrections were identified and implemented to get implementation back on track? this ethnographic case study employs qualitative methods including participant observation and interviews. data were analyzed using the analytic method of qualitative description, in which the goal of data analysis is to summarize and report an event using the ordinary, everyday terms for that event and the unique descriptions of those present. three key findings are noted. first, network participants quickly responded to the emerged problem of under-performing screening and referral completion statistics. second, they shifted advocacy targets away from executive appeals and toward the line staff actually providing screening. third, participants endorsed two specific course corrections, using “opt out, not opt in” choice architecture at intake and implementing visual incentives for workers to track progress. opt-out choice architecture and visual incentives served as useful means of focusing organizational collaboration and correcting mid-implementation problems. this study examines inter-organizational collaboration among human service organizations serving a specific population of pregnant women and mothers at risk for perinatal depression. these organizations received restrictive funding from a local community network to screen this population for risk for depression, make appropriate referrals as indicated, and log screening and referral data into a project-wide data system for a 1-year pilot program. this paper asked three specific research questions: (1) what problems emerged by mid-implementation of the screening and referral program that required course correction? (2) how were advocacy targets developed to influence network and agency responses to these mid-course problems? (3) what specific course corrections were identified and implemented to get implementation back on track? previous scholarship (mcmillin 2017) reported the background of how the maternal and child health organization studied here began as a community committee funded by the state legislature to address substance use by pregnant women and new mothers. ultimately this committee grew into a 501(c)3 nonprofit backbone organization (mcmillin 2017) that increasingly served as a pass-through entity for many grants it administered and dispersed to health and social service agencies who were members and partners of the network and primarily served families with young children. one important grant was shared with six network partner agencies to create a pilot program and data-sharing system for a universal screening and referral protocol for perinatal mood and anxiety disorders. this innovation used a network-wide shared data software system into which staff from all six partner agencies entered their screening results and the referrals they gave to clients. universal screening and referral for perinatal mood and anxiety disorders and cooccurring issues meant that every partner agency would do some kind of screening (virtually always an empirically validated instrument such as the edinburgh postnatal depression scale), and every partner would respond to high screens (scores over the designated clinical cutoff of whatever screening instrument was being used, indicating the presence of or high risk for clinical depression) with a referral to case managers in partner agencies that were also funded by the network. the funded partners faced a very tight timeline that anticipated regular screening and enrollment of an estimated number of clients in case management and depression treatment for every month of the fiscal program year. a slow start in screening and enrolling patients meant that funded partners would likely be in violation of their grant contract with the network while facing a rapidly closing window of time in which they would be able to catch up and provide enough contracted services to meet the contractual numbers for their catchment area, which could jeopardize funding for a second year. this paper covers the 4 months in the middle of the pilot program year when network staff realized that funded partners were seriously behind schedule in the amount of screens and referrals for perinatal mood and depression these agencies were contracted to make at this point in the fiscal year. although challenging and complex for many human service organizations, collaboration with competitors in the form of "co-opetive relationships" has been linked to greater innovation and efficiency (bunger et al. 2017, p. 13) . but grant cycle funding can add to this complexity in the form of the "capacity paradox," in which small human service organizations working with specific populations face funding restrictions because they are framed as too small or lacking capacity for larger scale grants and initiatives (terrana and wells 2018, p. 109) . finally, once new initiatives are implemented in a funded cycle, human service organizations are increasingly expected to engage in extensive, timely, and often very specific data collection to generate evidence of effectiveness for a particular program (benjamin et al. 2018) . mid-course corrections during implementation of prevention programs targeted to families with young children have long been seen as important ways to refine and modify the roles of program staff working with these populations and add formal and informal supports to ongoing implementation and service delivery prior to final evaluation (lynch et al. 1998; wandersman et al. 1998) . mid-course corrections can help implementers in current interventions or programs adopt a more facilitative and individualized approach to participants that can improve implementation fidelity and cohesion (lynch et al. 1998; sobeck et al. 2006) . comprehensive reviews of implementation of programs for families with young children have consistently found that well-measured implementation improves program outcomes significantly, especially when dose or service duration is also assessed (durlak and dupre 2008; fixsen et al. 2005) . numerous studies have emphasized capturing implementation data at a low enough level to be able to use it to improve service data quickly and hit the right balance of implementation fidelity and thoughtful, intentional implementation adaptation (durlak and dupre 2008; schoenwald et al. 2010; schoenwald et al. 2013; schoenwald and hoagwood 2001; tucker et al. 2006 ). inter-organizational networks serving families with young children face special challenges in making mid-course corrections while maintaining implementation fidelity across member organizations (aarons et al. 2011; hanf and o'toole 1992) . implementation through inter-organizational networks is never merely a result of clear success or clear failure; rather, it is an ongoing assessment of how organizational actors are cooperating or not across organizational boundaries (cline 2000) . frambach and schillewaert (2002) echo this point by noting that intra-organizational and individual cooperation, consistency, and variance also have strong effects on the eventual level of implementation cohesion and fidelity that a given project is able to reach. moreover, recent research suggests that while funders and networks may emphasize and prefer inter-organizational collaboration, individual agency managers in collaborating organizations may see risks and consequences of collaboration and may face dilemmas in complying with network or funder expectations (bunger et al. 2017) . similar organizations providing similar services with overlapping client bases may fear opportunism or poaching from collaborators, and interpersonal trust as well as contracts or memoranda of understanding might be needed to assuage these concerns (bunger 2013; bunger et al. 2017) . even successful collaboration may expedite mergers between collaborating organizations that are undesired or disruptive to stakeholders and sectors (bunger 2013) . while funders may often prefer to fund larger and more comprehensive merged organizations, smaller specialized community organizations founded by and for marginalized populations may struggle to maintain their community connection and focus as subordinate components of larger firms (bunger 2013) . organizational policy practice and advocacy for mid-course corrections in a pilot program likely looks different from the type of advocacy and persuasion efforts that might seek to gain buy-in for initial implementation of the program. fischhoff (1989) notes that in the public health workforce, individual workers rarely know how to organize their work to anticipate the possibility or likelihood of mid-course corrections because most work is habituated and routinized to the point that it is rarely intentionally changed, and when it is changed, it is due to larger issues on which workers expect extensive further guidance. when a need for even a relatively minor mid-course correction is identified, it can result in everyone concerned "looking bad," from the workers adapting their implementation to the organizations requesting the changes (fischhoff 1989, p. 112) . there is also some evidence that health workers have surprisingly stable, pre-existing beliefs about their work situations and experiences, and requests to make mid-course corrections in work situations may have to contend with workers' pre-existing, stable beliefs about the program they are implementing no matter how well-reasoned the proposed course corrections are (harris and daniels 2005) . given a new emphasis in social work that organizational policy advocacy should be re-conceptualized as part of everyday organizational practice (mosley 2013) , a special focus on strategies that contribute to the success of professional networks and organizations that can leverage influence beyond that of a single agency becomes increasingly important. given the above problems with inter-organizational collaboration, increased attention has turned to automated methods of implementation that reduce burden on practitioners without unduly reducing freedom of choice and action. behavioral economics and behavioral science approaches have been suggested as ways to assist direct practitioners to follow policies and procedures that they are unlikely to intend to violate. evidence suggests that behavior in many contexts is easy to predict with high accuracy, and behavioral economics seeks to alter people's behavior in predictable, honest, and ethical ways without forbidding any options or significantly adding too-costly incentives, so that the best or healthiest choice is the easiest choice (thaler and sunstein 2008) . following mosley's (2013) recommendation, this paper examines in detail how a heavily advocated quality improvement pilot program for a maternal and child health network working in a large midwestern metropolitan area attempted to make mid-implementation course corrections for a universal screening and referral program for perinatal mood and anxiety disorders conducted by its member agencies. this paper answers the call of recent policy practice and advocacy research to examine how "openness, networking, sharing of tasks," and building and maintaining positive relationships are operative within organizational practice across multiple organizations (ruggiano et al. 2015, p. 227 ). additionally, this paper focuses on extending recent research to understand how mandated screening for perinatal mood and anxiety disorders can be implemented well (yawn et al. 2015) . this study used an ethnographic case study method because treating the network and this pilot program as a case study makes it possible to examine unique local data while also locating and investigating counter-examples to what was expected locally (stake 1995). this method makes it possible to inform and modify grand generalizations about the case before such generalizations become widely accepted (stake 1995). this study also used ethnographic methods such participant observation and informal, unstructured interview conversations at regularly scheduled meetings. adding these ethnographic approaches to a case study which is tightly time-limited can help answer research questions fully and efficiently (fusch et al. 2017 ). data were collected at regular network meetings, which are 2-3 h long and held twice a month. one meeting is a large group of about 30 participants who supervise or perform screening and case management for perinatal mood and anxiety disorders as well as local practitioners in quality improvement and workforce training and development. a second executive meeting was held with 8-12 participants, typically network staff and the two co-chairs of each of organized three groups, a screening and referral group, a workforce training group, and a quality improvement group, to debrief and discuss major issues reported at the large group meeting. for this study, the author served as a consultant to the quality improvement group and attended and took notes on network meetings in the middle of the program year (november through february) to investigate how mid-course corrections in the middle of the contract year were unfolding. these network meetings generally used a world café focus group method, in which participants move from a large group to several small groups discussing screening and referral, training, and quality improvement specifically, then moved back to report small group findings to the large group (fouché and light 2011) . the author typed extensive notes on an ipad, and notetaking during small group breakouts could only intermittently capture content due to the roving nature of the world café model. note-taking was typically unobtrusive because virtually all participants in both small and large group meetings took notes on discussion. note-taking and note-sharing were also a frequent and iterative process, in which the author commonly forwarded the notes taken at each meeting to network staff and group participants after each meeting to gain their insights and help construct the agenda of the next meeting. by the middle of the program year, network participants had gotten to know each other and the author quite well, so the author was typically able to easily arrange additional conversations for purposes of member checking. these informal meetings supplemented the two regular monthly meetings of the network and allowed for specific follow-up in which participants were asked about specific comments and reactions they had shared at previous meetings. brief powerpoint presentations were also used at the beginning of successive meetings during the program year to summarize announcements and ideas from the last meeting and encourage new discussion. often, powerpoints were used to remind participants of dates, deadlines, statistics, and refined and summarized concepts. because so many other meeting participants took their own notes and shared them during meetings, a large amount of feedback on meeting topics and their meaning were able to be obtained. the author then coded the author's meeting notes in an iterative and sequenced process guided by principles of qualitative description, in which the goal of data analysis is to summarize and report an event using the ordinary, everyday terms for that event and the unique descriptions of those present (sandelowski and leeman 2012) . this analytic method was chosen because it is especially useful when interviewing health professionals about a specific topic, in that interpretation stays very close to the data presented while leveraging all of the methodological strengths of qualitative research, such as multiple, iterative coding, member checking, and data triangulation (neergaard et al. 2009 ). in this way, qualitative organizational research remains rigorous, while the significance of findings is easily translated to wider audiences for rapid action in intervention and implementation (sandelowski and leeman 2012) . by the middle of the program year, network meeting participants explicitly recognized that mid-course corrections were needed in the implementation of the new quality improvement and data-sharing program for universal screening and referral of perinatal mood and anxiety disorders. after iterative analysis of shared meeting notes, three key challenges were salient as themes from network meetings in the middle of the program year. regarding the first research question, concerning what problems emerged by midimplementation that required course correction, data showed that the numbers of clients screened and referred were a fraction of what was contractually anticipated by midway through the program year. this problem was two-fold, in that fewer screenings than expected were reported, but also data showed that those clients who screened as at risk for a perinatal mood and anxiety disorder were not consistently being given the referrals to further treatment indicated by the protocol. this was the first time the network had seen "real numbers" from the collected data for the program year that could be compared with estimated and predicted numbers for each part of the program year, both in terms of numbers anticipated to be screened and especially in terms of numbers expected to be referred to the case management services being funded by the network. however, the numbers were starkly disappointing: only about half of those whose screening scores were high enough to trigger referrals were actually offered referrals, and only about 2/3 of those who received referrals actually accepted the referral and followed up for further care. by the middle of the program year, only 16% of expected, estimated referrals had been issued, and no network partner was at the 50% expected. in responding to this data presentation, participants offered several possible patientlevel explanations. first, several noted that patients commonly experience inconsistent providers during perinatal care and may have little incentive to follow up on referrals after such a fragmented experience. one participant noted a patient who had been diagnosed with preeclampsia (a pregnancy complication marked by high blood pressure) by her first provider, but the diagnostician gave no further information beyond stating the diagnosis, and then the numerous other providers this patient saw never mentioned it again. this patient moved through care knowing nothing about her diagnosis and with little incentive to accept or follow up with other referrals. other participants noted that the typical approach to discharge planning and care transitions by providers was a poor match for clean, universal screening and referral, and that satisfaction surveys had captured patient concerns about how they were given information, which was typically on paper and presented to the patient as she leaves the hospital or medical office. as one participant noted, "we flood them the day mom leaves the hospital and we're lucky if the paper ever gets out of the back seat of the car." others noted that patients may not follow up on referrals simply because they are feeling overwhelmed with other areas of life or are feeling emotionally better without further treatment. however, while these explanations may shed light on why referred patients did not follow up on or keep referrals, they do nothing to explain why no referral or follow-up was offered for screens that were above the referral cutoff. two further explanations were salient. one explanation centered on the idea that some positive screens were potentially being ignored because staff may be reluctant to engage or feared working with clients in crisis-described as an, "if i don't ask the question, i don't have to deal with it" mindset. all screening tools used numeric scores, so that triggered referrals were not dependent on staff having to decide independently to make a referral, but conveying the difficult news that a client had scored high enough on a depression scale to warrant a follow-up referral may have been daunting to some staff. an alternative explanation suggested that staff were not ignoring positive screens but were not understanding the intricacies and expectations of the screening process. of the community agencies partnering with the network to provide screening, many were also able to provide case management as well, but staff did not realize that internal referrals to a different part of their agency still needed to be documented. in this case, a number of missed referrals could have been provided but never documented in the network datasharing system. regarding the second research question, concerning how advocacy targets needed to change based on the identification of the problem, participants agreed that the previous plan to reinforce the importance of the screening program to senior executives in current and potential partner agencies (mcmillin 2017) needed to be updated to reflect a much tighter focus on the line staff actually doing the work (or alternatively not doing the work in the ways expected) in the months remaining in the funded program year. one participant noted that the elusive warm handoff-a discharge and referral where the patient was warmly engaged, clearly understood the expected next steps, and was motivated to visit the recommended provider for further treatment-was also challenging for staff who might default to a "just hand them the paper" mindset, especially for those staff who were overwhelmed and understaffed. the network was funding additional case managers to link patients to treatment, but partner agencies were expected to screen using usual staff, who had been trained but not increased or otherwise compensated to do the screening. additional concerns mentioned the knowledge and preparation of staff to make good referrals, with an example noted of one staff member seemingly unaware of how to make a domestic violence referral even through a locally well-known agency specializing in interpersonal violence treatment and prevention has been working with the network for some time. meeting participants agreed that in the time remaining for the screening and referral pilot, advocacy efforts would have to be diverted away from senior executives and toward line staff if there was to be any chance of meeting enrollment targets and justifying further funding for the screening and referral program. participants also noted that while the operational assumption was that agencies that were network partners this pilot year would remain as network partners for future years of the universal screening and referral program, there was no guarantee about this. partner agencies that struggled to complete the pilot year of the program, with disappointing numbers, may decline to participate next year, especially if they lost network funding based on their challenged performance in the current program year. this suggested that additional advocacy at the executive level might still be needed, as executives could lead their agencies out of the network screening system after june 30, but that for the remainder of the program year, the line staff themselves who were performing screening needed to be heavily engaged and lobbied to have any hope of fully completing the pilot program on time. regarding the third research question, concerning specific course corrections identified and implemented to get implementation back on track, a prolonged brainstorming session was held after the disappointing data were shared. this effort produced a list of eight suggested "best practices" to help engage staff performing screening duties to excel in the work: (1) making enrollment targets more visible to staff, perhaps by using visual charts and graphs common in fundraising drives; (2) using "opt-out" choice architecture that would automatically enroll patients who screened above the cutoff score unless the patient objected; (3) sequencing screens with other paperwork and assessments in ways that make sure screens are completed and acted upon in a timely way; (4) offering patients incentives for completing screens; (5) educating staff on reflective practice and compassion fatigue to avoid or reduce feelings of being overwhelmed about screening; (6) using checklists that document work that is done and undone; (7) maintaining intermittent contact and follow-up with patients to check on whether they have accepted and followed up on referrals; and (8) using techniques of prolonged engagement so that by the time staff are screening patients for perinatal mood and anxiety disorders, patients are more likely to be engaged and willing to follow up. further discussion of these best practices noted that there was no available funding to compensate either patients for participating in screening or staff for conducting screening. long-term contact or prolonged engagement also seemed to be difficult to implement rapidly in the remaining months of the program year. low-cost, rapid implementation strategies were seen as most needed, and it was noted that strategies from behavioral economics were the practices most likely to be rapidly implemented at low-cost. visual charts and graphs displaying successful screenings and enrollments while also emphasizing the remaining screenings and enrollments needed to be on schedule were chosen for further training because these tactics would involve virtually no additional cost to partner agencies and could be implemented immediately. likewise, shifting to "opt-out" enrollment procedures was encouraged, where referred patients would be automatically enrolled in case management unless they specifically objected. in addition, the network quickly scheduled a workshop on how to facilitate meetings so that supervisors in partner agencies would be assisted in immediately discussing and implementing the above course corrections and behavioral strategies with their staff. training on using visual incentives emphasized three important components of using this technique. first, it was important to make sure that enrollment goals were always visually displayed in the work area of staff performing screening and enrollment work. this could be something as simple as a hand-drawn sign in the work area noting how many patients had been enrolled compared with what that week's enrollment target was. ideally this technique would transition to an infographic that was connected to an electronic dashboard in real time-where results would be transparently displayed for all to see in an automatic way that did not take additional staff time to maintain. second, the visual incentive needed to be displayed vividly enough to alter or motivate new worker behavior, but not so vividly as to compete with, distract, or delay new worker behavior. in many social work settings, participants agreed that weekly updates are intuitive for most staff. without regular check-in's and updates of the target numbers, it could be easy for workers to lose their sense of urgency about meeting these time-constrained goals. third, training emphasized teaching staff how behavioral biases could reduce their effectiveness. many staff are used to (and often good at) cramming and working just-in-time, but this is not possible when staff cannot control all aspects of work. screeners cannot control the flow of enrollees-rather they must be ready to enroll new clients intermittently as soon as they see a screening is positive-so re-learning not to cram or work just-in-time suggested a change in workplace routines for many staff. training on "opt-out" choice architecture for network enrollment procedures emphasized using behavioral economics and behavioral science to empower better direct practitioners. evidence suggests that behavior in many contexts is easy to predict with high accuracy, and behavioral economics seeks to alter people's behavior in predictable, honest, and ethical ways without forbidding any options or significantly adding too-costly incentives, so that the best or healthiest choice is the easiest choice (thaler and sunstein 2008) . training here also emphasized meeting thaler and sunstein's (2008) two standards for good choice architecture: (1) the choice had to be transparent, not hidden, and (2) it had to be cheap and easy to opt out. good examples of such choice architecture were highlighted, such as email and social media group enrollments, where one can unsubscribe and leave such a group with just one click. bad or false examples of choice architecture were also highlighted, such as auto-renewal for magazines or memberships where due dates by which to opt out are often hidden and there is always one more financial charge before one is free of the costly enrollment or subscription. training concluded by advising network participants to use opt-in choice architecture when the services in question are highly likely to be spam, not meaningful, or only relevant to a fraction of those approached. attendees were advised to use optout choice architecture when the services in question are highly likely to be meaningful, not spam, and relevant to most of those approached. since those with positive depression screens were not approached unless they had scored high on a depression screening, automatic enrollment in a case management service where clients would receive at least one more contact from social services was highly relevant to the population served in this pilot program and was encouraged, with clients always having the right to refuse. to jump-start changing the behavior of the staff in partner agencies actually doing the screenings, making the referrals, and enrolling patients in the case management program, the network quickly scheduled a facilitation training so that supervisors and all who led staff or chaired meetings could be prepared and empowered to discuss enrollment and teach topics like opt-out enrollment to staff. this training emphasized the importance of creating spaces for staff to express doubt or confusion about what was being asked of them. one technique that resonated with participants was doing check-ins with staff during a group training by asking staff to make "fists to fives," a hand signal on a 0-5 scale on how comfortable they were with the discussion, where holding a fist in the air is discomfort, disagreement, or confusion and waving all five fingers of one hand in the air meant total comfort or agreement with a query or topic. training also emphasized that facilitators and trainers should acknowledge that conflict and disagreement typically comes from really caring, so it was important to "normalize discomfort," call it out when people in the room seem uncomfortable, and reiterate that the partner agency is able and willing to have "the tough conversations" about the nature of the work. mid-course corrections attempted during implementation of a quality improvement system in a maternal and child health network offered several insights to how organizational policy practice and advocacy techniques may rapidly change on the ground. specifically, findings highlighted the importance of checking outcome data early enough to be able to respond to implementation concerns immediately. participants broadly endorsed organizational adoption of behavioral economic techniques to influence rapidly the work behavior of line staff engaged in screening and referral before lobbying senior executives to extend the program to future years. these findings invite further exploration of two areas: (1) the workplace experiences of line staff tasked with mid-course implementation corrections, and (2) the organizational and practice experiences of behavioral economic ("nudge") techniques. this network's approach to universal screening and referral was very clearly meant to be truly neutral or even biased in the client's favor. staff were allowed and even encouraged to use their own individual judgment and discretion to refer clients for case management, even if the client did not score above the clinical cutoff of the screening instrument. mindful of the dangers of rigid, top-down bureaucracy, the network explicitly sought to empower line staff to work in clients' favor, yet still experienced disappointing results. this outcome suggests several possibilities. first, it is possible that, as participants implied, line staff were sometimes demoralized workers or nervous non-clinicians who were not eager to convey difficult news regarding high depression scores to clients who may have already been difficult to serve. as hasenfeld's classic work (hasenfeld 1992) has explicated, the organizational pull toward people-processing in lieu of true people-changing is powerful in many human service organizations. tummers (2016) also recently showcases the tendency of workers to prioritize service delivery to motivated rather than unmotivated clients. smith (2017) suggests that regulatory and contractual requirements can ameliorate disparities in who gets prioritized for what kind of human service, but the variability if human service practice makes this problem hard to eliminate altogether. however, it is also possible that line staff did not see referral as clearly in a client's best interest but rather as additional paperwork and paper-pushing within their own workplaces, additional work that line staff were given neither extra time nor compensation to complete. given that ultimately the number of internal referrals that were undercounted or undocumented was seen as an important cause of disappointing project outcomes, staff reluctance to engage in extra bureaucratic sorting tasks is a distinct possibility. the line staff here providing screening may have seen their work as less of a clinical assessment process and more of a tedious, internal bureaucracy geared toward internal compliance and payment rather than getting needy clients to worthwhile treatment. further research on the experience of line staff members performing time-sensitive sorting tasks is needed to understand how even in environments explicitly trying to be empowering and supportive of worker discretion; worker discretion may have negative impacts on desired implementation outcomes. in addition to the experience of line staff in screening clients, the interest and embrace of agency supervisors in choosing behavioral economic techniques for staff training and screening processes also deserves further study. grimmelikhuijsen et al. (2017) advocate broadly for further study and understanding of behavioral public administration which integrates behavioral economic principles and psychology, noting that whether one agrees or disagrees with the "nudge movement" (p. 53) in public administration, it is important to understand its growing influence. ho and sherman (2017) pointedly critique nudging and behavioral economic approaches, noting that they may hold promise for improving implementation and service delivery but do not focus on front-line workers, and the quality and consistency of organizational and bureaucratic services in which arbitrariness remains a consistent problem. finally, more research is needed on links between organizational policy implementation and state policy. in this case, state policy primarily set report deadlines and funding amounts with little discernible impact on ongoing organizational implementation. this gap also points to challenges in how policymakers can feasibly learn from implementation innovation in the community and how successful innovations can influence the policy process going forward. this article's findings and any inferences drawn from them must be understood in light of several study limitations. this study used a case study method and ethnographic approaches of participant observation, a process which always runs the risk of the personal bias of the researcher intruding into data collection as well as the potential for social desirability bias among those observed. moreover, a case study serves to elaborate a particular phenomenon and issue, which may limit its applicability to other cases or situations. a critical review of the use of the case study method in high-impact journals in health and social sciences found that case studies published in these journals used clear triangulation and member-checking strategies to strengthen findings and also used well-regarded case study approaches such as stake's and qualitative analytic methods such as sandelowski's (hyett et al. 2014 ). this study followed these recommended practices. continued research on health and human service program implementation that follows the criteria and standards analyzed by hyett et al. (2014) will contribute to the empirical base of this literature while ameliorating some of these limitations. research suggests that collaboration may be even more important for organizations than for individuals in the implementation of social innovations (berzin et al. 2015) . the network studied here adopted behavioral economics as a primary means of focusing organizational collaboration. however, a managerial turn to nudging or behavioral economics must do more than achieve merely administrative compliance. "opt-out, not-in" organizational approaches could positively affect implementation of social programs in two ways. first, it could eliminate unnecessary implementation impediments (such as the difficult conversations about depression referrals resisted by staff in this case) by using tools such as automatic enrollment to push these conversations to more specialized staff who could better advise affected clients. second, such approaches could reduce the potential workplace dissatisfaction of line staff, including any potential discipline they could face for incorrectly following more complicated procedures. thaler and sunstein (2003) explicitly endorse worker welfare as a rationale and site for behavioral economic approaches. they note that every system as a system has been planned with an array of choice decisions already made, and given that there is always a starting default, it should be set to predict desired best outcomes. this study supports considering behavioral economic approaches for social program implementation as a way to reset maladaptive default settings and provide services in ways that can be more just and more effective for both workers and clients. advancing a conceptual model of evidence-based practice implementation in public service sectors. administration and policy in mental health and mental health services research policy fields, data systems, and the performance of nonprofit human service organizations defining our own future: human service leaders on social innovation administrative coordination in nonprofit human service delivery networks: the role of competition and trust. nonprofit and voluntary sector quarterly institutional and market pressures on interorganizational collaboration and competition among private human service organizations defining the implementation problem: organizational management versus cooperation implementation matters: a review of research on the influence of implementation on program outcomes and the factors affecting implementation helping the public make health risk decisions implementation research: a synthesis of the literature the world café" in social work research organizational innovation adoption: a multi-level framework of determinants and opportunities for future research how to conduct a mini-ethnographic case study: a guide for novice researchers behavioral public administration: combining insights from public administration and psychology revisiting old friends: networks, implementation structures and the management of inter-organizational relations daily affect and daily beliefs human services as complex organizations managing street-level arbitrariness: the evidence base for public sector quality improvement methodology or method? a critical review of qualitative case study reports successful program development using implementation evaluation organizational policy advocacy for a quality improvement innovation in a maternal and child health network: lessons learned in early implementation recognizing new opportunities: reconceptualizing policy advocacy in everyday organizational practice qualitative description-the poor cousin of health research? identifying attributes of relationship management in nonprofit policy advocacy writing usable qualitative health research findings effectiveness, transportability, and dissemination of interventions: what matters when? workforce development and the organization of work: the science we need. administration and policy in mental health and mental health services research clinical supervision in effectiveness and implementation research the future of nonprofit human services lessons learned from implementing school-based substance abuse prevention curriculums financial struggles of a small community-based organization: a teaching case of the capacity paradox libertarian paternalism is not an oxymoron. university of chicago public law & legal theory working paper 43 nudge: improving decisions about health, wealth, and happiness lessons learned in translating research evidence on early intervention programs into clinical care. mcn the relationship between coping and job performance comprehensive quality programming and accountability: eight essential strategies for implementing successful prevention programs identifying perinatal depression and anxiety: evidence-based practice in screening, psychosocial assessment and management publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations the author declares that this work complied with appropriate ethical standards. the author declares that they have no conflict of interest. key: cord-218639-ewkche9r authors: ghavasieh, arsham; bontorin, sebastiano; artime, oriol; domenico, manlio de title: multiscale statistical physics of the human-sars-cov-2 interactome date: 2020-08-21 journal: nan doi: nan sha: doc_id: 218639 cord_uid: ewkche9r protein-protein interaction (ppi) networks have been used to investigate the influence of sars-cov-2 viral proteins on the function of human cells, laying out a deeper understanding of covid--19 and providing ground for drug repurposing strategies. however, our knowledge of (dis)similarities between this one and other viral agents is still very limited. here we compare the novel coronavirus ppi network against 45 known viruses, from the perspective of statistical physics. our results show that classic analysis such as percolation is not sensitive to the distinguishing features of viruses, whereas the analysis of biochemical spreading patterns allows us to meaningfully categorize the viruses and quantitatively compare their impact on human proteins. remarkably, when gibbsian-like density matrices are used to represent each system's state, the corresponding macroscopic statistical properties measured by the spectral entropy reveals the existence of clusters of viruses at multiple scales. overall, our results indicate that sars-cov-2 exhibits similarities to viruses like sars-cov and influenza a at small scales, while at larger scales it exhibits more similarities to viruses such as hiv1 and htlv1. the covid-19 pandemic, with global impact on multiple crucial aspects of human life, is still a public health threat in most areas of the world. despite the ongoing investigations aiming to find a viable cure, our knowledge of the nature of disease is still limited, especially regarding the similarities and differences it has with other viral infections. on the one hand, sars-cov-2 shows high genetic similarity to sars-cov 1 with the rise of network medicine [6] [7] [8] [9] [10] [11] , methods developed for complex networks analysis have been widely adopted to efficiently investigate the interdependence among genes, proteins, biological processes, diseases and drugs 12 . similarly, they have been used for characterizing the interactions between viral and human proteins in case of sars-cov-2 [13] [14] [15] , providing insights into the structure and function of the virus 16 and identifying drug repurposing strategies 17, 18 . however, a comprehensive comparison of sars-cov-2 against other viruses, from the perspective of network science, is still missing. here, we use statistical physics to analyze 45 viruses, including sars-cov-2. we consider the virus-human protein-protein interactions (ppi) as an interdependent system with two parts, human ppi network targeted by viral proteins. in fact, due to the large size of human ppi network, its structural properties barely change after being merged with viral components. consequently, we show that percolation analysis of such interdependent systems provides no information about the distinguishing features of viruses. instead, we model the propagation of perturbations from viral nodes through the whole system, using bio-chemical and regulatory dynamics, to obtain the spreading patterns and compare the average impact of viruses on human proteins. finally, we exploit gibbsian-like density matrices, recently introduced to map network states, to quantify the impact of viruses on the macroscopic functions of human ppi network, such as von neumann entropy. the inverse temperature β is used as a resolution parameter to perform a multiscale analysis. we use the above information to cluster together viruses and our findings indicate that sars-cov-2 groups with a number of pathogens associated with respiratory infections, including sars-cov, influenza a and human adenovirus (hadv) at the smallest scales, more influenced by local topological features. interestingly, at larger scales, it exhibits more similarity with viruses from distant families such as hiv1 and human t-cell leukemia virus type 1 (htlv1). our results shed light on the unexplored aspects of sars-cov-2, from the perspective of statistical physics of complex networks, and the presented framework opens the doors for further theoretical developments aiming to characterize structure and dynamics of virus-host interactions, as well as grounds for further experimental investigation and potentially novel clinical treatments. here, we use data regarding the viral proteins and their interactions with human proteins for 45 viruses (see methods and fig. 1) . to obtain the virus-human interactomes, we link the data to the biostr human ppi network (19, percolation of the interactomes. arguably, the simplest conceptual framework to assess how and why a networked system loses its functionality is via the process of percolation 19 . here, the structure of interconnected systems is modeled by a network g with n nodes, which can be fully represented by an adjacency matrix a (a ij = 1 if nodes i and j are connected, it is 0 oth20 . this point of view assumes that, as a first approximation, there is an intrinsic relation between connectivity and functionality: when the node removal occurs, the more capable of remaining assembled a system is, the better it will perform its tasks. hence, we have a quantitative way to assess the robustness of the system. if one wants to single out the role played by a certain property of the system, instead of selecting the nodes randomly, they can be sequentially removed following that criteria. for instance, if we want to find out what is the relevance of the most connected elements on the functionality, we can remove a fraction of the nodes with largest degree 21, 22 . technically, the criteria can be whatever metric that allows us to rank nodes, although in practical terms topologically-oriented protocols are the most frequently used due to their accessibility, such as degree, betweenness, etc. therefore percolation is, at all effects, a topological analysis, since its input and output are based on structural information. in the past, the usage of percolation has been proved useful to shed light on several aspects of protein-related networks, such as in the identification of functional clusters 23 and protein complexes 24 , the verification of the quality of functional annotations 25 or the critical properties as a function of mutation and duplication rates 26 , to name but a few. following this research line, we perform the percolation analysis to all the ppi networks to understand if this technique brings any information that allows us to differentiate among viruses. the considered protocols are the random selection of nodes, the targeting of nodes by degree -i.e., the number of connections they haveand their removal by betweenness centrality -i.e., a measure of the likelihood of a node to be in the information flow exchanged through the system by means of shortest paths. we apply these attack strategies and compute the resulting (normalized) size of the largest connected component s in the network, which serves as a proxy to the remaining functional part, as commented above. this way, when s is close to unity the function of the network has been scarcely impacted by the intervention, while when s is close to 0 the network can no longer be operative. the results are shown in fig. 3 . surprisingly, for each attacking protocol, we observe that the curves of the size of the largest connected component neatly collapse in a common curve. in other words, percolation analysis completely fails at finding virus-specific discriminators. viruses do respond differently depending on the ranking used, but this is somehow expected due to the correlation between the metrics employed and the position of the nodes in the network. we can shed some light on the similar virus-wise response to percolation by looking at topological structure of the interactomes. despite being viruses of diverse nature and causing such different symptomatology, their overall structure shows a high level of similarity when it comes to the protein-protein interaction. indeed, for every pair of viruses we find the fraction of nodes f n and fraction of links f l that simultaneously participate in both. averaging over all pairs, we obtain that f n = 0.9996 ± 0.0002 and f l = 0.9998 ± 0.0007. that means that the interactomes are structurally very similar, so the dismantling ranks. if purely topological analysis is not able to differentiate between viruses, then we need more convoluted, non-standard techniques to tackle this problem. in the next sections we will employ these alternative approaches. analysis of perturbation propagation. ppi networks represent the large scale set of interacting proteins. in the context of regulatory networks, edges encode dependencies for activation/inhibition with transcription factors. ppi edges can also represent the propensity for pairwise binding and the formation of complexes. the analytical treatment of these processes is described via bio-chemical dynamics 27, 28 and regulatory dynamics 29 . in bio-chemical (bio-chem) dynamics, these interactions are proportional to the product of concentrations of reactants, thus resulting in a second-order interaction, forming dimers. protein concentration x i (i = 1, 2, ..., n ) is also dependent on its degradation rate b i and the amount of protein synthesized at a rate f i . the resulting law of mass action: a ij x i x j summarizes the formation of complexes and degradation/synthesis processes that occur in a ppi. regulatory dynamics can be instead characterized by an interaction with neighbors described by a hill function that saturates at unity: in the context of the study of signal propagation, recent works have introduced the definition of network global correlation function 30, 31 as ultimately, the idea is that constant perturbation brings the system to a new steady state x i → x i + dx i , and dx i /x i quantifies the magnitude of the response of node i from the perturbation in j. this allows also the definition of measures such as impact 31 of a node as i i = j a ij g t ij describing the response of i's neighbors to its perturbation. interestingly, it was found that these measures can be described with power laws of degrees (i i ≈ k φ i ), via universal exponents dependent on the dynamics underlying odes allowing to effectively describe the interplay between topology and dynamics. in our case, φ = 0 for both processes, therefore the perturbation from i has the same impact on neighbors, regardless of its degree. we exploit the definition of g ij to define the vector g v of perturbations of concentrations induced by the interaction with the virus v, where the k-th entry is given by 31 the steps we follow to asses the impact of the viral nodes in the human interactome via the microscopic dynamics are described next. we first obtain the equilibrium states of human interactome by numerical integration of equations. then, for each virus, we compute the system response from perturbations starting in ∀i ∈ v which is eventually encoded in g v . finally, we repeat these steps for both the bio-chem and m-m models. the amount of correlation generated is a measure of the impact of the virus on the interactome equilibrium state. we estimate it as the euclidean 1-norm of the correlation vectors g v 1 = i |g v i |, which we refer to as cumulative correlation. the results are presented in fig. 4 . by allowing for multiple sources of perturbation, the biggest responses in magnitude will come from direct neighbors of these sources, making them the dominant contributors to g v 1 . with i i not being dependent on the source degree, these results support the idea that with these specific forms of dynamical processes on the top of the interactome, the overall impact of a perturbation generated by a virus is proportional to the amount of human proteins it interacts with. results shown in fig. 5 highlight that propagation patterns strongly depend on the sources (i.e., the affected nodes v), and strong similarities will generally be found within the same family and for viruses that share common impacted proteins in the interactome. conversely, families and viruses with small (or null) overlap in the sources exhibit low similarity and are not sharply distinguishable. to cope with this, we adopt a rather macroscopic view of the interactomes in the next section. analysis of spectral information. we have shown that the structural properties of human ppi network does not significantly change after being targeted by viruses. percolation analysis seems ineffective in distinguishing the specific characteristics of virus-host interactomes while, in contrast, the propagation of biochemical signals from viral components into human ppi network has been shown successful in assessing the viruses in terms of their average impact on human proteins. remarkably, the propagation patterns can be used to hierarchically cluster the viruses, although some of them are highly dependent on the choice of threshold (fig. 5) . in this section, which is defined in terms of the propagator of a diffusion process on top of the network, normalized by the partition function z(β, g) = tr e −βl , which has an elegant physical meaning in terms of dynamical trapping for diffusive flows 38 . consequently, the counterpart of massieu functionalso known as free entropy -in statistical physics can be defined for networks as note that a low value of the massieu function indicates high information flow between the nodes. the von neumann entropy can be directly derived from the massieu function by encoding the information content of graph g. finally, the difference between von neumann entropy and the massieu function follows where u(β, g) is the counterpart of internal energy in statistical physics. in the following, we use the above quantities to compare the interactomes corresponding to different virus-host interactomes. in fact, as the number of viral nodes is much smaller than the number of human proteins, we model each virus-human interdependent system g as a perturbation of the large human ppi network g (see fig. 6 ). after considering the viral perturbations, the von neumann entropy, massieu function and the energy of the human ppi network change slightly. the magnitude of such perturbations can be calculated as explained in fig. 6 , for von neumann entropy and massieu function, while the perturbation in internal energy follows their difference βδu(β, g) = δs(β, g) − δφ(β, g), according to eq. 7. the parameter β encodes the propagation time in diffusion dynamics, or equivalently an inverse temperature from a thermodynamic perspective, and is used as a resolution parameter tuned to characterize macroscopic perturbations due to node-node interactions at different scales, from short to long range 40 . based on the perturbation values and using k-means algorithm, a widely adopted clustering technique, we group the viruses together (see fig. 6 , tab. 1 and tab. 2). at small scales, sarscov-2 appears in a cluster with a number of other viruses causing respiratory illness, including sars-cov, influenza a and hadv. however, at larger scales, it exhibits more similarity with hiv1, htlv1 and hpv type 16. table 1 : the summary of clustering results at small scales (β ≈ 1 from fig.6 ) is presented. remarkably, at this scale, sars-cov-2 groups with a number of respiratory diseases including sars-cov, influenza a and hadv. fig.6 ) is presented. here, sars-cov-2 shows higher similarity to hiv1, htlv1 and hpv type 16. comparing covid-19 against other viral infections is still a challenge. in fact, various approaches can be adopted to characterize and categorize the complex nature of viruses and their impact on human cells. in this study, we used an approach based on statistical physics to analyze virus-human overview of the data set. it is worth noting that to build the covid-19 virus-host interactions, a different procedure had to be used. in fact, since the sars-cov-2 is too novel we could not find its ppi in the string repository and we have considered, instead, the targets experimentally observed in gordon et al 13 , consisting of 332 human proteins. the remainder of the procedure used to build the virus-host ppi is the same as before. see fig. 1 for summary information about each virus. a key enzyme involved in the process of prostaglandin biosynthesis; ifih1 (interferon induced with helicase c domain 1, ncbi gene id: 64135), encoding mda5, an intracellular sensor of viral rna responsible for triggering the innate immune response: it is fundamental for activating the process of pro-inflammatory response that includes interferons, for this reason it is targeted by several virus families which are able to hinder the innate immune response by evading its specific interferon response. contributions. ag, oa and sb performed numerical experiments and data analysis. mdd conceived and designed the study. all authors wrote the manuscript. the proximal origin of sars-cov-2 the genetic landscape of a cell epidemiologic features and clinical course of patients infected with sars-cov-2 in singapore a trial of lopinavir-ritonavir in adults hospitalized with severe covid-19 remdesivir, lopinavir, emetine, and homoharringtonine inhibit sars-cov-2 replication in vitro network medicine: a network-based approach to human disease focus on the emerging new fields of network physiology and network medicine human symptoms-disease network network medicine approaches to the genetics of complex diseases the human disease network the multiplex network of human diseases network medicine in the age of biomedical big data a sars-cov-2 protein interaction map reveals targets for drug repurposing structural genomics and interactomics of 2019 wuhan novel coronavirus, 2019-ncov, indicate evolutionary conserved functional regions of viral proteins structural analysis of sars-cov-2 and prediction of the human interactome fractional diffusion on the human proteome as an alternative to the multi-organ damage of sars-cov-2 network medicine framework for identifying drug repurposing opportunities for covid-19 predicting potential drug targets and repurposable drugs for covid-19 via a deep generative model for graphs network robustness and fragility: percolation on random graphs introduction to percolation theory error and attack tolerance of complex networks breakdown of the internet under intentional attack identification of functional modules in a ppi network by clique percolation clustering identifying protein complexes from interaction networks based on clique percolation and distance restriction percolation of annotation errors through hierarchically structured protein sequence databases infinite-order percolation and giant fluctuations in a protein interaction network computational analysis of biochemical systems a practical guide for biochemists and molecular biologists propagation of large concentration changes in reversible protein-binding networks an introduction to systems biology quantifying the connectivity of a network: the network correlation function method universality in network dynamics the statistical physics of real-world networks classical information theory of networks the von neumann entropy of networks structural reducibility of multilayer networks spectral entropies as information-theoretic tools for complex network comparison complex networks from classical to quantum enhancing transport properties in interconnected systems without altering their structure scale-resolved analysis of brain functional connectivity networks with spectral entropy unraveling the effects of multiscale network entanglement on disintegration of empirical systems under revision string v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets biogrid: a general repository for interaction datasets the biogrid interaction database: 2019 update gene help: integrated access to genes of genomes in the reference sequence collection competing financial interests. the authors declare no competing financial interests.acknowledgements. the authors thank vera pancaldi for useful discussions. key: cord-024571-vlklgd3x authors: kim, yushim; kim, jihong; oh, seong soo; kim, sang-wook; ku, minyoung; cha, jaehyuk title: community analysis of a crisis response network date: 2019-07-28 journal: soc sci comput rev doi: 10.1177/0894439319858679 sha: doc_id: 24571 cord_uid: vlklgd3x this article distinguishes between clique family subgroups and communities in a crisis response network. then, we examine the way organizations interacted to achieve a common goal by employing community analysis of an epidemic response network in korea in 2015. the results indicate that the network split into two groups: core response communities in one group and supportive functional communities in the other. the core response communities include organizations across government jurisdictions, sectors, and geographic locations. other communities are confined geographically, homogenous functionally, or both. we also find that whenever intergovernmental relations were present in communities, the member connectivity was low, even if intersectoral relations appeared together within them. other or are friends, know each other, etc." which generally refers to a social circle (mokken, 1979, p. 161) , while a community is formed through concrete social relationships (e.g., high school friends) or sets of people perceived to be similar, such as the italian community and twitter community (gruzd, wellman, & takhteyev, 2011; hagen, keller, neely, depaula, & robert-cooperman, 2018) . in social network analysis, a clique is operationalized as " . . . a subset of actors in which every actor is adjacent to every other actor in the subset (borgatti, everett, & johnson, 2013, p. 183) , while communities refer to " . . . groups within which the network connections are dense, but between which they are sparser" (newman & girvan, 2004, p. 69) . the clique and its variant definitions (e.g., n-cliques and k-cores) focus on internal edges, while the community is a concept based on the distinction between internal edges and the outside. we argue that community analysis can provide useful insights about the interrelations among diverse organizations in the ern. we have not yet found any studies that have investigated cohesive subgroups in large multilevel, multisectoral erns through a community lens. with limited guidance from the literature on erns, we lack specific expectations or hypotheses about what the community structure in the network may look like. therefore, our study focuses on identifying and analyzing communities in the 2015 middle east respiratory syndrome coronavirus (mers) response in south korea as a case study. we address the following research questions: (1) in what way were distinctive communities divided in the ern? and (2) how did the interorganizational relations relate to the internal characteristics of the communities? by detecting and analyzing the community structure in an ern, we offer insights for future empirical studies on erns. the interrelations in erns have been examined occasionally by analyzing the entire network's structure. for example, the katrina case exhibited a large and sparse network, 1 in which a small number of nodes had a large number of edges and a large number of nodes had a small number of edges (butts, acton, & marcum, 2012) . the katrina response network can be thought of as " . . . a loosely connected set of highly cohesive clusters, surrounded by an extensive 'halo' of pendant trees, small independent components, and isolates" (butts et al., 2012, p. 23) . the network was sparse and showed a tree-like structure but also included cohesive substructures. other studies on the katrina response network have largely concurred with these observations (comfort & haase, 2006; kapucu, arslan, & collins, 2010) . in identifying cohesive subgroups in the katrina response network, these studies rely on the analysis of cliques: "a maximal complete subgraph of three or more nodes" (wasserman & faust, 1994, p. 254) or clique-like (n-cliques or k-cores). the n-cliques can include nodes that are not in the clique but are accessible. similarly, k-cores refer to maximal subgraphs with a minimum degree of at least k. many cliques were identified in the katrina response network, in which federal and state agencies appeared frequently (comfort & haase, 2006; kapucu, 2005) . using k-cores analysis, butts, acton, and marcum (2012) suggest that the katrina response network's inner structure was built around a small set of cohesive subgroups that was divided along institutional lines corresponding to five state clusters (alabama, colorado, florida, georgia, and virginia), a cluster of u.s. federal organizations, and one of nongovernmental organizations. while these studies suggest the presence of cohesive subgroups in erns, we have not found any research that thoroughly discussed subsets of organizations' significance in erns. from the limited literature, we identify two different, albeit related, reasons that cohesive subgroups have interested ern researchers. in their analysis of cohesive subgroups using cliques, comfort and haase (2006) assume that a cohesive subgroup can facilitate achieving shared tasks as a group, but it can be less adept at managing the full flow of information and resources across groups and thus decreasing the entire network's coherence. kapucu and colleagues (2010) indicate that the recurrent patterns of interaction among the sets of selected organizations may be the result of excluding other organizations in decision-making, which may be a deterrent to all organizations' harmonious concerted efforts in disaster responses. comfort and haase (2006) view cliques as an indicator of " . . . the difficulty of enabling collective action across the network" (p. 339), 2 and others have adhered closely to this perspective (celik & corbacioglu, 2016; hossain & kuti, 2010; kapucu, 2005) . cohesive subgroups such as cliques are assumed to be a potential hindrance to the entire network's performance. the problem with this perspective is that one set of eyes can perceive cohesive subgroups in erns as a barrier, while another can regard them as a facilitator of an effective response. while disaster and emergency response plans are inherently limited and not implemented in practice as intended (clarke, 1999) , stakeholder organizations' responses may be performed together with presumed structures, particularly in a setting in which government entities are predominant. for example, the incident command system (ics) 3 was designed to improve response work's efficiency by constructing a standard operating procedure (moynihan, 2009 ). structurally, one person serves as the incident commander who is responsible for directing all other responders (kapucu & garayev, 2016) . ics is a somewhat hierarchical command-and-control system with functional arrangements in five key resources and capabilities-that is, command, operations, planning, logistics, and finance (kapucu & garayev, 2016) . in an environment in which such an emergency response model is implemented, it is realistic to expect clusters and subgroups to reflect the model's structural designs and arrangements, and they may be intentionally designed to facilitate coordination, communication, and collaboration with other parts or subgroups efficiently in a large response network. others are interested in identifying cohesive subgroups because they may indicate a lack of cross-jurisdictional and cross-sectoral collaboration in erns. during these responses, public organizations in different jurisdictions participate, and a sizable number of organizations from nongovernmental sectors also become involved (celik & corbacioglu, 2016; comfort & haase, 2006; kapucu et al., 2010; spiro, acton, & butts, 2013) . organizational participation by multiple government levels and sectors is often necessary because knowledge, expertise, and resources are distributed in society. participating organizations must collaborate and coordinate their efforts. however, studies have suggested that interactions in erns are limited and primarily occur among similar organizations, particularly within the same jurisdiction. that is, public organizations tend to interact more frequently with other public organizations in specific geographic locations (butts et al., 2012; hossain & kuti, 2010; kapucu, 2005; tang, deng, shao, & shen, 2017) . these studies indicate that organizations have been insufficiently integrated across government jurisdictions (tang et al., 2017) or sectors (butts et al., 2012; hossain & kuti, 2010) , and the identification of cliques composed of similar organizations reinforces such a concern. in our view, there is a greater, or perhaps more interesting, question related to the crossjurisdictional and cross-sectoral integration in interorganizational response networks: how are intergovernmental relations mixed with intersectoral relations in erns? here, we use the term interorganizational relations to refer to both intergovernmental and intersectoral relations. intergovernmental relations refer to the interaction among organizations across different government levels (local, provincial, and national) , and intersectoral relations involve the interaction among organizations across different sectors (public, private, nonprofit, and civic sectors). recent studies have suggested that both intergovernmental and intersectoral relations shape erns (kapucu et al., 2010; kapucu & garayev, 2011; tang et al., 2017) , but few have analyzed the way the two interorganizational relations intertwine. if the relation interdependencies in the entire network are of interest to ern researchers, as is the case in this article, focusing on cliques may not necessarily be the best approach to the question because clique analysis may continue to find sets of selected organizations that are tightly linked for various reasons. the analysis of cliques is a very strict way of operationalizing cohesive subgroups from a social network perspective (moody & coleman, 2015) , and there are two issues with using it to identify cohesive subgroups in erns. first, clique analysis assumes complete connections of three or more subgroup members, while real-world networks tend to have many small overlapping cliques that do not represent distinct groups (moody & coleman, 2015) . even if substantively meaningful cliques appear, they may not necessarily imply a lack of information flow across subgroups or other organizations' exclusion, as previous ern studies have assumed (comfort & haase, 2006; kapucu et al., 2010) . second, clique analysis assumes no internal differentiation in members' structural position within the subgroup (wasserman & faust, 1994) . in a task-oriented network such as an ern, organizations within a subgroup may look similar (e.g., all fire organizations). however, this does not imply that they are identical in their structural positions. when these assumptions in clique analysis do not hold, identifying cohesive subgroups as cliques is inappropriate (wasserman & faust, 1994) . similarly, other clique-like approaches (n-cliques and k-cores) demand an answer to the question: "what is the n-or k-?" the clique and clique-like approaches have a limited ability to define and identify cohesive subgroups in a task-oriented network because they do not clearly explain why the subgroups need to be defined and identified in such a manner. we proposed a different way of thinking about and finding subsets of organizations in erns: community. when a network consists of subsets of nodes with many edges that connect nodes of the same subset, but few that lay between subsets, the network is said to have a community structure (wilkinson & huberman, 2004) . network researchers have developed methods with which to detect communities (fortunato, latora, & marchiori, 2004; latora & marchiori, 2001; lim, kim, & lee, 2016; newman & girvan, 2004; yang & leskovec, 2014) . optimization approaches, such as the louvain and leiden methods, which we use in this article, sort nodes into communities by maximizing a clustering objective function (e.g., modularity). beginning with each node in its own group, the algorithm joins groups together in pairs, choosing the pairs that maximize the increase in modularity (moody & coleman, 2015) . this method performs an iterative process of node assignments until modularity is maximized and leads to a hierarchical nesting of nodes (blondel, guillaume, lambiotte, & lefebvre, 2008) . recently, the louvain algorithm was upgraded and improved as the leiden algorithm that addresses some issues in the louvain algorithm (traag, waltman, & van eck, 2018) . modularity (q), which shows the quality of partitions, is measured and assessed quantitatively: in which e ii is the fraction of the intra-edges of community i over all edges, and e ij is the fraction of the inter-edges between community i and community j over all edges. modularity scores are used to compare assignments of nodes into different communities and also the final partitions. it is calculated as a normalized index value: if there is only one group in a network, q takes the value of zero; if all ties are within separate groups, q takes the maximum value of one. thus, a higher q indicates a greater portion of intra-than inter-edges, implying a network with a strong community structure (fortunato et al., 2004) . currently, there are two challenges in community detection studies. first, the modular structure in complex networks usually is not known beforehand (traag et al., 2018) . we know the community structure only after it is identified. second, there is no formal definition of community in a graph (reichardt & bornholdt, 2006; wilkinson & huberman, 2004) , it simply is a concept of relative density (moody & coleman, 2015) . a high modularity score ensures only that " . . . the groups as observed are distinct, not that they are internally cohesive" (moody & coleman, 2015, p. 909 ) and does not guarantee any formal limit on the subgroup's internal structure. thus, internal structure must be examined, especially in such situations as erns. despite these limitations, efforts to reveal underlying community structures have been undertaken with a wide range of systems, including online and off-line social systems, such as an e-mail corpus of a million messages in organizations (tyler, wilkinson, & huberman, 2005) , zika virus conversation communities on twitter (hagen et al., 2018) , and jazz musician networks (gleiser & danon, 2003) . further, one can exploit complex networks by identifying their community structure. for example, salathé and jones (2010) showed that community structures in human contact networks significantly influence infectious disease dynamics. their findings suggest that, in a network with a community structure, targeting individuals who bridge communities for immunization is better than intervening with highly connected individuals. we exploit the community detection and analysis to understand an ern's substructure in the context of an infectious disease outbreak. it is difficult to know the way communities in erns will form beforehand without examining clusters and their compositions and connectivity in the network. we may expect to observe communities that consist of diverse organizations because organizations' shared goal in erns is to respond to a crisis by performing necessary tasks (e.g., providing mortuary and medical services as well as delivering materials) through concerted efforts on the part of those with different capabilities (moynihan, 2009; waugh, 2003) . organizations that have different information, skills, and resources may frequently interact in a disruptive situation because one type alone, such as the government or organizations in an affected area, cannot cope effectively with the event (waugh, 2003) . on the other hand, we also cannot rule out the possibility shown in previous studies (butts et al., 2012; comfort & haase, 2006; kapucu, 2005) . organizations that work closely in normal situations because of their task similarity, geographic locations, or jurisdictions may interact more frequently and easily, even in disruptive situations (hossain & kuti, 2010) , and communities may be identified that correspond to those factors. a case could be made that communities in erns consist of heterogeneous organizations, but a case could also be made that communities are made up of homogeneous organizations with certain characteristics. it is equally difficult to set expectations about communities' internal structure in erns. we can expect that, regardless of their types, sectors, and locations, some organizations work and interact closely-perhaps even more so in such a disruptive situation. emergent needs for coordination, communication, and collaboration also can trigger organizational interactions that extend beyond the usual or planned structure. thus, the relations among organizations become dense and evolve into the community in which every member is connected. on the other hand, a community in the task network may not require all of the organizations within it to interact. for example, if a presumed structure is strongly established, organizations are more likely to interact with others within the planned structure following the chain of command and control. even without such a structure, government organizations may coordinate their responses following the existing chain of command and control in their routine. we may expect to observe communities with a sparse connection among organizations. thus, the way communities emerge in erns is an open empirical question that can be answered by examining the entire network. several countries have experienced novel infectious disease outbreaks over the past decade (silk, 2018; swaan et al., 2018; williams et al., 2015) and efforts to control such events have been more or less successful, depending upon the instances and countries. in low probability, high-consequence infectious diseases such as the 2015 mers outbreak in south korea, a concerted response among individuals and organizations is virtually the only way to respond because countermeasures-such as vaccines-are not readily available. thus, to achieve an effective response, it is imperative to understand the way individuals and organizations mobilize and respond in public health emergencies. however, the response system for a national or global epidemic is highly complex (hodge, 2015; sell et al., 2018; williams et al., 2015) because of several factors: (1) the large number of organizations across multiple government levels and sectors, (2) the diversity of and interactions among organizations for the necessary (e.g., laboratory testing) or emergent (e.g., hospital closure) tasks, and (3) concurrent outbreaks or treatments at multiple locations attributable to the virus's rapid spread. all of these factors create challenges when responding to public health emergencies. we broadly define a response network as the relations among organizations that can act as critical channels for information, resources, and support. when two organizations engage in any mers-specific response interactions, they are considered to be related in the response. examples of interactions include taking joint actions, communicating with each other, or sharing crucial information and resources (i.e., exchanging patient information, workforce, equipment, or financial support) related to performing the mers tasks, as well as having meetings among organizations to establish a collaborative network. we collected response network data from the following two archival sources: (1) news articles from south korea's four major newspapers 4 published between may 20, 2015, and december 31, 2015 (the outbreak period), and (2) a postevent white paper that the ministry of health and welfare published in december 2016. in august 2016, hanyang university's research center in south korea provided an online tagging tool for every news article in the country's news articles database that included the term "mers (http://naver.com)." a group of researchers at the korea institute for health and social affairs wrote the white paper (488 pages, plus appendices) based on their comprehensive research using multiple data sources and collection methods. the authors of this article and graduate research assistants, all of whom are fluent in korean, were involved in the data collection process from august 2016 to september 2017. because of the literature's lack of specific guidance on the data to collect from archival materials to construct interorganizational network data, we collected the data through trial and error. we collected data from news articles through two separate trials (a total of 6,187 articles from the four newspapers). the authors and a graduate assistant then ran a test trial between august 2016 and april 2017. in july 2017, the authors developed a data collection protocol based on the test trial experience collecting the data from the news articles and white paper. then, we recollected the data from the news articles between august 2017 and september 2017 using the protocol. 5 when we collected data by reviewing archival sources, we first tagged all apparent references within the source text to organizations' relational activities. organizations are defined as "any named entity that represents (directly or indirectly) multiple persons or other entities, and that acts as a de facto decision making unit within the context of the response" (butts et al., 2012, p. 6) . if we found an individual's name on behalf of the individual's organization (e.g., the secretary of the ministry of health and welfare), we coded the individual as the organization's representative. these organizational interactions were coded for a direct relation based on "whom" to "whom" and for "what purpose." then, these relational activity tags were rechecked. all explicit mentions of relations among organizations referred to in the tagged text were extracted into a sociomatrix of organizations. we also categorized individual organizations into different "groups" using the following criteria. first, we distinguished the entities in south korea from those outside the country (e.g., world health organization [who], centers for disease control and prevention [cdc] ). second, we sorted governmental entities by jurisdiction (e.g., local, provincial/metropolitan, or national) and then also by the functions that each organization performs (e.g., health care, police, fire). for example, we categorized local fire stations differently from provincial fire headquarters because these organizations' scope and role differ within the governmental structure. we categorized nongovernmental entities in the private, nonprofit, or civil society sectors that provide primary services in different service areas (e.g., hospitals, medical waste treatment companies, professional associations). at the end of the data collection process, 69 organizational groups from 1,395 organizations were identified (see appendix). 6 we employed the leiden algorithm using python (traag et al., 2018) , which we discussed in the previous section. the leiden algorithm is also available for gephi as a plugin (https://gephi.org/). after identifying communities, the network can be reduced to these communities. in generating the reduced graph, each community appears within a circle, the size of which varies according to the number of organizations in the community. the links between communities indicate the connections among community members. the thickness of the lines varies in proportion to the number of pairs of connected organizations. this process improves the ability to understand the network structure drastically and provides an opportunity to analyze the individual communities' internal characteristics such as the organizations' diversity and their connectivity for each community. shannon's diversity index (h) is used as a measure of diversity because uncertainty increases as species' diversity in a community increases (dejong, 1975) . the h index accounts for both species' richness and evenness in a community (organizational groups in a community in our case). s indicates the total number of species. the fraction of the population that constitutes a species, i, is represented by p i below and then multiplied by the natural logarithm of the proportion (lnp i ). the resulting product is then summed across species and multiplied by à1: high h values represent more diverse communities. shannon's e is calculated by e ¼ h=ln s, which indicates various species' equality in a community. when all of the species are equally abundant, maximum evenness (i.e., 1) is obtained. while limited, density and the average clustering coefficient can capture the basic idea of a subgraph's structural cohesion or "cliquishness" (moody & coleman, 2015) . a graph's density (d) is the proportion of possible edges presented in the graph, which is the ratio between the number of edges present and the maximum possible. it ranges from 0 (no edges) to 1 (if all possible lines are present). a graph's clustering coefficient (c) is the probability that two neighbors of a node are neighbors themselves. it essentially measures the way a node's neighbors form a 1-clique. c is 1 in a graph connected fully. the mers response network in the data set consists of 1,395 organizations and 4,801 edges. table 1 shows that most of the organizations were government organizations (approximately 80%) and 20% were nongovernmental organizations from different sectors. local government organizations constituted the largest proportion of organizations (68%). further, one international organization (i.e., who) and foreign government agencies or foreign medical centers (i.e., cdc, erasmus university medical center) appeared in the response network. organizations coordinated with approximately three other organizations (average degree: 3.44). however, six organizations coordinated with more than 100 others. the country's health authorities, such as the ministry of health and welfare (mohw: 595 edges), central mers management headquarters (cmmh: 551 edges), and korea centers for disease control and prevention (kcdc: 253 edges), were found to have a large number of edges. the ministry of environment (304 edges) also coordinated with many other organizations in the response. the national medical center had 160 edges, and the seoul metropolitan city government had 129. the leiden algorithm detected 27 communities in the network, labeled as 0 through 26 in figures 1-3 and tables 2 and 3. the final modularity score (q) was 0.584, showing that the community detection algorithm partitioned and identified the communities in the network reasonably well. in real-world networks, modularity scores " . . . typically fall in the range from about 0.30 to 0.70. high values are rare" (newman & girvan, 2004, p. 7) . the number of communities was also consistent in the leiden and louvain algorithms (26 communities in the louvain algorithm). the modularity score was slightly higher in the leiden algorithm than the q ¼ 0.577 in the louvain. figure 1 presents the mers response network with communities in different colors to show the organizations' clustering using forceatlas2 layout in gephi. in figure 2 , the network's community structure is clear to the human eye. from the figures (and the community analysis in table 2 ), we find that the mers response network was divided into two sets of communities according to which communities were at the center of the network and their nature of activity in the response, core response communities in one group and supportive functional communities in the other. the two core communities (1 and 2) at the center of the response network included a large number of organizations, with a knot of intergroup coordination among the groups surrounding those two. these communities included organizations across government jurisdictions, sectors, and geographic locations ( table 2 , description) and were actively involved in the response during the mers outbreak. while not absolute, we observe that the network of a dominating organization had a "mushroom" shape of interactions with other organizations within the communities (also see figure 3a ). the dominant organizations were the central government authorities such as the mohw, the cmmh, and kcdc. the national health authorities led the mers response. other remaining communities were (1) confined geographically, (2) oriented functionally, or (3) both. first, some communities consisted of diverse organizations in the areas where two mers hospitals are located-seoul metropolitan city and gyeonggi province (communities 3 and 5). organizations in these communities span government levels and sectors within the areas affected. second, two communities consisted of organizations with different functions and performed supportive activities (community 4, also see figure 3b ). other supportive functional communities that focus on health (community 11, see figure 3c ) or foreign affairs (community 15) had a "spiderweb" shape of interactions among organizations within the communities. third, several communities consisted of a relatively small number of organizations connected to one in the center (communities 16, 17, 18, and 19) . these consisted of local fire organizations in separate jurisdictions (see figure 3d ) that were both confined geographically and oriented functionally. table 2 summarizes the characteristics of the 27 communities in the response network. in table 2 , we also note distinct interorganizational relations present within the communities. the two core response communities include both intergovernmental and intersectoral relations. 7 that is, organizations across government jurisdictions or sectors were actively involved in response to the epidemic in the communities. while diverse organizations participated in these core communities, the central government agencies led and directed other organizations, which reduced member connectivity. among the supportive functional communities, those that are confined geographically showed relatively high diversity but low connectivity (communities 3, 5, and 6 through 10). these communities included intergovernmental relations within geographic locations. secondly, communities of organizations with a specialized function showed relatively high diversity or connectivity. these included organizations from governmental and nongovernmental sectors and had no leading or dominating organizations. for example, communities 11 and 12 had intersectoral relations but no intergovernmental relations. thirdly, within each community of fire organizations in different geographic locations, one provincial or metropolitan fire headquarters was linked to multiple local fire stations in a star network. these communities, labeled igf, had low member diversity and member connectivity, while they were organizationally and functionally coherent. table 3 summarizes the results elaborated above. in addition to the division of communities along the lines of the nature of their response activities, we observe that the structural characteristics of communities with only intersectional or international relations showed high diversity and high connectivity. whenever intergovernmental relations were present in communities, however, the member connectivity was low, even if intersectoral relations appeared together within them. we use the community detection method to gain a better understanding of the patterns of associations among diverse response organizations in an epidemic response network. the large data sets available and increased computational power significantly transform the study of social networks and can shed light on topics such as cohesive subgroups in large networks. network studies today involve mining enormous digital data sets such as collective behavior online (hagen et al., 2018) , an e-mail corpus of a million messages (tyler, wilkinson, & buberman, 2005) , or scholars' massive citation data (kim & zhang, 2016) . the scale of erns in large disasters and emergencies is noteworthy (moynihan, 2009; waugh, 2003) , and over 1,000 organizations appeared in butts et al. (2012) study as well as in this research. their connections reflect both existing structural forms by design and by emergent needs. the computational power needed to analyze such large relational data is ever higher and the methods simpler now, which allows us to learn about the entire network. we find two important results. first, the national public health ern in korea split largely into two groups. the core response communities' characteristics were that (1) they were not confined geographically, (2) organizations were heterogeneous across jurisdictional lines as well as sectors, and (3) the community's internal structure was sparse even if intersectoral relations were present. on the other hand, supportive functional communities' characteristics were that (1) they were communities of heterogeneous organizations in the areas affected that were confined geographically; (2) the communities of intersectoral, professional organizations were heterogeneous, densely connected, and not confined geographically; and (3) the communities of traditional emergency response organizations (e.g., fire) were confined geographically, homogeneous, and connected sparsely in a centralized fashion. these findings show distinct features of the response to emerging infectious diseases. the core response communities suggest that diverse organizations across jurisdictions, sectors, and functions actually performed active and crucial mers response activities. however, these organizations' interaction and coordination inside the communities were found to be top down from the key national health authorities to all other organizations. this observation does not speak to the quality of interactions in the centralized top-down structure, but one can also ask how effective such a structure can be in a setting where diverse organizations must share authority, responsibilities, and resources. second, infectious diseases spread rapidly and can break out in multiple locations simultaneously. the subgroup patterns in response networks to infectious diseases can differ from those of location-bound natural disasters such as hurricanes and earthquakes. while some organizations may not be actively or directly involved in the response, communities of these organizations can be formed to prepare for potential outbreaks or provide support to the core response communities during the event. second, we also find that the communities' internal characteristics (diversity and connectivity) differed depending upon the types of interorganizational relations that appeared within the communities. based on these analytical results, two propositions about the community structure in the ern can be developed: (1) if intergovernmental relations operate in a community, the community's member connectivity may be low, regardless of member diversity. (2) if community members are functionally similar, (a) professional organization communities' (e.g., health or foreign affairs) member connectivity may be dense and (b) emergency response organization communities' (e.g., fire) member connectivity may be sparse. the results suggest that the presence of intergovernmental relations within the communities in erns may be associated with low member connectivity. however, this finding does not imply that those communities with intergovernmental relations are not organizationally or functionally cohesive. instead, we may expect a different correlation between members' functional similarity and their member connectivity depending upon the types of professions, as seen in 2(a) and (b). organizations' concerted efforts during a response to an epidemic is a prevalent issue in many countries (go & park, 2018; hodge, gostin, & vernick, 2007; seo, lee, kim, & lee, 2015; swaan et al., 2018) . the 2015 mers outbreak in south korea led to 16,693 suspected cases, 186 infected cases, and 38 deaths in the country (korea centers for disease control and prevention, 2015) . the south korean government's response to it was severely criticized for communication breakdowns, lack of leadership, and information secrecy (korea ministry of health and welfare, 2016). the findings of this study offer a practical implication for public health emergency preparedness and response in the country studied. erns' effective structure has been a fundamental question and a source of continued debate (kapucu et al., 2010; nowell, steelman, velez, & yang, 2018 ). the answer remains unclear, but the recent opinion leans toward a less centralized and hierarchical structure, given the complexity of making decisions in disruptive situations (brooks, bodeau, & fedorowicz, 2013; comfort, 2007; hart, rosenthal, & kouzmin, 1993) . our analysis shows clearly that the community structure and structures within communities in the network were highly centralized (several mushrooms) and led by central government organizations. given that the response to the outbreak was severely criticized for its poor communication and lack of coordination, it might be beneficial to include more flexibility and openness in the response system in future events. we suggest taking advice from the literature above conservatively because of the contextual differences in the event and setting. this study's limitations also deserve mention. several community detection methods have been developed with different assumptions for network partition. some algorithms take deterministic group finding approaches that partition the network based on betweenness centrality edges (girvan & newman, 2002) or information centrality edges (fortunato et al., 2004) . other algorithms take the optimization approaches we use in this article. in our side analyses, we tested three algorithms with the same data set: g-n, louvain, and leiden. the modularity scores were consistent, as reported in this article, but the number of communities in g-n and the other two algorithms differed. the deterministic group finding approach (g-n) found a substantively high number of communities. the modularity score can help make sense of the partition initially, but the approach is limited (reichardt & bornholdt, 2006) . thus, two questions remain: which algorithm do we choose and how do we know whether the community structure is robust (karrer, levina, & newman, 2008) ? in their nature, these questions do not differ from which statistical model to use given the assumptions and types of data in hand. the algorithms also require further examination and tests. while we reviewed the data sources carefully multiple times to capture the response coordination, communication, and collaboration, the process of collecting and cleaning data can never be free from human error. it was a time-consuming, labor-intensive process that required trial and error. further, the original written materials can have their own biases that reflect the source's perspective. government documents may provide richer information about the government's actions but less so about other social endeavors. media data, such as newspapers, also have their limitations as information sources to capture rich social networks. accordingly, our results must be interpreted in the context of these limitations. in conclusion, this article examines the community structure in a large ern, which is a quite new, but potentially fruitful, approach to the field. we tested a rapidly developing analytical approach to the ern to generate theoretical insights and find paths to exploit such insights for better public health emergency preparedness and response in the future. much work remains to build and refine the theoretical propositions on crisis response networks drawn from this rich case study. the katrina response network consisted of 1,577 organizations and 857 connections with a mean degree except for the quote, comfort and haase (2006) do not provide further explanation incident command system was established originally for the response to fire and has been expanded to other disaster areas in the end, we found that the process was not helpful because of the volume and redundancy of content in news articles different newspapers published, which is not an issue in analysis because it can be filtered and handled easily using network analysis tool. because we had not confronted previous disaster response studies that collected network data from text materials, such as news articles and situation reports, and reported their reliability we also classified organizations based on specialty, such as quarantine, economy, police, tourism, and so on regardless of jurisdictions. twenty-seven specialty areas were classified. we note that the result of diversity analysis using the 27 specialty areas did not differ from that using the 69 organizational groups. the correlation of the diversity indices based on the two different classification criteria was r ¼ .967. we report the result based on organization groups because the classification criterion can indicate better the different types of we did not measure the frequency, intensity, or quality of interorganizational relations but only the presence of either or both relations within the communities fast unfolding of communities in large networks organising for effective emergency management: lessons from research analyzing social networks network management in emergency response: articulation practices of state-level managers-interweaving up, down, and sideways interorganizational collaboration in the hurricane katrina response from linearity to complexity: emergent characteristics of the 2006 avian influenza response system in turkey comparing coordination structures for crisis management in six countries mission improbable: using fantasy documents to tame disaster crisis management in hindsight: cognition, coordination, communication communication, coherence, and collective action a comparison of three diversity indices based on their components of richness and evenness method to find community structures based on information centrality community structure in social and biological networks community structure in jazz a comparative study of infectious disease government in korea: what we can learn from the 2003 sars and the 2015 mers outbreak imagining twitter as an imagined community crisis communications in the age of social media: a network analysis of zika-related tweets crisis decision making: the centralization revisited global and domestic legal preparedness and response: 2014 ebola outbreak pandemic and all-hazards preparedness act disaster response preparedness coordination through social networks interorganizational coordination in dynamic context: networks in emergency response management examining intergovernmental and interorganizational response to catastrophic disasters: toward a network-centered approach collaborative decision-making in emergency and disaster management structure and network performance: horizontal and vertical networks in emergency management robustness of community structure in networks digital government and wicked problems subgroup analysis of an epidemic response network of organizations: 2015 mers outbreak in korea middle east respiratory syndrome coronavirus outbreak in the republic of korea the 2015 mers white paper. seoul, south korea: ministry of health and welfare efficient behavior of small-world networks blackhole: robust community detection inspired by graph drawing cliques, clubs and clans clustering and cohesion in networks: concepts and measures the network governance of crisis response: case studies of incident command systems finding and evaluating community structure in networks the structure of effective governance of disaster response networks: insights from the field when are networks truly modular? dynamics and control of diseases in networks with community structure public health resilience checklist for high-consequence infectious diseases-informed by the domestic ebola response in the united states epidemics crisis management systems in south korea infectious disease threats and opportunities for prevention extended structures of mediation: re-examining brokerage in dynamic networks ebola preparedness in the netherlands: the need for coordination between the public health and the curative sector leveraging intergovernmental and cross-sectoral networks to manage nuclear power plant accidents: a case study from from louvain to leiden: guaranteeing well-connected communities e-mail as spectroscopy: automated discovery of community structure within organizations social network analysis: methods and applications terrorism, homeland security and the national emergency management network a method for finding communities of related genes cdc's early response to a novel viral disease, middle east respiratory syndrome coronavirus structure and overlaps of communities in networks author biographies yushim kim is an associate professor at the school of public affairs at arizona state university and a coeditor of journal of policy analysis and management. her research examines environmental and urban policy issues and public health emergencies from a systems perspective jihong kim is a graduate student at the department of seong soo oh is an associate professor of public administration at hanyang university, korea. his research interests include public management and public sector human resource management he is an associate editor of information sciences and comsis journal. his research interests include data mining and databases her research focuses on information and knowledge management in the public sector and its impact on society, including organizational learning, the adoption of technology in the public sector, public sector data management, and data-driven decision-making in government jaehyuk cha is a professor at the department of computer and software, hanyang university, korea. his research interests include dbms, flash storage system the authors appreciate research assistance from jihyun byeon and useful comments from chan wang, haneul choi, and young jae won. the early idea of this article using partial data from news articles was presented at the 2019 dg.o research conference and published as conference proceeding (kim, kim, oh, kim, & ku, 2019) . data are available from the author at ykim@asu.edu upon request. we used python to employ the leiden community detection algorithm (see the source code: https://github.com/ vtraag/leidenalg). network measures, such as density and clustering coefficient, as well as the diversity index were calculated using python libraries (networkx, math, pandas, nump). we used gephi 0.9.2 for figures and mendeley for references. the authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. the authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: this work was supported by the national research foundation of korea grant funded by the korean government (ministry of science and ict; no. 2018r1a5a7059549). supplemental material for this article is available online. key: cord-225177-f7i0sbwt authors: pastor-escuredo, david; tarazona, carlota title: characterizing information leaders in twitter during covid-19 crisis date: 2020-05-14 journal: nan doi: nan sha: doc_id: 225177 cord_uid: f7i0sbwt information is key during a crisis such as the current covid-19 pandemic as it greatly shapes people opinion, behaviour and even their psychological state. it has been acknowledged from the secretary-general of the united nations that the infodemic of misinformation is an important secondary crisis produced by the pandemic. infodemics can amplify the real negative consequences of the pandemic in different dimensions: social, economic and even sanitary. for instance, infodemics can lead to hatred between population groups that fragment the society influencing its response or result in negative habits that help the pandemic propagate. on the contrary, reliable and trustful information along with messages of hope and solidarity can be used to control the pandemic, build safety nets and help promote resilience and antifragility. we propose a framework to characterize leaders in twitter based on the analysis of the social graph derived from the activity in this social network. centrality metrics are used to identify relevant nodes that are further characterized in terms of users parameters managed by twitter. we then assess the resulting topology of clusters of leaders. although this tool may be used for surveillance of individuals, we propose it as the basis for a constructive application to empower users with a positive influence in the collective behaviour of the network and the propagation of information. misinformation and fake news are a recurrent problem of our digital era [1] [2] [3] . the volume of misinformation and its impact grows during large events, crises and hazards [4] . when misinformation turns into a systemic pattern it becomes an infodemic [5, 6] . infodemics are frequent specially in social networks that are distributed systems of information generation and spreading. for this to happen, the content is not the only variable but the structure of the social network and the behavior of relevant people greatly contribute [6] . during a crisis such as the current covid-19 pandemic, information is key as it greatly shapes people's opinion, behaviour and even their psychological state [7] [8] [9] . however, the greater the impact the greater the risk [10] . it has been acknowledged from the secretary-general of the united nations that the infodemic of misinformation is an important secondary crisis produced by the pandemic. during a crisis, time is critical, so people need to be informed at the right time [11, 12] . furthermore, information during a crisis leads to action, so population needs to be properly informed 1 center of innovation and technology for development, technical university madrid, spain 2 lifed lab, madrid, spain to act right [13] . thus, infodemics can amplify the real negative consequences of the pandemic in different dimensions: social, economic and even sanitary. for instance, infodemics can lead to hatred between population groups [14] that fragment the society influencing its response or result in negative habits that help the pandemic propagate. on the contrary, reliable and trustful information along with messages of hope and solidarity can be used to control the pandemic, build safety nets and help promote resilience and antifragility. to fight misinformation and hate speech,content-based filtering is the most common approach taken [6, [15] [16] [17] . the availability of deep learning tools makes this task easier and scalable [18] [19] [20] . also, positioning in search engines is key to ensure that misinformation does not dominate the most relevant results of the searches. however, in social media, besides content, people's individual behavior and network properties, dynamics and topology are other relevant factors that determine the spread of information through the network [21] [22] [23] . we propose a framework to characterize leaders in twitter based on the analysis of the social graph derived from the activity in this social network [24] . centrality metrics are used to identify relevant nodes that are further characterized in terms of users' parameters managed by twitter [25] [26] [27] [28] [29] . although this tool may be used for surveillance of individuals, we propose it as the basis for a constructive application to empower users with a positive influence in the collective behaviour of the network and the propagation of information [27, 30] . tweets were retrieved using the real-time streaming api of twitter. two concurrent filters were used for the streaming: location and keywords. location was restricted to a bounding box enclosing the city of madrid [-3.7475842804 each tweet was analyzed to extract mentioned users, retweeted users, quoted users or replied users. for each of these events the corresponding nodes were added to an undirected graph as well as a corresponding edge initializing the edge property "flow". if the edge was already created, the property "flow" was incremented. this procedure was repeated for each tweet registered. the network was completed by adding the property "inverse flow", that is 1/flow, to each edge. the resulting network featured 107544 nodes and 116855 edges. to compute centrality metrics the network described above was filtered. first, users with a node degree (number of edges connected to the note) less than a given threshold (experimentally set to 3) were removed from the network as well as the edges connected to those nodes. the reason of this filtering was to reduce computation cost as algorithms for centrality metrics have a high computation cost and also removed poorly connected nodes as the network built comes from sparse data (retweets, mentions and quotes). however, it is desirable to minimize the amount of filtering performed to study large scale properties within the network. the resulting network featured 15845 nodes and 26837 edges. additionally the network was filtered to be connected which is a requirement for the computation of several of the centrality metrics described bellow. for this purpose the subnetworks connected were identified, selecting the largest connected network as the target network for analysis. the resulting network featured 12006 nodes and 25316 edges. several centrality metrics were computed: cfbetweenness, betweenness, closeness, cfcloseness, eigenvalue, degree and load. each of this centrality metric highlights a specific relevance property of a node with regards to the whole flow through the network. descriptors explanations are summarized in table 1 . besides the network-based metrics, twitter user' parameters were collected: followers, following and favorites so the relationships with relevance metrics could be assessed. we applied several statistical tools to characterize users in terms of the relevance metrics. we also implemented visualizations of different variables and the network for a better understanding of leading nodes characterization and topology. we compared the relevance in the network derived from the centrality metrics with the user' profile variables of twitter: number of followers, number of following and retweet count. figure 1 shows a scatter plots matrix among all variables. principal diagonal of the figure shows the distribution of each variable which are normally characterized by a high concentration in low values and a very long tail of the distribution. these distributions imply that few nodes concentrate most part of the relevance within the network. more surprisingly, same distributions are observed for twitter user' parameters such as number of followers or friends (following). the load centrality of a node is the fraction of all shortest paths that pass through that node. load centrality is slightly different than betweenness. the scatter plots shows that the is no significant correlation between variables except for the pair betweenness and load centralities as it is expected expected because they have similar definitions. this fact is remarkable as different centrality metrics provide a different perspective of leading nodes within the network and it does not necessarily correlates with the amount of related users, but also in the content dynamics. users were ranked using on variable as the reference. figure 2 shows the ranking resulting from using the eigenvalue centrality as the reference. the values were saturated to the percentile 95 of the distribution to improve visualization and avoid the effect of single values with very out of range values. this visualization confirms the lack of correlation between variables and the highly asymmetric distribution of the descriptors. figure 3 summarizes the values of each leader for each descriptor showing that even within the top ranked leaders there is a very large variability. this means that some nodes are singular events within the network that require further analysis to be interpreted, as they could be leaders in society or just a product of the network dynamics. figure 4 shows the ranking resulting from using current flow betweenness centrality as the reference. in this cases, the distribution of this reference variable is smoother and shows a more gradual behavior of leaders. to assess how the nodes with high relevance are distributed with projected the network into graphs by selecting the subgraph of nodes with a certain level of relevance (threshold on the network). the resulting network graphs may not be therefore connected. the eigenvalue-ranked graph shows high connectivity and very big nodes (see fig. 5 ). this is consistent with the definition of eigenvalue centrality that highlights how a node is connected to nodes that are also highly connected. this structure has implications in the reinforcement of specific messages and information within high connected clusters which can act as promoters of solutions or may become lobbies of information. the current flow betweenness shows an unconnected graph which is very interesting as decentralized nodes play a key role in transporting information through the network (see fig. 6 ). the current flow closeness shows also an unconnected graph which means that the social network is rather homogeneously distributed overall with parallel communities of information that do not necessarily interact with each other (see fig. 7 ). by increasing the size of the graph more clusters can be observed, specially in the eigenvalue-ranked network (fig. 8) . some clusters also appear for the current flow betweenness and current flow closeness (see fig.9 and 10). these clusters may have a key role in establishing bridges between different communities of practice, knowledge or region-determined groups. as the edges of the network are characterized in terms of flows between users, these bridges can be understood in terms of volume of information between communities. the distributions of the centrality metrics indicate that there are some nodes with massive relevance. these nodes can be seen as events within the flow of communication through the network [23] that require further contextualization to be interpreted. these nodes can propagate misinformation or make news or messages viral. further research is required to understand the cause of this massive relevance events, for instance, if it is related to a relevant concept or message or whether it is an emerging event of the network dynamics and topology. another way to assess these nodes is if they are consistently behaving this way along time or they are a temporal event. also, it may be necessary to contextualize with the type of content they normally spread to understand their exceptional relevance. besides the existence of massive relevance nodes, the quantification and understanding of the distribution of high relevant nodes has a lot of potential applications to spread messages to reach a wide number of users within the network. current flow betweenness particularly seems a good indicator to identify nodes to create a safety net in terms of information and positive messages. the distribution of the nodes could be approached for the general network or for different layers or subnetworks, isolated depending on several factors: type of interaction, type of content or some other behavioral pattern. experimental work is needed to test how a message either positive or negative spreads when started at one of the relevant nodes or close to the relevant nodes. for this purpose we are working towards integrating a network of concepts and the network of leaders. understanding the dynamics of narratives and concept spreading is key for a responsible use of social media for building up resilience against crisis. we also plan to make interactive graph visualization to browse the relevance of the network and dynamically investigate how relevant nodes are connected and how specific parts of the graph are ranked to really understand the distribution of the relevance variables as statistical parameters are not suitable to characterize a common pattern. it is necessary to make a dynamic ethical assessment of the potential applications of this study. understanding the network can be used to control purposes. however, we consider it is necessary that social media become the basis of pro-active response in terms of conceptual content and information. digital technologies must play a key role on building up resilience and tackle crisis. fake news detection on social media: a data mining perspective the science of fake news fake news and the economy of emotions: problems, causes, solutions. digital journalism social media and fake news in the 2016 election viral modernity? epidemics, infodemics, and the 'bioinformational'paradigm how to fight an infodemic. the lancet the covid-19 social media infodemic corona virus (covid-19)"infodemic" and emerging issues through a data lens: the case of china infodemic": leveraging high-volume twitter data to understand public sentiment for the covid-19 outbreak infodemic and risk communication in the era of cov-19 information flow during crisis management: challenges to coordination in the emergency operations center the signal code: a human rights approach to information during crisis quantifying information flow during emergencies measuring political polarization: twitter shows the two sides of venezuela false news on social media: a data-driven survey hate speech detection: challenges and solutions an emotional analysis of false information in social media and news articles declare: debunking fake news and false claims using evidence-aware deep learning csi: a hybrid deep model for fake news detection a deep neural network for fake news detection dynamical strength of social ties in information spreading impact of human activity patterns on the dynamics of information diffusion efficiency of human activity on information spreading on twitter multiple leaders on a multilayer social media the ties that lead: a social network approach to leadership. the leadership quarterly detecting opinion leaders and trends in online social networks exploring the potential for collective leadership in a newly established hospital network who takes the lead? social network analysis as a pioneering tool to investigate shared leadership within sports teams discovering leaders from community actions analyzing world leaders interactions on social media we would like to thank the center of innovation and technology for development at technical university madrid for support and valuable input, specially to xose ramil, sara romero and mónica del moral. thanks also to pedro j. zufiria, juan garbajosa, alejandro jarabo and carlos garcía-mauriño for collaboration. key: cord-027463-uc0j3fyi authors: brandi, giuseppe; di matteo, tiziana title: a new multilayer network construction via tensor learning date: 2020-05-25 journal: computational science iccs 2020 doi: 10.1007/978-3-030-50433-5_12 sha: doc_id: 27463 cord_uid: uc0j3fyi multilayer networks proved to be suitable in extracting and providing dependency information of different complex systems. the construction of these networks is difficult and is mostly done with a static approach, neglecting time delayed interdependences. tensors are objects that naturally represent multilayer networks and in this paper, we propose a new methodology based on tucker tensor autoregression in order to build a multilayer network directly from data. this methodology captures within and between connections across layers and makes use of a filtering procedure to extract relevant information and improve visualization. we show the application of this methodology to different stationary fractionally differenced financial data. we argue that our result is useful to understand the dependencies across three different aspects of financial risk, namely market risk, liquidity risk, and volatility risk. indeed, we show how the resulting visualization is a useful tool for risk managers depicting dependency asymmetries between different risk factors and accounting for delayed cross dependencies. the constructed multilayer network shows a strong interconnection between the volumes and prices layers across all the stocks considered while a lower number of interconnections between the uncertainty measures is identified. network structures are present in different fields of research. multilayer networks represent a widely used tool for representing financial interconnections, both in industry and academia [1] and has been shown that the complex structure of the financial system plays a crucial role in the risk assessment [2, 3] . a complex network is a collection of connected objects. these objects, such as stocks, banks or institutions, are called nodes and the connections between the nodes are called edges, which represent their dependency structure. multilayer networks extend the standard networks by assembling multiple networks 'layers' that are connected to each other via interlayer edges [4] and can be naturally represented by tensors [5] . the interlayer edges form the dependency structure between different layers and in the context of this paper, across different risk factors. however, two issues arise: 1 the construction of such networks is usually based on correlation matrices (or other symmetric dependence measures) calculated on financial asset returns. unfortunately, such matrices being symmetric, hide possible asymmetries between stocks. 2 multilayer networks are usually constructed via contemporaneous interconnections, neglecting the possible delayed cause-effect relationship between and within layers. in this paper, we propose a method that relies on tensor autoregression which avoids these two issues. in particular, we use the tensor learning approach establish in [6] to estimate the tensor coefficients, which are the building blocks of the multilayer network of the intra and inter dependencies in the analyzed financial data. in particular, we tackle three different aspects of financial risk, i.e. market risk, liquidity risk, and future volatility risk. these three risk factors are represented by prices, volumes and two measures of expected future uncertainty, i.e. implied volatility at 10 days (iv10) and implied volatility at 30 days (iv30) of each stock. in order to have stationary data but retain the maximum amount of memory, we computed the fractional difference for each time series [7] . to improve visualization and to extract relevant information, the resulting multilayer is then filtered independently in each dimension with the recently proposed polya filter [8] . the analysis shows a strong interconnection between the volumes and prices layers across all the stocks considered while a lower number of interconnection between the volatility at different maturity is identified. furthermore, a clear financial connection between risk factors can be recognized from the multilayer visualization and can be a useful tool for risk assessment. the paper is structured as follows. section 2 is devoted to the tensor autoregression. section 3 shows the empirical application while sect. 4 concludes. tensor regression can be formulated in different ways: the tensor structure is only in the response or the regression variable or it can be on both. the literature related to the first specification is ample [9, 10] whilst the fully tensor variate regression received attention only recently from the statistics and machine learning communities employing different approaches [6, 11] . the tensor regression we are going to use is the tucker tensor regression proposed in [6] . the model is formulated making use of the contracted product, the higher order counterpart of matrix product [6] and can be expressed as: where x ∈ r n ×i1×···×in is the regressor tensor, y ∈ r n ×j1×···×jm is the response tensor, e ∈ r n ×j1×···×jm is the error tensor, a ∈ r 1×j1×···×jm is the intercept tensor while the slope coefficient tensor, which represents the multilayer network we are interested to learn, is b ∈ r i1×···×in ×j1×···×jm . subscripts i x and j b are the modes over winch the product is carried out. in the context of this paper, x is a lagged version of y, hence b represents the multilinear interactions that the variables in x generate in y. these interactions are generally asymmetric and take into account lagged dependencies being b the mediator between two separate in time tensor datasets. therefore, b represents a perfect candidate to use for building a multilayer network. however, the b coefficient is high dimensional. in order to resolve the issue, a tucker structure is imposed on b such that it is possible to recover the original b with smaller objects. 1 one of the advantages of the tucker structure is, contrarily to other tensor decompositions such as the parafac, that it can handle dimension asymmetric tensors since each factor matrix does not need to have the same number of components. tensor regression is prone to over-fitting when intra-mode collinearity is present. in this case, a shrinkage estimator is necessary for a stable solution. in fact, the presence of collinearity between the variables of the dataset degrades the forecasting capabilities of the regression model. in this work, we use the tikhonov regularization [12] . known also as ridge regularization, it rewrites the standard least squares problem as where λ > 0 is the regularization parameter and 2 f is the squared frobenius norm. the greater the λ the stronger is the shrinkage effect on the parameters. however, high values of λ increase the bias of the tensor coefficient b. indeed, the shrinkage parameter is usually set via data driven procedures rather than input by the user. the tikhonov regularization can be computationally very expensive for big data problem. to solve this issue, [13] proposed a decomposition of the tikhonov regularization. the learning of the model parameters is a nonlinear optimization problem that can be solved by iterative algorithms such as the alternating least squares (als) introduced by [14] for the tucker decomposition. this methodology solves the optimization problem by dividing it into small least squares problems. recently, [6] developed an als algorithm for the estimation of the tensor regression parameters with tucker structure in both the penalized and unpenalized settings. for the technical derivation refer to [6] . in this section, we show the results of the construction of the multilayer network via the tensor regression proposed in eq. 1. the dataset used in this paper is composed of stocks listed in the dow jones (dj). these stocks time series are recorded on a daily basis from 01/03/1994 up to 20/11/2019, i.e. 6712 trading days. we use 26 over the 30 listed stocks as they are the ones for which the entire time series is available. for the purpose of our analysis, we use log-differenciated prices, volumes, implied volatility at 10 days (iv10) and implied volatility at 30 days (iv30). in particular, we use the fractional difference algorithm of [7] to balance stationarity and residual memory in the data. in fact, the original time series have the full amount of memory but they are non-stationary while integer log-differentiated data are stationary but have small residual memory due to the process of differentiation. in order to preserve the maximum amount of memory in the data, we use the fractional differentiation algorithm with different levels of fractional differentiation and then test for stationarity using the augmented dickey-fuller test [15] . we find that all the data are stationary when the order of differentiation is α = 0.2. this means that only a small amount of memory is lost in the process of differentiation. the tensor regression presented in eq. 1 has some parameters to be set, i.e. the tucker rank and the shrinkage parameter λ for the penalized estimation of eq. 2 as discussed in [6] . regarding the tucker rank, we used the full rank specification since we do not want to reduce the number of independent links. in fact, using a reduced rank would imply common factors to be mapped together, an undesirable feature for this application. regarding the shrinkage parameter λ, we selected the value as follows. first, we split the data in a training set composed of 90% of the sample and in a test set with the remaining 10%. we then estimated the regression coefficients for different values of λ on the training set and then we computed the predicted r 2 on the test set. we used a grid of λ = 0, 1, 5, 10, 20, 50. and the predicted r 2 is maximized at λ = 0 (no shrinkage). in this section, we show the results of the analysis carried out with the data presented in sect. 3.1. the multilayer network built via the estimated tensor autoregression coefficient b represents the interconnections between and within each layer. in particular b i,j,k,l is the connection between stock i in layer j and stock k in layer l. it is important to notice that the estimated dependencies are in general not symmetric, i.e. b i,j,k,l = b k,j,i,l . however, the multilayer network constructed using b is fully connected. for this reason, a method for filtering those networks is necessary. different methodologies are available for filtering information from complex networks [8, 16] . in this paper, we use the polya filter of [8] as it can handle directed weighted networks and it is both flexible and statistically driven. in fact, it employs a tuning parameter a that drives the strength of the filter and returns the p-values for the null hypotheses of random interactions. we filter every network independently (both intra and inter connections) using a parametrization such that 90% of the total links are removed. 2 in order to asses the dependency across the layers, we analyze two standard multilayer network measures, i.e. inter-layer assortativity and edge overlapping. a standard way to quantify inter-layer assortativity is to calculate pearson's correlation coefficient over degree sequences of two layers and it represents a measure of association between layers. high positive (negative) values of such measure mean that the two risk factors act in the same (opposite) direction. instead, overlapping edges are the links between pair of stocks present contemporaneously in two layers. high values of such measure mean that the stocks have common connections behaviour. as it can be possible to see from fig. 1 , prices and volatility have a huge portion of overlapping edges, still, these layers are disassortative as the correlation between the nodes sequence across the two layer is negative. this was an expected result since the negative relationship between prices and volatility is a stylized fact in finance. not surprisingly, the two measures of volatility are highly assortative and have a huge fraction of overlapping edges. finally, we show in fig. 2 the filtered multilayer network constructed via the tensor coefficient b estimated via the tensor autoregression of eq. 1. as it can be possible to notice, the volumes layer has more interlayer connections rather than intralayer connections. since each link represents the effect that one variable has on itself and other variables in the future, this means that stocks' liquidity risk mostly influences future prices and expected uncertainty. the two volatility networks have a relatively small number of interlayer connections despite being assortative. this could be due to the fact that volatility risk tends to increase or decrease through a specific maturity rather than across maturities. it is also possible to notice that more central stocks, depicted as bigger nodes in fig. 2 , have more connections but that this feature does not directly translate in a higher strength (depicted as darker colour of the nodes). this is a feature already emphasized in [3] for financial networks. fig. 2 . estimated multilayer network. node colours: loglog scale; darker colour is associated to higher strength of the node. node size: loglog scale; darker colour is associated to higher k-coreness score. edge colour: uniform. from a financial point of view, such graphical representation put together three different aspects of financial risk: market risk, liquidity risk (in terms of volumes exchanged) and forward looking uncertainty measures, which account for expected volatility risk. in fact, the stocks in the volumes layer are not strongly interconnected but produce a huge amount of risk propagation through prices and volatility. understanding the dynamics of such multilayer network representation would be a useful tool for risk managers in order to understand risk balances and propose risk mitigation techniques. in this paper, we proposed a methodology to build a multilayer network via the estimated coefficient of the tucker tensor autoregression of [6] . this methodology, in combination with a filtering technique, has proven able to reproduce interconnections between different financial risk factors. these interconnections can be easily mapped to real financial mechanisms and can be a useful tool for monitoring risk as the topology within and between layers can be strongly affected in distressed periods. in order to preserve the maximum memory information in the data but requiring stationarity, we made use of fractional differentiation and found out that the variables analyzed are stationary with differentiation of order α = 0.2. the model can be extendedto a dynamic framework in order to analyze the dependency structures under different market conditions. the multiplex dependency structure of financial markets risk diversification: a study of persistence with a filtered correlation-network approach systemic liquidity contagion in the european interbank market the structure and dynamics of multilayer networks unveil stock correlation via a new tensor-based decomposition method predicting multidimensional data via tensor learning a fast fractional difference algorithm a pólya urn approach to information filtering in complex networks tensor regression with applications in neuroimaging data analysis parsimonious tensor response regression tensor-on-tensor regression on the stability of inverse problems a decomposition of the tikhonov regularization functional oriented to exploit hybrid multilevel parallelism principal component analysis of three-mode data by means of alternating least squares algorithms introduction to statistical time series complex networks on hyperbolic surfaces key: cord-002929-oqe3gjcs authors: strano, emanuele; viana, matheus p.; sorichetta, alessandro; tatem, andrew j. title: mapping road network communities for guiding disease surveillance and control strategies date: 2018-03-16 journal: sci rep doi: 10.1038/s41598-018-22969-4 sha: doc_id: 2929 cord_uid: oqe3gjcs human mobility is increasing in its volume, speed and reach, leading to the movement and introduction of pathogens through infected travelers. an understanding of how areas are connected, the strength of these connections and how this translates into disease spread is valuable for planning surveillance and designing control and elimination strategies. while analyses have been undertaken to identify and map connectivity in global air, shipping and migration networks, such analyses have yet to be undertaken on the road networks that carry the vast majority of travellers in low and middle income settings. here we present methods for identifying road connectivity communities, as well as mapping bridge areas between communities and key linkage routes. we apply these to africa, and show how many highly-connected communities straddle national borders and when integrating malaria prevalence and population data as an example, the communities change, highlighting regions most strongly connected to areas of high burden. the approaches and results presented provide a flexible tool for supporting the design of disease surveillance and control strategies through mapping areas of high connectivity that form coherent units of intervention and key link routes between communities for targeting surveillance. networks, the regular and planar nature of road networks precludes the formation of clear communities, i.e. roads that cluster together shaping areas that are more connected within their boundaries than with external roads. highly connected regional communities can promote rapid disease spread within them, but can be afforded protection from recolonization by surrounding regions of reduced connectivity, making them potentially useful intervention or surveillance units 6, 26, 27 . for isolated areas, a focused control or elimination program is likely to stand a better chance of success than those highly connected to high-transmission or outbreak regions. for example, reaching a required childhood vaccination coverage target in one district is substantially more likely to result in disease control and elimination success if that district is not strongly connected to neighbouring districts where the target has not been met. the identification of 'bridge' routes between highly connected regions could also be of value in targeting limited resources for surveillance 28 . moreover, progressive elimination of malaria from a region needs to ensure that parasites are not reintroduced into areas that have been successfully cleared, necessitating a planned strategy for phasing that should be informed by connectivity and mobility patterns 26 . here we develop methods for identifying and mapping road connectivity communities in a flexible, hierarchical way. moreover, we map 'bridge' areas of low connectivity between communities and apply these new methods to the african continent. finally, we show how these can be weighted by data on disease prevalence to better understand pathogen connectivity, using p. falciparum malaria as an example. african road network data. data on the african road network (arn) were obtained from gps navigation and cartography as described in a previous study 24 . the dataset maps primary and secondary roads across the continent, and while it does have commercial restrictions, it is a more complete and consistent dataset than alternative open road datasets (e.g. openstreetmap 29 , groads 30 ). visual inspection and comparison between the arn and other spatial road inventories validated the improved accuracy and consistency of arn, however a quantitative validation analysis was not possible due to the lack of consistent ground-truth data at continental scales. figure 1a shows the african road network data used in this analysis. the road network dataset is a commercial restricted product and requests for it can be directly addressed to garmin 31 . plasmodium falciparum malaria prevalence and population maps. to demonstrate how geographically referenced data on disease occurrence or prevalence can be integrated into the approaches outlined, gridded data on plasmodium falciparum malaria prevalence were obtained from the malaria atlas project (http:// www.map.ox.ac.uk/). these represent modelled estimates of the prevalence of p. falciparum parasites in 2015 per 5 × 5 km grid square across africa 32 . additionally, gridded data on estimated population totals per 1 × 1 km grid square across africa in 2015 were obtained from the worldpop program (http://www.worldpop.org/). the population data were aggregated to the same 5 × 5 km gridding as the malaria data, and then multiplied together to obtain estimates of total numbers of p. falciparum infections per 5 × 5 km grid square. detecting communities in the african road network. we modeled the arn as a'primal' road network, where roads are links and road junctions are nodes 33 . spatial road networks have, as any network embedded in two dimensions, physical spatial constraints that impose on them a grid-like structure. in fact, the arn primal network is composed of 300, 306 road segments that account for a total length of 2, 304, 700 km, with an average road length of 7.6 km ± 13.2 km. such large standard deviations, as already observed elsewhere 23, 24, 34 , are due to the long tailed distribution of road lengths, as illustrated in fig. 1c . another property of road network structure is the frequency distribution of the degree of nodes, defined as the number of links connected to each node. most networks in nature and society have a long tail distribution of node degree, implying the existence of hubs (nodes that connect to a large amount of other nodes) 21 , with the majority of nodes connecting to very few others. for road networks, however, the degree distribution strongly peaks around 3, indicating that most of the roads are connected with two other roads. the long tail distribution of the length of road segments, coupled with the peaked degree distribution, indicates the presence of translational invariant grid-like structure, in which road density smoothly varies among regions while their connectivity and structure does not. within such gridlike structures it is very difficult to identify clustered communities, i.e. groups of roads that are more connected within themselves than to other groups. this observation is confirmed by the spatial distribution of betweenness centrality (bc), which measures the amount of time the shortest paths between each couple of nodes pass through a road. the probability distribution of bc is long tailed (fig. 1d) , while its spatial distribution spreads across the entire network, with a structural backbone form, as shown in fig. 1b. again, under such conditions and because of the absence of bottlenecks, any strategy to detect communities that employs pruning on bc values 35 , will be minimally effective. to detect communities in road networks we follow the observation that human displacement in urban networks is guided by straight lines 36 . therefore, geometry can be used to detect communities of roads by assuming that people tend to move more along streets than between between streets. we developed a community detection pipeline that converts a primal road network, where roads are links and roads junction are nodes 33 , to a dual network representation, where link are nodes and street junction link between nodes 37 , by mean of straightness and contiguity of roads. it is important to note here that the units of analysis are road segments, which here are typically short and straight between intersections, making the straightness assumption valid. community detection in the dual network is then performed using a modularity optimization algorithm 38 . the communities found in the dual network are then mapped back to the original primal road network. these communities encode information about the geometry of road pattern but can also incorporate weights associated with a particular disease to guide the process of community detection. nodes in the dual network represent lines in the primal network. the conversion from primal to dual is done by using a modified version of the algorithm known as continuity negotiation 37 . in brief, we assume that a pair of adjacent edges belongs to the same street if the angle θ between these edges is smaller than θ c = 30°. we also assume that the angle between two adjacent edges (i, j) and (j, p) is given by the dot product cos (θ) = r i, j r j,p /r i, j r j,p , where r i, j = r j r i . under these assumptions, the angle between two edges belonging to a perfect straight line is zero, while it assumes a value of 90° for perpendicular edges. our algorithm starts searching for the edge that generates the longest road in the primal space, as can be seen in fig. 2a . then, a node is created in the dual space and assigned to this road. next, we search for the edge that generates the second longest road, and a new node is created in the dual space and assigned to this road. if there is at least one interception between the new road and the previous one, we connect the respective nodes in the dual space. the algorithm continues until all the edges in the primal space are assigned to a node in the dual space, as shown in fig. 2b . note that the conversion from primal to the dual road network has been used extensively to estimate human perception and movement along road networks (space syntax, see 36 ) , which also supports our use of road geometry to detect communities. despite the regular structure of the network in the primal space, the topology of these networks in the dual space is very rich. for instance the degree distribution in dual space follows the power-law p(k) k −γ . this property has been previously identified in urban networks 33 and it is strongly related to the long tailed distribution of road lengths in these networks (see fig. 1c ). since most of the roads are short, most of the nodes in dual space will have a small number of connections. on the other hand, there are a few long roads (fig. 2a ) that originate at hubs in the dual space (fig. 2b ). our approach for detecting communities in road networks consists then in performing classical community detection in the dual representation ( fig. 2c) and then bringing the result back to the primal representation, as shown in fig. 2d . the algorithm used to detect the communities is the modularity-based algorithm by clauset and newman 35 . the hierarchical mapping of communities on the african road network, with outputs for 10, 20, 30 and 40 sets of communities, is shown in fig. 3 . the maps highlight how connectivity rarely aligns with national borders, with the areas most strongly connected through dense road networks typically straddling two or more countries. the hierarchical nature of the approach is illustrated through the breakdown of the 10 large regions in fig. 3a into further sub-regions in b, c and d, emphasizing the main structural divides within each region in mapped in 3a. some large regions appear consistently in each map, for example, a single community spans the entire north african coast, extending south into the sahara. south africa appears as wholly contained within a single community, while the horn of africa containing somalia and much of ethiopia and kenya in consistently mapped as one community. the four maps shown are example outputs, but any number of communities can be identified. the clustering that maximises modularity produces 104 communities, and these are mapped in fig. 4 . even with division into 104 communities, the north africa region remains as a single community, strongly separated from sub-saharan africa by large bridge regions. south africa also remains as almost wholly within its own community, with somalia and namibia showing similar patterns. the countries with the largest numbers of communities tend to be those with the least dense infrastructure equating to poor connectivity, such as drc and angola, though west africa also shows many distinct clusters, especially within nigeria. apart from the sahara, the largest bridge regions of poor connectivity are located across the central belt of sub-saharan africa, where population densities are low and transport infrastructure is both sparse and often poor. the communities mapped in figs 3 and 4 align in many cases with recorded population and pathogen movements. for example, the broad southern and eastern community divides match well those seen in hiv-1 subtype analyses 12 and community detection analyses based on migration data 27 . at more regional scales, there also exist similarities with prior analyses based on human and pathogen movement patterns. for example, the western, coastal and northern communities within kenya in fig. 4b , identified previously through mobile phone and census derived movement data 39, 40 . further, guinea, liberia and sierra leone typically remain mostly within a single community in fig. 3 , with some divides evident in fig. 4c . this shows some strong similarities with the spread of ebola virus through genome analysis 15 , particularly the multiple links between rural guinea and sierra leone, though fig. 4c highlights a divide between the regions containing conakry and freetown when africa is broken into the 104 communities. figure 3 highlights the connections between kinshasa in western drc and angola, with the recent yellow fever outbreak spreading within the communities mapped. figure 4d shows the'best' communities map for an area of southern africa, and the strong cross-border links between swaziland, southern mozambique and western south africa are mapped within a single community, as well as wider links highlighted in fig. 3 , matching the travel patterns found from swaziland malaria surveillance data 41 . integrating p. falciparum malaria prevalence and population data with road networks for weighted community detection. the previous section outlined methods for community detection on unweighted road networks. to integrate disease occurrence, prevalence or incidence data for the identification of areas of likely elevated movement of infections or for guiding the identification of operational control units, an adaptation to weighted networks is required. we demonstrate this through the integration of the data on estimated numbers of p. falciparum infections per 5 × 5 km grid square into the community detection pipeline. the final pipeline for community detection calculated a trade-off between form and function of roads in order to obtain a network partition. the form is related to the topology of the road network and is taken into account during the primal-dual conversion. the topological component guarantees that only neighbor and well connected locations could belong to the same community. the functional part, on the other hand, is calculated by the combination of estimated p. falciparum malaria prevalence multiplied by population to obtain estimated numbers of infections, as outlined above. the two factors were combined to form a weight to each edge of our primal network. the weight w i, j of edge (i, j) is defined as where m(r) is the p. falciparum malaria prevalence and p(r) is the population count, both at coordinate r. these values are obtained directly from the data. when the primal representation is converted into its dual version, the weights of primal edges, given by eq. 1, are converted into weights of dual nodes, which are defined as where i represents the i th dual node and ω i represents the set of all the primal edges that were combined together to form the dual node i (see fig. 2a,b) . finally, weights for the dual edges are created from the weights of dual nodes, by simply assuming the dual network weighted by values of λ i,¯j was used as input for a weighted community detection algorithm. ultimately, when the communities detected in the dual space are translated back to primal space, we have that neighbor locations with similar values of estimated p. falciparum infections belong to the same communities. for the example of p. falciparum malaria used here, the max function was used, representing maximum numbers of infections on each road segment in 2015. this was chosen to identify connectivity to the highest burden areas. areas with large numbers of infections are often 'sources' , with infected populations moving back and forward from them spreading parasites elsewhere 6, 42 . therefore, mapping which regions are most strongly connected to them is of value. alternative metrics can be used however, depending on the aims of the analyses. the integration of p. falciparum malaria prevalence and population (fig. 5a ) through weighting road links by the maximum values across them produces a different pattern of communities (fig. 5b) to those based solely on network structure (fig. 3) . the mapping of 20 communities is shown here, as it identifies key regions of known malaria connectivity, as outlined below. the mapping shows areas of key interest in malaria elimination efforts connected across national borders, such as much of namibia linked to southern angola 43 , but the zambezi region of namibia more strongly linked to the community encompassing neighbouring zambia, zimbabwe and botswana 44 . in namibia, malaria movement communities identified through the integration of mobile phone-based movement data and case-based risk mapping 26 show correspondence in mapping a northeast community. moreover, swaziland is shown as being central to a community covering, southern mozambique and the malaria endemic regions of south africa, matching closely the origin locations of the majority of internationally imported cases to swaziland and south africa 41, 45, 46 . the movements of people and malaria between the highlands and southern and western regions of uganda, and into rwanda 47 , also aligns with the community patterns shown in fig. 5b . finally, though quantifying different factors, the analyses show a similar east-west split to that found in analyses of malaria drug resistance mutations 6, 48 and malaria movement community mapping 27 . the emergence of new disease epidemics is becoming a regular occurrence, and drug and insecticide resistance are continuing to spread around the world. as global, regional and local efforts to eliminate a range of infectious diseases continue and are initiated, an improved understanding of how regions are connected through human transport can therefore be valuable. previous studies have shown how clusters of connectivity exist within the global air transport network 49, 50 and shipping traffic network 50 , but these represent primarily the sources of occasional long-distance disease or vector introductions 1, 8 , rather than the mode of transport that the majority of the population uses regularly. the approaches presented here focused on road networks provide a tool for supporting the design of disease and resistance surveillance and control strategies through mapping (i) areas of high connectivity where pathogen circulation is likely to be high, forming coherent units of intervention; (ii) areas of low connectivity between communities that form likely natural borders of lower pathogen exchange; (iii) key link routes between communities for targetting surveillance efforts. the outputs of the analyses presented here highlight how highly connected areas consistently span national borders. with infectious disease control, surveillance, funding and strategies principally implemented country by country, this emphasises a mismatch in scales and the need for cross-border collaboration. such collaborations are being increasingly seen, for example with countries focused on malaria elimination (e.g. 51, 52 ), but the outputs here show that the most efficient disease elimination strategies may need to reconsider units of intervention, moving beyond being constrained by national borders. results from the analysis of pathogen movements elsewhere confirm these international connections (e.g. 6, 12, 41, 48 , building up additional evidence on how pathogen circulation can be substantially more prevalent in some regions than others. the approaches developed here provide a complement to other approaches for defining and mapping regional disease connectivity and mobility 9 . previously, census-based migration data has been used to map blocks of countries of high and low connectivity 27 , but these analyses are restricted to national-scales and cover only longer-term human mobility. efforts are being made to extend these to subnational scales 53, 54 , but they remain limited to large administrative unit scales and the same long timescales. mobile phone call detail records (cdrs) have also been used to estimate and map pathogen connectivity 26, 40 , but the nature of the data mean that they do not include cross-border movements, so remain limited to national-level studies. an increasing number of studies are uncovering patterns in human and pathogen movements and connectivity through travel history questionnaires (e.g. 41, 47, 55, 56 ), resulting in valuable information, but typically limited to small areas and short time periods. there exist a number of limitations to the methods and outputs presented here that future work will aim to address. firstly, the hierarchies of road types are not currently taken into account in the network analyses, meaning that a major highway and small local roads contribute equally to community detection and epidemic spreading. the lack of reliable data on road typologies, and inconsistencies in classifications between countries, makes this challenging to incorporate however. moreover, the relative importance of a major road versus secondary, tertiary and tracks is exceptionally difficult to quantify within a country, let alone between countries and across africa. finally, data on seasonal variations in road access does not exist consistently across the continent. our focus has therefore been on connectivity, in terms of how well regions are connected based on existing road networks, irrespective of the ease of travel. a broader point that deserves future research is that while intuition suggests a correspondence in most places, connectivity may not always translate into human or pathogen movement. future directions for the work presented here include quantitative comparison and integration with other connectivity data, the integration of different pathogen weightings, and the extension to other regions of the world. qualitative comparisons outlined above show some good correspondence with analyses of alternative sources of connectivity and disease data. a future step will be to compare these different connections and communities quantitatively to examine the weight of evidence for delineating areas of strong and weak connectivity. this could potentially follow similar studies looking at community structure on weighted networks, such as in the us based on commuting data 57 , or uk and belgium from mobile network data 58, 59 . here, p. falciparum malaria was used to provide an example of the potential for weighting analyses by pathogen occurrence, prevalence, incidence or transmission suitability. moreover, future work will examine the integration of alternative pathogen weightings. the maximum difference method was used here to pick out regions well connected to areas high p. falciparum burden, but the potential exists to use different weighting methods depending on requirements, strategic needs, and the nature of the pathogen being studied. despite the rapid growth of air travel, shipping and rail in many parts of the world, roads continue to be the dominant route on which humans move on sub-national, national and regional scales. they form a powerful force in shaping the development of areas, facilitating trade and economic growth, but also bringing with them the exchange of pathogens. results here show that their connectivity is not equal however, with strong clusters of high connectivity separated by bridge regions of low network density. these structures can have a significant impact on how pathogens spread, and by mapping them, a valuable evidence base to guide disease surveillance as well as control and elimination planning can be built. results were produced through four main phases. phase 1: road network cleaning and weighted adjacency list production: the road cleaning operation aimed to produce a road network from the georeferenced vectorial network of roads infrastructure. this phase was conducted using esri arcmap 10.4 (http://desktop.arcgis.com/en/ arcmap/) through the use of the topological cleaning tool. the tool integrates contiguous roads, removes very short links and removes overlapping road segments. road junctions were created using the polyline to node conversion tool, while road-link association was computed using the spatial join tool. malaria prevalence values were assigned to each road using the spatial join tool. the adjacency matrix output, containing also the coordinates for each road junctions, was extracted in form of text file. phase 2: conversion from the primal to the dual network: the primal network created in phase 1 was then used as input for a continuity negotiation-like algorithm. the goal of this algorithm was to translate the primal network into its dual representation (see fig. 2a,b) . the implementation of the negotiation-like algorithm used the igraph library in c++ (http://igraph.org/c/) on an octa-core imac. the conversion took around 20 hours for a primal network with ~200 k nodes running. the algorithm works by first identifying roads composed of many contiguous edges in the primal space. two primal-edges are assumed to be contiguous if the angle between them is not greater than 30° degrees. because the dual representation generated by the algorithm strongly depends on the starting edge, we started by looking for the edge that produces the longest road. as soon as this edge was found, a dual-node was created to represent that road. next we proceeded to look for the edge that produced the second longest road and create a dual-node for that road. we continued this process until every primal-edge had been assigned to a road. finally, dual-nodes were connected to each other if their primal counterparts (roads) crossed each other in the primal space. phase 3: community detection: we used a traditional modularity optimization-based algorithm to identify communities in the dual representation of the road network. the modularity metrics were computed in r using the igraph library (http://igraph.org/r/). to incorporate the prevalence of malaria, we used the malaria prevalence values as edge weights for community detection. phase 4: mapping communities. detected communities were mapped back to the primal road network with the use of the spatial join tool in arcmap. all maps were produced in arcmap. global transport networks and infectious disease spread severe acute respiratory syndrome h5n1 influenza-continuing evolution and spread geographic dependence, surveillance, and origins of the 2009 influenza a (h1n1) virus the global tuberculosis situation and the inexorable rise of drug-resistant disease the transit phase of migration: circulation of malaria and its multidrug-resistant forms in africa population genomics studies identify signatures of global dispersal and drug resistance in plasmodium vivax air travel and vector-borne disease movement mapping population and pathogen movements unifying viral genetics and human transportation data to predict the global transmission dynamics of human influenza h3n2 the blood dna virome in 8,000 humans spatial accessibility and the spread of hiv-1 subtypes and recombinants the early spread and epidemic ignition of hiv-1 in human populations spread of yellow fever virus outbreak in angola and the democratic republic of the congo 2015-16: a modelling study virus genomes reveal factors that spread and sustained the ebola epidemic commentary: containing the ebola outbreak-the potential and challenge of mobile network data world development report 2009: reshaping economic geography population distribution, settlement patterns and accessibility across africa in 2010 the structure of transportation networks elementary processes governing the evolution of road networks urban street networks, a comparative analysis of ten european cities the scaling structure of the global road network street centrality and densities of retail and services in bologna integrating rapid risk mapping and mobile phone call record data for strategic malaria elimination planning international population movements and regional plasmodium falciparum malaria elimination strategies cross-border malaria: a major obstacle for malaria elimination information technology outreach services -itos-university of georgia. global roads open access data set, version 1 (groadsv1) the effect of malaria control on plasmodium falciparum in africa between the network analysis of urban streets: a primal approach random planar graphs and the london street network. the eur finding community structure in very large networks networks and cities: an information perspective the network analysis of urban streets: a dual approach modularity and community structure in networks the use of census migration data to approximate human movement patterns across temporal scales quantifying the impact of human mobility on malaria travel patterns and demographic characteristics of malaria cases in swaziland human movement data for malaria control and elimination strategic planning malaria risk in young male travellers but local transmission persists: a case-control study in low transmission namibia the path towards elimination reviewing south africa's malaria elimination strategy (2012-2018): progress, challenges and priorities targeting imported malaria through social networks: a potential strategy for malaria elimination in swaziland association between recent internal travel and malaria in ugandan highland and highland fringe areas multiple origins and regional dispersal of resistant dhps in african plasmodium falciparum malaria the worldwide air transportation network: anomalous centrality, community structure, and cities' global roles the complex network of global cargo ship movements asian pacific malaria elimination network mapping internal connectivity through human migration in malaria endemic countries census-derived migration data as a tool for informing malaria elimination policy key traveller groups of relevance to spatial malaria transmission: a survey of movement patterns in four subsaharan african countries infection importation: a key challenge to malaria elimination on bioko island, equatorial guinea an economic geography of the united states: from commutes to megaregions redrawing the map of great britain from a network of human interactions uncovering space-independent communities in spatial networks e.s., m.p.v. and a.j.t. conceived and designed the analyses. e.s. and m.p.v. designed the road network community mapping methods and undertook the analyses. all authors contributed to writing and reviewing the manuscript. competing interests: the authors declare no competing interests.publisher's note: springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.open access this article is licensed under a creative commons attribution 4.0 international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. key: cord-024552-hgowgq41 authors: zhang, ruixi; zen, remmy; xing, jifang; arsa, dewa made sri; saha, abhishek; bressan, stéphane title: hydrological process surrogate modelling and simulation with neural networks date: 2020-04-17 journal: advances in knowledge discovery and data mining doi: 10.1007/978-3-030-47436-2_34 sha: doc_id: 24552 cord_uid: hgowgq41 environmental sustainability is a major concern for urban and rural development. actors and stakeholders need economic, effective and efficient simulations in order to predict and evaluate the impact of development on the environment and the constraints that the environment imposes on development. numerical simulation models are usually computation expensive and require expert knowledge. we consider the problem of hydrological modelling and simulation. with a training set consisting of pairs of inputs and outputs from an off-the-shelves simulator, we show that a neural network can learn a surrogate model effectively and efficiently and thus can be used as a surrogate simulation model. moreover, we argue that the neural network model, although trained on some example terrains, is generally capable of simulating terrains of different sizes and spatial characteristics. an article in the nikkei asian review dated 13 september 2019 warns that both the cities of jakarta and bangkok are sinking fast. these iconic examples are far from being the only human developments under threat. the united nation office for disaster risk reduction reports that the lives of millions were affected by the devastating floods in south asia and that around 1,200 people died in the bangladesh, india and nepal [30] . climate change, increasing population density, weak infrastructure and poor urban planning are the factors that increase the risk of floods and aggravate consequences in those areas. under such scenarios, urban and rural development stakeholders are increasingly concerned with the interactions between the environment and urban and rural development. in order to study such complex interactions, stakeholders need effective and efficient simulation tools. a flood occurs with a significant temporary increase in discharge of a body of water. in the variety of factors leading to floods, heavy rain is one of the prevalent [17] . when heavy rain falls, water overflows from river channels and spills onto the adjacent floodplains [8] . the hydrological process from rainfall to flood is complex [13] . it involves nonlinear, time-varying interactions between rain, topography, soil types and other components associated with the physical process. several physics-based hydrological numerical simulation models, such as hec-ras [26] , lisflood [32] , lisflood-fp [6] , are commonly used to simulate floods. however, such models are usually computation expensive and expert knowledge is required for both design and for accurate parameter tuning. we consider the problem of hydrological modelling and simulation. neural network models are known for their flexibility, efficient computation and capacity to deal with nonlinear correlation inside data. we propose to learn a flood surrogate model by training a neural network with pairs of inputs and outputs from the numerical model. we empirically demonstrate that the neural network can be used as a surrogate model to effectively and efficiently simulate the flood. the neural network model that we train learns a general model. with the trained model from a given data set, the neural network is capable of simulating directly spatially different terrains. moreover, while a neural network is generally constrained to a fixed size of its input, the model that we propose is able to simulate terrains of different sizes and spatial characteristics. this paper is structured as follows. section 2 summarises the main related works regarding physics-based hydrological and flood models as well as statistical machine learning models for flood simulation and prediction. section 3 presents our methodology. section 4 presents the data set, parameters setting and evaluation metrics. section 5 describes and evaluates the performance of the proposed models. section 6 presents the overall conclusions and outlines future directions for this work. current flood models simulate the fluid movement by solving equations derived from physical laws with many hydrological process assumptions. these models can be classified into one-dimensional (1d), two-dimensional (2d) and threedimensional (3d) models depending on the spatial representation of the flow. the 1d models treat the flow as one-dimension along the river and solve 1d saint-venant equations, such as hec-ras [1] and swmm [25] . the 2d models receive the most attention and are perhaps the most widely used models for flood [28] . these models solve different approximations of 2d saint-venant equations. two-dimensional models such as hec-ras 2d [9] is implemented for simulating the flood in assiut plateau in southwestern egypt [12] and bolivian amazonia [23] . another 2d flow models called lisflood-fp solve dynamic wave model by neglecting the advection term and reduce the computation complexity [7] . the 3d models are more complex and mostly unnecessary as 2d models are adequate [28] . therefore, we focus our work on 2d flow models. instead of a conceptual physics-based model, several statistical machine learning based models have been utilised [4, 21] . one state-of-the-art machine learning model is the neural network model [27] . tompson [29] uses a combination of the neural network models to accelerate the simulation of the fluid flow. bar-sinai [5] uses neural network models to study the numerical partial differential equations of fluid flow in two dimensions. raissi [24] developed the physics informed neural networks for solving the general partial differential equation and tested on the scenario of incompressible fluid movement. dwivedi [11] proposes a distributed version of physics informed neural networks and studies the case on navier-stokes equation for fluid movement. besides the idea of accelerating the computation of partial differential equation, some neural networks have been developed in an entirely data-driven manner. ghalkhani [14] develops a neural network for flood forecasting and warning system in madarsoo river basin at iran. khac-tien [16] combines the neural network with a fuzzy inference system for daily water levels forecasting. other authors [31, 34] apply the neural network model to predict flood with collected gauge measurements. those models, implementing neural network models for one dimension, did not take into account the spatial correlations. authors of [18, 35] use the combinations of convolution and recurrent neural networks as a surrogate model of navier-stokes equations based fluid models with a higher dimension. the recent work [22] develops a convolutional neural network model to predict flood in two dimensions by taking the spatial correlations into account. the authors focus on one specific region in the colorado river. it uses a convolutional neural network and a conditional generative adversarial network to predict water level at the next time step. the authors conclude neural networks can achieve high approximation accuracy with a few orders of magnitude faster speed. instead of focusing on one specific region and learning a model specific to the corresponding terrain, our work focuses on learning a general surrogate model applicable to terrains of different sizes and spatial characteristics with a datadriven machine learning approach. we propose to train a neural network with pairs of inputs and outputs from an existing flood simulator. the output provides the necessary supervision. we choose the open-source python library landlab, which is lisflood-fp based. we first define our problem in subsect. 3.1. then, we introduce the general ideas of the numerical flood simulation model and landlab in subsect. 3.2. finally, we present our solution in subsect. 3.3. we first introduce the representation of three hydrological parameters that we use in the two-dimensional flood model. a digital elevation model (dem) d is a w × l matrix representing the elevation of a terrain surface. a water level h is a w × l matrix representing the water elevation of the corresponding dem. a rainfall intensity i generally varies spatially and should be a matrix representing the rainfall intensity. however, the current simulator assumes that the rainfall does not vary spatially. in our case, i is a constant scalar. our work intends to find a model that can represent the flood process. the flood happens because the rain drives the water level to change on the terrain region. the model receives three inputs: a dem d, the water level h t and the rainfall intensity i t at the current time step t. the model outputs the water level h t+1 as the result of the rainfall i t on dem d. the learning process can be formulated as learning the function l: physics-driven hydrology models for the flood in two dimensions are usually based on the two-dimensional shallow water equation, which is a simplified version of navier-stokes equations with averaged depth direction [28] . by ignoring the diffusion of momentum due to viscosity, turbulence, wind effects and coriolis terms [10] , the two-dimensional shallow water equations include two parts: conservation of mass and conservation of momentum shown in eqs. 1 and 2, where h is the water depth, g is the gravity acceleration, (u, v) are the velocity at x, y direction, z(x, y) is the topography elevation function and s fx , s fy are the friction slopes [33] which are estimated with friction coefficient η as for the two-dimensional shallow water equations, there are no analytical solutions. therefore, many numerical approximations are used. lisflood-fp is a simplified approximation of the shallow water equations, which reduces the computational cost by ignoring the convective acceleration term (the second and third terms of two equations in eq. 2) and utilising an explicit finite difference numerical scheme. the lisflood-fp firstly calculate the flow between pixels with mass [20] . for simplification, we use the 1d version of the equations in x-direction shown in eq. 3, the result of 1d can be directly transferable to 2d due to the uncoupled nature of those equations [3] . then, for each pixel, its water level h is updated as eq. 4, to sum up, for each pixel at location i, j, the solution derived from lisflood-fp can be written in a format shown in eq. 5, where h t i,j is the water level at location i, j of time step t, or in general as h t+1 = θ (d, h t , i t ) . however, the numerical solution as θ is computationally expensive including assumptions for the hydrology process in flood. there is an enormous demand for parameter tuning of the numerical solution θ once with high-resolution two-dimensional water level measurements mentioned in [36] . therefore, we use such numerical model to generate pairs of inputs and outputs for the surrogate model. we choose the lisflood-fp based opensource python library, landlab [2] since it is a popular simulator in regional two-dimensional flood studies. landlab includes tools and process components that can be used to create hydrological models over a range of temporal and spatial scales. in landlab, the rainfall and friction coefficients are considered to be spatially constant and evaporation and infiltration are both temporally and spatially constant. the inputs of the landlab is a dem and a time series of rainfall intensity. the output is a times series of water level. we propose here that a neural network model can provide an alternative solution for such a complex hydrology dynamic process. neural networks are well known as a collection of nonlinear connected units, which is flexible enough to model the complex nonlinear mechanism behind [19] . moreover, a neural network can be easily implemented on general purpose graphics processing units (gpus) to boost its speed. in the numerical solution of the shallow water equation shown in subsect. 3.2, the two-dimensional spatial correlation is important to predict the water level in flood. therefore, inspired by the capacity to extract spatial correlation features of the neural network, we intend to investigate if a neural network model can learn the flood model l effectively and efficiently. we propose a small and flexible neural network architecture. in the numerical solution eq. 5, the water level for each pixel of the next time step is only correlated with surrounding pixels. therefore, we use, as input, a 3 × 3 sliding window on the dem with the corresponding water levels and rain at each time step t. the output is the corresponding 3 × 3 water level at the next time step t + 1. the pixels at the boundary have different hydrological dynamic processes. therefore, we pad both the water level and dem with zero values. we expect that the neural network model learns the different hydrological dynamic processes at boundaries. one advantage of our proposed architecture is that the neural network is not restricted by the input size of the terrain for both training and testing. therefore, it is a general model that can be used in any terrain size. figure 1 illustrates the proposed architecture on a region with size 6 × 6. in this section, we empirically evaluate the performance of the proposed model. in subsect. 4.1, we describe how to generate synthetic dems. subsect. 4.2 presents the experimental setup to test our method on synthetic dems as a micro-evaluation. subsect. 4.3 presents the experimental setup on the case in onkaparinga catchment. subsect. 4.4 presents details of our proposed neural network. subsect. 4.5 shows the evaluation metrics of our proposed model. in order to generate synthetic dems, we modify alexandre delahaye's work 1 . we arbitrarily set the size of the dems to 64 × 64 and its resolution to 30 metres. we generate three types of dems in our data set that resembles real world terrains surface as shown in fig. 2a , namely, a river in a plain, a river with a mountain on one side and a plain on the other and a river in a valley with mountains on both sides. we evaluate the performance in two cases. in case 1, the network is trained and tested with one dem. this dem has a river in the valley with mountains on both sides, as shown in fig. 2a right. in case 2, the network is trained and tested with 200 different synthetic dems. the data set is generated with landlab. for all the flood simulations in landlab, the boundary condition is set to be closed on four sides. this means that rainfall is the only source of water in the whole region. the roughness coefficient is set to be 0.003. we control the initial process, rainfall intensity and duration time for each sample. the different initial process is to ensure different initial water level in the whole region. after the initial process, the system run for 40 h with no rain for stabilisation. we run the simulation for 12 h and record the water levels every 10 min. therefore, for one sample, we record a total of 72 time steps of water levels. table 1 summarises the parameters for generating samples in both case 1 and case 2. the onkaparinga catchment, located at lower onkaparinga river, south of adelaide, south australia, has experienced many notable floods, especially in 1935 and 1951. many research and reports have been done in this region [15] . we get two dem data with size 64 × 64 and 128 × 128 from the australia intergovernmental committee on surveying and mapping's elevation information system 2 . figure 2b shows the dem of lower onkaparinga river. we implement the neural network model under three cases. in case 3, we train and test on 64 × 64 onkaparinga river dem. in case 4, we test 64 × 64 onkaparinga river dem directly with case 2 trained model. in case 5, we test 128 × 128 onkaparinga river dem directly with case 2 trained model. we generate the data set for both 64 × 64 and 128 × 128 dem from landlab. the initial process, rainfall intensity and rain duration time of both dem are controlled the same as in case 1. the architecture of the neural network model is visualized as in fig. 1 . it firstly upsamples the rain input into 3 × 3 and concatenates it with 3 × 3 water level input. then, it is followed by several batch normalisation and convolutional layers. the activation functions are relu and all convolutional layers have the same size padding. the total parameters for the neural network are 169. the model is trained by adam with the learning rate as 10 −4 . the batch size for training is 8. the data set has been split with ratio 8:1:1 for training, validation and testing. the training epoch is 10 for case 1 and case 3 and 5 for case 2. we train the neural network model on a machine with a 3 ghz amd ryzen tm 7-1700 8-core processor. it has a 64 gb ddr4 memory and an nvidia gtx 1080ti gpu card with 3584 cuda cores and 11gb memory. the operating system is ubuntu 18.04 os. in order to evaluate the performance of our neural network model, we use global measurements metrics for the overall flood in the whole region. these metrics are global mean squared error: case 5 is to test the scalability of our model for the different size dem. in table 2b , for global performance, the mape of case 5 is around 50% less than both case 3 and case 4, and for local performance, the mape of case 5 is 34.45%. similarly, without retraining the existed model, the trained neural network from case 2 can be applied directly on dem with different size with a good global performance. we present the time needed for the flood simulation of one sample in landlab and in our neural network model (without the training time) in table 3 . the average time of the neural network model for a 64 × 64 dem is around 1.6 s, while it takes 47 s in landlab. furthermore, for a 128 × 128 dem, landlab takes 110 more time than the neural network model. though the training of the neural network model is time consuming, it can be reused without further training or tuning terrains of different sizes and spatial characteristics. it remains effective and efficient (fig. 4 ). we propose a neural network model, which is trained with pairs of inputs and outputs of an off-the-shelf numerical flood simulator, as an efficient and effective general surrogate model to the simulator. the trained network yields a mean absolute percentage error of around 20%. however, the trained network is at least 30 times faster than the numerical simulator that is used to train it. moreover, it is able to simulate floods on terrains of different sizes and spatial characteristics not directly represented in the training. we are currently extending our work to take into account other meaningful environmental elements such as the land coverage, geology and weather. hec-ras river analysis system, user's manual, version 2 the landlab v1. 0 overlandflow component: a python tool for computing shallow-water flow across watersheds improving the stability of a simple formulation of the shallow water equations for 2-d flood modeling a review of surrogate models and their application to groundwater modeling learning data-driven discretizations for partial differential equations a simple raster-based model for flood inundation simulation a simple inertial formulation of the shallow water equations for efficient two-dimensional flood inundation modelling rainfall-runoff modelling: the primer hec-ras river analysis system hydraulic userś manual numerical solution of the two-dimensional shallow water equations by the application of relaxation methods distributed physics informed neural network for data-efficient solution to partial differential equations integrating gis and hec-ras to model assiut plateau runoff flood hydrology processes and their variabilities application of surrogate artificial intelligent models for real-time flood routing extreme flood estimation-guesses at big floods? water down under 94: surface hydrology and water resources papers the data-driven approach as an operational real-time flood forecasting model analysis of flood causes and associated socio-economic damages in the hindukush region deep fluids: a generative network for parameterized fluid simulations fully convolutional networks for semantic segmentation optimisation of the twodimensional hydraulic model lisfood-fp for cpu architecture neural network modeling of hydrological systems: a review of implementation techniques physics informed data driven model for flood prediction: application of deep learning in prediction of urban flood development application of 2d numerical simulation for the analysis of the february 2014 bolivian amazonia flood: application of the new hec-ras version 5 physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations storm water management model-user's manual v. 5.0. us environmental protection agency hydrologic engineering center hydrologic modeling system, hec-hms: interior flood modeling decentralized flood forecasting using deep neural networks flood inundation modelling: a review of methods, recent advances and uncertainty analysis accelerating eulerian fluid simulation with convolutional networks comparison of the arma, arima, and the autoregressive artificial neural network models in forecasting the monthly inflow of dez dam reservoir lisflood: a gis-based distributed model for river basin scale water balance and flood simulation real-time waterlevel forecasting using dilated causal convolutional neural networks latent space physics: towards learning the temporal evolution of fluid flow in-situ water level measurement using nirimaging video camera acknowledgment. this work is supported by the national university of singapore institute for data science project watcha: water challenges analytics. abhishek saha is supported by national research foundation grant number nrf2017vsg-at3dcm001-021. key: cord-256707-kllv27bl authors: zhang, jun; cao, xian-bin; du, wen-bo; cai, kai-quan title: evolution of chinese airport network date: 2010-09-15 journal: physica a doi: 10.1016/j.physa.2010.05.042 sha: doc_id: 256707 cord_uid: kllv27bl with the rapid development of the economy and the accelerated globalization process, the aviation industry plays a more and more critical role in today’s world, in both developed and developing countries. as the infrastructure of aviation industry, the airport network is one of the most important indicators of economic growth. in this paper, we investigate the evolution of the chinese airport network (can) via complex network theory. it is found that although the topology of can has remained steady during the past few years, there are many dynamic switchings inside the network, which have changed the relative importance of airports and airlines. moreover, we investigate the evolution of traffic flow (passengers and cargoes) on can. it is found that the traffic continues to grow in an exponential form and has evident seasonal fluctuations. we also found that cargo traffic and passenger traffic are positively related but the correlations are quite different for different kinds of cities. ranging from biological systems to economic and social systems, many real-world complex systems can be represented by networks, including chemical-reaction networks, neuronal networks, food webs, telephone networks, the world wide web, railroad and airline routes, social networks and scientific-collaboration networks [1] [2] [3] . obviously, real networks are neither regular lattices nor simple random networks. since the small-world network model [4] and the scale-free network model [5] were put forward at the end of the last century, people have found that many real complex networks are actually associated with the small-world property and a scale-free, power-law degree distribution. in the past ten years, the theory of complex networks has drawn continuous attention from diverse scientific communities, such as network modelling [6] [7] [8] , synchronization [9, 10] , information traffic [11] [12] [13] [14] , epidemic spreading [15, 16] , cascading failures [17] [18] [19] [20] , evolutionary games [21] [22] [23] [24] [25] , social dynamics [26] etc. one interesting and important research direction is understanding transportation infrastructures in the framework of complex network theory [27] [28] [29] [30] [31] [32] [33] [34] . with the acceleration of the globalization process, the aviation industry plays a more and more critical role in the economy and many scientists have paid special attention to the airline transportation infrastructure. complex network theory is naturally a useful tool since the airports can be denoted by vertices and the flights can be denoted by edges. in the past few years, some interesting research has been reported studying airport networks from the point of view of network theory. for example, amaral et al. and guimerà et al. comprehensively investigated the worldwide airport network (wan) [35, 36] . they found that wan is a typical scale-free small-world network and the most connected nodes in wan are not necessarily the most central nodes, which means critical locations might not coincide with highly-connected hubs in the infrastructures. this interesting phenomenon inspired them to propose a geographical-political-constrained network model. barrat networks and found correlations between weighted quantities and topology [37, 38] . they proposed a weighted evolving network model to expand our understanding of the weighted features of real systems. furthermore, they proposed a global epidemic model to study the role of wan in the prediction and predictability of global epidemics. also, several empirical works on the chinese airport network [39] [40] [41] and the indian airport network [42] reveal that the scale of national airport networks can exhibit different properties from the global scale of wan, i.e., the two-regime power-law degree distribution and the disassortative mixing property. as the aviation industry is an important indicator of economic growth, it is necessary and very meaningful to investigate the evolution of the airport network. recently, gautreau et al. studied the us airport network in the time period 1990-2000. they found that most statistical indicators are stationary and that intense activity takes place at the microscopic level, with many disappearing/appearing links between airports [43] . rocha studied the brazilian airport network (ban) in the time period 1995-2006. he also found the network structure is dynamic, with changes in the importance of airports and airlines, and the traffic on ban has doubled during a period in which the topology of ban has shrunk [44] . inspired by their interesting work, we investigate the evolution of chinese airport network (can) from the year 1950 to 2008 (1991-2008 for detailed traffic information and 2002-2009 for detailed topology information). it is found that the airway traffic volume increased in an exponential form while the topology had no significant change. the paper is organized as follows. in the next section, the description of can data is presented. the statistical analysis of can topology is given in section 3. in section 4, we analyze the evolution of traffic flow on can. the paper is concluded in the last section. the airport network is the backbone of the aviation industry. it includes airports and direct flights linking airport pairs. since the aviation industry is closely related to economic development, we firstly investigate the development of the chinese economy, airports and flights. fig. 1(a) shows the development of chinese gdp from 1950 to 2008. one can see that it has greatly increased in those 58 years. however, the development of airlines ( fig. 1(b) ) and airports ( fig. 1(c) ) is not consistent with that of gdp. for the development of airports ( fig. 1(c) ), one can see that the number of airports grew in [1987] [1988] [1989] [1990] [1991] [1992] [1993] [1994] [1995] although the airline infrastructure (e.g., airports and airlines) does not keep growing due to various constraints, the traffic on can keeps growing with the gdp. as shown in fig it should be noted that: • the timetable contains both domestic and international airlines. as we only focus on the domestic information, the international airlines are excluded. • since ref. [45] is a statistical yearbook edited by caac, it contains not only the scheduled flights but also the temporary flights, whereas the timetables only comprise the scheduled flights. thus the number of airlines in the timetable is smaller than the data in ref. [45] by about 150. • airports in one city are view as one airport. for instance, there are 3 airports in shanghai and chengdu, and 2 airports in beijing. • the timetables are not perfectly consistent with real flights due to weather or emergencies. fig. 3(a) shows the degree distribution p(k) of can, which follows a two-regime power-law distribution with two different exponents as in refs. [39] [40] [41] (i.e., for small degrees, p(k) ∝ k λ 1 and λ 1 = −0.49; and for large degrees, p(k) ∝ k λ 2 and λ 2 = −2.63). we also investigated the directed can and it is found that p(k in ) and p(k out ) are almost the same as p(k), where k in is the ingoing degree and k out is the outgoing degree. fig. 1(b) shows the correlation between k in and k out . one can see that the in-out degree correlation is very strong: the slope is 1.000421. this means that one can fly from one airport to another and return using the same airline. another important topological property is the degree-degree correlation. it is defined as the mean degree of the neighbors (k nn , which is closely related to the network modularity [46] ) of a given airport. fig. 3(c) shows the results of degree-degree correlation of undirected can and we can find that the degrees of adjacent airports have significant linear anti-correlation. fig. 3(d) exhibits the relationship of clustering coefficient c and degree k. as it shows, lower degree nodes have larger clustering coefficient. all the results above are well in accordance with the results reported by li and cai [39] , liu and zhou [40] and liu et al. [41] . in networks, a node participating in more shortest paths is usually more important. thus the betweenness is proposed to quantify the node's importance in traffic [47] . fig. 4 shows the relation between degree and betweenness. one can see that betweenness generally obeys an exponential function of degree but there exist three nodes whose betweenness is obviously much larger: urumqi, xi'an and kunming. the three nodes are all located in west china: kunming is the central city of the southwest, xi'an is the central city of the northwest and urumqi is the central city of the far northwest. the western population needs to be connected to the political centers (e.g., beijing) and economic centers (e.g. shanghai and source: the data of ban is reproduced from ref. [44] . evolution of topology parameters of can from year 2002 to 2009. k is average degree, k in is ingoing degree, k out is outgoing degree, d is average shortest path length, d is network diameter, c is clustering coefficient, r is a reciprocity parameter to measure the asymmetry of directed networks and is defined [44] . here a ij = 1 if there is direct flight from airport i to j, otherwise a ij = 0. year k shenzhen) in the east. however, due to the long distance from western china to eastern china (over 3000 km), it is costly and unnecessary to make all western airports directly link to the eastern airports. thus some transit airports are naturally formed as the bridge between east and west china. now we study the evolution of the topological properties of can. it can be seen from table 1 that the topological properties of can do not significantly change from 2002 to 2009. similarly, the topological properties of the brazilian airport network did not significantly change during a long period of time [44] . next we make a comparison between the two networks. fig. 5 (a) compares average shortest path length d of can and ban. one can see that d of can is around 2.25 and is slightly smaller than that of ban. fig. 5(b) shows the diameter d, which is also slightly smaller in can. this means that can is more convenient for passengers. table 2 gives detailed results of shortest paths of can in the first half year of 2009. about 10% paths are direct connections and over 98% paths consist of no more than 2 flights. fig. 5(c) shows that average clustering coefficient c of can is apparently larger than that of ban and fig. 5(d) shows that can is more reciprocal than ban. from the discussions above, we know that can is an asymmetric small-world network with a two-regime powerlaw degree distribution, a high clustering coefficient, a short average path length, a negative degree-degree correlation, a negative clustering-degree correlation and an exponential betweenness-degree correlation. although the topology characteristics of can is quite steady from year 2002 to 2009, a dynamic switching process underlies the evolution of can. fig. 6 shows the measured fluctuation of can from year 2002 to 2009. fig. 6(a) shows the fluctuation of airports and we can see that the fluctuation (including the added airports and removed airports) is usually between 5 and 15. but for the second half year of 2007 and the first half year of 2008, the fluctuation is evidently more vigorous. fig. 6(b) shows that the percentage of changed airlines is usually smaller than 20% and the majority of changes is mainly induced by aoo and doo. but for the second half year of 2007 and the first year of 2008, when many airports were added and removed, aon and dor contribute the majority of changes. this section investigates evolution of traffic on can. as shown in fig. 7 , the traffic (including cargoes and passengers) has evident seasonal fluctuations as in the united states. if the seasonal fluctuations are averaged out, one finds that the traffic of can increases exponentially. we can also observe similar growth (fig. 8) fig. 9 displays the cumulative distribution of node strength s, namely the throughput of each airport including passengers (s passenger , see fig. 9 (a)) and cargoes (s cargo , see fig. 9(b) ). the distributions are quite broad: 5 orders of magnitude for passengers and 7 for cargoes. the correlations of k and s are also presented. fig. 9 (c) shows the dependence of s passenger on k, and fig. 9(d) denoting a strong correlation between strength and topology: s passenger ∝ k 1.58 and s cargo ∝ k 2.20 . we also examined the data from year 2002 to 2007 and the results are similar. fig. 10 shows the correlations of cargo traffic and passenger traffic from year 2001 to 2008. one can find a strong linear correlation between cargo traffic and passenger traffic for both the total traffic of can and the traffic of a single airport/city. however, the ratios of cargo traffic and passenger traffic are quite different. as shown in fig. 10(a) , the slope is 0.045 for the total traffic of can. for municipalities beijing (fig. 10(b) ) and shanghai (fig. 10(c) chengdu ( fig. 10(d) ) and kunming (fig. 10(e) ), the slopes are obviously larger, indicating that the passenger traffic is more active in these two cities. in summary, we investigate the evolution of chinese airport network (can), including the topology, the traffic and the interplay between them. we relate the evolution of can to the development of the chinese economy. the traffic on can (passengers and cargoes) grew almost linearly with chinese gdp: 1 million rmb of gdp can support about 7 passengers and 153 kg cargoes. we also found that there exists a dynamic switching process inside the network, i.e., from the year 2002 to 2009, although the main topological indicators of can were quite stationary, there were airports and airlines added and/or removed. moreover, the traffic flow (including passengers and cargoes) on can is studied. the traffic grew at an exponential rate with seasonal fluctuations, and the traffic throughput of an airport has a power-law correlation with its degree: s passenger ∝ k 1.58 and s cargo ∝ k 2.20 . our comparative studies also show that cargo traffic and passenger traffic are positively related, but with different ratios for different kinds of cities. we also found that the outbreak of global epidemic diseases can greatly affect passenger traffic. for example, during the epidemic spreading period of severe acute respiratory syndrome (sars, 2003), the passenger traffic decreased sharply while the cargo traffic was not affected. our work can provide some insights in understanding the evolution of the airport network as affected by some social factors such as the development of economy and the outbreak of disease. proc. natl. acad. sci. usa 99 proc. natl. acad. sci. usa 97 proc. natl. acad. sci proc. natl. acad. sci. usa proc. natl. acad. sci. usa we thank gang yan, rui jiang and mao-bin hu for their useful discussions. this work is supported by the program for new century excellent talents in university (ncet-07-0787) and the foundation for innovative research groups of the national natural science foundation of china (no. 60921001). key: cord-104001-5clslvqb authors: wang, xiaoqi; yang, yaning; liao, xiangke; li, lenli; li, fei; peng, shaoliang title: selfrl: two-level self-supervised transformer representation learning for link prediction of heterogeneous biomedical networks date: 2020-10-21 journal: biorxiv doi: 10.1101/2020.10.20.347153 sha: doc_id: 104001 cord_uid: 5clslvqb predicting potential links in heterogeneous biomedical networks (hbns) can greatly benefit various important biomedical problem. however, the self-supervised representation learning for link prediction in hbns has been slightly explored in previous researches. therefore, this study proposes a two-level self-supervised representation learning, namely selfrl, for link prediction in heterogeneous biomedical networks. the meta path detection-based self-supervised learning task is proposed to learn representation vectors that can capture the global-level structure and semantic feature in hbns. the vertex entity mask-based self-supervised learning mechanism is designed to enhance local association of vertices. finally, the representations from two tasks are concatenated to generate high-quality representation vectors. the results of link prediction on six datasets show selfrl outperforms 25 state-of-the-art methods. in particular, selfrl reveals great performance with results close to 1 in terms of auc and aupr on the neodti-net dataset. in addition, the pubmed publications demonstrate that nine out of ten drugs screened by selfrl can inhibit the cytokine storm in covid-19 patients. in summary, selfrl provides a general frame-work that develops self-supervised learning tasks with unlabeled data to obtain promising representations for improving link prediction. in recent decades, networks have been widely used to represent biomedical entities (as nodes) and their relations (as edges). predicting potential links in heterogeneous biomedical networks (hbns) can be beneficial to various significant biology and medicine problems, such as target identification, drug repositioning, and adverse drug reaction predictions. for example, network-based drug repositioning methods have already offered promising insights to boost the effective treatment of covid-19 disease (zeng et al. 2020; xiaoqi et al. 2020) , since it outbreak in december of 2019. many network-based learning approaches have been developed to facilitate link prediction in hbns. in particularly, network representation learning methods, that aim at converting high-dimensionality networks into a low-dimensional space while maximally preserve structural * to whom correspondence should be addressed: fei li (pitta-cus@gmail.com) , and shaoliang peng (slpeng@hnu.edu.cn). properties (cui et al. 2019) , have provided effective and potential paradigms for link prediction li et al. 2017) . nevertheless, most of the network representation learning-based link prediction approaches heavily depend on a large amount of labeled data. the requirement of large-scale labeled data may not be met in many real link prediction for biomedical networks (su et al. 2020) . to address this issue, many studies have focused on developing unsupervised representation learning algorithms that use the network structure and vertex attributes to learn low-dimension vectors of nodes in networks (yuxiao et al. 2020) , such as grarep (cao, lu, and xu 2015) , tadw (cheng et al. 2015) , line (tang et al. 2015) , and struc2vec (ribeiro, saverese, and figueiredo 2017) . however, these network presentation learning approaches are aimed at homogeneous network, and cannot applied directly to the hbns. therefore, a growth number of studies have integrated meta paths, which are able to capture topological structure feature and relevant semantic, to develop representation learning approaches for heterogeneous information networks. dong et al. used meta path based random walk and then leveraged a skip-gram model to learn node representation (dong, chawla, and swami 2017). shi et al. proposed a fusion approach to integrate different representations based on different meta paths into a single representation (shi et al. 2019 ). ji et al. developed the attention-based meta path fusion for heterogeneous information network embedding (ji, shi, and wang 2018) . wang et al. proposed a meta path-driven deep representation learning for a heterogeneous drug network (xiaoqi et al. 2020) . unfortunately, most of the meta path-based network representation approaches focused on preserving vertex-level information by formalizing meta paths and then leveraging a word embedding model to learn node representation. therefore, the global-level structure and semantic information among vertices in heterogeneous networks is hard to be fully modeled. in addition, these representation approaches is not specially designed for link prediction, thus resulting in learning an inexplicit representation for link prediction. on the other hand, self-supervised learning, which is a form of unsupervised learning, has been receiving more and more attention. self-supervised representation learning for-mulates some pretext tasks using only unlabeled data to learn representation vector without any manual annotations (xiao et al. 2020) . self-supervised representation learning technologies have been widely use for various domains, such as natural language processing, computer vision, and image processing. however, very few approaches have been generalized for hbns because the structure and semantic information of heterogeneous networks is significantly differ between domains, and the model trained on a pretext task may be unsuitable for link prediction tasks. based on the above analysis, there are two main problems in link prediction based on network representation learning. the first one is how to design a self-supervised representation learning approach based on a great amount of unlabeled data to learn low-dimension vectors that integrate the differentview structure and semantic information of hbns. the second one is how to ensure the pretext tasks in self-supervised representation learning be beneficial for link prediction of hbns. in order to overcome the mentioned issues, this study proposes a two-level self-supervised representation learning (selfrl) for link prediction in heterogeneous biomedical networks. first, a meta path detection self-supervised learning mechanism is developed to train a deep transformer encoder for learning low-dimensional representations that capture the path-level information on hbns. meanwhile, sel-frl integrates the vertex entity mask task to learn local association of vertices in hbns. finally, the representations from the entity mask and meta path detection is concatenated for generating the embedding vectors of nodes in hbns. the results of link prediction on six datasets show that the proposed selfrl is superior to 25 state-of-the-art methods. in summary, the contributions of the paper are listed below: • we proposed a two-level self-supervised representation learning method for hbns, where this study integrates the meta path detection and vertex entity mask selfsupervised learning task based on a great number of unlabeled data to learn high quality representation vector of vertices. • the meta path detection self-supervised learning task is developed to capture the global-level structure and semantic feature of hbns. meanwhile, vertex entity-masked model is designed to learn local association of nodes. therefore, the representation vectors of selfrl integrate two-level structure and semantic feature of hbns. • the meta path detection task is specifically designed for link prediction. the experimental results indicate that selfrl outperforms 25 state-of-the-art methods on six datasets. in particular, selfrl reveals great performance with results close to 1 in terms of auc and aupr on the neodti-net dataset. heterogeneous biomedical network a heterogeneous biomedical network is defined as g = (v, e) where v denotes a biomedical entity set, and e rep-resents a biomedical link set. in a heterogeneous biomedical network, using a mapping function of vertex type φ(v) : v → a and a mapping function of relation type ψ(e) : e → r to associate each vertex v and each edge e, respectively. a and r denote the sets of the entity and relation types, where |a| + |r| > 2. for a given heterogeneous network g = (v, e), the network schema t g can be defined as a directed graph defined over object types a and link types r, that is, t g = (a, r). the schema of a heterogeneous biomedical network expresses all allowable relation types between different type of vertices, as shown in figure 1 . figure 1 : schema of the heterogeneous biomedical network that includes four types of vertices (i.e., drug, protein, disease, and side-effect). network representation learning plays a significant role in various network analysis tasks, such as community detection, link prediction, and node classification. therefore, network representation learning has been receiving more and more attention during recent decades. network representation learning aims at learning low-dimensional representations of network vertices, such that proximities between them in the original space are preserved (cui et al. 2019 ). the network representation learning approaches can be roughly categorized into three groups: matrix factorizationbased network representation learning approaches, random walk-based network representation learning approaches, and neural network-based network representation learning approaches (yue et al. 2019 ). the matrix factorization-based network representation learning methods extract an adjacency matrix, and factorize it to obtain the representation vectors of vertices, such as, laplacian eigenmaps (belkin and niyogi 2002) and the locally linear embedding methods (roweis and saul 2000) . the traditional matrix factorization has many variants that often focus on factorizing the high-order data matrix, such as, grarep (cao, lu, and xu 2015) and hope (ou et al. 2016) . inspired by the word2vec (mikolov et al. 2013) ... 2014), node2vec (grover and leskovec 2016) , and metap-ath2vec/metapath2vec++ (dong, chawla, and swami 2017), in which a network is transformed into node sequences. these models were later extended by struc2vec (ribeiro, saverese, and figueiredo 2017) for the purpose of better modeling the structural identity. over the past years, neural network models have been widely used in various domains, and they have also been applied to the network representation learning areas. in neural network-based network representation learning, different methods adopt different learning architectures and various network information as input. for example, the line (tang et al. 2015) aims at embedding by preserving both local and global network structure properties. the sdne (wang, cui, and zhu 2016) and dngr (cao 2016) were developed using deep autoencoder architecture. the graphgan (wang et al. 2017 ) adopts generative adversarial networks to model the connectivity of nodes. predicting potential links in hbns can greatly benefit various important biomedical problems. this study proposes selfrl that is a two-level self-supervised representation learning algorithm, to improve the quality of link prediction. the flowchart of the proposed selfrl is shown in figure 2 . considering meta path reflecting heterogeneous characteristics and rich semantics, selfrl first uses a random walk strategy guided by meta-paths to generate node sequences that are treated as the true paths of hbns. meanwhile, an equal number of false paths is produced by randomly replacing some of the nodes in each of true path. then, based on the true paths, this work proposes vertex entity masked as self-supervised learning task to train deep transformer encoder for learning entity-level representations. in addition, a meta path detection-based self-supervised learning task based on all true and false paths is designed to train a deep transformer encoder for learning path-level representation vectors. finally, the representations obtained from the twolevel self-supervised learning task are concatenated to generate the embedding vectors of vertices in hbns, and then are used for link prediction. true path generation a meta-path is a composite relation denoting a sequence of adjacent links between nodes a 1 and a i in a heterogeneous network, and can be expressed in the where r i represents a schema between two objects. different adjacent links indicate distinct semantics. in this study, all the meta paths are reversible, and no longer than four nodes. this is based on the results of the previous studies that meta paths longer than four nodes may be too long to contribute to the informative feature (fu et al. 2016 ). in addition, sun et al. have suggested that short meta paths are good enough, and that long meta paths may even reduce the quality of semantic meanings (sun et al. 2011) . in this work, each network vertex and meta path are regarded as vocabulary and sentence, respectively. indeed, a large percentage of meta paths are biased to highly visible objects. therefore, three key steps are defined to keep a balance between different semantic types of meta paths, and they are as follows: (1) generate all sequences according to meta paths whose positive and reverse directional sampling probabilities are the same and equal to 0.5. (2) count the total number of meta paths of each type, and calculate their median value (n ); (3) randomly select n paths if the total number of meta paths of each type is larger than n ; otherwise, select all sequences. the selected paths are able to reflect topological structures and interaction mechanisms between vertices in hbns, and will be used to design selfsupervised learning task to learn low-dimensional representations of network vertices. false path generation the paths selected using the above procedure are treated as the true paths in hbns. the equal number of false paths are produced by randomly replacing some nodes in each of the true paths. in other words, each true path corresponds to a false path. there is no relation between the permutation nodes and context in false paths, and the number of replaced nodes is less than the length of the current path. for instance, a true path (i.e., d3 p8 d4 e9) is shown in figure 2 (b). during the generation procedure of false paths, the 1st and 3rd tokens that correspond to d3 and d4, respectively, are randomly chosen, and two nodes from the hbns which correspond to d2 and d1, respectively, are also randomly chosen. if there is a relationship between d2 and p8, node d3 is replaced with p2. if there is a relationship between d2 and p8, another node from the network is chosen until the mentioned conditions are satisfied. similarly, node d4 is replaced with d1, because there are no relations between d1 and e9 (or p8). finally, the path (i.e., d2 p8 d1 e9) is treated as a false path. meta path detection in general language understanding evaluation, the corpus of linguistic acceptability (cola) is a binary classification task, where the goal is to predict whether a sentence is linguistically acceptable or not ). in addition, perozzi et al. have suggested that paths generated by short random walks can be regarded as short sentences (perozzi, alrfou, and skiena 2014) . inspired by their work, this study assumes that true paths can be treated as linguistically acceptable sentences, while the false paths can be regarded as linguistically unacceptable sentences. based on this hypothesis, we proposes the meta path detection task where the goal is to predict whether a path is acceptable or not. in the proposed selfrl, a set of true and false paths is fed into the deep transformer encoder for learning path-level representation vector. selfrl maps a path of symbol representations to the output vector of continuous representations that is fed into the softmax function to predict whether a path is a true or false path. apparently, the only distinction between true and false paths is whether there is an association between nodes of path sequence. therefore, the meta path detection task is the extension of the link prediction to a certain extent. especially, when a path includes only two nodes, the meta path detection is equal to the link prediction. for instance, judging whether a path is a true or false path, e.g., d1 s5 in figure b , is the same as predicting whether there is a relation between d1 and s5. however, the meta path detection task is generally more difficult compared to link prediction, because it requires the understanding of long-range composite relationships between vertices of hbns. therefore, the meta path detection-based self-supervised learning task encourages to capturing high-level structure and semantic information in hbns, thus facilitating the performance of link prediction. in order to capture the local information on hbns, this study develops the vertex entity mask-based self-supervised learning task, where nodes in true paths are randomly masked, and then predicting those masked nodes. the vertex entity mask task has been widely applied to natural language processing. however, using the vertex entity mask task to drive the heterogeneous biomedical network representation model is a less explored research. in this work, the vertex entity mask task fellows the implementation described in the bert, and the implementation is almost identical to the original (devlin et al. 2018) . in brief, 15% of the vertex en-tities in true paths are randomly chosen for prediction. for each selected vertex entity, there are three different operations for improving the model generalization performance. the selected vertex entity is replaced with the ¡mask¿ token for 80% time, and is replaced with a random node for 10% time. furthermore, it has 10% chance to keep the original vertex. finally, the masked path is used for training a deep transformer encoder model according to the vertex entity mask task where the last hidden vectors corresponding to the mask vertex entities are fed into the softmax function to predict their original vertices with cross entropy loss. the vertex entity mask task can keep a local contextual representation of every vertex. the vertex entity mask-based self-supervised learning task captures the local association of the vertex in hbns. the meta path detection-based self-supervised learning task enhances the global-level structure and semantic features of the hbns. therefore, the two-level representations are concatenated as the final embedding vectors that integrate structure and semantics information on hbns from different level, as shown in figure 2 (f). layer normalization the model of selfrl is a deep transformer encoder, and the implementation is almost identical to the original (vaswani et al. 2017) . the selfrl follows the overall architecture that includes the stacked self-attention and point-wise, fully connected layers, and softmax function, as shown in figure 3 . multi-head attention an attention function can be described as mapping a query vectors and a set of key-value pairs to an output vectors. the multi-head attention leads table 1 : the node and edge statistics of the datasets. here, ddi, dti, dsa, dda, pda, ppi represent the drug-drug interaction, drug-target interaction, drug-side-effect association, and drug-disease association, protein-disease association and protein-protein interaction, respectively. where w o is a parameter matrices, and h i is the attention function of i-th subspace, and is given as follows: respectively denotes the query, key, and value representations of the i-th subspace, and w is parameter matrices which represent that q, k, and v are transformed into h i subspaces, and d and d k hi represent the dimensionality of the model and h i submodel. position-wise feed-forward network in addition to multi-head attention layers, the proposed selfrl model include a fully connected feed-forward network, which includes two linear transformations with a relu activation function, is given as follows: there are the same the linear transformations for various positions, while these linear transformations use various parameters from layer to layer. residual connection for each sub-layer, a residual connection and normalization mechanism are employed. that is, the output of each sub-layer is given as follows: where x and f (x) stand for input and the transformational function of each sub-layer, respectively. in this work, the performance of selfrl is evaluated comprehensively by link prediction on six datasets. the results of selfrl is also compared with the results of 25 methods. for neodti-net datasets, the performance of selfrl is compared with those of seven state-of-the-art methods, including mscmf ( . the details on how to set the hyperparameters in above baseline approaches can be found in neodti (wan et al. 2018) . for deepdr-net datasets, the link prediction results generated by selfrl are compared with that of seven baseline algorithms, including deepdr (zeng et al. 2019) , dtinet (luo et al. 2017) , kernelized bayesian matrix factorization (kbmf) (gonen and kaski 2014) , support vector machine (svm) (cortes and vapnik 1995) , random forest (rf) (l 2001), random walk with restart (rwr) (cao et al. 2014) , and katz (singhblom et al. 2013) . the details of the baseline approaches and hyperparameters selection can be seen in deepdr (zeng et al. 2019) . for single network datasets, selfrl is compared with 11 network representation methods, that is laplacian (belkin and niyogi 2003) , singular value decomposition (svd), graph factorization (gf) (ahmed et al. 2013) , hope (ou et al. 2016) , grarep (cao, lu, and xu 2015) , deepwalk (perozzi, alrfou, and skiena 2014) , node2vec (grover and leskovec 2016) , struc2vec (ribeiro, saverese, and figueiredo 2017) , line (tang et al. 2015) , sdne (wang, cui, and zhu 2016) , and gae (kipf and welling 2016) . more implementation details can be found in bionev (yue et al. 2019) . the hyperparameters selection of baseline methods were set to default values, and the original data of neodti (wan et al. 2018) , deepdr (zeng et al. 2019) , and bionev (yue et al. 2019) were used in the experiments. the parameters of the proposed selfrl follows those of the bert (devlin et al. 2018 ) which the number of transformer blocks (l), the number of self-attention heads (a), and the hidden size (h) is set to 12, 12, and 768, respectively. for the neodti-net dataset, the embedding vectors are fed into the inductive matrix completion model (imc) (jain and dhillon 2013) to predict dti. the number of negative samples that are randomly chosen from negative pairs, is ten times that of positive samples according to the guidelines in neodti (wan et al. 2018) . then, to reduce the data bias, the ten-fold cross-validation is performed repeatedly ten times, and the average value is calculated. for the deepdr-net dataset, a collective variational autoencoder (cvae) is used to predict dda. all positive samples and the same number of negative samples that is randomly selected from unknown pairs are used to train and test the model according to the guidelines in deepdr (zeng et al. 2019) . then, five-fold crossvalidation is performed repeatedly 10 times. for neodti-net and deepdr-net datasets, the area under precision recall (aupr) curve and the area under receiver operating characteristic (auc) curve are adopted to evaluate the link prediction performance generated by all approaches. for other datasets, the representation vectors are fed into the logistic regression binary classifier for link prediction, the training set (80%) and the testing set (20%) consisted of the equal number of positive samples and negative samples that is randomly selected from all the unknown interactions according to the guidelines in bionev. the performance of different methods is evaluated by accuracy (acc), auc, and f1 score. the overall performances of all methods for dti prediction on the neodti-net dataset are presented in figure 4 . selfrl shows great results with the auc and aupr value close to 1, and significantly outperformed the baseline methods. in particular, neodti and dtinet were specially developed for the neodti-net dataset. however, selfrl is still superior to both neodti and dtinet, improving the aupr by approximately 10% and 15%, respectively. the results of dda prediction of selfrl and baseline methods are represented in figure 5 . these experimental results demonstrate that selfrl generates better results of the dda prediction on the deepdr-net dataset than the baseline methods. however, selfrl achieves the improvements in term of auc and aupr less than 2%. a major reason for such a poor superiority of the selfrl to the other methods is that selfrl considers only four types of objects and edges. however, deepdr included 12 types of vertices and 11 types of edges of drug-related data. in addition, deepdr specially integrated multi-modal deep autoencoder (mda) and cvae model to improve the dda prediction on the deepdr-net dataset. unfortunately, the selfrl+cvae combination maybe reduce the original balance between the mda and cvae. the above results and analysis indicate that the proposed selfrl is a powerful network representation approach for complex heterogeneous networks, and that can achieve very promising results in link prediction. such a good performance of the proposed selfrl is due to the following facts: (1) selfrl designs a two-level self-supervised learning task to integrate the local association of a node and the global level information of hbns. (2) meta path detection selfsupervised learning task that is an extension of link prediction, is specially designed for link prediction. in particular, path detection of two nodes is equal to link prediction. therefore, the representation generated by meta path detection is able to facilitate the link prediction performance. (3) selfrl uses meta paths to integrate the structural and semantic features of hbns. in this section, the link prediction results on four single network datasets are presented to further verify the representable 2 , and the best results are marked in boldface. selfrl shows higher accuracy in link prediction on four single networks compared to the other 11 baseline approaches. especially, the proposed selfrl can achieves an approximately 2% improvement in terms of auc and acc over the second best method on the string-ppi dataset. the auc value of link prediction on the ndfrt-dda dataset is improved from 0.963 to 0.971 when selfrl is compared with grarep. however, grarep only achieves an enhancement of 0.001 compared to line that is the third best method on the string-ppi dataset. therefore, the improvement of selfrl is significant in comparison to the enhancement of grarep compared to line. meanwhile, we also notice that selfrl have poor superiority to the second best method on the ctd-dda and drugbank-ddi datasets. one possible reason for this result can be that the structure and semantic of the ctd-dda and drugbank-ddi datasets are simple and monotonous, so most of the network representation approaches are able to achieve good performance on them. consequently, the proposed selfrl is a potential representation method for the single network datasets, and can contribute to link prediction by introducing a two-level self-supervised learning task. in the neodti and deepdr, low-dimensional representations of nodes in hbns are first learned by network representation approaches, and then are fed into classifier models for predicting potential link among vertices. to further examine the contribution of the network representation approaches, the low-dimensional representation vector is fed into svm that is a traditional and popular classifier for link prediction. the experimental results of these combinations are shown in table 3 . selfrl achieves the best per-formance in link prediction for complex heterogeneous networks, providing a great improvement of over 10% with regard to auc and aupr compared to the neodti and deepdr. with the change of classifiers, the result of sel-frl in link prediction reduced from 0.988 to 0.962 on the neodti-net dataset, while the auc value of neodti approximately reduce by 9%. interestingly, the results on the deepdr-net dataset are similar. therefore, the experimental results indicate that the network representation performance of selfrl is more robust and better than those of the other embedding approaches. this is mainly because selfrl integrates a two-level self-supervised learning model to fuse the rich structure and semantic information from different views. meanwhile, path detection is an extension of link prediction, yielding to better representation in link prediction. the emergence and rapid expansion of covid-19 have posed a global health threat. recent studies have demonstrated that the cytokine storm, namely the excessive inflammatory response, is a key factor leading to death in patients with covid-19. therefore, it is urgent and important to discover potential drugs that prevent the cytokine storm in covid-19 patients. meanwhile, it has been proven that interleukin(il)-6 is a potential target of antiinflammatory response, and drugs targeting il-6 are promising agents blocking cytokine storm for severe covid-19 patients (mehta et al. 2020 ). in the experiments, selfrl is used for drug repositioning for covid-19 disease which aim to discovery agents binding to il-6 for blocking cytokine storm in patients. the low-dimensional representation vectors generated by selfrl are fed into the imc algorithm for predicting the confidence scores between il-6 and each drug in neodti-net dataset. then, the top-10 agents with the highest confidence scores are selected as potential therapeutic agents for covid-19 patients. the 10 candidate drugs and their anti-inflammatory mechanisms of action in silico is shown in table 4 . the knowledge from pubmed publications demonstrates that nine out of ten drugs are able to reduce the release and express of il-6 for exerting anti-inflammatory effects in silico. meanwhile, there are three drugs (i.e., dasatinib, carvedilol, and indomethacin) that inhibit the release of il-6 by reducing the mrna levels of il-6. however, imatinib inhibits the function of human monocytes to prevent the expression of il-6. in addition, although the anti-inflammatory mechanisms of action of five agents (i.e., arsenic trioxide, irbesartan, amiloride, propranolol, sorafenib) are uncertain, these agents can still reduce the release or expression of il-6 for preforming anti-inflammatory effects. therefore, the top ten agents predicted by selfrl-based drug repositioning is able to be used for inhibiting cytokine storm in patients with covid-19, and should be taken into consideration in clinical studies on covid-19. these results further indicate that the proposed selfrl is a powerful network representation learning approach, and can facilitate the link prediction in hbns. in this study, selfrl uses transformer encoders to learn representation vectors by the proposed vertex entity mask and meta path detection tasks. meanwhile, the entity-and pathtable 5 : the dti and dda prediction result of selfrl and baseline methods on the neodti-net and deepdr-net datasets. the mlth and clth stand for the mean and concatenation values of representation from the last two hidden layers, respectively. atlre denotes the mean value of the two-level representation from the last hidden layer. table 5 . selfrl achieves the best performance. meanwhile, the results show that the two-level representation are superior to the single level representation. interestingly, the concatenation of vectors from the lth layers is beneficial to improving the link prediction performance compared to the mean value of the vectors from the lth layers for each level representation model. this is intuitive since two-level representation can fuse the structural and semantic information from different views in hbns. meanwhile, larger number of dimensions can provide more and richer information. this study proposes a two-level self-supervised representation learning, termed selfrl, for link prediction in heterogeneous biomedical networks. the proposed selfrl designs a meta path detection-based self-supervised learning task, and integrates vertices entity-level mask tasks to capture the rich structure and semantics from two-level views of hbns. the results of link prediction indicate that selfrl is superior to 25 state-of-the-art approaches on six datasets. in the future, we will design more self-supervised learning tasks with unable data to improve the representation performance of the model. in addition, we will also developed the effective multi-task learning framework in the proposed model. distributed large-scale natural graph factorization drug-target interaction prediction through domain-tuned network-based inference laplacian eigenmaps and spectral techniques for embedding and clustering laplacian eigenmaps for dimensionality reduction and data representation new directions for diffusion-based network prediction of protein function: incorporating pathways with confidence deep neural network for learning graph representations grarep: learning graph representations with global structural information network representation learning with rich text information support-vector networks a survey on network embedding bert: pre-training of deep bidirectional transformers for language understanding predicting drug target interactions using meta-pathbased semantic network analysis kernelized bayesian matrix factorization node2vec: scalable feature learning for networks provable inductive matrix completion attention based meta path fusion forheterogeneous information network embedding variational graph auto-encoders. arxiv:machine learning random forests deepcas: an end-to-end predictor of information cascades predicting drug-target interaction using a novel graph neural network with 3d structure-embedded graph representation a network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information covid-19: consider cytokine storm syndromes and immunosuppression. the lancet drug?target interaction prediction by learning from local information and neighbors distributed representations of words and phrases and their compositionality asymmetric transitivity preserving graph embedding deepwalk: online learning of social representations struc2vec: learning node representations from structural identity. in knowledge discovery and data mining nonlinear dimensionality reduction by locally linear embedding heterogeneous information network embedding for recommendation prediction and validation of gene-disease associations using methods inspired by social network analyses network embedding in biomedical data science pathsim: meta path-based top-k similarity search in heterogeneous information networks line: large-scale information network embedding neodti: neural integration of neighbor information from a heterogeneous network for discovering new drug-target interactions glue: a multi-task benchmark and analysis platform for natural language understanding structural deep network embedding graphgan: graph representation learning with generative adversarial nets shine:signed heterogeneous information network embedding for sentiment link prediction semisupervised drug-protein interaction prediction from heterogeneous biological spaces self-supervised learning: generative or contrastive. arxiv doi network representation learning-based drug mechanism discovery and anti-inflammatory response against a novel approach for drug response prediction in cancer cell lines via network representation learning graph embedding on biomedical networks: methods, applications and evaluations heterogeneous network representation learning using deep learning deepdr: a network-based deep learning approach to in silico drug repositioning key: cord-127900-78x19fw4 authors: leung, abby; ding, xiaoye; huang, shenyang; rabbany, reihaneh title: contact graph epidemic modelling of covid-19 for transmission and intervention strategies date: 2020-10-06 journal: nan doi: nan sha: doc_id: 127900 cord_uid: 78x19fw4 the coronavirus disease 2019 (covid-19) pandemic has quickly become a global public health crisis unseen in recent years. it is known that the structure of the human contact network plays an important role in the spread of transmissible diseases. in this work, we study a structure aware model of covid-19 cgem. this model becomes similar to the classical compartment-based models in epidemiology if we assume the contact network is a erdos-renyi (er) graph, i.e. everyone comes into contact with everyone else with the same probability. in contrast, cgem is more expressive and allows for plugging in the actual contact networks, or more realistic proxies for it. moreover, cgem enables more precise modelling of enforcing and releasing different non-pharmaceutical intervention (npi) strategies. through a set of extensive experiments, we demonstrate significant differences between the epidemic curves when assuming different underlying structures. more specifically we demonstrate that the compartment-based models are overestimating the spread of the infection by a factor of 3, and under some realistic assumptions on the compliance factor, underestimating the effectiveness of some of npis, mischaracterizing others (e.g. predicting a later peak), and underestimating the scale of the second peak after reopening. epidemic modelling of covid-19 has been used to inform public health officials across the globe and the subsequent decisions have significantly affected every aspect of our lives, from financial burdens of closing down businesses and the overall economical crisis, to long term affect of delayed education, and adverse effects of confinement on mental health. given the huge and long-term impact of these models on almost everyone in the world, it is crucial to design models that are as realistic as possible to correctly assess the cost benefits of different intervention strategies. yet, current models used in practice have many known issues. in particular, the commonly-used compartment based models from classical epidemiology do not consider the structure of the real world contact networks. it has been shown previously that contact network structure changes the course of an infection spread significantly (keeling 2005; bansal, grenfell, and meyers 2007) . in this paper, we demonstrate the structural effect of different underlying contact networks in covid-19 modelling. standard compartment models assume an underlying er contact network, whereas real networks have a non-random structure as seen in montreal wifi example. in each network, two infected patients with 5 and 29 edges are selected randomly and the networks in comparison have the same number of nodes and edges. in wifi network, infected patients are highly likely to spread their infection in their local communities while in er graph they have a wide-spread reach. non-pharmaceutical interventions (npis) played a significant role in limiting the spread of covid-19. understanding effectiveness of npis is crucial for more informed policy making at public agencies (see the timeline of npis applied in canada in table 2 ). however, the commonly used compartment based models are not expressive enough to directly study different npis. for example, ogden et al. (2020) described the predictive modelling efforts for covid-19 within the public health agency of canada. to study the impact of different npis, they used an agent-based model in addition to a separate deterministic compartment model. one significant disadvantage of the compartment model is its inability to realistically model the closure of public places such as schools and universities. this is due to the fact that compartment models assume that each individual has the same probability to be in contact with every other individual in the population which is rarely true in reality. only by incorporating real world contact networks into compartment models, one can disconnect network hubs to realistically simulate the effect of closure. therefore, ogden et al. (2020) need to rely on a separate stochastic agent-based model to model the closure of public places. in contrast, our proposed cgem is able to directly model all npis used in practice realistically. in this work, we propose to incorporate structural information of contact network between individuals and show the effects of npis applied on different categories of contact networks. in this way, we can 1) more realistically model various npis, 2) avoid the imposed homogeneous mixing assumption from compartment models and utilize different networks for different population demographics. first, we perform simulations on various synthetic and real world networks to compare the impact of the contact network structure on the spread of disease. second, we demonstrate that the degree of effectiveness of npis can vary drastically depending on the underlying structure of the contact network. we focus on the effects of 4 widely adopted npis: 1) quarantining infected and exposed individuals, 2) social distancing, 3) closing down of non-essential work places and schools, and 4) the use of face masks. lastly, we simulate the effect of re-opening strategies and show that the outcome will depend again on the assumed underlying structure of the contact networks. to design a realistic model of the spread of the pandemic, we also used a wifi hotspot network from montreal to simulate real world contact networks. given our data is from montreal, we focus on studying montreal timeline but the basic principles are valid generally and cgem is designed to be used with any realistic contact network. we believe that cgem can improve our understanding on the current covid-19 pandemic and be informative for public agencies on future npi decisions. summary of contributions: • we show that structure of the contact networks significantly changes the epidemic curves and the current compartment based models are subject to overestimating the scale of the spread • we demonstrate the degree of effectiveness of different npis depends on the assumed underlying structure of the contact networks • we simulate the effect of re-opening strategies and show that the outcome will depend again on the assumed underlying structure of the contact networks reproducibility: code for the model and synthetic network generation are in supplementary material. the real-world data can be accessed through the original source. different approaches have accounted for network structures in epidemiological modelling. degree block approximation (barabási et al. 2016 ) considers the degree distribution of the network by grouping nodes with the same degree into the same block and assuming that they have the same behavior. percolation theory methods (newman 2002) can approximate the final size of the epidemic for networks with specified degree distributions. recently, sambaturu et al. (2020) (vogel 2020; lawson et al. 2020) design effective vaccination strategies based on real and diverse contact networks. various modifications are made to the compartment differential equations to account for the network effect (aparicio and pascual 2007; keeling 2005; bansal, grenfell, and meyers 2007) . simulation-based approaches are often used when the underlying networks are complex and mathematically intractable. grefenstette et al. (2013) employed an agent-based model to simulate the dynamics of the seir model with a census-based synthetic population. the contact networks are implied by the behavior patterns of the agents. chen et al. (2020) adopted the independent cascade (ic) model (saito, nakano, and kimura 2008) to simulate the disease propagation and used facebook network as a proxy for the contact network. social networks, however, are not always a good approximation for the physical contact networks. in our study, we attempt to better ground the simulations by inferring the contact networks from wifi hub connection records. table 2 : cgem can realistically model all npis used in practice while existing models miss one or more npis period), prevalence of hospital admissions and icu use, and death. they assumed the effect of physical-distancing measures were to reduce the number of contacts per day across the entire population. in addition, enhanced testing and contact tracing were assumed to move individuals with nonsevere symptoms from the infectious to isolated compartments. in this work, we also examine the effect of closure of public places which is difficult to simulate in a realistic manner for standard compartment models. ogden et al. (2020) described the predictive modelling efforts for covid-19 within the public health agency of canada. they estimated that more than 70% of the canadian population may be infected by covid-19 if no intervention is taken. they proposed an agent-based model and a deterministic compartment model. in the compartment model, similar to tuite, fisman, and greer (2020), effects of physical distancing are modelled by reducing daily per capita contact rates. the agent model is used to separately simulate the effects of closing schools, workplaces and other public places. in this work, we compare the effects all npis used in practice through a unified model and show how different contact networks change the outcome of npis. in addition, ferguson et al. (2020) employed an individual-based simulation model to evaluate the impact of npis, such as quarantine, social distancing and school closure. the number of deaths and icu bed demand are used as proxies to compare the effectiveness of npis. in comparison, our model can directly utilize contact networks and we also model the impact of wearing masks. block et al. (2020) proposed three selective social distancing strategies based on the observations that epidemic dynamics depends on the network structure. the strategies aim to increase network clustering and eliminate shortcuts and are shown to be more effective than naive social distancing. reich, shalev, and kalvari (2020) proposed a selective social distancing strategy which lower the mean degree of the network by limiting super-spreaders. the authors also compared the impact of various npis, including testing, contact tracing, quarantine and social distancing. neural network based approaches (soures et al. 2020; dan-dekar and barbastathis 2020) are also proposed to estimate the effectiveness of quarantine and forecast the spread of the disease. in a classic seir model, referred to as base seir, the dynamics of the system at each time step can be described by the following equations (aron and schwartz 1984) : where an individual can be in one of the 4 states: (s) susceptible, (e) exposed, (i) infected and can infect nodes that are susceptible, and (r) recovered at any given time step t. β, σ, γ are the transition rates from s to e, e to i, and i to r respectively. similarly, in cgem, an individual can be either s susceptible, e exposed, i infected or r recovered. we do not consider reinfection, but extensions are straightforward. unlike the equation-based seir model which assumes homogeneous mixing, cgem takes into account the contact patterns between the individuals by simulating the spread of a disease over a contact network. each individual becomes a node in the network and the edges represent the connections between people. algorithm 1 shows the pseudo code for cgem 1 . given a contact network, we assume that a node comes into contact with all its neighbours at each time step. more specifically, at each time step, the susceptible neighbours of infected individuals will become infected with a transmission probability φ, and enter the exposed state (illustrated below). we randomly select exposed nodes to become infected with probability σ and let them recover with a probability γ. (barabási et al. 2016) , the parameters of the synthetic graph generation could be adjusted to produce graphs with same sizes thus facilitating a fair comparison between different structures. we discuss details in the following sections. inferring transmission rate by definition, β represents the likelihood that a disease is transmitted from an infected to a susceptible in a unit time. barabási et al. (2016) assumes that on average each node comes into contact with k neighbors, then the relationship between β and the transmission rate φ can be expressed as: where k is the average degree of the nodes. in the case of a regular random network, all nodes have the same degree, i.e. k = k and equation 1 can be reduced into: β = k · φ (2) the homogeneous mixing assumption made by the standard seir model can be well simulated by running cgem over a regular random network, we propose to bridge the two models with the following procedure: 1. fit the classic seir model to real data to estimate β. 2. run cgem over regular random networks with different values of k and with φ derived from equation 2. 3. choose k = k * which produce the best fit to the predictions of the classic seir model. the regular random network with average degree k * would be the contact network the classic seir model is approximating and φ * = β/k * would be the implied transmission rate. we will use this transmission rate for other contact networks studied, so that the dynamics of the disease (transmissibility) is fixed and only the structure of contact graph changes. tuning synthetic network generators as a proxy for actual contact networks which are often not available, we can pair cgem with synthetic networks with more realistic properties, comparable to real world networks e.g. heavy-tail degree distribution, small average shortest path, etc. to adjust the parameters of these generators, we can reframe the problem as: given transmission rate φ * and population size n, are there other networks which can produce the same infection curve? for this, we can carry out similar procedures as above. for example, we can run cgem with transmission rate φ * over scale-free networks generated from different values of m ba , where m ba is the number of edges a new node can form in the barabasi albert algorithm (barabási et al. 2016) . m ba which produces the best fit to the infection curve gives us a synthetic contact network that is realistic in terms of number of edges compared to the real contact network. here we explain how different npis can be modelled directly in cgem as changes in the underlying structure. quarantine how can we model the quarantining and selfisolation of exposed and infected individuals? exposed individuals have come into close contact with an infected person and are considered to have high risk of contracting. in an ideal world, most, if not all, infected individuals would be easily identifiable and quarantined. however, in reality, over 40% (he et al. 2020 ) of infected cases are asymptomatic and not all are identified immediately or at all and therefore can go on to infect others unintentionally. to account for this in our model, we apply quarantining by removing all edges from a subset of exposed and infected nodes. social distancing social distancing reduces opportunities of close contacts between individuals by limiting contacts to those from the same household and staying at least 6 feet apart from others when out in public. in cgem, a percentage of edges from each node are removed to simulate the effects of social distancing to different extent. wearing masks masks are shown to be effective in reducing the transmission rate of covid-19 with a relative risk (rr) of 0.608 (ollila et al. 2020) . we simulate this by assigning a mask wearing state to each node and varying the transmissibility, φ, based on whether 2 nodes in contact are wearing masks or not. we define the new transmission rate with this npi, φ mask as follows: if both nodes wearing masks m 1 · φ, if 1 node wearing masks m 0 · φ, otherwise closure: removing hubs places of mass gathering (e.g. schools and workplaces) put large number of people in close proximity. if infected individuals are present in these locations, they can have a large number of contacts and very quickly infect many others. in a network, these nodes with a high number of connections, or degree, are known as hubs. by removing the top degree hubs, we simulate the effects of cancelling mass gathering, and closing down schools and non-essential workplaces. in cgem, we remove all edges from r% of top degree nodes to simulate the closure of schools and non-essential workplaces. however, some hubs, such as (workers in) grocery stores and some government agencies, must remain open, so we assign each hub a successful removal rate of p success to control this effect. compliance given the npis are complied by majority but not all the individuals, we randomly assign a fixed percentage of the nodes as non-compilers. we set this to 26% in all the simulations based on a recent survey (bricker 2020) . due to the economical and psychological impacts of a complete lockdown on the society, it is critical to know how safe it is to resume commercial and social activities once the pandemic has stabilized. therefore, we also investigate the impact of relaxing each npis and the risk of a second wave infection. more specifically, we simulate a complete reversing of the npis, by adding back the edges that were removed when the npi was applied at first, to return the underlying structure to its original form. we compare the spread of covid-19 with synthetic and real world networks. these networks include 3 synthetic networks, (1) the regular random network, where all nodes have the same degree, (2) the erdős-reńyi random network, where the degree distribution is poisson distributed, (3) the barabasi albert network, where the degree distributions follows a power law. additionally, we analyzed 4 real world network, the usc35 network from the face-book100 dataset (traud, mucha, and porter 2012) , consisting of facebook friendship relationship links between students and staffs at the university of southern california in september 2005, and 3 snapshots of a real world wifi hotspot network from montreal , a network often used as a proxy for human contact network while studying disease transmission yang et al. 2020 ). in the montreal wifi network, edges are formed between nodes (mobile phones) that are connected to the same public wifi hub at the same time. as shown in table 3 , each of the 7 networks consist of 17,800 nodes, consistent with 1/100th of the population of the city of montreal, and have between 110,000 to 220,000 edges, with the exception of the usc network. due to the aggregated nature of the usc dataset, edge sampling is enforced during the contact phase in order to obtain reasonable disease spread. the synthetic networks are in general more closely connected than the montreal wifi networks, despite having similar number of nodes and edges. only the largest connected component is considered in all networks. the structure of the contact network plays an important role in the spread of a disease (bansal, grenfell, and meyers 2007) . it dictates how likely susceptible nodes will come into contact with infected ones and therefore it is crucial to evaluate how the disease will spread on each network with the same initial parameters. here, the classic seir model is fitted against the infection rates from the first of the 100th case in montreal to april 4 to obtain β, which is before any npi is applied. with eq. 2, the transmission rate, φ, is estimated to be 0.0371 and is used across all networks. in all experiments, we also seed the population with the same initial number of 3 exposed nodes and 1 infected node. the parameters used to generate synthetic networks are obtained following the procedures described in the previous session. all results are averaged across 10 runs. the grey shaded region shows the 95% confidence interval of each curve. as shown in figure 2 , the er network fits the base seir model almost perfectly-compare green 'er' and black 'base' curves. observation 1 cgem closely approximates the base seir model when the contact network is assumed to be erdős-reńyi graph. all networks drastically overestimates the spread of covid-19 when compared with real world data. this can be expected to some degree as in this experiment we are projecting the curves assuming no npi is in effect which is not what happened in reality (see 'real' orange curve). however, we observe that all 3 synthetic networks, including the er model exceedingly overshoot, showing almost the entire population getting infected, whereas the real-world wifi networks predict a 3x lower peak. observation 2 assuming an erdős-reńyi graph as the contact network overestimates the impact of covid-19 by more than a factor of 3 when compared with more realistic structures. in order to limit the effects of the pandemic, the federal and provincial governments introduced a number of measures to reduce the spread of covid-19. we simulate the effects of 4 different non-pharmaceutical interventions, or npis, at different strengths to determine their effectiveness. these include, (1) quarantining exposed and infected individuals, (2) social distancing between nodes, (3) removing hubs, and (4) the use of face masks. quarantine we apply quarantining into our model on march 23. where both quebec and canadian government have asked those who returned from foreign travels or experienced flu-like symptoms to self isolate. we remove all edges from 50, 75, and 95% of exposed and infected nodes to simulate various strengths of quarantining. figure 8 displays the effect of quarantining on different graph structures. quarantining infected and exposed nodes both reduces and delays the peak of all infection curve. however, the peak is not delayed as much in the wifi graphs as the er graph predicts, which is important information in planning for the healthcare system. out of all tested npis, applying quarantine has the most profound reduction on all infections curves. observation 3 quarantining delays the peak of infection on the er graph whereas the peak on the real world graphs are lowered but not delayed significantly. social distancing reduces the number of close contacts. different degrees of 10%, 30%, and 50% of edges from each node is removed to simulate this. figure 9 shows the effects of social distancing on the infection curves of each network structures. it is effective in reducing the peak of the pandemic on all networks but again delays the peaks only on synthetic networks. similar to observation 3, we have: observation 4 social distancing delays the peak of infection on the er graph whereas the peak on the real world graphs are lowered but not delayed significantly. removing hubs we remove all edges from 1% of top degree nodes to simulate the closure of schools and 5 and 10% of top degree nodes to simulate the closure of non-essential workplaces. these npis are applied on march 23 respectively, coinciding with the dates of school and non-essential business closure in quebec. p success is set to 0.8 unless otherwise stated. figure 10 shows the effects of removing hubs. this npi is very effective on the ba network and all 3 montreal wifi networks since these networks have a power law degree distribution and hubs are present. however, it is not very effective on the regular and er random networks. observation 5 the er graph significantly underestimates the effect of removing hubs. removing hubs is most effective on networks with a power law degree distribution since hubs act as super spreaders and removing them effectively contains the virus. however, no hubs are present in the er and regular random network, and thus removing hubs reduces to removing random nodes. luckily, real world contact networks have power law degree distributions, making a hubs removal an effective strategy in practice. wearing masks we set m 2 = 0.6, m 1 = 0.8 and m 0 = 1, and use the following transmission rate, φ mask in cgem: if both nodes wearing masks 0.8 · φ, if 1 node wearing masks 1 · φ, otherwise wearing masks is only able to flatten the infection curve on the synthetic networks but does not reduce the final epidemic attack rate, the total size of population infected, as shown in figure 11 . however, in the real world wifi networks, wearing masks is able to both flatten the curve and also significantly reduce the final epidemic attack rate. observation 6 the er graph significantly underestimates the effect of wearing masks in terms of the total decrease in the final attack rate. we experiment with reopening of all the npis, but for brevity we only report the results for allowing hubs, which corresponds to the current reopening of schools and public places. the results form other npis are available in the extended results. for removing hubs, we apply reopening on july 18 (denoted by the second vertical line in figure 7) , after many non-essential businesses and workplaces are allowed to open in quebec. because the synthetic networks estimates that most of the population would be infected before the hubs are reopened, we calibrate the number of infected and recovered individuals at the point of reopening to align with figure 6 : difference between cumulative curves from wearing masks and not wearing masks. the cumulative curves represent the total impact, and the different shows how much drop in final attack rate is estimated with the npi enforced. statistics available in the real world data. therefore the simulation continues after reopening with all the models having the same number of susceptible individuals, otherwise int the er graph, everyone is infected at that point. we can see in figure 7 that er and regular random network significantly underestimates the extent of second wave infections. ba and the wifi networks all show second wave infections with a higher peak than the initial, prompting more caution when considering reopening businesses and schools. observation 7 er graph significantly underestimates the second peak after reopening public places, i.e. allowing back hubs. in this paper, we propose to model covid-19 on contact networks (cgem) and show that such modelling, when compared to traditional compartment based models, gives significantly different epidemic curves. moreover, cgem subsumes the traditional models while providing more expressive power to model the npis. we hope that cgem could be used to achieve more informed policy making when studying reopening strategies for covid-19 . url https building epidemiological models from r 0: an implicit treatment of transmission in networks seasonality and period-doubling bifurcations in an epidemic model when individual behaviour matters: homogeneous and network models in epidemiology network science social networkbased distancing strategies to flatten the covid-19 curve in a post-lockdown world one quarter 26 percent of canadians admit they're not practicing physical distancing as a time-dependent sir model for covid-19 with undetectable infected persons neural network aided quarantine control model estimation of global covid-19 spread impact of non-pharmaceutical interventions (npis) to reduce covid19 mortality and healthcare demand fred (a framework for reconstructing epidemic dynamics): an open-source software system for modeling infectious diseases and control strategies using census-based populations temporal dynamics in viral shedding and transmissibility of covid-19 epidemic wave dynamics attributable to urban community structure: a theoretical characterization of disease transmission in a large network données covid-19 au québec the implications of network structure for epidemic dynamics covid-19: recovery and re-opening tracker crawdad dataset ilesansfil/wifidog situation of the coronavirus covid-19 in montreal spread of epidemic disease on networks predictive modelling of covid-19 in canada face masks prevent transmission of respiratory diseases: a meta-analysis of randomized controlled trials modeling covid-19 on a network: super-spreaders, testing and containment. medrxiv prediction of information diffusion probabilities for independent cascade model designing effective and practical interventions to contain epidemics sir-net: understanding social distancing measures with hybrid neural network model for covid-19 infectious spread social structure of facebook networks mathematical modelling of covid-19 transmission and mitigation strategies in the population of ontario covid-19: a timeline of canada's first-wave response targeted pandemic containment through identifying local contact network bottlenecks montreal wifi network 3 snapshots of the montreal wifi network are used in this paper with the following time periods: 2004-08-27 to 2006-11-30, 2007-07-01 to 2008-02-26, and 2009-12-02 to 2010-03-08 . each entry in the dataset consists of a unique connection id, a user id, node id (wifi hub), timestamp in, and timestamp out. nodes in the network are the users in each connection. an edge forms between users who have connected to the same wifi hub at the same time. connections are sampled with the aforementioned timestamp in dates to obtain ∼ 17800 nodes. since there are many disconnected nodes in the wifi networks, only the giant connected component is used.synthetic networks we compared cgem with the wifi networks with 3 synthetic network models, the regular, er, and ba networks. in each of these models, we set the number of nodes to be 17,800 and fit respective parameters to best match the infection curve of the base model and the number of edges in the wifi networks. table 5 all the experiments have been performed on a stock laptop. the following assumptions are made in cgem:1. individuals who recover from covid-19 cannot be infected again 2. symptomatic and asymptomatic individuals have the same transmission rate and they quarantine with the same probability 3. a certain percentage of the population do not compile with npis regardless of their connection. quarantine figure 8 shows the results of quarantining on all graph structures. quarantining infected and exposed nodes both reduces and delays the peak of all infection curve. however, the peak is not delayed as much in the wifi graphs when compared to the regular and er graphs.social distancing figure 9 shows the results of applying social distancing on all networks. like quarantining, this is effective in reducing the peaks of the infection curve on all networks, but the delay of peaks is only apparent on the synthetic networks.removing hubs figure 10 shows the results of apply school and business closure on all networks. the er and regular random networks significantly underestimates the effect of removing hubs.wearing masks figure 11 shows the results of wearing masks and without on each network. figure 12 shows the infection curves of all the networks with all npis applied. on march 23, 50% social distancing and 50% quaranine is applied, and 10% of hubs are removed with a success rate of 0.8. wearing mask is applied on april 6. the wifi networks more closely resemble the shape of the real infection curve. table 2 key: cord-164703-lwwd8q3c authors: noury, zahra; rezaei, mahdi title: deep-captcha: a deep learning based captcha solver for vulnerability assessment date: 2020-06-15 journal: nan doi: nan sha: doc_id: 164703 cord_uid: lwwd8q3c captcha is a human-centred test to distinguish a human operator from bots, attacking programs, or other computerised agents that tries to imitate human intelligence. in this research, we investigate a way to crack visual captcha tests by an automated deep learning based solution. the goal of this research is to investigate the weaknesses and vulnerabilities of the captcha generator systems; hence, developing more robust captchas, without taking the risks of manual try and fail efforts. we develop a convolutional neural network called deep-captcha to achieve this goal. the proposed platform is able to investigate both numerical and alphanumerical captchas. to train and develop an efficient model, we have generated a dataset of 500,000 captchas to train our model. in this paper, we present our customised deep neural network model, we review the research gaps, the existing challenges, and the solutions to cope with the issues. our network's cracking accuracy leads to a high rate of 98.94% and 98.31% for the numerical and the alpha-numerical test datasets, respectively. that means more works is required to develop robust captchas, to be non-crackable against automated artificial agents. as the outcome of this research, we identify some efficient techniques to improve the security of the captchas, based on the performance analysis conducted on the deep-captcha model. captcha, abbreviated for completely automated public turing test to tell computers and humans apart is a computer test for distinguishing between humans and robots. as a result, captcha could be used to prevent different types of cyber security treats, attacks, and penetrations towards the anonymity of web services, websites, login credentials, or even in semiautonomous vehicles [13] and driver assistance systems [27] when a real human needs to take over the control of a machine/system. in particular, these attacks often lead to situations when computer programs substitute humans, and it tries to automate services to send a considerable amount of unwanted emails, access databases, or influence the online pools or surveys [4] . one of the most common forms of cyber-attacks is the ddos [8] attack in which the target service is overloaded with unexpected traffic either to find the target credentials or to paralyse the system, temporarily. one of the classic yet very successful solutions is utilising a captcha system in the evolution of the cybersecurity systems. thus, the attacking machines can be distinguished, and the unusual traffics can be banned or ignored to prevent the damage. in general, the intuition behind the captcha is a task that can distinguish humans and machines by offering them problems that humans can quickly answer, but the machines may find them difficult, both due to computation resource requirements and the algorithm complexity [5] . captchas can be in form of numerical or alpha-numerical strings, voice, or image sets. figure 1 shows a few samples of the common alpha-numerical captchas and their types. one of the commonly used practices is using text-based captchas. an example of these types of questions can be seen in figure 2 , in which a sequence of random alphanumeric characters or digits or combinations of them are distorted and drawn in a noisy image. there are many techniques and fine-details to add efficient noise and distortions to the captchas to make them more complex. for instance [4] and [9] recommends several techniques to add various type of noise to improve the security of captchas schemes such as adding crossing lines over the letters in order to imply an anti-segmentation schema. although these lines should not be longer than the size of a letter; otherwise, they can be easily detected using a line detection algorithm. another example would be using different font types, size, and rotation at the character level. one of the recent methods in this regard can be found in [28] which is called visual cryptography. on the other hand, there are a few critical points to avoid while creating captchas. for example, overestimating the random noises; as nowadays days the computer vision-based algorithms are more accurate and cleverer in avoiding noise in contrast to humans. besides, it is better to avoid very similar characters such as the number '0' and the letter 'o', letter 'l' and 'i' which cannot be easily differentiated, both by the computer and a human. besides the text-based captchas, other types of captchas are getting popular recently. one example would be image-based captchas that include sample images of random objects such as street signs, vehicles, statues, or landscapes and asks the user to identify a particular object among the given images [22] . these types of captchas are especially tricky due to the context-dependent spirit. figure 3 shows a sample of this type of captchas. however, in this paper, we will focus on text-based captchas as they are more common in high traffic and dense networks and websites due to their lower computational cost. before going to the next section, we would like to mention another application of the captcha systems that need to be discussed, which is its application in ocr (optical character recognition) systems. although current ocr algorithms are very robust, they still have some weaknesses in recognising different hand-written scripts or corrupted texts, limiting the usage of these algorithms. utilising captchas proposes an excellent enhancement to tackle such problems, as well. since the researchers try to algorithmically solve captcha challenges this also helps to improve ocr algorithms [7] . besides, some other researchers, such as ahn et al. [6] , suggest a systematic way to employ this method. the proposed solution is called recaptcha, and it merely offers a webbased captcha system that uses the inserted text to finetune its ocr algorithms. the system consists of two parts: first, the preparation stage which utilises two ocr algorithms to transcribe the document independently. then the outputs are compared, and then the matched parts are marked as correctly solved; and finally, the users choose the mismatched words to create a captcha challenge dataset [14] . this research tries to solve the captcha recognition problem, to detect its common weaknesses and vulnerabilities, and to improve the technology of generating captchas, to ensure it will not lag behind the ever-increasing intelligence of bots and scams. the rest of the paper is organised as follows: in section 2., we review on the literature by discussing the latest related works in the field. then we introduce the details of the proposed method in section 3.. the experimental results will be provided in section 4., followed by the concluding remarks in section 5.. in this this section, we briefly explore some of the most important and the latest works done in this field. geetika garg and chris pollett [1] performed a trained python-based deep neural network to crack fix-lengthed captchas. the network consists of two convolutional maxpool layers, followed by a dense layer and a softmax output layer. the model is trained using sgd with nesterov momentum. also, they have tested their model using recurrent layers instead of simple dense layers. however, they proved that using dense layers has more accuracy on this problem. in another work done by sivakorn et al. [2] , they have created a web-browser-based system to solve image captchas. their system uses the google reverse image search (gris) and other open-source tools to annotate the images and then try to classify the annotation and find similar images, leading to an 83% success rate on similar image captchas. stark et al. [3] have also used a convolutional neural network to overcome this problem. however, they have used three convolutional layers followed by two dense layers and then the classifiers to solve six-digit captchas. besides, they have used a technique to reduce the size of the required training dataset. in researches done in [4] and [9] the authors suggest addition of different types of noise including crossing line noise or point-based scattered noise to improve the complexity and security of the captchas patterns. furthermore, in [11] , [12] , [18] , and [31] , also cnn based methods have been proposed to crack captcha images. [24] has used cnn via the style transfer method to achieve a better result. [29] has also used cnn with a small modification, in comparison with the densenet [32] structure instead of common cnns. also, [33] and [21] have researched chinese captchas and employed a cnn model to crack them. on the other hand, there are other approaches which do not use convolutional neural networks, such as [15] . they use classical image processing methods to solve captchas. as another example, [17] uses a sliding window approach to segment the characters and recognise them one by one. another fascinating related research field would be the adversarial captcha generation algorithm. osadchy et al. [16] add an adversarial noise to an original image to make the basic image classifiers misclassifying them, while the image still looks the same for humans. [25] also uses the same approach to create enhanced text-based images. similarly, [26] and [10] , use the generative models and generative adversarial networks from different point of views to train a better and more efficient models on the data. deep learning based methodologies are widely used in almost all aspects of our life, from surveillance systems to autonomous vehicles [23] , robotics, and even in the recent global challenge of the covid-19 pandemic [35] . to solve the captcha problem, we develop a deep neural network architecture named deep-captcha using customised convolutional layers to fit our requirements. below, we describe the detailed procedure of processing, recognition, and cracking the alphanumerical captcha images. the process includes input data pre-processing, encoding of the output, and the network structure itself. applying some pre-processing operations such as image size reduction, colour space conversion, and noise reduction filtering can have a tremendous overall increase on the network performance. the original size of the image data used in this research is 135 × 50 pixel which is too broad as there exist many blank areas in the captcha image as well as many codependant neighbouring pixels. our study shows by reducing the image size down to 67 × 25 pixel, we can achieve almost the same results without any noticeable decrease in the systems performance. this size reduction can help the training process to become faster since it reduces the data without having much reduction in the data entropy. colour space to gray-space conversion is another preprocessing method that we used to reduce the size of the data while maintaining the same level of detection accuracy. in this way, we could further reduce the amount of redundant data and ease the training and prediction process. converting from a three-channel rgb image to a grey-scale image does not affect the results, as the colour is not crucial on the textbased captcha systems. the last preprocessing technique that we consider is the application of a noise reduction algorithm. after a careful experimental analysis on the appropriate filtering approaches, we decided to implement the conventional median-filter to remove the noise of the input image. the algorithm eliminates the noise of the image by using the median value of the surrounding pixels values instead of the pixel itself. the algorithm is described in algorithm 1 in which we generate the resultimage from the input 'image' using a predefined window size. unlike the classification problems where we have a specific number of classes in the captcha recognition problems, the number of classes depends on the number of digits and the length of the character set in the designed captcha. this leads to exponential growth depending on the number of classes to be detected. hence, for a captcha problem with five numerical digits, we have around 100,000 different combinations. as a result, we are required to encode the output data to fit into a single neural network. the initial encoding we used in this research was to employ nb input = d × l neurons, where d is the length of the alphabet set, and l is the character set length of the captcha. the layer utilises the sigmoid activation function: where x is the input value and s(x) is the output of the sigmoid function. by increasing the x, the s(x) conversing to 1 and by reducing it the s(x) is getting close to −1. applying sigmoid function adds a non-linearity feature to neurons which improves the learning potential and also the complexity of those neurons in dealing with non-linear inputs. these sets of neurons can be arranged in a way so that the first set of d neurons represent the first letter of the captcha; the second set of d neurons represent the second letter of the captcha, and so on. in other words, assuming d = 10, the 15 th neuron tells whether the fifth letter from the second character matches with the predicted alphabet or not. a visual representation can be seen in figure 4 .a, where the method encompasses three numerical serial digits that represent 621 as the output. however, this approach seemed not to be worthy due to its incapability of normalising the numerical values and also the impossibility of using the softmax function as the output layer of the intended neural network. therefore, we employed l parallel softmax layers, instead: where i is the corresponding class for which the softmax is been calculated, z i is the input value of that class, and k is the maximum number of classes. each softmax layer individually represents d neurons as figure 4 .b and these d neurons in return represent the alphabet that is used to create the captchas (for example 0 to 9, or a to z). l unit is represents the location of the digit in the captcha pattern (for example, locations 1 to 3). using this technique allows us to normalise each softmax unit individually over its neurons, d. in other words, each unit can normalise its weight over the different alphabets; hence it performs better, in overall. although the recurrent neural networks (rnns) can be one of the options to predict captcha characters, in this research we have focused on sequential models as they perform faster than rnns, yet can achieve very accurate results if the model is well designed. the structure of our proposed network is depicted in figure 5 . the network starts with a convolutional layer with 32 input neurons, the relu activation function, and 5 × 5 kernels. a 2×2 max-pooling layer follows this layer. then, we have two sets of these convolutional-maxpooling pairs with the same parameters except for the number of neurons, which are set to 48 and 64, respectively. we have to note that all of the convolutional layers have the "same" padding parameter. after the convolutional layers, there is a 512 dense layer with the relu activation function and a 30% drop-out rate. finally, we have l separate softmax layers, where l is the number of expected characters in the captcha image. the loss function of the proposed network is the binarycross entropy as we need to compare these binary matrices all together: were n is the number of samples and p is the predictor model. the x i and y i represent the input data and the label of the i th sample, respectively. since the label could be either zero or one, therefore only one part of this equation would be active for each sample. we also employed adam optimiser, which is briefly described in equations 4 to 8 where m t and v t representing an exponentially decaying average of the past gradients and past squared gradients, respectively. β 1 and β 2 are configurable constants. g t is the gradient of the optimising function and t is the learning iteration. in equations 6 and 7, momentary values for m and v are calculated as follows: finally, using equation 8 and by updating θ t in each iteration, the optimum value of the function could be attained. m t andv t are calculated via equations 6 and 7 and η, the step size (also known as learning rate) is set to 0.0001 in our approach. the intuition behind using adam optimiser is its capability in training the network in a reasonable time. this can be easily inferred from figure 6a in which the adam optimiser achieves the same results in comparison with stochastic gradient descent (sgd), but with a much faster convergence. after several experiments, we trained the network for 50 epochs with a batch size of 128 for each. as can be inferred from figure 6a , even after 30 epochs the network tends to an acceptable convergence. as a result, 50 epochs seem to be sufficient for the network to perform steadily. furthermore, figure 6e would also suggest the same inference based on the measured accuracy metrics. after developing the above-described model, we trained the network on 500,000 randomly generated captchas using python imagecaptcha library [38] . see figure 7 for some of the randomly generated numerical captchas with the fixed lengths of five-digits. to be balanced, the dataset consists of ten randomly generated images from each permutation of a five-digit text. we tested the proposed model on another set of half a million captcha images as our test dataset. as represented in table i , the network reached the overall performance and accuracy rate of 99.33% on the training set and 98.94% on the test dataset. we have to note that the provided accuracy metrics are calculated based on the number of correctly detected captchas as a whole (i.e. correct detection of all five individual digits in a given captcha); otherwise, the accuracy of individual digits are even higher, as per the table ii . we have also conducted a confusion matrix check to visualise the outcome of this research better. figure 8 shows how the network performs on each digit regardless of the position of that digit in the captcha string. as a result, the network seems to work extremely accurately on the digits, with less than 1% misclassification for each digit. by analysing the network performance and visually inspecting 100 misclassified samples we pointed out some important results as follows that can be taken into account to decrease the vulnerability of the captcha generators: while an average human could solve the majority of the misclassified captchas, the following weaknesses were identified in our model that caused failure by the deep-captcha solver: • in 85% of the misclassified samples, the gray-level intensity of the generated captchas were considerably lower than the average intensity of the gaussian distributed pepper noise in the captcha image. • in 54% of the cases, the digits 3, 8, or 9 were the cause of the misclassification. • in 81.8% of the cases, the misclassified digits were rotated for 10 • or more. • confusion between the digits 1 and 7 was also another cause of the failures, particularly in case of more than 20 • counter-clockwise rotation for the digit 7. consequently, in order to cope with the existing weakness and vulnerabilities of the captcha generators, we strongly suggest mandatory inclusion of one or some of the digits 3, fig. 7 : samples of the python numerical image-captcha library used to train the deep-captcha. 7, 8 and 9 (with/without counter-clockwise rotations) with a significantly higher rate of embedding in the generated captchas comparing to the other digits. this will make the captchas harder to distinguish for automated algorithms such as the deep-captcha, as they are more likely to be confused with other digits, while the human brain has no difficulties in identifying them. a similar investigation was conducted for the alphabetic part of the failed detections by the deep-captcha and the majority of the unsuccessful cases were tied to either too oriented characters or those with close contact to neighbouring characters. for instance, the letter "g" could be confused with "8" in certain angles, or a "w" could be misclassified as an "m" while contacting with an upright letter such as "t ". in general, the letters that can tie together with one/some of the letters: w, v, m, n can make a complex scenario for the deep-captcha. therefore we suggest more inclusion of these letters, as well as putting these letters in close proximity to others letter, may enhance the robustness of the captchas. our research also suggests brighter colour (i.e. lower grayscale intensity) alpha-numerical characters would also help to enhance the difficulty level of the captchas. in this section, we compare the performance of our proposed method with 10 other state-of-the-art techniques. the comparison results are illustrated in table iii followed by further discussions about specification of each method. as mentioned in earlier sections, our approach is based on convolutional neural network that has three pairs of convolutional-maxpool layers followed by a dense layer that is connected to a set of softmax layers. finally, the network is trained with adam optimiser. in this research we initially focused on optimising our network to solve numerical captchas; however, since many existing methods work on both numerical and alphanumerical captchas, we developed another network capable of solving both types. also, we trained the network on 700,000 alphanumerical captchas. for a better comparison and to have a more consistent approach, we only increased the number of neurons in each softmax units from 10 to 31 to cover all common latin characters and digits. the reason behind having 31 neurons is that we have used all latin alphabets and numbers except for i, l, 1, o, 0 due to their similarity to each other and existing difficulties for an average human to tell them apart. although we have used both upper and lower case of each letter to generate a captcha, we only designate a single neuron for each of these cases in order to simplicity. in order to compare our solution, first, we investigated the research done by wang et al. [29] which includes evaluations on the following approaches: densenet-121 and resnet-50 which are fine-tuned model of the original densenet and resnet networks to solve captchas as well as dfcr which is an optimised method based on the densenet network. the dfcr has claimed an accuracy of 99.96% which is the best accuracy benchmark among other methods. however, this model has only been trained on less than 10,000 samples and only on four-digit captcha images. although the quantitative comparison in table iii shows the [29] on top of our proposed method, the validity of the method can neither be verified on larger datasets, nor on complex alphanumerical captchas with more than half a million samples, as we conducted in our performance evaluations. the next comparing method is [36] which uses an svm based method and also implementation of the vgg-16 network to solve captcha problems. the critical point of this method is the usage of image preprocessing, image segmentation and one by one character recognition. these techniques have lead to 98.81% accuracy on four-digit alphanumerical captchas. the network has been trained on a dataset composed of around 10,000 images. similarly, tod-cnn [20] have utilised segmentation method to locate the characters in addition to using a cnn model which is trained on a 60,000 dataset. the method uses a tensorflow object detection (tod) technique to segment the image and characters. goodfellow et al. [14] have used distbelief implementation of cnns to recognise numbers more accurately. the dataset used in this research was the street view house numbers (svhn) which contains images taken from google street view. finally, the last discussed approach is [37] which compares vgg16, vgg cnn m 1024, and zf. although they have relatively low accuracy compared to other methods, they have employed r-cnn methods to recognise each character and locate its position at the same time. in conclusion, our methods seem to have relatively satisfactory results on both numerical and alphanumerical captchas. having a simple network architecture allows us to utilise this network for other purposes with more ease. besides, having an automated captcha generation technique allowed us to train our network with a better accuracy while maintaining the detection of more complex and more comprehensive captchas comparing to state-of-the-art. we designed, customised and tuned a cnn based deep neural network for numerical and alphanumerical based captcha detection to reveal the strengths and weaknesses of the common captcha generators. using a series of paralleled softmax layers played an important role in detection improvement. we achieved up to 98.94% accuracy in comparison to the previous 90.04% accuracy rate in the same network, only with sigmoid layer, as described in section 3.2. and table i . although the algorithm was very accurate in fairly random captchas, some particular scenarios made it extremely challenging for deep-captcha to crack them. we believe taking the addressed issues into account can help to create more reliable and robust captcha samples which makes it more complex and less likely to be cracked by bots or aibased cracking engines and algorithms. as a potential pathway for future works, we suggest solving the captchas with variable character length, not only limited to numerical characters but also applicable to combined challenging alpha-numerical characters as discussed in section 4.. we also recommend further research on the application of recurrent neural networks as well as the classical image processing methodologies [30] to extract and identify the captcha characters, individually. neural network captcha crackers i am robot:(deep) learning to break semantic image captchas captcha recognition with active deep learning recognition of captcha characters by supervised machine learning algorithms captcha: using hard ai problems for security recaptcha: human-based character recognition via web security measures designing a secure text-based captcha ddos attack evolution accurate, data-efficient, unconstrained text recognition with convolutional neural networks yet another text captcha solver: a generative adversarial network based approach breaking microsofts captcha breaking captchas with convolutional neural networks look at the driver, look at the road: no distraction! no accident! multi-digit number recognition from street view imagery using deep convolutional neural networks an optimized system to solve text-based captcha no bot expects the deepcaptcha! introducing immutable adversarial examples, with applications to captcha generation the end is nigh: generic solving of text-based captchas captcha breaking with deep learning a survey on breaking technique of text-based captcha a low-cost approach to crack python captchas using ai-based chosen-plaintext attack a security analysis of automated chinese turing tests im not a human: breaking the google recaptcha simultaneous analysis of driver behaviour and road condition for driver distraction detection captcha image generation using style transfer learning in deep neural network captcha image generation systems using generative adversarial networks a generative vision model that trains with high data efficiency and breaks text-based captchas toward next generation of driver assistance systems: a multimodal sensor-based platform applying visual cryptography to enhance text captchas captcha recognition based on deep convolutional neural network object detection, classification, and tracking captcha recognition with active deep learning densely connected convolutional networks an approach for chinese character captcha recognition using cnn image flip captcha zero-shot learning and its applications from autonomous vehicles to covid-19 diagnosis: a review research on optimization of captcha recognition algorithm based on svm captcha recognition based on faster r-cnn key: cord-256713-tlluxd11 authors: welch, david title: is network clustering detectable in transmission trees? date: 2011-06-03 journal: viruses doi: 10.3390/v3060659 sha: doc_id: 256713 cord_uid: tlluxd11 networks are often used to model the contact processes that allow pathogens to spread between hosts but it remains unclear which models best describe these networks. one question is whether clustering in networks, roughly defined as the propensity for triangles to form, affects the dynamics of disease spread. we perform a simulation study to see if there is a signal in epidemic transmission trees of clustering. we simulate susceptible-exposed-infectious-removed (seir) epidemics (with no re-infection) over networks with fixed degree sequences but different levels of clustering and compare trees from networks with the same degree sequence and different clustering levels. we find that the variation of such trees simulated on networks with different levels of clustering is barely greater than those simulated on networks with the same level of clustering, suggesting that clustering can not be detected in transmission data when re-infection does not occur. to understand the dynamics of infectious diseases it is crucial to understand the structure and interactions within the host population. conversely, it is possible to learn something about host population structure by observing the pattern of pathogen spread within it. in either case, it is necessary to have a good model of the host population structure and interactions within it. networks, where nodes of the network represent hosts and edges between nodes represent contacts across which pathogens may be transmitted, are now regularly used to model host interactions [1] [2] [3] . while many models have been proposed to describe the structure of these contact networks for different populations and different modes of transmission, it is not yet understood how different features of networks affect the spread of pathogens. one promising development in this field is the use of statistical techniques which aim to model a contact network based on data relating to the passage of a pathogen through a population. such data includes infection times [4] [5] [6] and genetic sequences that are collected from an epidemic present in the population of interest [7] [8] [9] . these data have previously been shown to be useful for reconstructing transmission histories (the distinction between a contact network and a transmission history is that a contact network includes all edges between hosts across which disease may spread, whereas the transmission history is just the subset of edges across which transmission actually occurred). infection times can be used to crudely reconstruct transmission histories by examining which individuals were infectious at the time that any particular individual was infected [10] . genetic sequences from viruses are informative about who infected whom by comparing the similarity between sequences. due to the random accumulation of mutations in the sequences, we expect sequences from an infector/infectee pair to be much closer to each other than sequences from a randomly selected pair in the population (see [11] for a review of modern approaches to analysing viral genetic data). the work of [4] [5] [6] seeks to extend the use of this data to reconstruct a model for the whole contact network rather than just the transmission history. in theory, these statistical methods could settle arguments about which features of the network are important in the transmission of the disease and which are simply artifacts of the physical system. in this article, we focus on clustering in networks and ask whether or not networks which differ only in their level of clustering could be distinguished if all we observed was transmission data from an epidemic outbreak. the answer to this question will determine whether these new statistical techniques can be extended to estimate the level of clustering in a network. throughout, we consider a population with n individuals that interact through some contact process. this population and its interactions are fully described by a undirected random network, denoted y , on n nodes. an simple example of a network is shown in figure 1 with illustrations of some of the terms we use in this article. y can be represented by the symmetric binary matrix [y ij ] where y ij = y ji = 1 if an edge is present between nodes i and j, otherwise y ij = 0. we stipulate that there no loops in the network, so y ii = 0 for all i. the degree of the ith node, denoted d i is the number of edges connected to i, so d i = j:j>i y ij . clustering is one of the central features of observed social networks [12, 13] . intuitively, clustering is the propensity for triangles or other small cycles to form, so that, for example, a friend of my friend is also likely to be my friend. where there is a positive clustering effect, the existence of edges (i, j) and (i, k) increases the propensity for the edge (j, k) to exist, while a negative clustering effect implies that (j, k) is less likely to exist given the presence of (i, j) and (i, k). when there is no clustering effect, the presence or absence of (i, j) and (i, k) has no bearing on that of (j, k). thus clustering is one of the most basic of the true network effects-when it is present, the relationship between two nodes depends not only on properties of the nodes themselves but the presence or absence of other relationships in the network. the effect of clustering on the dynamics of stochastic epidemics that run over networks remains largely unknown, though it has been studied in a few special cases. the difficulty with studying this effect in isolation is in trying to construct a network model where clustering can change but other properties of the network are held constant. in simulations we study here, we focus on holding the degree sequence of a network constant-that is, each node maintains the same number of contacts-while varying the level of clustering. intuition suggests that clustering will have some effect on epidemic dynamics since, in a graph with no cycles, if an infection is introduced to a population at node i and there is a path leading to j then k, k can only become infected if j does first. however, where cycles are present, there may be multiple paths leading from i to k that do not include j, so giving a different probability that k becomes infected and a different expected time to infection for k. figure 1 . an example of a network on 7 nodes. the nodes are the red dots, labelled 1 to 7 and represent individuals in the population. the edges are shown as black lines connecting the nodes and represent possible routes of transmission. the degree of each node is number of edges adjacent to it, so that node 5 has degree 3 and node 7 has degree 1. the degree sequence of the network is the count of nodes with a given degree and can be represented by the vector (0, 2, 0, 3, 1, 1) showing that there are 0 nodes of degree 0, 2 of degree 1, 0 of degree 2 and so on. a cycle in the network is a path starting at a node and following distinct edges to end up back at the same node. for example, the path from node 6 to node 1 to node 3 and back to node 6 is a cycle but there is no cycle that includes node 4. clustering is a measure of propensity of cycles of length 3 (triangles) to form. here, the edges (2,1) and (2,6) form a triangle with the edge (1,6), so work to increase clustering in the network. however, the edges (2,1) and (2,5) do not comprise part of a triangle as (1,5) does not exist, so work to decrease clustering. previous work on the effect of clustering on epidemic dynamics has produced a variety of results which are largely specific to particular types of networks. newman [14] and britton et al. [15] show that for a class of networks known as random intersection graphs in which individuals belong to one or more overlapping groups and groups form fully connected cliques, an increase in clustering reduces the epidemic threshold, that is, major outbreaks may occur at lower levels of transmissibility in highly clustered networks. newman [14] , using heuristic methods and simulations, suggests that for sufficiently high levels of transmissibility the expected size of an outbreak is smaller in a highly clustered network than it would be in a similar network with lower clustering. these articles show that graphs with different levels of clustering do, at least in some cases, have different outbreak probabilities and final size distributions for epidemic outbreaks. kiss and green [16] provide a succinct rebuttal to the suggestion that the effects found by [14] and [15] are solely due to clustering. they show that, while the mean degree of the network is preserved in the random intersection graph, the degree distribution varies greatly (in particular, there are many zero-degree nodes) and variance of this distribution increases with clustering. an increase in the variance of the degree distribution has previously been shown to lower the epidemic threshold. they demonstrate that a rewiring of random intersection graphs that preserves the degree sequence but decreases clustering produces networks with similarly lowered epidemic thresholds and even smaller mean outbreak sizes. our experiments, reported below, are similar in spirit to those of [16] but look at networks with different degree distributions and study in detail how epidemic data from networks with varying levels of clustering might vary. ball et al. [17] show, using analytical techniques, that clustering induced by household structure in a population (where individuals have many contacts with individuals in the same household and fewer global contacts with those outside of the household) has an effect on probability of an outbreak and the expected size of any outbreak. the probability of an outbreak, in some special cases, is shown to be monotonically decreasing with clustering coefficient and the expected outbreak size also decreases with clustering. there is no suggestion that these results will apply to clustered networks outside of this specific type of network or that they apply when degree distributions are held constant. eames [18] also studies networks with two types of contacts: regular contacts (between people who live or work together, for example) and random contacts (sharing a train ride, for example). using simulations of a stochastic epidemic model and deterministic approximations, it is shown that both outbreak final size and probability of an outbreak are reduced with increased clustering, particularly when regular contacts dominate. as the number of random contacts increases, the effect of clustering reduces to almost zero. strong effects on the expected outbreak size in networks with no random contacts are observed for values of the clustering coefficient above about 0.4, however, no indication of the magnitude of the variance of these effects is given. keeling [19] reports similar results, introducing clustering to a network using a spatial technique-nodes live in a two-dimensional space and two nodes are connected by an edge with a probability inversely proportional to their distance. the clustering comes about by randomly choosing positions in space to which nodes are attracted before connections are made. the results suggest that changes in clustering at lower levels has little effect on the probability of an outbreak, but as the clustering coefficient reaches about 0.45, the chance of an outbreak reduces significantly. as in [14] and [15] , while the mean degree of network nodes is held constant here, nothing is said about the degree distribution as clustering varies. serrano and boguñá [20] look specifically at infinite power-law networks and shows that the probability of an outbreak increases as clustering increases but the expected size of an outbreak decreases. some more recent papers seek to distinguish the effects of clustering from confounding factors such as assortativity and degree sequence. miller [21] develops analytic approximations to study the interplay of various effects such as clustering, heterogeneity in host infectiousness and susceptibility and the weighting of contacts on the spread of disease over a network. the impact of clustering on the probability and size of an outbreak is found to be small on "reasonable" networks so long as the average degree of the network is not too low. the rate at which the epidemic spreads, measured by the reproduction number, r 0 , is found to reduce with increased clustering in such networks. in networks with low mean degree, r 0 may be reduced to point of affecting the probability and size of an outbreak. miller [22] points out that studies of the effects of clustering should take into account assortativity in the network, that is, the correlations in node degree between connected nodes. assortativity has been shown to affect epidemic dynamics and changing the level of clustering in a network can change the level of assortativity. to distinguish between the effects of assortativity and clustering, a method of producing networks with arbitrary degree distributions and arbitrary levels of clustering with or without correlated degrees is presented and studied using percolation methods. the effect of increasing clustering in these models is to reduce the probability of outbreaks and reduce the expected size of an epidemic. badham and stocker [23] use simulated networks and epidemics to study the relationship between assortativity and clustering. their results suggest that increased clustering diminished the final size of the epidemic, while the effect of clustering on probability of outbreak was not very clear. like [23] , moslonka-lefebvre et al. [24] use simulations to try to distinguish the effects of clustering and assortativity but look at directed graphs. here, they find that clustering has little effect on epidemic behaviour. melnik et al. [25] propose that the theory developed for epidemics on unclustered (tree-like) networks applies with a high degree of accuracy to networks with clustering so long as the network has a small-world property [12] . that is, if the mean length of the shortest path between vertices of the clustered network is sufficiently small, quantities such as the probability of an outbreak on the network can be estimated using known results that require only the degree distribution and degree correlations. the theory is tested using simulations on various empirical networks from a wide range of domains and synthetic networks simulated from theoretical models. taken together, these studies show that clustering can have significant effects on crucial properties of epidemics on networks such as the probability, size and speed of an outbreak. these results primarily relate to the final outcome and mean behaviour of epidemics. however, if we can obtain a transmission tree for an outbreak then we have information from the start to the finish of a particular epidemic including times of infection and who infected whom. since epidemics are stochastic processes, data from a particular epidemic may differ considerably from the predicted mean. whether or not such data contains information about clustering in the underlying network is the question we seek to address here. we simulate epidemics over networks with fixed degree distributions and varying levels of clustering and inspect various summary statistics of the resulting epidemic data, comparing the summaries for epidemics run over networks with the same degree distribution but different levels of clustering. the precise details of the simulations are described in section 2. the results of the simulations, presented in section 3, show that there is likely little to no signal of clustering in a contact network to be found in a single realisation of an epidemic process over that network. we conclude that it is unlikely that clustering parameters can be inferred solely from epidemiological data that relates to the transmission tree and suggest that further work in parameter estimation for contact networks would be best focused on other properties of contact networks such as degree distribution or broader notions of population structure. we simulate multiple networks from two network models: a bernoulli model [26] and a power-law model [27] . under the bernoulli model (also called the erdős-rényi or binomial model), an edge between nodes i and j is present with some fixed probability 0 ≤ p ≤ 1 and absent with probability 1 − p, independently of all other edges. due to their simplicity, bernoulli networks are well-studied and commonly used in disease modeling but are not generally thought to be accurate models of social systems. a bernoulli network is trivial to construct by sampling first the total number of edges in a the graph where n is the number of nodes in the network, and then sampling |y | edges uniformly at random without replacement. we set n = 500 and p = 7/n = 0.014 in the simulations reported below. a power-law network is defined as having a power-law degree distribution, that is, for nodes i = 1, . . . , n , p (d i = k) ∝ k −α for some α > 0. power-law networks are commonly used to model social interactions and various estimates of α in the range 1.5-2.5 have been claimed for observed social networks. in the model used here, we set α = 1.8. we simulate power-law using a reed-molloy type algorithm [28] . that is, the degree of each node, d i , i = 1, . . . , n , is sampled from the appropriate distribution. node i is then assigned d i "edge stubs" and pairs of stubs are sampled uniformly without replacement to be joined and become edges. when all stubs have been paired, loops are removed and multiple edges between the same nodes are collapsed to single edges. this last step of removing loops and multiple edges causes the resulting graph to be only an approximation of a power law graph but the approximation is good for even moderately large n . we set n = 600 and consider only the largest connected component of the network in the simulation reported below. the size of the networks considered here is smaller than some considered in simulation studies though on a par with others (see, for example, [25] who looks a a wide range of network sizes). we choose these network sizes partly for convenience and partly because the current computational methods for statistical fitting of epidemic data to network models would struggle with networks much larger than a few hundred nodes [6] so our interest is in networks around this size. from each sampled network, y , we generate two further networks, y hi and y lo that preserve the degrees of all nodes in y but have, respectively, high and low levels of clustering. we achieve this using a monte carlo algorithm implemented in the ergm package [29] in r [30] that randomly rewires the input network while preserving the degree distribution. a similar algorithm is implemented in bansal et al. [31] . for details of the ergm model and implementation of this algorithm, we refer the reader the package manual [32] and note that the two commands used to simulate our networks are y_hi = simulate(y˜gwesp(0.2,fixed=t), theta0 = 5,... constraints =˜degreedist, burnin=5e+5) and y_lo = simulate(y˜gwesp(0.2,fixed=t), theta0 = -5,... constraints =˜degreedist, burnin=5e+5) we measure clustering in the resulting networks using the clustering coefficient [12] , defined as follows. let n i = {j|y ij = 1} be the neighbourhood of vertex i and d i = |n i | be the degree of i. let n i = j 1, the local clustering coefficient is , which is the ratio of extant edges between neighbours of i to possible edges. for d i ∈ {0, 1}, let c i = 0. the (global) clustering coefficient is the mean of the local coefficients, 1} is somewhat arbitrary, though other possible choices, such as c i = 1 or excluding those statistics from the mean, give similar qualitative results in our experiments. over each simulated network, we simulate a stochastic susceptible-exposed-infectious-removed (seir) epidemic. all nodes are initially susceptible to the infection. the outbreak starts when a single node is chosen uniformly at random and exposed to a disease. after a gamma-distributed waiting period with mean k e θ e and variance k e θ 2 e , the node becomes infectious. the infection may spread across the edges of the network, from infectious nodes to susceptible nodes according to a poisson process with rate β. infected nodes recover after an infectious period with a gamma distributed waiting time with mean k i θ i and variance k i θ 2 i . once a node is recovered, it plays no further part in the spread of the infection. the process stops when there are no longer any exposed or infectious nodes. for each pair, y hi and y lo , we start the infection from the same node. we condition on the outbreak infecting at least 20 nodes. the parameter values are set at β = 0.1, k e = k i = 1 and θ e = θ i = 3 in the simulations reported below. a transmission tree encodes all information about the epidemic outbreak it describes. as such, it is a very complicated object. to compare sets of transmission trees and decide whether there are some systematic differences between them, we rely on various summary statistics derived from the trees and compare the distribution of the summaries over the ensembles in question. the summaries we use can be divided into two groups, those relating solely to the number of infected through time and those relating to topology of the tree. the first group of summaries can all be derived from the epidemic curves, that is, the number infected as a function of time. from this, we derive scalar summaries being the total number of individuals infected, the length of the epidemic (measured from the time of the first infection to the last recovery), the maximum of the epidemic curve and the time of that maximum. we label each individual in the population (equivalently, each node in the contact network) with labels 1, . . . , n . a transmission tree, a distinct graph from the contact network, has a time component and can be defined as follows; an example of a transmission tree and the notation is given in figure 2 . there are three types of nodes in a transmission tree (not to be confused with nodes in the contact network): the root node corresponding to the initial infection, transmission or internal nodes corresponding to transmission events, and leaf or external nodes corresponding to recovery events. leaf nodes are defined by the time and label pair (t i , u i ) where t ≥ 0 is the time of the recovery event and u i is the label of individual that recovered. the internal nodes are associated with the triple (t i , u i , v i ) being the time of the event, t i , the label u i of the exposed individual and v i that is the transmitter or "parent" of the infection. the root node is like an internal node but the infection parent is given as 0, so is denoted (t 0 , u 0 , 0). the branches of the tree are times between infection, transmission and recovery events for a particular vertex. for example, if the individual labelled u is infected at event (t 1 , u, v 1 ), is involved in transmission events (t k , v k , u), k = 2, . . . , m − 1, and recovers at (t m , u) where t i < t j for i < j and {v 1 , . . . , u m−1 } are other individuals in the population, there are m − 1 branches of the transmission tree at u defined by the intervals (t i , t i+1 ], for i = 1, . . . , m − 1. we summarise the transmission tree using the following statistics: the mean branch length between internal nodes (corresponding to the mean time between secondary infections for each individual); the mean branch length of those branches adjacent to a leaf node (which corresponds to the mean time from the last secondary infection to removal for each individual); the number of secondary infections caused by each infected individual (that is, for each infected individual v we count the number of internal nodes that have the form (t i , u i , v), for some i); and, the distribution of infective descendants for each individual, v, which is defined recursively as the sum of secondary infections caused by v and the secondary infections caused by the secondary infections of v and so on. an equivalent definition is to say that number of infective descendants of v is the number of leaves that have a node of the form (t, u i , v) as an ancestor. finally, we consider the number of cherries in the tree [33] which is the number of pairs of leaves that are adjacent to a common internal node. this simple statistic is chosen as it is easy to compute and contains information about the topology or shape of the tree. to compare the number of cherries in outbreaks of different size, we look at the ratio of extant cherries to the maximum possible number of cherries for the given outbreak. the experimental pipeline can thus be summarised as: 1. repeat for i = 1, . . . , 500: (a) sample a graph y i according to given degree distribution. (b) simulate two further graphs y hi i and y lo i with high clustering and low clustering, respectively, using a monte carlo sampler that rewires y i to alter the clustering level while preserving the degree of each node. we report results here for seir epidemics run over bernoulli and power-law networks. a number of smaller trials that we do not report were run: with different values chosen for the network and epidemic parameters; on networks with the same degree distributions as a random intersection graph; and, using an sir epidemic rather than an seir. the results for those smaller trials were qualitatively similar to the results reported here. the distributions of the measured clustering coefficients is shown in figure 3 and show that the simulated networks with high and low clustering for a given degree distribution are easily distinguished from one another. the bernoulli networks with low clustering contain no triangles, so the clustering coefficient for each of these networks is zero, while for highly-clustered bernoulli networks, clustering coefficients are in the range (0.28,0.33). for the power-law networks, the low clustered networks have clustering in the range (0.00,0.09) while the highly clustered networks have clustering in the range (0.24,0.38). figures 4 and 5 show comparisons of summary statistics for networks with differing levels of clustering and bernoulli degree distributions. the summaries show some differences between the outbreaks on the differently clustered networks. in particular, the outbreaks in the highly-clustered networks spread more slowly, on average, leading to marginally longer epidemics with fewer individuals infected at the peak of the outbreak, that occurs slightly later, than we see in outbreaks on the networks with low clustering. these mean effects are in line with the predictions of [22] . the variances of the measured statistics, however, are sufficiently large due to stochastic effects in the model that the ranges of the distributions overlap almost completely in most cases. statistics derived from the transmission tree appear to add little information, with only the number of cherries differing in the mean. figures 6 and 7 show the corresponding distributions for networks with power-law degree distributions. again, differences in the means between the two sets of statistics are apparent with the mean length of epidemic, total number infected and number infected at peak all lower in the epidemics on networks with high-clustering. the largest difference is found in the total number infected, where in the low-clustered networks, the range of the statistic is (231, 445) while it is just (211, 361) in the high-clustered networks. the primary cause here is due to the change in size of the largest connected component of the network. if we adjust for this by looking instead at the proportion of the giant component infected, the distributions again overlap almost completely with the range for the proportion infected in the low-clustered networks being (0.39, 0.74) and (0.42, 0.74) for the high-clustered networks. the results presented above suggest that the behaviour of an epidemic on a random network with a given degree sequence is relatively unaffected by the level of clustering in the network. some effect is seen, but it is small relative to the random variation we see between epidemics on similarly clustered networks. the results also suggest that the complete transmission tree from an epidemic provides little information about clustering that is not present in the epidemic curve. these results do not imply that clustering has little effect, rather they suggest as noted in [16] , the apparently strong effect of clustering observed by some is more likely to due to a change in the degree distribution-an effect we have nullified by holding the degree sequence constant. these broader effects are probably best analysed on a grosser level such as the household or subgroup level rather than at the individual level at which clustering is measured. our simulation method, in which the degree sequence for each network is held constant while clustering levels are adjusted, places significant restrictions on the space of possible graphs and therefore clustering coefficients. the levels of clustering achieved in the simulations reported here (for example, having a clustering coefficient in the low-clustered bernoulli case of 0 versus a mean of 0.30 for the high-clustered case) are not so high as those considered in the some of the simulations and theoretical work described in section 1, and this may partly account for the limited effect on epidemic outcomes that we find here. there is little known about the levels of clustering found in real contact networks [31] (though one recent detailed study [34] find values for clustering in a social contact network in the region 0.15-0.5) and no evidence to suggest that very extreme values of clustering are achieved for a given degree sequence. it is plausible, however, that the degree sequence of a social network of interest could be found-for example, via ego-centric or full-network sampling [34] [35] [36] -and therefore reasonable to explore the achievable levels of clustering conditional on the degree sequence. in doing so, we separate the effects on epidemic dynamics of change in the degree sequence of the contact network from those of clustering. from a statistical point of view, these results indicate that even with full data from a particular epidemic outbreak, such as complete knowledge of the transmission tree, it is unlikely that the level of clustering in the underlying contact network could be accurately inferred independently of the degree distribution. this is primarily due to the large stochastic variation found from one epidemic to the next that masks the relatively modest effects of clustering on an outbreak. with this much stochastic noise, we suggest that it would require data from many outbreaks over the same network (that is, pathogens with a similar mode of transmission spreading in the same population) to infer the clustering level of that network with any accuracy. the results also suggest that attempting to estimate a clustering parameter without either estimating or fixing the degree sequence, as in goudie [37] , may see the estimated clustering parameter acting chiefly a proxy for the degree sequence. it cannot be ruled out that a statistical method, which takes into account the complete data rather than the summaries we use here, or which takes data from parts of the parameter space that we have not touched on here, could find some signal of clustering from such data. in practise, however, it would be highly unusual to have access to anything approaching complete data. a more realistic data set might include times of onset and recovery from disease symptoms for some individuals in the population and sequences taken from viral genetic material. the noise that characterises such data sets already makes it difficult to accurately reconstruct the transmission tree; this extra uncertainty would likely make any inference of a clustering parameter, in the absence of other information, very difficult. i thank david hunter, marcel salathé, mary poss and an anonymous referee for useful comments and references that improved this paper. this work is supported by nih grant r01-gm083603-01. a survey of statistical network models. foundations and trends in machine learning network epidemiology: a handbook for survey design and data collection the structure and function of complex networks bayesian inference for stochastic epidemics in populations with random social structure. scand bayesian inference for contact networks given epidemic data. scand a network-based analysis of the 1861 hagelloch measles data episodic sexual transmission of hiv revealed by molecular phylodynamics integrating genetic and epidemiological data to determine transmission pathways of foot-and-mouth disease virus statistical inference to advance network models in epidemiology different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures evolutionary analysis of the dynamics of viral infectious disease collective dynamics of small world networks why social networks are different from other types of networks properties of highly clustered networks epidemics on random graphs with tunable clustering comment on "properties of highly clustered networks analysis of a stochastic sir epidemic on a random network incorporating household structure modelling disease spread through random and regular contacts in clustered populations the implications of network structure for epidemic dynamics percolation and epidemic thresholds in clustered networks spread of infectious disease through clustered populations percolation and epidemics in random clustered networks the impact of network clustering and assortativity on epidemic behaviour. theor disease spread in small-size directed networks: epidemic threshold, correlation between links to and from nodes, and clustering the unreasonable effectiveness of tree-based theory for networks with clustering on random graphs statistical mechanics of complex networks a critical point for random graphs with a given degree sequence a package to fit, simulate and diagnose exponential-family models for networks, version 2.2-2 r development core team. r: a language and environment for statistical computing. r foundation for statistical computing exploring biological network structure with clustered random networks a package to fit, simulate and diagnose exponential-family models for networks distributions of cherries for two models of trees a high-resolution human contact network for infectious disease transmission using data on social contacts to estimate age-specific transmission parameters for respiratory-spread infectious agents social contacts and mixing patterns relevant to the spread of infectious diseases what does a tree tell us about a network? this article is an open access article distributed under the terms and conditions of the creative commons attribution license key: cord-191876-03a757gf authors: weinert, andrew; underhill, ngaire; gill, bilal; wicks, ashley title: processing of crowdsourced observations of aircraft in a high performance computing environment date: 2020-08-03 journal: nan doi: nan sha: doc_id: 191876 cord_uid: 03a757gf as unmanned aircraft systems (uass) continue to integrate into the u.s. national airspace system (nas), there is a need to quantify the risk of airborne collisions between unmanned and manned aircraft to support regulation and standards development. both regulators and standards developing organizations have made extensive use of monte carlo collision risk analysis simulations using probabilistic models of aircraft flight. we've previously determined that the observations of manned aircraft by the opensky network, a community network of ground-based sensors, are appropriate to develop models of the low altitude environment. this works overviews the high performance computing workflow designed and deployed on the lincoln laboratory supercomputing center to process 3.9 billion observations of aircraft. we then trained the aircraft models using more than 250,000 flight hours at 5,000 feet above ground level or below. a key feature of the workflow is that all the aircraft observations and supporting datasets are available as open source technologies or been released to the public domain. the continuing integration of unmanned aircraft system (uas) operations into the national airspace system (nas) requires new or updated regulations, policies, and technologies to maintain safe and efficient use of the airspace. to help achieve this, regulatory organizations such as the federal aviation administration (faa) and the international civil aviation organization (icao) mandate the use of collision avoidance systems to minimize the risk of a midair collision (mac) between most manned aircraft (e.g. 14 cfr § 135.180). monte carlo safety simulations and statistical encounter models of aircraft behavior [1] have enabled the faa to develop, assess, and certify systems to mitigate the risk of airborne collisions. these simulations and models are based on observed aircraft behavior and have been used to design, evaluate, and validate collision avoidance systems deployed on manned aircraft worldwide [2] . for assessing the safety of uas operations, the monte carlo simulations need to determine if the uas would be a hazard to manned aircraft. therefore there is an inherent need for models that represent how manned aircraft behave. while various models have been developed for decades, many of these models were not designed to model manned aircraft behavior where uas are likely to operate [3] . in response, new models designed to characterize the low altitude environment are required. in response, we previously identified and determined that the opensky network [4] , a community network of ground-based sensors that observe aircraft equipped with automatic dependent surveillance-broadcast (ads-b) out, would provide sufficient and appropriate data to develop new models [5] . ads-b was initially developed and standardized to enable aircraft to leverage satellite signals for precise tracking and navigation. [6, 7] . however, the previous work did not train any models. this work considered only how aircraft, observed by the opensky network, within the united states and flying between 50 and 5,000 feet above ground level (agl) or less. thus this work does not consider all aircraft, as not all aircraft are equipped with ads-b. the scope of this work was informed by the needs of faa uas integration office, along with the activities of the standards development organizations of astm f38, rtca sc-147, and rtca sc-228. initial scoping discussions were also informed by the uas excom science and research panel (sarp), an organization chartered under the excom senior steering group; however the sarp did not provide a final review of the research. we focused on two objectives identified by the aviation community to support integration of uas into the nas. first to train a generative statistical model of how manned aircraft behavior at low altitudes. and second to estimate the relative frequency that a uas would encounter a specific type of aircraft. these contributions are intended to support current and expected uas safety system development and evaluation and facilitate stakeholder engagement to refine our contributions for policy-related activities. the primary contribution of this paper is the design and evaluation of the high performance computing (hpc) workflow to train models and complete analyses that support the community's objectives. refer to previous work [5, 8] to use the results from this workflow. this paper focus primarily on the use of the lincoln laboratory supercomputing center (llsc) [9] to process billions of aircraft observations in a scalable and efficient manner. we first briefly overview the storage and compute infrastructure of the llsc. the llsc and its predecessors have been widely used to process aircraft tracks and support aviation research for more than a decade. the llsc high-performance computing (hpc) systems have two forms of storage: distributed and central. distributed storage is comprised of the local storage on each of the compute nodes and this storage is typically used for running database applications. central storage is implemented using the opensource lustre parallel file system on a commercial storage array. lustre provides high performance data access to all the compute nodes, while maintaining the appearance of a single filesystem to the user. the lustre filesystem is used in most of the largest supercomputers in the world. specifically, the block size of lustre is 1mb, thus any file created on the llsc will take at least 1mb of space. the processing described in this paper was conducted on the llsc hpc system [9] . the system consists of a variety of hardware platforms, but we specifically developed, executed, and evaluated our software using compute nodes based on dual socket haswell (intel xeon e5-2683 v3 @ 2.0 ghz) processors. each haswell processor has 14 cores and can run two threads per core with the intel hyper-threading technology. the haswell node has 256 gb of memory. this section describes the high performance computing workflow and the results for each step. a shell script was used to download the raw data archives for a given monday from the opensky network. data was organized by day and hour. both the opensky network and our architecture will create a dedicated directory for a given day, such as 2020-06-22. after extracting the raw data archives, up to 24 comma separated value (csv) files will populate the directory; each hour in utc time corresponds to a specific file. however, there are a few cases where not every hour of the day was available. the files contain all the abstracted observations of all aircraft for that given hour. for a specific aircraft, observations are updated at least every ten seconds. for this paper, we downloaded 85 mondays spanning february 2018 to june 2020, totaling 2002 hours. the size of each hourly file was dependent upon the number of active sensors that hour, the time of day, the quantity of aircraft operations, and the diversity of the operations. across a given day, the hourly files can range in size by hundreds of megabytes with the maximum file size between 400 and 600 megabytes. together all the hourly files for a given day currently require about 5-9 gigabytes of storage. we observed that on average the daily storage requirement for 2019 was greater than for 2018. parsing, organizing, and aggregating the raw data for a specific aircraft required high performance computing resources, especially when organizing the data at scale. many aviation use cases require organizing data and building a track corpus for each specific aircraft. yet it was unknown how many unique aircraft were observed in a given hour and if a given hourly file has any observations for a specific aircraft. to efficiently organize the raw data, we need to address these unknowns. we identified unique aircraft by parsing and aggregating the national aircraft registries of the united states, canada, the netherlands, and ireland. registries were processed for each individual year for 2018-2020. all registries specified the registered aircraft's type (e.g. rotorcraft, fixed wing singleengine, etc.), the registration expiration date, and a global unique hex identifier of the transponder equipped on the aircraft. this identifier is known as the icao 24-bit address [10] , with (2 24 -2) unique addresses available worldwide. some of the registries also specified the maximum number of seats for each aircraft. using the registries, we created a four tier directory structure to organize the data. the highest level directory corresponds to the year, such as 2019. the next level was organized by twelve general aircraft type, such as fixed wing single-engine, glider, or rotorcraft. the third directory level was based on the number of seats, with each directory representing a range of seats. a dedicated directory was created for aircraft with an unknown number of seats. the lowest level directory was based on the sorted unique icao 24-bit addresses. for each seat-based directory, up to 1000 icao 24-bit address directories are created. additionally to address that the four aircraft registries do not contain all registered aircraft globally, a second level directory titled "unknown" was created and populated with directories corresponding to each hour of data. the top and bottom level directories remained the same as the known aircraft types. the bottom directories for unknown aircraft are generated at runtime. this hierarchy ensures that there are no more than 1000 directories per level, as recommended by the llsc, while organizing the data to easily enable comparative analysis between years or different types of aircraft. the hierarchy was also sufficiently deep and wide to support efficient parallel process i/o operations across the entire structure. for example, a full directory path for the first three tiers of the directory hierarchy could be: "2020/rotorcraft/seats_001_010/." the directory would contain all the known unique icao 24-bit addresses for rotorcraft with 1-10 seats in 2018. within this directory would be up to 1000 directories, such as "a00c12_a00d20" or "a000d20_a00ecf" this lowest level directory would be used to store all the organized raw data for aircraft with an icao 24-bit address. the first hex value was inclusive, but the second hex value was not inclusive. with a directory structure established, each hourly file was then loaded into memory, parsed, and lightly processed. observations with incomplete or missing position reports were removed, along with any observations outside a user-defined geographic polygon. the default polygon, illustrated by figure 1 , was a convex hull with a buffer of 60 nautical mile around approximately north america, central america, the caribbean, and hawaii. units were also converted to u.s. aviation units. the country polygons were sourced from natural earth, a public domain map dataset [11] . specifically for the 85 mondays across the three years, 2214 directories were generated across the first three tiers of the hierarchy and 802,159 directories were created in total across the entire hierarchy. of these, 770,661 directories were nonempty. the majority of the directories were created within the unknown aircraft type directories. as overviewed by tables 1 and 2, about 3.9 billion raw observations were organized, with about 1.4 billion observations available after filtering. there was a 15% annual percent increase in observations per hour from 2018 to 2019. however, a 50% percent decrease in the average number of observations per hour was observed when comparing 2020 to 2019; this could be attributed to the covid-19 pandemic. this worldwide incident sharply curtailed travel, especially travel between countries. this reduction in travel was reflected in the amount of data filtered using the geospatial polygon. in 2018 and 2019, about 41-44% of observations were filtered based on their location. however, only 27% of observations were filtered for observations from march to june 2020. conversely, the amount of observations removed due to quality control did not significantly vary in 2020, as 26%, 20%, and 25% were removed for 2018, 2019, and 2020. these results were generated using 512 cpus across 2002 tasks, where each task corresponded to a specific hourly file. tasks were uniformly distributed across cpus, a dynamic selfscheduling parallelization approach was not implemented. each task required on average 626 seconds to execute, with a median time of 538 seconds. the maximum and minimum times to complete a task were 2153 and 23 seconds. across all tasks, about 348 hours of total compute time was required to parse and filter the 85 days of data. it is expected that if the geospatial filtering was relaxed and observations from europe were not removed, that the compute time would increase due to increase demands on creating and writing to hourly files for each aircraft. since files were created for every hour for each unique aircraft, tens of millions of small files less than 1 megabyte in size were created. this was problematic as small files typically use a single object storage target, thus serializing access to the data. additionally, in a cluster environment, hundreds or thousands of concurrent, parallel processes accessing small files can lead to significantly large random i/o patterns for file access and generates massive amounts of networks traffic. this results in increased latency for file access, higher network traffic and significantly slows down i/o and consequently causes degradation in overall application performance. while this approach to data organization may provide acceptable performance on a laptop or desktop computer, it was unsuitable for use in a shared, distributed hpc system. in response, we created zip archives for each of the bottom directories. in a new parent directory, we replicated the first three tiers of the directory hierarchy from the previous step. then instead of creating directories based on the icao 24-bit addresses, we archiving each directory with the hourly csv files from the previous organization step. we then removed the hourly csv files from storage. this was achieved using llmapreduce [12] , with a task created for each of the 770,661 non-empty bottom level directories. similar to the previous organization step, all tasks were completed in a few hours but with no optimization for load balancing. the performance of this step could be improved by distributing tasks based on the number of files in the directories or the estimated size the output archive. a key advantage to archiving the organized data, is that the archives can be updated with new data as it becomes available. if the geospatial filtering parameters and aircraft registry data doesn't change, only new open sky data needs to be organized. once organized into individual csv files, llmapreduce can be used again to update the existing archives. this substantially reduces the computational and storage requirements to process new data. the archived data can now be segmented, have outliers removed, and interpolated. additionally above ground level altitude was calculated, airspace class was identified, and dynamic rates (e.g. vertical rate) were calculated. we also split the raw data into track segments based on unique position updates and time between updates. this ensures that each segment does not include significantly interpolated or extrapolated observations. track segments without ten points are removed. figure 2 illustrates the track segments for a faa registered fixed wing multi-engine aircraft from march to june 2020. note that segment length can vary from tens to hundreds of nautical miles long. track segment length was dependent upon the aircraft type, availability of active opensky network sensors, and nearby terrain. however, the ability to generate track segments that span multiple states represents a substantial improvement over previous processing approaches for development of aircraft behavior models. then for each segment we detect altitude outliers using a 1.5 scaled median absolute deviations approach and smooth the track using a gaussian-weight average filter with a 30-second time window. dynamic rates, such as acceleration, are calculated using a numerical gradient. outliers are then detected and removed based on these rates. outlier thresholds were based on aircraft type. for example, the speeds greater than 250 knots were considered outliers for rotorcraft, but fixed wing multiengine aircraft had a threshold of 600 knots. the tracks were then interpolated to a regular one second interval. lastly, we estimated the above ground level altitude using digital elevation models. this altitude estimation was the most computationally intensive component of the entire workflow. it consists of loading into memory and interpolating srtm3 or noaa globe [13] digital elevation models (dems) to determine the elevation for each interpolated track segment position. to reduce the computational load prior to processing the terrain data, it was determined using a c++ based polygon test to identify which track segment positions are over the ocean, as defined by natural earth data. points are over the ocean are assumed to have an elevation of 0 feet mean sea level and their elevation are not estimated using the dems. for the 85 days of organized data, approximately 900,000,000 interpolated track segments were generated. for each aircraft in a given year, a single csv was generated containing all the computed segments. in total across the three years, 619,337 files were generated. as these files contained significantly more rows and columns than when organizing the raw data, the majority of these final files were greater than 1 mb in size. the output of this step did not face any significant storage block size challenges. similar to the previous step, tasks were created based on the bottom tier of the directory hierarchy. specifically for processing, parallel tasks were created for each archive. during processing, archives were extracted to a temporary directory while the final output was stored in standard memory. given the processed data, this section overviews two applications on how to exploit and dissemination the data to inform and support the aviation safety community. as the aircraft type was identified when organizing the raw data, it was a straightforward task to estimate the observed distribution of aircraft types per hour. these distributions are not reflective of all aircraft operations in the united states, as not all aircraft are observed by the opensky network. the distributions were also calculated independently for each aircraft type, so the yearly (row) percentages may not sum to 100%. furthermore the relatively low percentage of unknown aircraft was due to the geospatial filtering when organizing the raw data. if the same aircraft registries were used by the filtering was change to only include tracks in europe, the percentage of unknown aircraft would likely significantly rise. this analysis can be extended by identifying specific aircraft manufactures and models, such as boeing 777. however, the manufacturer and model information are not consistent within an aircraft registry nor across different registries. for example, entries of "cessna 172," "textron cessna 172," and "textron c172" all refer to the same aircraft model. one possible explanation for the differences between entries is that cessna used to be an independent aircraft manufacturer and then eventually was acquired by textron. depending on the year of registration, the name of the aircraft may differ but the size and performance of the aircraft remains constant. since over 300,000 aircraft with unique icao 24-bit addresses were identified annually across the aircraft registries, parsing and organizing the aircraft models can be formulated as a traditional natural language processing problem. parsing the aircraft registries differs from a common problem of parsing aviation incident or safety reports [14, 15, 16] due to the reduced word count of the registries and the structured format of the registries. future work will focus on using fuzzy string matching to identify similar aircraft. for many aviation safety studies, manned aircraft behavior is represented using mit lincoln laboratory encounter models. each encounter model is a bayesian network, a generative statistical model that mathematically represents aircraft behavior during close or safety critical encounters, such as near midair collisions. the development of the modern models started in 2008 [1] , with significant updates in 2013 [17] and 2018 [18] . all the models were trained using the llsc [9] or its predecessors. the most widely used of these models were trained using observations collected by groundbased secondary surveillance radars from the 84th radar evaluation squadron (rades) network. aircraft observations by the rades network are based on mode 3a/c, an identification friend or foe technology that provides less metadata than ads-b. notably aircraft type or model cannot be explicitly correlated or identified with specific aircraft tracks. instead, we filtered the rades observations based on the flying rules reported by the aircraft. however, this type of filtering is not unique to the rades data, it is also supported by the opensky network data. additionally, due to the performance of the rades sensors, we filtered out any observations below 500 feet agl due to position uncertainties associated with radar time of arrival measurements. observations of ads-b equipped aircraft by the opensky network differ because ads-b enables aircraft to broadcast the aircraft's estimate of their own location, which is often based on precise gnss measurements. the improved position reporting of ads-b enabled the new opensky network-based models to be trained with an altitude floor of 50 feet agl, instead of 500. specifically, three new statistical models of aircraft behavior were trained, each for a different aircraft type of fixed wing multi-engine, fixed wing single-engine, and rotorcraft. a key advantage to these models is the data reduction and dimensionality reduction. a model was created for each of the three aircraft types and stored as a human readable text file. each file requires approximately just 0.5 megabytes. this a significant reduction from the hundreds of gigabytes used to store the original 85 days of data. table iv reports the quantity of data used to train each model. for example, the rotorcraft model was trained from about 25,000 flight hours over 85 days. however, like the rades-based model, these models do not represent the geospatial nor temporal distribution of the training data. for example, a limitation of these models is that they do not inform if more aircraft were observed in new york city than los angeles. [17] . these figures illustrate how different aircraft behave, such as rotorcraft flying relatively lower and slower than fixed wing multi-engine aircraft. also note that the rades-based model has no altitude observations below 500 feet agl, whereas 18% of the approximately 25,000 rotorcraft flight hours were observed at 50-500 feet agl. it has not been assessed if the opensky network-based models can be used a surrogates for other aircraft types or operations. additionally the new models do not fully supersede the existing rades-based models, as each models represent different varieties of aircraft behavior. on github.com, please refer to the mit lincoln laboratory (@mit-ll) and airspace encounter models (@airspace-encounter-models) organizations. airspace encounter models for estimating collision safety analysis of upgrading to tcas version 7.1 using the 2008 u.s. correlated encounter model well-clear recommendation for small unmanned aircraft systems based on unmitigated collision risk bringing up opensky: a large-scale ads-b sensor network for research developing a low altitude manned encounter model using ads-b observations vision on aviation surveillance systems ads-mode s: initial system description representative small uas trajectories for encounter modeling interactive supercomputing on 40,000 cores for machine learning and data analysis mode s: an introduction and overview (secondary surveillance radar) introducing natural earth datanaturalearthdata.com llmapreduce: multi-level map-reduce for high performance data analysis the global land one-kilometer base elevation (globe) digital elevation model, version 1.0 using structural topic modeling to identify latent topics and trends in aviation incident reports temporal topic modeling applied to aviation safety reports: a subject matter expert review ontologies for aviation data management ieee/aiaa 35th digital avionics systems conference (dasc) uncorrelated encounter model of the national airspace system, version 2.0 correlated encounter model for cooperative aircraft in the national airspace system version 2.0 we greatly appreciate the support and assistance provided by sabrina saunders-hodge, richard lin, and adam hendrickson from the federal aviation administration. we also would like to thank fellow colleagues dr. rodney cole, matt edwards, and wes olson. key: cord-290033-oaqqh21e authors: georgalakis, james title: a disconnected policy network: the uk's response to the sierra leone ebola epidemic date: 2020-02-13 journal: soc sci med doi: 10.1016/j.socscimed.2020.112851 sha: doc_id: 290033 cord_uid: oaqqh21e this paper investigates whether the inclusion of social scientists in the uk policy network that responded to the ebola crisis in sierra leone (2013–16) was a transformational moment in the use of interdisciplinary research. in contrast to the existing literature, that relies heavily on qualitative accounts of the epidemic and ethnography, this study tests the dynamics of the connections between critical actors with quantitative network analysis. this novel approach explores how individuals are embedded in social relationships and how this may affect the production and use of evidence. the meso-level analysis, conducted between march and june 2019, is based on the traces of individuals' engagement found in secondary sources. source material includes policy and strategy documents, committee papers, meeting minutes and personal correspondence. social network analysis software, ucinet, was used to analyse the data and netdraw for the visualisation of the network. far from being one cohesive community of experts and government officials, the network of 134 people was weakly held together by a handful of super-connectors. social scientists’ poor connections to the government embedded biomedical community may explain why they were most successful when they framed their expertise in terms of widely accepted concepts. the whole network was geographically and racially almost entirely isolated from those affected by or directly responding to the crisis in west africa. nonetheless, the case was made for interdisciplinarity and the value of social science in emergency preparedness and response. the challenge now is moving from the rhetoric to action on complex infectious disease outbreaks in ways that value all perspectives equally. global health governance is increasingly focused on epidemic and pandemic health emergencies that require an interdisciplinary approach to accessing scientific knowledge to guide preparedness and crisis response. of acute concern is zoonotic disease, that can spread from animals to humans and easily cross borders. the "grave situation" of the chinese coronavirus (covid-19) outbreak seems to have justified these fears and is currently the focus of an international mobilisation of scientific and state resources (wood, 2020) . covid-19 started in wuhan, the capital of china's hubei province and has been declared a public health emergency of international concern (pheic) by the world health organisation (who). the interactions currently taking place, nationally and internationally between evidence, policy and politics, are complex and relate to theories around the role of the researcher as broker or advocate and the form and function of research policy networks (pielk, 2007) and (ward et al., 2011) and (georgalakis and rose, 2019) . in this paper i seek to explore these areas further through the lens of the uk's response to ebola in west africa. this policy context has been selected in relation to the division of the affected countries between key donors. the british government assumed responsibility for sierra leone and sought guidance from health officials, academics, humanitarian agencies and clinicians. the ebola epidemic that struck west africa in 2013 has been described as a "transformative moment for global health" (kennedy and nisbett, 2015, p.2) , particularly in relation to the creation of a transdisciplinary response that was meant to take into account cultural practices and the needs of communities. the mobilisation of anthropological perspectives towards enhancing the humanitarian intervention was celebrated as an example of research impact by the uk's economic and social research council (esrc) and department for international development (dfid) (esrc, 2016 ). an eminent group of social scientists called for future global emergency health interventions to learn from this critical moment of interdisciplinary cooperation and mutual understanding (s. a. abramowitz et al., 2015) . however, there has been much criticism of this narrative, ranging from the serious https://doi.org/10.1016/j.socscimed.2020.112851 received 13 august 2019; received in revised form 6 february 2020; accepted 10 february 2020 * director of communications and impact, institute of development studies university of sussex, library road, falmer, brighton, bn1 9re, uk. e-mail addresses: j.georgalakis@ids.ac.uk, mjcg20@bath.ac.uk. available online 13 february 2020 0277-9536/ crown copyright © 2020 published by elsevier ltd. all rights reserved. t doubts of some anthropologists themselves about their impact (martineau et al., 2017) , to denouncements of largely european and north american anthropologists' legitimacy and the utility of their advice (benton, 2017) . there are two questions i hope to address through a critical commentary on the events that unfolded and with social network analysis of the uk based research and policy network that emerged: i) how transformational was the uk policy response to ebola in relation to changes in evidence use patterns and behaviours? ii) how does the form and function of the uk policy network relate to epistemic community theory? the first question will explore the degree to which social scientists and specifically anthropologists and medical anthropologists, were incorporated into the uk policy network. the second question seeks to locate the dynamics of this network in the literature on network theory and the role of epistemic communities in influencing policy during emergencies. the paper does not attempt to evidence the impact of anthropology in the field or take sides in hotly debated issues such as support for home care. instead, it looks at how individuals are embedded in social relationships and how this may affect the production and use of evidence (victor et al., 2017) . the emerging field of network analysis around the generation and uptake of evidence in policy, recommends this critical realist constructivist methodology. it utilises interactive theories of evidence use, the study of whole networks and the analysis of the connections between individuals in policy and research communities (nightingale and cromby, 2002; oliver and faul, 2018) . although ebola related academic networks have been mapped, this methodological approach has never previously been applied to the policy networks that coalesced around the international response. hagel et al. show how research on the ebola virus rapidly increased during the crisis in west africa and identified a network of institutions affiliated through co-authorship. unfortunately, their data tell us very little about the type of research being published and how it was connected into policy processes (hagel et al., 2017) . in contrast, this paper seeks to inform the ongoing movements promoting interdisciplinarity as key to addressing global health challenges. zoonotic disease has been the subject of particular concerns around the, "connections and disconnections between social, political and ecological worlds" (bardosh, 2016, p. 232) . with the outbreak of covid-19 in china at the end of 2019, its rapid spread overseas and predictions of more frequent and more deadly pandemics and epidemics in the future, the importance of breaking down barriers between policy actors, humanitarians, social scientists, doctors and medical scientists can only increase with time. before we look at detailed accounts of events relating to the uk policy network, first we must consider what the key policy issues were relating to an anthropological response versus a purely clinical one. anthropological literature exists, from previous outbreaks, documenting the cultural practices that affected the spread of ebola (hewlett and hewlett, 2007) . the main concerns relate to how local practices may accelerate the spread of the virus and the need to address these in order to lower infection rates. ebola is highly contagious, particularly from contamination by bodily fluids. in west africa, many local customs exist around burial practices that clinicians believe heighten the risk to communities. common characteristics of these are, the washing of bodies by family members, passing clothing belonging to the deceased to family and the touching of the body (richards, 2016) . another concern, as the crisis unfolded, was people attempting to provide home care to victims of the virus. the clinical response was to create isolation units or ebola treatment units (etus) in which to assess and treat suspected cases (west & von saint andré-von arnim, 2014) . community based care centres were championed by the uk government but their deployment came late and opinion was divided around their effectiveness. clinicians regarded etus as an essential part of the response and wanted to educate people to discourage them from engaging in what they regarded as deeply unsafe practices, including home care (walsh and johnson, 2018) and (msf, 2015) . anthropologists with expertise in the region focused instead on engaging communities more constructively, managing stigma and understanding local behaviours and customs (fairhead, 2014) , (richards, 2014b) and (berghs, 2014) . anthropologist, paul richards, argues that agencies' and clinicians' lack of understanding of local customs worsened the crisis (richards, 2016) and that far from being ignorant and needing rescuing from themselves, communities had coping strategies of their own. his studies from sierra leone and liberia relate how some villages isolated themselves, created their own burial teams and successfully protected those who came in contact with suspected cases with makeshift protective garments (richards, 2014a) . anthropologists working in west africa during the epidemic prioritised studies of social mobilisation and community engagement and worked with communities directly on ebola transmission. sharon abramowitz, in her review of the anthropological response across guinea, liberia and sierra leone, provides examples from the field work of chiekh niang (fleck, 2015) , sylvain faye, juliene anoko, almudena mari saez, fernanda falero, patricia omidian, several medicine sans frontiers (msf) anthropologists and others (s. abramowitz, 2017) . however, abramowitz argues that learning generated by these ethnographic studies was largely ignored by the mainstream response. however, not everyone has welcomed the intervention of the international anthropological community. some critics have argued that social scientists in mostly european and north american universities were poorly suited to providing sound advice given their lack of familiarity with field-based operations. adia benton suggests that predominantly white northern anthropologists have an "inflated sense of importance" that led them to exaggerate the relevance of their research. this in turn helped reinforce concepts of "superior northern knowledge" (benton, 2017, p. 520 ). this racial optic seems to contradict the portrayal of plucky anthropologists being the victims of knowledge hierarchies that favour other knowledges over their own. our focus here, on the mobilisation of knowledge from an international community of experts, recommends that we consider how this can be understood in relation to group dynamics as well as individual relationships. particularly relevant is peter haas' theory of epistemic communities. haas helped define epistemic communities and how they differ from other policy communities, such as interest groups and advocacy coalitions (haas, 1992) . they share common principles and analytical and normative beliefs (causal beliefs). they have an authoritative claim to policy relevant expertise in a particular domain and haas claims that policy actors will routinely place them above other interest groups in their level of expertise. he believes that epistemic communities and smaller more temporary collaborations within them, can influence policy. he observes that in times of crisis and acute uncertainty, policy actors often turn to them for advice. the emergence of an epistemic community focused on the uk policy response was framed by the division of the affected countries between key donors along historic colonial lines. namely, the uk was to lead in sierra leone, the united states in liberia and the french in guinea. this seems to have focused social scientists in the uk on engaging effectively with a government and wider scientific community who seemed to want to draw on their expertise. this was a relatively close-knit community of scholars who already worked together, co-published and cited each other's work and in many cases worked in the same academic institutions. crucially, their ranks were swelled by a small number of epidemiologists and medical anthropologists who shared their concerns. from the time msf first warned the international community of an unprecedented outbreak of ebola in guinea at the end of march 2014, it was six months before an identifiable and organised movement of social scientists emerged (msf, 2015) . things began to happen quickly when the who announced in early september of that year that conventional biomedical responses to the outbreak were failing (who, 2014a) . this acted like a siren call to social scientists incensed by the reported treatment of local communities and the way in which a narrative had emerged blaming local customs and ignorance for the rapid spread of the virus. british anthropologist, james fairhead, hastily organised a special panel on ebola at the african studies association (asa) annual conference, that was taking place at the university of sussex (uos) on the september 10, 2014. amongst the panellists were: anthropologist melissa leach, director of the institute of development studies (ids); audrey gazepo, university of ghana, medical anthropologist melissa parker from the london school of hygiene and tropical medicine (lshtm); anthropologist and public health specialist, anne kelly from kings college london and stefan elbe, uos. informally, after the conference, this group discussed the idea of an online repository or platform for the supply of regionally relevant social science (f. martineau et al., 2017) . this would later become the ebola response anthropology platform (erap). in the days and weeks that followed it was the personal and professional connections of these individuals that shaped the network engaging with the uk's intervention. just two days after the emergency panel at the asa, jeremy farrar, director of the wellcome trust, convened a meeting of around 30 public health specialists and researchers, including leach, on the uk's response to the epidemic. discussions took place on the funding and organisation of the anthropological response. the government was already drawing on the expertise and capacity of public health england (phe), the ministry of defence (mod) and the department of health (doh), to drive its response but social scientists had no seat at the table. the government's chief medical officer (cmo) sally davies called a meeting of the ebola scientific assessment and response group (esarg), on the 19th september, focused on issues which included community transmission of ebola. leach's inclusion as the sole anthropologist was largely thanks to farrar and chris whitty, dfid's chief scientific advisor (m leach, 2014). there was already broad acceptance of the need for the response to focus on community engagement and the who had been issuing guidance on how to engage and what kind of messaging to use for those living in the worst affected areas (who, 2014c) . in their account of these events three of the central actors from the uk's anthropological community describe how momentum gathered quickly and that: "it felt as if we were pushing at an open door" (f. martineau et al., 2017, 481) . by the following month, the uk's coalition government was embracing its role as the leading bilateral donor in sierra leone and wanted to raise awareness and funds from other governments and foundations. a high level conference: defeating ebola in sierra leone, had been quickly organised, in partnership with the sierra leone government, at which an international call for assistance was issued (dfid, 2014) . it was shortly after this that the cmo, at the behest of the government's cabinet office briefing room (cobra), formed the scientific advisory group for emergencies on ebola (sage). by its first meeting on the october 16, 2014, british troops were on the ground along with volunteers from the uk national health service (nhs) (stc, 2016). leach was pulled into this group along with most of the members of esarg that had met the previous month. it was decided in this initial meeting to set up a social science sub-group including whitty, leach and the entire steering group of the newly established erap (sage, 2014a). this included not just british-based anthropologists but also paul richards and esther mokuwa from njala university, sierra leone. from this point anthropologists appeared plugged into the government's architecture for guiding their response. there were several modes for the interaction between social scientists and policy actors that focused on the uk led response. firstly, there were the formal meetings of committees or other bodies that were set up to directly advise the uk government in london. secondly, there were the multitude of ad-hoc interactions, conversations, meetings and briefings, some of which were supported with written reports. then, there was the distribution of briefings, reports and previously published works by erap which included use of the pre-existing health, education advice and resource team (heart) platform, which already provided bespoke services to dfid in the form of a helpdesk (heart, 2019). erap was up and running by the 14th october and during the crisis the platform published around 70 open access reports which were accessed by over 16,000 users (erap, 2016). there were also a series of webinars and workshops and an online course (lshtm, 2015) . according to ids and lshtm's application to the esrc's celebrating impact awards (m. leach et al., 2016) , the policy actors that participated in these interactions included: uk government officials in dfid's london head quarters and its sierra leone country office, in the mod and the government's office for science (go-science). closest of all to the prime minister and the cabinet office was sage. they also communicated with international non-governmental organisations (ingos) like help aged international and christian aid who requested briefings or meetings. erap members advised the who via three core committees, as well as the united nations mission for ebola emergency response (unmeer) and the united nations food and agricultural organisation (unfao). by the end of the crisis members of erap had given written and oral evidence to three separate uk parliamentary inquiries. these interactions were not entirely limited to policy audiences. erap members also contributed to the design of training sessions and a handbook on psychosocial impact of ebola delivered to all the clinical volunteers from the nhs prior to their deployment from december 2014 onwards (redruk, 2014). the way in which anthropologists engaged in policy and practice seemed to reflect an underlying assumption that they would work remotely to the response and engage primarily with the uk government, multilaterals and ingos. a strength of this approach, apart from the obvious personal safety and logistical implications, was that anthropologists enjoyed a proximity to key actors in london. face to face meetings could be held and committees joined in person (f. martineau et al., 2017) . a good example of a close working relationship that required a personal interaction were the links built with two policy analysts working in the mod. not even dfid staff had made this connection and it was thanks to a member of the erap steering committee that one of these officials was able to join the sage social science subcommittee and provide a valuable connection back into the ministry (martineau et al., 2017) . with proximity to the uk government in london came distance from the policy professionals and humanitarians in sierra leone. just 3% of erap's initial funding was focused on field work. although, this later went up and comparative analysis on resistance in guinea and sierra leone and between ebola and lassa fever was undertaken (wilkinson and fairhead, 2017) , as well as a review of the disaster emergency committee (dec) crisis appeal response (oosterhoff, 2015) . there was also an evaluation of the community care centres and additional funding from dfid supported village-level fieldwork by erap researchers from njala university, leading to advice to social mobilisation teams. nonetheless, the network's priority was on giving advice to donors and multilaterals, albeit at a great distance from the action. this type of intervention has not escaped accusations of "armchair anthropology" (delpla and fassin, 2012) in (s. abramowitz, 2017, p. 430 ). rather than relying solely on this qualitative account, drawn largely from those directly involved in these events, social network analysis (sna) produces empirical data for exploring the connections between individuals and within groups (crossley and edwards, 2016) . it is a quantitative approach rooted in graph theory and covers a range of methods which are frequently combined with qualitative methods (s. p. borgatti et al., 2018) . in this case, the network comprises of nodes who are the individuals identified as being directly involved in some of the key events just described. a second set of nodes are the events or interactions themselves. content analysis of secondary sources linked to these events provides an unobtrusive method for identifying in some detail the actors who will have left traces of their involvement. sna allows us to establish these actors' ties to common nodes (they were part of the same committee or event or contributed to the same reports.) furthermore, we can assign non-network related attributes to each of our nodes such as gender, location, role and organisation affiliation type. not only does this approach provide a quantitative assessment of who was involved and through which channels but the mathematical foundations of sna allow for whole network analysis of cohesion across certain groups. you may calculate levels of homophily (the tendency of individuals to associate with similar others) between genders, disciplines and organisational type and identify sub-networks and the super-connectors that bridge them (s. p. borgatti et al., 2018) . the descriptive and statistical examination of graphs provides a means with which to test a hypothesis and associated network theory that is concerned with the entirety of the social relations in a network and how these affect the behaviour of individuals within them (stovel and shaw, 2012) and (ward et al., 2011) . the quantitative analysis of secondary sources was conducted between march and june 2019, utilising content analysis of artefacts which included reports, committee papers, public statements, policy documentation and correspondence. sna software, ucinet, was used to analyse nodes and ties and netdraw for the visualisation of the network (s. and (s. . the source material is limited to artefacts relating to the uk government's response to the ebola outbreak in ii) the apparent prominence or influence of these groups on the uk's response to the crisis, iii) the remit of these groups to focus on the social response, as opposed to the purely clinical one. tracing the events and policy moments which reveal how individual social scientists engaged with the ebola crisis from mid-2014 requires one to look well beyond academic literature. whilst some of this material is openly available, a degree of insider knowledge is required to identify who the key actors were and the modes of their engagement. this is partly a reflection of a sociological approach to policy research that treats networks, only partially visible in the public domain, as a social phenomenon (gaventa, 2006) . the calculation of network homogeneity (how interconnected the network is), the identification of cliques or sub-networks and the centrality of particular nodes, can be mathematically stable measures of the function of the network. however, the reliability of this study mainly resides on its validity. the assignment of attributes is in some cases fairly subjective. whereas gender and location are verifiable, the choice of whether an individual is an international policy actor or a national policy actor must be inferred from their official role during the crisis period. sometimes this can be based on the identity of their home institution. given dfid's central focus on overseas development assistance, its officials have been classified as internationals, rather than nationals. in some cases, individuals may be qualified clinicians or epidemiologists but their role in the crisis may have been primarily policy related and not medical or scientific. therefore, they are classified as policy actors not scientists. other demographic attributes could have been identified such as race and age which would have enabled more options for data analysis. a key factor here is the use of a two mode matrix that identifies connections via people participating in the same events or forums, rather than direct social relationships such as friendship. therefore, measurement validity is largely determined by whether connections of this type can be used to determine how knowledge and expertise flow between individuals. to mitigate the risk that this measurement fails to capture knowledge exchange toward policy processes, particular care was taken with the sampling of focal events used to generate the network. the majority of errors in sna relate to the omission of nodes or ties. fig. 1 sets out the advantages and disadvantages of each of the selected events and the data artefacts used to identify associated individuals. i am aware that some critics might take exception to my choice of network. it is sometimes suggested that by focusing on northern dominated networks or the actions of bilaterals and multilaterals, you simply reinforce coloniality and a racist framing of development and aid (richardson et al., 2016) and (richardson, 2019) . however, there is a valid, even essential, purpose here. only by seeking to understand the politics of knowledge and the social and political dynamics of global health and humanitarian networks can we challenge injustice and historically reinforced narratives that favour some perspectives over others. the secondary sources identify 134 unique individuals, all but five of whom can be identified by name. four types of attribute are assigned to these nodes: gender, location (global north or south), organisation type and organisational role. attributes have been identified through an internet search of institutional websites, linkedin and related online resources. role and organisation type are recorded for the period of the crisis. the total number of nodes given at the bottom of fig. 2 is slightly lower due to the anonymity of five individuals whose gender and role could not be established. looking at this distribution of attributes across the whole network one can make the following observations in relation to how prominently different characteristics are represented: i. females slightly outnumber males in the social science category but there are twice as many male 'scientists other' than female. they are a combination of clinicians, virologists, epidemiologists and other biomedical expertise. ii. there are just nine southern based nodes out of a total of 134 and none of these are policy makers or practitioners. this is racially and geographically a northern network with just a sliver of west african perspectives. these included, yvonne aki-sawyerr, veteran ebola campaigner and current mayor of freetown, four academics from njala university and development professionals working in the sierra leone offices of agencies such as the unfao. iii. although 'scientists other' only just outnumber social scientists this is heavily skewed by one of the eight interaction nodesthe lessons for development conferencewhich was primarily a learning event and not part of the advisory processes around the response. many individuals who participated in this event are not active in any of the other seven interactions. if we remove these non-active nodes from the network, you get just 23 social scientists compared to 32 'scientist other'. the remaining core policy network of 77 individuals appears to be weighted towards the biomedical sciences. netdraw's standardised graph layout algorithm has been used in fig. 3 to optimise distances between nodes which helps to visualise cohesive sub-groups or sub-networks and produces a less cluttered picture (s. . however, it should be noted that graph layout algorithms provide aesthetic benefits at the expense of attribute based or values based accuracy. the exact length of ties between nodes and their position do not correspond exactly to the quantitative data. we can drag one of these nodes into another position to make it stand out more clearly without changing its mathematical or sociological properties (s. p. borgatti et al., 2018) . we can see in this graph layout the clustering of the eight interactive nodes or focal events and observe some patterns in the attributes of the nodes closest to them. the right-hand side is heavily populated with social scientists. as mentioned above, this is influenced by the lessons for development event. as you move to the left side fewer social scientists are represented and they are outnumbered by other disciplines. the state owned or driven interactions such as sage and parliamentary committees appear on this left side and the anthropological epistemic community driven or owned interactions, such as erap reports and lessons for development, appear on the right side. the apparent connectors or bridges are in the centre. these bridges can be conceptualised as both focal events, including the erap steering committee, the sage social science sub-committee and the asa ebola panel, or as the key individual nodes connected to these. we know that many informal interactions between researchers, officials and humanitarians are not captured here. we are only seeing a partial picture of the network, traces of which remain preserved in documents pertaining to the eight nodal events sampled. nonetheless, so far the quantitative data seem to correspond closely with the qualitative accounts of the crisis. also, of interest is the visual representation of organisation affiliation. all bar one of the 39 social scientists (in the whole network fig. 3 ) are affiliated to a research organisation, whereas one third of the members of other scientific disciplines are attached to government institutions, donors or multilaterals. these are the public health officials and virologists working in the doh, phe and elsewhere. they appear predominantly on the left side with much stronger proximity to government led initiatives. however, it is also clear that whilst social scientists are a small minority in the government led events, the right side of the graph includes a significant number of practitioners, policy actors and clinicians. it is this part of the network that most closely resembles an inter-epistemic community. for the centrally located bridging nodes we can see a small number of social scientists and policy actors embedded in government. as accounts of the crisis have suggested these individuals appear to have been the super-connectors. a final point of clarification is that this is not a map showing the actual knowledge flow between actors during the crisis. each of the spider shaped sub-networks represent co-occurrence of individuals on committees, panels and other groups. we can infer from this some likelihood of knowledge exchange but we cannot measure this. one exception to these co-occurrence types of tie between nodes are the erap reports (bottom right), which reveals a cluster of nodes who contributed to reports along with those who requested them. even though this represents a knowledge flow of sorts we can still only record the interaction and make assumptions about the actual flow of knowledge. a variation of degree centrality, eigenvector centrality, counts the number of nodes adjacent to a given node and weighs each adjacent node by its centrality. the eigenvector equation, used by netdraw, calculates each node's centrality proportionally to the sum of centralities of the nodes it is adjacent to (s. . netdraw increases the size of nodes in relation to their popularity or eigenvector value. the better connected nodes are to others who are also well connected the larger the nodes appear (s. p. borgatti et al., 2018) . in order to focus on the key influencers or knowledge brokers in the network, we entirely remove nodes solely connected to the lessons for development conference. as mentioned earlier, this event is a poor proxy for research-policy interactions and unduly over-represents social scientists who were otherwise unconnected to advisory or knowledge exchange activities. this reduces the number of individuals in the network from 134 to 77. we also utilise ucinet's transform function to convert the two-mode incidence matrix into a one mode adjacency matrix (s. . ties between nodes are now determined by connections through co-occurrence. we no longer need to see the events and committees themselves but can visualise the whole network as a social network of connected individuals. we can now observe and mathematically calculate, how inter-connected or homogeneous this research-policy network really is. we see in fig. 4 a more exaggerated separation of social science and other sciences on the right and left of the graph than in fig. 3 . we can also see three distinct sub-networks emerging, bridged by six key nodes with high centrality values. the highly interconnected sub-network on the right is shaped in part by erap and the production of briefings and their supply to a small number of policy actors. we can see here the visualisation of slightly higher centrality scores than for the government scientific advisors on the left. by treating this as a relational network we observe that interactions like the establishment of a sage sub-group for social scientists increased the homophily of the right side of the network and reduced its interconnectivity with the whole network. although, one must be cautious about assigning too much significance to the position of individual nodes in a whole network analysis, the central location of the two social scientists and a dfid official closely correspond to the accounts of the crisis. this heterogeneous brokerage demonstrates the tendency of certain types of actors to be the sole link between dissimilar nodes (hamilton et al., 2020) . likewise, some boundary nodes or outliers, such as one of the mod's advisors at the bottom of the network, are directly mentioned in the qualitative accounts. just four individuals in this whole network are based in africa, suggesting almost complete isolation from humanitarians operating on the ground and from african scholarship. both the qualitative accounts of the role of anthropologists in the crisis and the whole network analysis presented here largely, correspond with haas' definition of epistemic communities. the international community of anthropologists and medical anthropologists that mobilised in autumn 2014 do indeed share common principles and analytical and normative beliefs. debates around issues, such as the level to which communities could reduce transmission rates themselves, did not prevent this group from providing a coherent response to the key policy dilemmas. this community did indeed emerge or coalesce around the demand placed on their expertise by policy makers concerned with the community engagement dimensions of the response. in the area of burial practices, there does appear to be some indication of the knowledge of social scientists being incorporated into the response. various interactions between anthropologists, dfid and the who did provide the opportunity to raise the socio-political-economic significance of funerals. for example, it was explained that the funerals of high status individuals would be much more problematic in terms of the numbers of people exposed (f. martineau et al., 2017) . anthropologists contributed to the writing of the who's guidelines for safe and dignified burials (who, 2014b). however, their advice was only partially incorporated into these guidelines and the wider policies of the who at the time. the suggestion for a radical decentralised approach to formal burial response that would require the creation of community-based burial teams was ignored until much later in the crisis and never fully implemented. as loblova and dunlop suggest in their critique of epistemic community theory, the extent to which anthropology could influence policy was bounded by the beliefs and understanding of policy communities themselves (löblová, 2018) and (dunlop, 2017) . olga loblova argues that there is a selection bias in the tendency to look at case studies where there has been a shift in policy along the lines of the experts' knowledge. likewise, claire dunlop suggests that haas' framework may exaggerate their influence on policy. she separates the power of experts to control the production of knowledge and engage with key policy actors from policy objectives themselves. she refers to adult education literature and its implications for what decision makers learn from epistemic communities, or to put it another way, the cognitive influence of research evidence (dunlop, 2009) . she argues that the more control that knowledge exchange processes place with the "learners" in terms of framing, content and the intended policy outcomes, the less influential epistemic communities will be (dunlop, 2017) . hence, in contested areas such as home care, it was the more embedded and credible clinical epistemic community that prevailed. from october 2014, anthropologists were arguing that given limited access to etus, which were struggling at that time, home care was an inevitability and so should be supported. where they saw the provision of home care kits as an ethical necessity, many clinicians, humanitarians and global health professionals regarded home care as deeply unethical with the potential to lead to a two tier system of support (f. martineau et al., 2017) and (whitty et al., 2014) . in sierra leone, irish diplomat sinead walsh was baffled by what she saw as the blocking of the distribution of home care kits. an official from the us centres for disease control and protection (cdc) was quoted in an article in the new york times as saying that home care was: "admitting defeat" (nossiter, 2014) in (walsh and johnson, 2018) . home care was never prioritised in sierra leone whereas in liberia hundreds of thousands of kits were distributed (walsh and johnson, 2018) . in this area, clinicians, humanitarians and policy actors seemed to maintain a policy position directly opposed to anthropological based advice. network theory provides further evidence around why this may have been the case. in his study of uk think tanks, jordan tchilingirian suggests that policy think tanks operate on the periphery of more established networks and enjoy fluctuating levels of support and interest in their ideas. ideas and knowledge do not simply flow within the network, given that dominant paradigms and political, social and cultural norms privilege better established knowledge communities (tchilingirian, 2018) . this is reminiscent of meyer's work on the boundaries that exist between "amateurs" and "policy professionals" (meyer, 2008) . moira faul's research on global education policy networks proposes that far from being "flat," networks can augment existing power relations and knowledge hierarchies (faul, 2016) . this is worth considering when one observes how erap's supply of research knowledge and the sage sub-committee for anthropologists only increased the homophily of the social science sub-community, leaving it weakly connected to the core policy network (fig. 4.) . the positive influence of anthropological advice on the uk's response was cited by witnesses to the subsequent parliamentary committee inquiries in 2016. however, there is some indication of different groups or networks favouring different narratives. the international development select committee (idc) was very clear in its final report that social science had been a force for good in the response and recommended that dfid grow its internal anthropological capacity (idc, 2016a, b) . this contrasts to the report of the science and technology committee (stc), which despite including evidence from at least one anthropologist, does not make a direct reference to anthropology in its report (stc, 2016) . this is perhaps the public health officials in their core domain of infectious disease outbreaks reasserting their established authority. this sector has been described as the uk's "biomedical bubble" which benefits from much higher pubic support and funding than the social sciences (jones and wilsdon, 2018) . just the presence of anthropologists in an evidence session of the stc is a very rare event in contrast to the idc which regularly reaches out to social scientists. not everyone agrees that the threat of under-investing in social science was the primary issue. the stc's report highlights the view that there was a lack of front line clinicians represented on committees advising the uk government, particularly from aid organisations (stc, 2016) . regardless of assessments of how successfully anthropological knowledge influenced policy and practice during the epidemic, there has been a subsequent elevation of social science in global health preparedness and humanitarian response programmes. writing on behalf of the wellcome trust in 2018, joão rangel de almeida says: "epidemics are a social phenomenon as much as a biological one, so understanding people's behaviours and fears, their cultural norms and values, and their political and economic realities is essential too." (rangel de almeida, 2018). the social science in humanitarian action platform, which involves many of the same researchers who were part of the sierra leone response, has subsequently been supported by unicef, usaid and joint initiative on epidemic preparedness (jiep) with funding from dfid and wellcome. its network of social science advisers have been producing briefings to assist with the ebola response in the democratic republic of congo (drc) (farrar, 2019) and have mobilised in response to the covid-19 respiratory illness epidemic. network theory provides a useful framework with which to explore the politics of knowledge in global health with its emphasis on individuals' social context. by analysing data pertaining to researchers' and policy professionals' participation in policy networks one can test assumptions around interdisciplinarity and identify powerful knowledge gatekeepers. detailed qualitative accounts of policy processes needn't be available, as they were in this case, to employ this methodology. assuming the researcher has access to meeting minutes and other records of who attended which events or who was a member of which committees and groups, similar analysis of network homophily and centrality will be possible. the greatest potential for learning, with significant policy and research implications, comes from mixed methods approaches. by combining qualitative research to populate your network with a further round of data gathering to understand it better, you can reveal the social and political dynamics truly driving evidence use and decision making (oliver and faul, 2018) . although this study lacked this scope, it has still successfully identified the shape of the research-policy network that emerged around the uk led response to ebola and the clustering of actors within it. the network was a diverse group of scientists, practitioners and policy professionals. however, it favoured the views of government scientists with their emphasis on epidemiology and the medical response. it was also almost entirely lacking in west african members. nonetheless, it was largely thanks to a strong political demand for anthropological knowledge, in response to perceived community violence and distrust, that social scientists got a seat at the table. this was brokered by a small group of individuals from both government and research organisations, who had prior relationships to build on. the emergent inter-epistemic community was only partially connected into the policy network and we should reject the description of the whole network as trans-disciplinary. social scientists were most successful in engaging when they framed their expertise in terms of already widely accepted concepts, such as the need for better communications with communities. they were least successful when their evidence countered strongly held beliefs in areas such as home care. their high level of homophily as a group, or sub network, only deepened the ability of decision makers to ignore them when it suited them to do so. the epistemic community's interactivity with uk policy did not significantly alter policy design or implementation and it did not challenge fundamentally eurocentric development knowledge hierarchies. it was transformative only in as much as it helped the epistemic community itself learn how to operate in this environment. the real achievement has been on influencing longer term evidence use behaviours. they made the case for interdisciplinarity and the value of social science in emergency preparedness and response. the challenge now is moving from the rhetoric to action on complex infectious disease outbreaks. as demonstrated by ebola in drc and covid-19, every global health emergency we face will have its own unique social and political dimensions. we must remain cognisant of the learning arising from the international response to sierra leone's tragic ebola epidemic. it suggests that despite the increasing demand for interdisciplinarity, social science evidence is frequently contested and policy networks have a strong tendency to leave control over its production and use in the hands of others. credit authorship contribution statement james georgalakis: conceptualization, methodology, software, formal analysis, investigation, data curation, writing -original draft, visualization. epidemics (especially ebola) social science intelligence in the global ebola response one health : science, politics and zoonotic disease in africa ebola at a distance: a pathographic account of anthropology's relevance stigma and ebola: an anthropological approach to understanding and addressing stigma operationally in the ebola response ucinet 6 for windows. analytic technologies cases, mechanisms and the real: the theory and methodology of mixed-method social network analysis une histoire morale du temps present policy transfer as learning: capturing variation in what decisionmakers learn from epistemic communities the irony of epistemic learning: epistemic communities, policy learning and the case of europe's hormones saga ebola response anthropology platform erap milestone achievements up until the global community must unite to intensify ebola response in the drc networks and power: why networks are hierarchical not flat and what can be done about it the human factor. world health organization finding the spaces for change: a power analysis introduction: identifying the qualities of research-policy partnerships in international development-a new analytical framework introduction: epistemic communities and international policy coordination analysing published global ebola virus disease research using social network analysis evaluating heterogeneous brokerage: new conceptual and methodological approaches and their application to multi-level environmental governance networks health. education advice and resource team ebola, culture and politics: the anthropology of an emerging disease responses to the ebola crisis ebola: responses to a public health emergency. house of commons the biomedical bubble: why uk research and innovation needs a greater diversity of priorities the ebola epidemic: a transformative moment for global health ebola: engaging long-term social science research to transform epidemic response when epistemic communities fail: exploring the mechanism of policy influence online course: ebola in context: understanding transmission, response and control epistemologies of ebola: reflections on the experience of the ebola response anthropology platform on the boundaries and partial connections between amateurs and professionals pushed to the limit and beyond social constructionism as ontology: exposition and example a hospital from hell networks and network analysis in evidence, policy and practice ebola crisis appeal response review social science research: a much-needed tool for epidemic control. wellcome. redruk, 2014. pre-departure ebola response training burial/other cultural practices and risk of evd transmission in the mano river region burial/other cultural practices and risk of evd transmission in the mano river region ebola: how a people's science helped end an epidemic on the coloniality of global public health biosocial approaches to the 2013-2016 ebola pandemic. health hum. rights 18, 115. sage scientific advisory group for emergencies -ebola summary minute of 2nd meeting scientific advisory group for emergencies -ebola summary minute of 3rd meeting science in emergencies: uk lessons from ebola. house of commons. stovel producing knowledge, producing credibility: british think-tank researchers and the construction of policy reports the oxford handbook of political networks getting to zero: a doctor and a diplomat on the ebola frontline network analysis and political science clinical presentation and management of severe ebola virus disease infectious disease: tough choices to reduce ebola transmission key messages for social mobilization and community engagement in intense transmission areas comparison of social resistance to ebola response in sierra leone and guinea suggests explanations lie in political configurations not culture coronavirus: china president warns spread of disease 'accelerating', as canada confirms first case. the independent acknowledgements i thank dr jordan tchilingirian (university of western australia) for discussions and support on ucinet. i thank professor melissa leach and dr annie wilkinson (institute of development studies) for access to archival data. key: cord-200147-ans8d3oa authors: arimond, alexander; borth, damian; hoepner, andreas; klawunn, michael; weisheit, stefan title: neural networks and value at risk date: 2020-05-04 journal: nan doi: nan sha: doc_id: 200147 cord_uid: ans8d3oa utilizing a generative regime switching framework, we perform monte-carlo simulations of asset returns for value at risk threshold estimation. using equity markets and long term bonds as test assets in the global, us, euro area and uk setting over an up to 1,250 weeks sample horizon ending in august 2018, we investigate neural networks along three design steps relating (i) to the initialization of the neural network, (ii) its incentive function according to which it has been trained and (iii) the amount of data we feed. first, we compare neural networks with random seeding with networks that are initialized via estimations from the best-established model (i.e. the hidden markov). we find latter to outperform in terms of the frequency of var breaches (i.e. the realized return falling short of the estimated var threshold). second, we balance the incentive structure of the loss function of our networks by adding a second objective to the training instructions so that the neural networks optimize for accuracy while also aiming to stay in empirically realistic regime distributions (i.e. bull vs. bear market frequencies). in particular this design feature enables the balanced incentive recurrent neural network (rnn) to outperform the single incentive rnn as well as any other neural network or established approach by statistically and economically significant levels. third, we half our training data set of 2,000 days. we find our networks when fed with substantially less data (i.e. 1,000 days) to perform significantly worse which highlights a crucial weakness of neural networks in their dependence on very large data sets ... while leading papers on machine learning in asset pricing focus on predominantly returns and stochastic discount factors (chen, pelger & zhu 2020; gu, kelly & xiu 2020) , we are motivated by the global coid-19 virus crisis and the subsequent stock market crash to investigate if and how machine learning methods can enhance value at risk (var) threshold estimates. in line with gu, kelly & xiu's (2020: 7) , we like to open by disclaiming our awareness that " [m] achine learning methods on their own do not identify deep fundamental associations" .without human scientists designing hypothesized mechanisms into an estimation problem. 1 nevertheless, measurement errors can be reduced based on machine learning methods. hence, machine learning methods employed as means to an end instead of as end in themselves can significantly support researchers in challenging estimation tasks. 2 in their already legendary paper, gu, kelly & xiu (gkx in the following, 2020) apply machine learning to a key problem in academic finance literature: 'measuring asset risk premia'. they observe that machine learning improves the description of expected returns relative to traditional econometric forecasting methods based on (i) better out-ofsample r-squared and (ii) forecasts earning larger sharpe ratios. more specifically, they compare four 'traditional' methods (ols, glm, pcr/pca, pls) with regression trees (e.g. random forests) and a simple 'feed forward neural network' based on 30k stocks over 720 months , using 94 firm characteristics, 74 sectors and 900+ baseline signals. crediting inter alia (i) flexibility of functional form and (ii) enhanced ability to prioritize vast sets of baseline signals, they find the feed forward neural networks (ffnn) to perform best. contrary to results reported from computer vision, gkx further observe that "'shallow' learning outperforms 'deep' learning" (p.47), as their neural network with 3 hidden layers excels beyond neural networks with more hidden layers. they interpret this result as a consequence of a relatively much lower signal to noise ratio and much smaller data sets in finance. interestingly, the outperformance of nns over the other 5 methods widens at portfolio compared to stock level, another indication that an understanding of the signal to noise ratio in financial markets is crucial when training neural networks. that said, while classic ols is statistically significantly weaker than all other models, nn3 beats all others but not always at statistically significant levels. gkx finally confirm their results via monte carlo simulations. they show that if one generated two hypothetical security price datasets, one linear and un-interacted and one nonlinear and interactive, ols and glm would dominate in former, while nns dominate in the latter. they conclude by attributing the "predictive advantage [of neural networks] to accommodation of nonlinear interactions that are missed by other methods." (p.47) following gkx, an extensive literature on machine learning in finance is rapidly emerging. chen, pelger and zhu (cpz in the following, 2020) introduce more advanced (i.e. recurrent) neural networks and estimate a (i) non-linear asset pricing model (ii) regularized under no-arbitrage conditions operationalized via a stochastic discount factor (iii) while considering economic conditions. in particular they attribute the time varying dependency of the stochastic discount factor of about ten thousand us stocks to macroeconomic state processes via a recurrent long short term memory (lstm) network. in cpz's (2020: 5) view "it is essential to identify the dynamic pattern in macroeconomic time series before feeding them into a machine learning model". avramov et al. (2020) replicate the approaches of gkx's (2020) , cpz (2020) , and two conditional factor pricing models: kelly, pruitt, and su's (2019) linear instrumented principal component analysis (ipca) and gu, kelly, and xiu's (2019) nonlinear conditional autoencoder in the context of real-world economic restrictions. while they find strong fama french six factor (ff6) adjusted returns in the original setting without real world economic constraints, these returns reduce by more than half if microcaps or firms without credit ratings are excluded. in fact, when avramov et al. (2020: 3) are "[e]xcluding distressed firms, all deep learning methods no longer generate significant (valueweighted) ff6-adjusted return at the 5% level." they confirm this finding by showing that the gkx (2020) and cpz (2020) machine learning signals perform substantially weaker in economic conditions that limit arbitrage (i.e. low market liquidity, high market volatility, high investor sentiment). curiously though, avramov et al. (2020: 5) find that the only linear model they analyse kelly et al.'s (2019) ipca -"stands out … as it is less sensitive to market episodes of high limits to arbitrage." their finding as well as the results of cpz (2020) imply that economic conditions have to be explicitly accounted for when analysing the abilities and performance of neural networks. furthermore, avramov et al. (2020) as well as gkx (2020) and cpz (2020) make anecdotal observations that machine learning methods appear to reduce drawdowns. 1 while their manuscripts focused on return predictability, we devote our work to risk predictability in the context of market wide economic conditions. the covid-19 crisis as well as the density of economic crisis in the previous three decades imply that catastrophic 'black swan' type risks occur more frequent than predicted by symmetric economic distributions. consequently, underestimating tail risks can have catastrophic consequences for investors. hence, the analysis of risks with the ambition to avoid underestimations deserves, in our view, equivalent attention to the analysis of returns with its ambition to identify investment opportunities resulting from mispricing. more specifically, since a symmetric approach such as the "mean-variance framework implicitly assumes normality of asset returns, it is likely to underestimate the tail risk for assets with negatively skewed payoffs" (agarwal & naik, 2004:85) . empirically, equity market indices usually exhibit, not only since covid-19, negative skewness in its return payoffs (albuquerque, 2012 , kozhan et al. 2013 . consequently, it is crucial for a post covid-19 world with its substantial tail risk exposures (e.g. second pandemic wave, climate change, cyber security) that investors provided with tools which avoid the underestimation of risks best possible. naturally, neural networks with their near unlimited flexibility in modelling non-linearities appear suitable candidates for such conservative tail risk modelling that focuses on avoiding giglio & xiu (2019) , and kozak, nagel & santosh (2020) as also noteworthy, as are efforts by fallahgouly and franstiantoz (2020) and horel and giesecke (2019) to develop significant tests for neural networks. our paper investigates is basic and/or more advanced neural networks have the capability of underestimating tail risk less often at common statistical significance levels. we operationalize tail risk as value at risk which is the most used tail risk measure in both commercial practice as well as academic literature (billio et al. 2012 , billio and pellizon, 2000 , jorion, 2005 , nieto & ruiz, 2015 . specifically, we estimate var thresholds using classic methods (i.e. mean/variance, hidden markov model) 1 as well as machine learning methods (i.e. feed forward, convolutional, recurrent), which we advance via initialization of input parameter and regularization of incentive function. recognizing the importance of economic conditions (avramov et al. 2020 , chen et al. 2020 , we embed our analysis in a regime-based asset allocation setting. specifically, we perform monte-carlo simulations of asset returns for value at risk threshold estimation in a generative regime switching framework. using equity markets and long term bonds as test assets in the global, us, euro area and uk setting over an up to 1,250 weeks sample horizon ending in august 2018, we investigate neural networks along three design steps relating (i) to the initialization of the neural network's input parameter, (ii) its incentive function according to which it has been trained and which can lead to extreme outputs if it is not regularized as well as (iii) the amount of data we feed. first, we compare neural networks with random seeding with networks that are initialized via estimations from the best-established model (i.e. the hidden markov). we find latter to outperform in terms of the frequency of var breaches (i.e. the realized return falling short of the estimated var threshold). second, we balance the incentive structure of the loss function of our networks by adding a second objective to the training instructions so that the neural networks optimize for accuracy while also aiming to stay in empirically realistic regime distributions (i.e. bull vs. bear market frequencies). this design features leads to better regularization of the neural network, as it substantially reduces extreme outcomes than can result from a single incentive function. in particular this design feature enables the balanced incentive recurrent neural network (rnn) to outperform the single incentive rnn as well as any other neural network or established approach by statistically and economically significant levels. third, we half our training data set of 2,000 days. we find our networks when fed with substantially less data (i.e. 1,000 days) to perform significantly worse which highlights a crucial weakness of neural networks in their dependence on very large data sets. our contributions are fivefold. first, we extend the currently return focused literature of machine learning in finance (avramov et al. 2020 , chen et al. 2020 gu et al. 2020) to also focus on the estimation of risk thresholds. assessing the advancements that machine learning can bring to risk estimation potentially offers valuable innovation to asset owners such as pension funds and can better protect the retirement savings of their members. 2 second, we advance the design of our three types of neural networks by initializing their input parameter with the best established model. while initializations are a common research topic in core machine learnings fields such as image classification or machine translation (glorot & bengio, 2010 , we are not aware of any systematic application of initialized neural networks in the field of finance. hence, demonstrating the statistical superiority of an initialized neural network over itself non-initialized appears a relevant contribution to the community. third, while cpz (2020) regularize their neural networks via no arbitrage conditions, we regularize via balancing the incentive function of our neural networks on multiple objectives (i.e. estimation accuracy and empirically realistic regime distributions). this prevents any single objective from leading to extreme outputs and hence balances the computational power of the trained neural network in desirable directions. in fact, our results show that amendments to the incentive function maybe the strongest tool available to us in engineering neural networks. fourth, we also hope to make a marginal contribution to the literature on value at risk estimation. whereas our paper is focused on advancing machine learning techniques and is therefore following billio and pellizon (2000) anchored in a regime based asset allocation setting 1 to account for time varying economic states (cpz, 2020), we still believe that the nonlinearity and flexible form especially of recurrent neural networks maybe of interesting to the var (forecasting) literature (billio et al. 2012 , nieto & ruiz, 2015 , patton et al. 2019 . fifth, our final contribution lies in the documentation of weaknesses of neural networks as applied to finance. while avramov et al. (2020) subjects neural networks to real world economic constraints and finds these to substantially reduce their performance, we expose our neural networks to data scarcity and document just how much data these new approaches need to advance the estimation of risk thresholds. naturally, such long data history may not always be available in practice when estimating asset management var thresholds and therefore established methods and neural networks are likely to be used in parallel for the foreseeable future. in section two, we will describe our testing methodology including all five competing models (i.e. mean/variance, hidden markov model, feed forward neural network, convolutional neural network, recurrent neural network). section three describes data, model training, monte carlo simulations and baseline results. section four then advances our neural networks via initialization and balancing the incentive functions and discusses the results of both features. section five conducts robustness tests and sensitivity analyses before section six concludes. 1 we acknowledge that most recent statistical advances in value at risk estimation have concentrated on jointly modelling value at risk and expected shortfall and were therefore naturally less focused on time varying economic states (patton et al. 2019 , taylor 2019 , 2020 ). value at risk estimation with mean/variance approach when modelling financial time series related to investment decisions the asset return of portfolio (p) at time (t) as defined in equation (1) below is the focal point of interest instead of asset price , since investors earn on the difference between the price at which they sold. value-at-risk (var) metrics are an important tool in many areas of risk management. our particular focus on var measures as a means to perform risk budgeting in asset allocation. asset owners such as pension funds or insurances as well as asset managers often incorporate var measures into their investment processes (jorion, 2005) . value at risk is defined as in equation (2) as the lower bound of a portfolio's return, which the portfolio or asset is not expected to fall short off with a certain probability (a) within the next period of allocation (n). pr ( + < − ( )) = for example, an investment fund indicates that, based on the composition of its portfolio and on current market conditions, there is a 95% or 99% probability it will not lose more than a specified amount of assets over the next 5 trading days the var measurement can be interpreted as a threshold (billio and pellizon 2000) . if the actual portfolio or asset return falls below this threshold, we refer to this a var breach. the classic mean variance approach of measuring var values is based on the assumption that asset returns follow a (multivariate) normal distribution. var thresholds can then be measured by estimating the mean and covariance ( , σ) of the asset returns by calculating sample mean and sample covariance of the respective historical window. the 1% or 5% percentile of the resulting normal distribution will be an appropriate estimator of the 95% or 99% var threshold. we refer to this way of estimating var thresholds as being the "classical" approach and use it as baseline of our evaluation. this classic approach, however, does not sufficiently reflect the skewness of real world equity markets and the divergences of return distributions across different economics regimes. in other words, the classic approach does not take into account longer term market dynamics, which express themselves as phases of growth or of downside, also commonly known as bull market and bear markets. for this purpose, regime switching models have grown in popularity well before machine learning entered finance (billio and pellizon 2000) . in this study, we model financial markets inter alia using neural networks while accounting for shifts in economics regimes (avramov et al. 2020 , chen et al., 2020 . due to the generative nature of these networks, they are able to perform monte-carlo simulation of future returns, which could be beneficial for var estimation. in asset manager's risk budgeting it is advantageous to know about the current market phase (regime) and estimate the probability that the regime changes (schmeding et al., 2019) . the most common way of modelling market regimes is by distinguishing between bull markets and bear markets. unfortunately, market regimes are not directly observable, but are rather to be derived indirectly from market data. regime switching models based on hidden markov models are an established tool for regime based modelling. hidden markov models (hmm)which are based on markov chains -are models that allow for analysing and representing characteristics of time series such as negative skewness (ang and bekaert, 2002; timmerman, 2000) . we employ the hmm for the special case of two economic states called 'regimes' in the hmm context. specifically, we model asset returns y t ∈ n (we are looking at n ≥ 1 assets) at time t to follow an n-dimensional gaussian process with hidden states s ∈ {1, 2} as shown in equation (3): the returns are modelled to have state dependent expected returns μ ∈ as well as covariance σ ∈ . the dynamic of is following a homogenous markov chain with transition probability matrix with = ( = 1 | | −1 = 1 ) and = ( = 2 | | −1 = 2 ) . this definition describes if and how states are changing over time. it is also important to note the 'markov property' that the probability of being in any state at the next point in time only depends on the present state, not the sequence of states that preceded it. furthermore, the probability of being in a state at a certain point in time is given as π = ( = 1) and (1 − π ) = ( = 2). this is also called smoothed state probability. by estimating the smoothed probability πt of the last element of the historical window as the present regime probability, we can use the model to start from there and perform monte-carlo simulations of future asset returns for the next days. 1 this is outlined for the two-regimes case in figure 1 below. 2 figure 1 : algorithm for the hidden markov monte-carlo simulation (for two regimes) 1: estimate = ( 0 , , , σ) from history when graves [13] successfully made use of a long short-term memory (lstm) based recurrent neural network to generate realistic sequences of handwriting, he followed the idea of using a mixture density network (mdn) to parametrize a gaussian mixture predictive distribution (bishop, 1995) . compared to standard neural networks (multi-layer perceptron) as used by gkx (2020), this network does not only predict the conditional average of the target variable as point estimate (in gkx' case expected risk premia), but rather estimates the conditional distribution of the target variable. given the autoregressive nature of graves' approach, the output distributions are not assumed to be static over time, but dynamically conditioned on previous outputs, thus capturing the temporal context of the data. we consider both characteristics as being beneficial for modelling financial market returns, which experience a low signal to noise ratio as highlighted by gkx' results due to inherently high levels of intertemporal uncertainty. the core of the proposed neural network regime switching framework is a (swappable) neural network architecture, which takes as input the historical sequence of daily asset returns. at the output level, the framework computes regime probabilities and provides learnable gaussian mixture distribution parameters, which can be used to sample new asset returns for monte-carlo simulation. a multivariate gaussian mixture model (gmm) is a weighted sum of k different components, each following a distinct multivariate normal distribution as shown in equation (5): a gmm by its nature does not assume a single normal distribution, but naturally models a random variable as being the interleave of different (multivariate) normal distributions. in our model, we interpret k as the number of regimes and φi explains how much each regime contributes to the (current output). in other words, φi can be seen as the probability that we are in regime i. in this sense the gmm output provides a suitable level of interpretability for the use case of regime based modelling. with regard to the neural network regime switching model, we extend the notion of a gaussian mixture by conditioning φi via a yet undefined neural network f on the historic asset returns within a certain window of a certain size. we call this window receptive field and denote its size by r: this extension makes the gaussian mixture weights dependent on the (recent) history of the time varying asset returns. note that we only condition φ on the historical returns. the other parameters of the gaussian mixture ( , σ ), are modelled as unconditioned, yet optimizable parameters of the model. this basically means we assume the parameters of the gaussians to be constant over time (per regime). this is in contrast to the standard mdn, where ( , σ ) are also conditioned on and therefore can change over time. 1 keeping these remaining parameters unconditional is crucial to allow for a fair comparison between the neural networks and the hmm, which also exhibits time invariant parameters ( , σ ) in its regime shift probabilities. following graves (2013), we define the probability given by the network and the corresponding sequence as shown in equation (7) and (8), respectively: since financial markets operate in weekly cycles with many investors shying away from exposure to substantial leverage during the illiquid weekend period, we are not surprised to observe that model training is more stable when choosing the predictive distribution to not only be responsible for the next day, but for the next 5 days (hann and steuer, 1995) . we call this forward looking window the lookahead. this is also practically aligned with the overall investment process, in which we want to appropriately model the upcoming allocation period, which usually spans multiple days. it also fits with the intuition that regimes do not switch daily but have stability at least for a week. the extended sequence probability and sequence loss are denoted accordingly in equations (9) and (10): an important feature of the neural network regime model is how it simulates future returns. we follow graves (2013) approach and conduct sequential sampling from the network. when we want to simulate a path of returns for the next n business days, we do this according to the algorithm displayed in figure 2 . in accordance with gkx (2020) we first focus our analysis on traditional "feed-forward" neural networks before engaging in more sophisticated neural network architectures for time series analysis within the neural network regime model. the traditional model of neural networks, also called multi-layer perceptron, consists of an "input layer" which contains the raw input predictors and one or more "hidden layers" that combine input signals in a nonlinear way and an "output layer", which aggregates the output of the hidden layers into a final predictive signal. the nonlinearity of the hidden layers arises from the application of nonlinear "activation functions" on the combined signals. we visualise the traditional feed forward neural network and its input layers in figure 4 . we setup our network structure in alignment with gkx's (2020) best performance neural network 'nn3'. the setup of our network is thus given with 3 hidden layers with decreasing number of hidden units (32, 16, 8) . since we want to capture the temporal aspect of our time series data, we condition the network output on at least a receptive field of 10 days. even though the receptive field of the network is not very high in this case, the dense structure of the network results in a very high number of parameters (1698 in total, including the gmm parameters). in between layers, we make use of the activation function tanh. convolutional neural networks (cnns) can also be applied within the proposed neural network regime switching model. recently, cnns gained popularity for time series analysis, as for example van den oord et al. (2015) successfully applied convolutional neural networks on time series data for generating audio waveforms, the state-ofthe-art text-to-speech and music generation. their adaption of convolutional neural networkscalled wavenethas shown to be able to capture long ranging dependencies on sequences very well. in its essence, a wavenet consists of multiple layers of stacked convolutions along the time axis. crucial features of these convolutions are that they have to be causal and dilated. causal means that the output of a convolution only depends on past elements of the input sequence. dilated convolutions are ones that exhibit "holes" in their respective kernel, which effectively means that its filter size increases while being dilated with zeros in between. wavenet typically is constructed with increasing dilation factor (doubling in size) in each (hidden) layer. by doing so, the model is capable of capturing an exponentially growing number of elements from the input sequence depending on the number of hidden convolutional layers in the network. the number of captured sequence elements is called the receptive field of the network (and in this sense is equal to the receptive field defined for the neural network regime model). 1 the convolutional neural network (cnn), due to its structure of stacked dilated convolutions, has a much greater receptive field than the simple feed forward network and needs much less weights to be trained. we restricted the number of hidden layers to 3 to illustrate the idea. our network structure has 7 hidden layers. each hidden layer furthermore exhibits a number of channels, which are not visualized here. figure 5 illustrates the networks basic structure as a combination of stacked causal convolutions with a dilation factor of d = 2. the backing model presented in this investigation is inspired by wavenet, we restrict the model to the basic layout, using causal structure and increasing dilation between layers. the output layer comprises the regime predictive distributions by applying a softmax function to the hidden layers' outputs. our network consists of 6 hidden layers, each layer having 3 channels. the convolutions each have a kernel size of 3. in total, the network exhibits 242 weights (including gmm parameters), the receptive field has a size of 255 days. as graves (2013) was very successful in applying lstm for generating sequences, we also adapt this approach for the neural network regime switching model. originally introduced by hochreiter and schmidhuber (1997), a main characteristic of lstmswhich are a sub class of recurrent neural networks -is its purpose-built memory cells, which allows it to capture long range dependencies in the data. from a model perspective, lstms differ from other neural network architectures in that they are applied recurrently (see figure 6 ). the output from a previous sequence of the network function servesin combination with the next sequence element -as input for the next application of the network function. in this sense, the lstm can be interpreted as being similar to an hmm, in that there is a hidden state which conditions the output distribution. however, the lstm hidden state not only depends on its previous states, but it also captures long term sequence dependencies through its recurrent nature. maybe most notably, the receptive field size of an lstm is not bound architecture wise as in case of simple feed forward network and cnn. instead, the lstm's receptive field depends solely on the lstms ability to memorize the past input. in our architecture we have one lstm layer with a hidden state size of 5. in total, the model exhibits 236 parameters (including the gmm parameters). the potential of lstms was noted by cpz (2020: 6) who note that "lstms are designed to find patterns in time series data and … are among the most successful commercial ais". 3 assessment procedure we obtain daily price data for stock and bond indices globally for three major global markets (i.e. eu, uk, us) to study the presented regime based neural network approaches on a variety of stock markets and bond markets. for each stock market, we focus on one major stock index. for bond markets, we further distinguish between long term bond indices (7-10 years) and short term bond indices (1-3 years). the markets in scope are (1) the data dates back to at least january 1990 and ends with august 2018, which means covering almost 30 years of market development. hence, the data also accounts for crises like the dot-com bubble in the early 2000s as well as the financial crisis of 2008. this is especially important for testing the regime based approaches. the price indices are given as total return indices (i.e. dividends treated as being reinvested) to properly reflect market development. the data is taken from refinitiv's datastream. descriptive statistics are displayed in table 1 , whereby panel a displays a daily frequency and panel b a weekly frequency. mean returns for equities exceed the returns for bond whereby the longer bond return more than the shorter one. equities have naturally a much higher standard deviation and a far worse minimum return. in fact, equity returns in all four regions lose substantially more money than bond return even at the 25 th percentile, which highlights that the holy grail of asset allocation is the ability to predict equity market drawdowns. furthermore, equity markets tend to bequite negatively skewed as expected while short bonds experience a positive skewness, which reflects previous findings (albuquerque, 2012 , kozhan et al. 2013 ) and the inherent differential in the riskiness of both asset's payoffs. [insert table 1 about here] the back testing is done on a weekly basis via a moving window approach. at each point in time, the respective model is fitted by providing the last 2,000 days (which is roughly 8 years) as training data. we choose this long range window, because neural networks are known to need big datasets as inputs and it is reasonable to assume that over eight years include simultaneously times of (at least relative) crisis and times of market growth. covering both bull and bear markets in the training sample is crucial to allow the model to "learn" these types of regimes. 1 for all our models we set the number of regimes to k = 2. as we back test an allocation strategy with a weekly re-allocation, we set the lookahead for the neural network regime models to 5 days. we further configured the back testing dates to always align with the end of a business week (i.e. fridays). the classic approach does not need any configuration, model fitting is same as computing sample mean and sample covariance of the asset returns within the respective window. the hmm also does not need any more configuration, the baum-welch algorithm is guaranteed to converge the parameters into a local optimum with respect to the likelihood function (baum, 1970) . for the neural network regime models, additional data processing is required to learn network weights that lead to meaningful regime probabilities and distribution parameters. an important pre-processing step is input normalization, as it is considered good practice for neural network training (bishop, 1995) . for this purpose, we normalize the input data by ' = ( − ( )) / ( ) . in other words, we demean the input data and scale them by their variance but without removing the interactions between the assets. we train the network by using the adamax optimizing algorithm (kingma & ba, 2014) and at the same time applying weight decay to reduce overfitting (krogh & hertz, 1992) . learning rate and number of epochs configured for training vary depending on the model. in general, estimating parameters of a neural network model is a non-convex optimization problem. thus, the optimization algorithm might become stuck in an infeasible local optimum. in order to mitigate this problem, it is common practice to repeat the training multiple times, starting off having different (usually randomly chosen) parameter initializations, and then averaging over the resulting models or picking the best in terms of loss. in this paper, we follow a best-out-of-5 approach, that means each training is done five times with varying initialization and the best one is selected for simulation. the initialization strategy, which we will show in chapter 4.1, further mitigates this problem by starting off from an economically reasonable parameter set. we observe that the in-sample regime probabilities learned by the neural network regime switching models as compared to those estimated by the hmm based regime switching model generally show comparable results in terms of distribution and temporal dynamics. when we set k = 2 and the model fits two regimes with nearly invariably one having a positive corresponding equity means and low volatility, and the other experiencing a low or negative equity mean and high volatility. these regimes can be interpreted as bull and bear market, respectively. the respective insample regime probabilities over time also show strong alignment with growth and drawdown phases. this holds true for the vast majority of seeds and hence indicates that the neural network regime model is a valid practical alternative for regime modelling when compared to a hidden markov model. after training the model for a specific point in time, we start a monte carlo simulation of asset returns for the next 5 days (one week -monday to friday). for the purpose of calculating statistically solid quantiles of the resulting distribution, we simulate 100,000 paths for each model. we do this for at least 1093 (emu), and at most 1250 (globally) points in time within the back-test history window. as soon as we have simulated all return paths, we calculate a total (weekly) return for each path. the generated weekly returns follow a non-trivial distribution, which arises from the respective model and its underlying temporal dynamics. based on the simulations we compute quantiles for value at risk estimations. for example, the 0.01 and 0.05 percentile of the resulting distribution represent the 99% and 95% -5 day -var metric, respectively. we evaluate the quality of our value at risk estimations by counting the number of breaches of the asset returns. in case, the actual return is below the estimated var threshold, we count this as a breach. assuming an average performing model, it is e.g. reasonable to expect 5% breaches for a 95% var measurement. we compared the breaches of all models with each other. we classify a model as being superior to another model, if the number of var breaches is less than those from the compared model. a value comparison comp = 1.0(= 0.0) indicates that the row model is superior (inferior) to the column model. we performed significance tests by applying paired t-tests. we further evaluated a dominance value which is defined as shown in equation (11): in our view the three most crucial design features of neural networks in finance, where the sheer number of hidden layers appears less helpful due to the low signal to noise ratio (gkx, 2020), are: amount of input data, initializing information and incentive function. big input data is important for neural networks, as they need to consume sufficient evidence also of rarer empirical features to ensure that their nonlinear abilities in fitting virtually any functional form are used in a relevant instead of an exotic manner. similarly, the initialization of input parameters should be as much as possible based on empirically established estimates to ensure that the gradient descent inside the neural network takes off from a suitable point of departure, thereby substantially reducing the risks that a neural network confuses itself into irrelevant local minima. on the output side, every neural network is trained according to an incentive (i.e. loss) function. it is this particular loss function which determines the direction of travel for the neural network, which has no other ambitions than to minimize its loss best possible. hence, if the loss function only represents one of several practically relevant parameters, the neural network may come to results with bizarre outcomes for those parameters not included in its incentive function. in our case, for instance, the baseline incentive is just estimation accuracy which could lead to forecasts dominated much more by a single regime than ever observed in practice. in other words, after a long bull market, the neural network could "conclude" that bear markets do not exist. metaphorically spoken, a unidimensional loss function in a neural network has little decency (marcus, 2018) . commencing with the initialization and the incentive functions, we will assess our three neural networks in the following vis a vis classic and hmm approach, where each of the three networks is once displayed with an advanced design feature and once with a naïve design feature. if no specific initialization strategy for neural networks is defined, it occurs entirely random, normal via a computer generated random number. where established econometric approaches use naïve priors (i.e. mean), neural networks originally relied on brute force computing power and a bit of luck. hence, it is unsurprising that initializations are a common research topic in core machine learnings fields such as image classification or machine translation (glorot & bengio, 2010 nowadays. however, we are not aware of any systematic application of initialized neural networks in the field of finance. hence, we compare naïve neural networks, which are not initialized with neural networks that have been initialized with the best available prior. in our case, the best available prior for , σ of the model is the equivalent hmm estimation based on the same window. 1 such initialization is feasible, since the structure of the neural network -due to its similarity with respect to , σis broadly comparable with the hmm. in other words, we make use of already trained parameters from hmm training as starting parameters for the neural network training. in this sense, initialized neural networks are not only flexible in their functional form, they are also adaptable to "learn" from the best established model in the field if suitably supervised by the human data scientists. metaphorically spoken, our neural networks can stand on the shoulders of the giant that hmm is for regime based estimations. table 2 presents the results by comparing breaches between the two classic approaches (mean/variance, hmm) and the non-initialized and hmm initialized neural networks across all four regions. panel a and b display the 1% var threshold for equities and long bonds, respectively, while panels c and d show the equivalent comparison for 5% var thresholds. 2 note that for model training we apply a best-out-of-5 strategy as described in section 3.2. that means we repeat the training five times, starting off with random parameter initializations each time. in case of the presented hmm initialized model, we apply the same strategy, with the exception that , σ of the model are initialized the same for each of the five iterations. all residual parameters are initialized randomly as fits best according to the neural network part of the model. xxx findings are observable: first, not a single var threshold estimation process in a single region and in either of the two asset classes was able uphold its promise in that an estimated 1% var threshold should be breached no more than 1% of the time. this is very disappointing and quite alarming for institutional investors such as pension funds and insurance since it implies that all approachesestablished and machine learning basedfail to sufficiently capture downside tail risks and hence underestimate 1% var thresholds. the vast majority of approaches estimate var thresholds that occur in more than 2% of the cases and the lstm fails entirely if not initialised. in fact, even the best method, the hmm for us equities, estimates var thresholds which are breached in 1.34% of the cases. second, when inspecting the ability of our eight methods to estimate 5% var thresholds, the result remains bad but is less catastrophic. the mean/variance approach, the hmm and the initialised lstm display cases where their var thresholds were breaches in less than the expected 5%. the mean/variance and hmm approach make their thresholds in 3 out of 8 cases and the initialised lstm in 1 out of 8. overall, this is still a disappointing performance, especially for the feed forward neural network and the cnn. 1 even though we initialize , σ from hmm parameters, we still have weights to be initialized arising from the temporal neural network part of the model. we do this on a per layer level by sampling uniformly as where i is the number of input units for this layer. 2 we focus our discussion of results on the equities and long bonds since these have more variation, lower skewness and hence risk. results for the short bonds are available upon request from the contact author. third, when comparing the initialised with the non-initialised neural networks, the performance is like day vs. night. the non-initialised neural networks perform always worse and the lstm performs entirely dismal without a suitable prior. when comparing across all eight approaches, the hmm appears most competitive which means that we either have to further advance the design of our neural networks or their marginal value add beyond classic econometric approaches appears inexistent. to advance the design of our neural networks further, we aim to balance its utility function to avoid extreme unrealistic results possible in the univariate case. [insert table 2 about here] whereas cpz (2020) regularize their neural networks via no arbitrage conditions, we regularize via balancing the incentive function of our neural networks on multiple objectives. specifically, we extend the loss function to not only focus on accuracy of point estimates but also give some weight to eventually achieving empirically realistic regime distributions (i.e. in our data sample across all four regions no regimes display more than 60% frequency on a weekly basis). this balanced extension of the loss function prevents the neural networks from arriving at bizarre outcomes such as the conclusion that bear markets (or even bull markets) barely exist. technically, such bizarre outcomes result from cases where the regime probabilities φi(t) tend to converge globally either into 0 or 1 for all t, which basically means the neural network only recognises one-regime. to balance the incentive function of the neural network and facilitate balancing between regime contributions, we introduced an additional regularization term reg into the loss function which penalizes unbalanced regime probabilities. the regularization term is displayed in equation (13) below. if bear and bull market have equivalent regime probabilities the term converges to 0.5, while it converges towards 1 the larger the imbalance between the two regimes. substituting equation (13) into our loss function of equation (10), leads to equation (14) below, which doubles the point estimation based standard loss function in case of total regime balance inaccuracy but adds only 50% of the original loss function in case of full balance. conditioning the extension of the loss function on its origin is important to avoid biases due to diverging scales. setting the additional incentive function to initially have half the marginal weight of the original function also seems appropriate for comparability. the outcome of balancing the incentive functions of our neural networks are displayed in table 3 , where panels a-d are distributed as previously in table 2 . the results are very encouraging, especially with regards to the lstm. the regularized lstm is in all 32 cases (i.e. 2 thresholds, 2 asset classes, 4 regions) better than the non-regularized lstm. for the 5% var thresholds, it reaches realized occurrences of less than 4% in half the cases. this implies that the regularized lstm can even be more cautious than required. the regularized lstm also sets a new record for the 1% [insert table 4 about here] to measure how much value the regularized lstm can add compared to alternative approaches, we compute the annual accumulated costs of breaches as well as the average cost per breach. they are displayed in table 5 for the 5% var threshold. the regularized lstm is for both numbers in any case better than the classic approaches (mean/variance ad hmm) and the difference is economically meaningful. for equities the regularized lstm results in annual accumulated costs of 97-130 basis points less than the classic mean/variance approach, which would be up to over one billion us$ avoid loss per annum for a > us$100 billion equity portfolios of pension fund such as calpers or pggm. compared to the hmm approach, the regularized lstm avoids annual accumulated costs of 44-88 basis points, which is still a substantial amount of money for the vast majority of asset owners. with respect to long bonds, where total returns are naturally lower, the regularized lstm's avoided annual costs against the mean/variance and the hmm approach range between 23-30 basis points, which is high for bond markets. [insert table 5 about here] these statistically and economically attractive results have been achieved, however, based on 2,000 days of training data. such "big" amounts of data may not always be available for newer investment strategies. hence, it is natural to ask if the performance of the regularized neural networks drop when fed with just half the data (i.e. 1,000 days). apart from reducing statistical power, a period of over 4 years also may comprise less information on downside tail risks. indeed, the results displayed in table 6 show that in all context of var thresholds and asset classes, the regularized networks trained on 2,000 days substantially outperform and usually dominate their equivalently designed neural networks with half the training data. hence, the attractive risk management features for hmm initialised, balanced incentive lstms are likely only available for established discretionary investment strategies where sufficient historical data is available or for entirely rules-based approaches whose history can be replicated ex-post with sufficient confidence. [insert table 6 about here] we further conduct an array of robustness tests and sensitivity analysis to challenge our results and the applicability of neural network based regime switching models. as first robustness test, we extend the regularization in a manner that the balancing incentive function of equation (13) has the same marginal weight than the original loss function instead of just half the marginal weight. the performance of both types of regularized lstms is essentially equivalent second, we study higher var thresholds such as 10% and find the results to be very comparable to the 5% var results. third, we estimate monthly instead of weekly var. accounting for the loss of statistical power in comparison tests due to the lower number of observations, the results are equivalent again. we conduct two sensitivity analysis. first, we set up our neural networks to be generalized by two balancing incentive functions but without hmm initialisation. the results show the regularization enhances performance compared to the naïve non-regularized and non-initialized models but that both design features are needed to achieve the full performance. in other words, initialization and regularization seem additive design features in terms of neural network performance. second, we run analytical approaches with k > 2 regimes. adding a third or even fourth regime when asset prices only know two directions leads to substantial instability in the neural networks and tends to depreciate the quality of results. inspired by gkx (2020)'s and cpz (2020) to outperform the single incentive rnn as well as any other neural network or established approach by statistically and economically significant levels. third, we half our training data set of 2,000 days. we find our networks when fed with substantially less data (i.e. 1,000 days) to perform significantly worse which highlights a crucial weakness of neural networks in their dependence on very large data sets. hence, we conclude that well designed neural networks, i.e. a recurrent lstm neural network initialized with best current evidence and balanced incentivescan potentially advance the protection offered to institutional investors by var thresholds through a reduction in threshold breaches. however, such advancements rely on the availability of a long data history, which may not always be available in practice when estimating asset management var thresholds. descriptive statistics of the daily returns of the main equity index (equity), the main sovereign bond with (short) 1-3 years maturity (sb1-3y) and the main sovereign bond (long) with 7-10 year maturity (sb7-10). descriptive statistics include sample length, the first three moments of the return distribution and 11 thresholds along the return distribution. risks and portfolio decisions involving hedge funds skewness in stock returns: reconciling the evidence on firm versus aggregate returns can machines learn capital structure dynamics? working paper international asset allocation with regime shifts machine learning, human experts, and the valuation of real assets machine learning versus economic restrictions: evidence from stock return predictability a maximization technique occurring in the statistical analysis of probabilistic functions of markov chains bond risk premia with machine learning value-at-risk: a multivariate switching regime approach econometric measures of connectedness and systemic risk in the finance and insurance sectors neural networks for pattern recognition deep learning in asset pricing subsampled factor models for asset pricing: the rise of vasa microstructure in the machine age towards explaining deep learning: significance tests for multi-layer perceptrons asset pricing with omitted factors how to deal with small data sets in machine learning: an analysis on the cat bond market understanding the difficulty of training deep feedforward neural networks generating sequences with recurrent neural networks autoencoder asset pricing models much ado about nothing? exchange rate forecasting: neural networks vs. linear models using monthly and weekly data j. long short-term memory towards explainable ai: significance tests for neural networks improving earnings predictions with machine learning. working paper jorion, p.. value at risk characteristics are covariances: a unified model of risk and return adam: a method for stochastic optimization shrinking the cross-section the skew risk premium in the equity index market a simple weight decay can improve generalization advances in financial machine learning deep learning: a critical appraisal frontiers in var forecasting and backtesting dynamic semiparametric models for expected shortfall (and value-at-risk) maschinelles lernen bei der entwicklung von wertsicherungsstrategien. zeitschrift für das gesamte kreditwesen deep learning for mortgage risk forecasting value at risk and expected shortfall using a semiparametric approach based on the asymmetric laplace distribution forecast combinations for value at risk and expected shortfall moments of markov switching models verstyuk, s. 2020. modeling multivariate time series in economics: from auto-regressions to recurrent neural networks. working paper fixup initialization: residual learning without normalization. interantional conference on learning representations (iclr) paper acknowledgments: we are grateful for comments from theodor cojoianu, james hodson, juho kanniainen, qian li, yanan, andrew vivian, xiaojun zeng and participants at 2019 financial data science association conference in san francisco the international conference on fintech and financial data science at university college dublin (ucd). the views expressed in this manuscript are not necessarily shared by sociovestix labs, the technical expert group of dg fisma or warburg invest ag. authors are listed in alphabetical order, whereby hoepner serves as the contact author (andreas.hoepner@ucd.ie). any remaining errors are our own. key: cord-005090-l676wo9t authors: gao, chao; liu, jiming; zhong, ning title: network immunization and virus propagation in email networks: experimental evaluation and analysis date: 2010-07-14 journal: knowl inf syst doi: 10.1007/s10115-010-0321-0 sha: doc_id: 5090 cord_uid: l676wo9t network immunization strategies have emerged as possible solutions to the challenges of virus propagation. in this paper, an existing interactive model is introduced and then improved in order to better characterize the way a virus spreads in email networks with different topologies. the model is used to demonstrate the effects of a number of key factors, notably nodes’ degree and betweenness. experiments are then performed to examine how the structure of a network and human dynamics affects virus propagation. the experimental results have revealed that a virus spreads in two distinct phases and shown that the most efficient immunization strategy is the node-betweenness strategy. moreover, those results have also explained why old virus can survive in networks nowadays from the aspects of human dynamics. the internet, the scientific collaboration network and the social network [15, 32] . in these networks, nodes denote individuals (e.g. computers, web pages, email-boxes, people, or species) and edges represent the connections between individuals (e.g. network links, hyperlinks, relationships between two people or species) [26] . there are many research topics related to network-like environments [23, 34, 46] . one interesting and challenging subject is how to control virus propagation in physical networks (e.g. trojan viruses) and virtual networks (e.g. email worms) [26, 30, 37] . currently, one of the most popular methods is network immunization where some nodes in a network are immunized (protected) so that they can not be infected by a virus or a worm. after immunizing the same percentages of nodes in a network, the best strategy can minimize the final number of infected nodes. valid propagation models can be used in complex networks to predict potential weaknesses of a global network infrastructure against worm attacks [40] and help researchers understand the mechanisms of new virus attacks and/or new spreading. at the same time, reliable models provide test-beds for developing or evaluating new and/or improved security strategies for restraining virus propagation [48] . researchers can use reliable models to design effective immunization strategies which can prevent and control virus propagation not only in computer networks (e.g. worms) but also in social networks (e.g. sars, h1n1, and rumors). today, more and more researchers from statistical physics, mathematics, computer science, and epidemiology are studying virus propagation and immunization strategies. for example, computer scientists focus on algorithms and the computational complexities of strategies, i.e. how to quickly search a short path from one "seed" node to a targeted node just based on local information, and then effectively and efficiently restrain virus propagation [42] . epidemiologists focus on the combined effects of local clustering and global contacts on virus propagation [5] . generally speaking, there are two major issues concerning virus propagation: 1. how to efficiently restrain virus propagation? 2. how to accurately model the process of virus propagation in complex networks? in order to solve these problems, the main work in this paper is to (1) systematically compare and analyze representative network immunization strategies in an interactive email propagation model, (2) uncover what the dominant factors are in virus propagation and immunization strategies, and (3) improve the predictive accuracy of propagation models through using research from human dynamics. the remainder of this paper is organized as follows: sect. 2 surveys some well-known network immunization strategies and existing propagation models. section 3 presents the key research problems in this paper. section 4 describes the experiments which are performed to compare different immunization strategies with the measurements of the immunization efficiency, the cost and the robustness in both synthetic networks (including a synthetic community-based network) and two real email networks (the enron and a university email network), and analyze the effects of network structures and human dynamics on virus propagation. section 5 concludes the paper. in this section, several popular immunization strategies and typical propagation models are reviewed. an interactive email propagation model is then formulated in order to evaluate different immunization strategies and analyze the factors that influence virus propagation. network immunization is one of the well-known methods to effectively and efficiently restrain virus propagation. it cuts epidemic paths through immunizing (injecting vaccines or patching programs) a set of nodes from a network following some well-defined rules. the immunized nodes, in most published research, are all based on node degrees that reflect the importance of a node in a network, to a certain extent. in this paper, the influence of other properties of a node (i.e. betweenness) on immunization strategies will be observed. pastor-satorras and vespignani have studied the critical values in both random and targeted immunization [39] . the random immunization strategy treats all nodes equally. in a largescale-free network, the immunization critical value is g c → 1. simulation results show that 80% of nodes need to be immunized in order to recover the epidemic threshold. dezso and barabasi have proposed a new immunization strategy, named as the targeted immunization [9] , which takes the actual topology of a real-world network into consideration. the distributions of node degrees in scale-free networks are extremely heterogeneous. a few nodes have high degrees, while lots of nodes have low degrees. the targeted immunization strategy aims to immunize the most connected nodes in order to cut epidemic paths through which most susceptible nodes may be infected. for a ba network [2] , the critical value of the targeted immunization strategy is g c ∼ e −2 mλ . this formula shows that it is always possible to obtain a small critical value g c even if the spreading rate λ changes drastically. however, one of the limitations of the targeted immunization strategy is that it needs to know the information of global topology, in particular the ranking of the nodes must be clearly defined. this is impractical and uneconomical for handling large-scale and dynamic-evolving networks, such as p2p networks or email networks. in order to overcome this shortcoming, a local strategy, namely the acquaintance immunization [8, 16] , has been developed. the motivation for the acquaintance immunization is to work without any global information. in this strategy, p % of nodes are first selected as "seeds" from a network, and then one or more of their direct acquaintances are immunized. because a node with higher degree has more links in a scale-free network, it will be selected as a "seed" with a higher probability. thus, the acquaintance immunization strategy is more efficient than the random immunization strategy, but less than the targeted immunization strategy. moreover, there is another issue which limits the effectiveness of the acquaintance immunization: it does not differentiate nodes, i.e. randomly selects "seed" nodes and their direct neighbors [17] . another effective distributed strategy is the d-steps immunization [12, 17] . this strategy views the decentralized immunization as a graph covering problem. that is, for a node v i , it looks for a node to be immunized that has the maximal degree within d steps of v i . this method only uses the local topological information within a certain range (e.g. the degree information of nodes within d steps). thus, the maximal acquaintance strategy can be seen as a 1-step immunization. however, it does not take into account domain-specific heuristic information, nor is it able to decide what the value of d should be in different networks. the immunization strategies described in the previous section are all based on node degrees. the way different immunized nodes are selected is illustrated in fig. 1 1 an illustration of different strategies. the targeted immunization will directly select v 5 as an immunized node based on the degrees of nodes. suppose that v 7 is a "seed" node. v 6 will be immunized based on the maximal acquaintance immunization strategy, and v 5 will be indirectly selected as an immunized node based on the d-steps immunization strategy, where d = 2 fig. 2 an illustration of betweenness-based strategies. if we select one immunized node, the targeted immunization strategy will directly select the highest-degree node, v 6 . the node-betweenness strategy will select v 5 as it has the highest node betweenness. the edge-betweenness strategy will select one of v 3 , v 4 and v 5 because the edges, l 1 and l 2 , have the highest edge betweenness the highest-degree nodes from a network, many approaches cut epidemic paths by means of increasing the average path length of a network, for example by partitioning large-scale networks based on betweenness [4, 36] . for a network, node (edge) betweenness refers to the number of the shortest paths that pass through a node (edge). a higher value of betweenness means that the node (edge) links more adjacent communities and will be frequently used in network communications. although [19] have analyzed the robustness of a network against degree-based and betweenness-based attacks, the spread of a virus in a propagation model is not considered, so the effects of different measurements on virus propagation is not clear. is it possible to restrain virus propagation, especially from one community to another, by immunizing nodes or edges which have higher betweenness. in this paper, two types of betweenness-based immunization strategies will be presented, i.e. the node-betweenness strategy and the edge-betweenness strategy. that is, the immunized nodes are selected in the descending order of node-and edge-betweenness, in an attempt to better understand the effects of the degree and betweenness centralities on virus propagation. figure 2 shows that if v 4 is immunized, the virus will not propagate from one part of the network to another. the node-betweenness strategy will select v 5 as an immunized node, which has the highest node betweenness, i.e. 41. the edge-betweenness strategy will select the terminal nodes of l 1 or l 2 (i.e. v 3 , v 4 or v 4 , v 5 ) as they have the highest edge betweenness. as in the targeted immunization, the betweenness-based strategies also require information about the global betweenness of a network. the experiments presented in this paper is to find a new measurement that can be used to design a highly efficient immunization strategy. the efficiency of these strategies is compared both in synthetic networks and in real-world networks, such as the enron email network described by [4] . in order to compare different immunization strategies, a propagation model is required to act as a test-bed in order to simulate virus propagation. currently, there are two typical models: (1) the epidemic model based on population simulation and (2) an interactive email model which utilizes individual-based simulation. lloyd and may have proposed an epidemic propagation model to characterize virus propagation, a typical mathematical model based on differential equations [26] . some specific epidemic models, such as si [37, 38] , sir [1, 30] , sis [14] , and seir [11, 28] , have been developed and applied in order to simulate virus propagation and study the dynamic characteristics of whole systems. however, these models are all based on the mean-filed theory, i.e. differential equations. this type of black-box modeling approach only provides a macroscopic understanding of virus propagation-they do not give much insight into microscopic interactive behavior. more importantly, some assumptions, such as a fully mixed (i.e. individuals that are connected with a susceptible individual will be randomly chosen from the whole population) [33] and equiprobable contacts (i.e. all nodes transmit the disease with the same probability and no account is taken of the different connections between individuals) may not be valid in the real world. for example, in email networks and instant message (im) networks, communication and/or the spread of information tend to be strongly clustered in groups or communities that have more closer relationships rather than being equiprobable across the whole network. these models may also overestimate the speed of propagation [49] . in order to overcome the above-mentioned shortcomings, [49] have built an interactive email model to study worm propagation, in which viruses are triggered by human behavior, not by contact probabilities. that is to say, the node will be infected only if a user has checked his/her email-box and clicked an email with a virus attachment. thus, virus propagation in the email network is mainly determined by two behavioral factors: email-checking time intervals (t i ) and email-clicking probabilities (p i ), where i ∈ [1, n ] , n is the total number of users in a network. t i is determined by a user's own habits; p i is determined both by user security awareness and the efficiency of the firewall. however, the authors do not provide much information about how to restrain worm propagation. in this paper, an interactive email model is used as a test-bed to study the characteristics of virus propagation and the efficiency of different immunization strategies. it is readily to observe the microscopic process of worm propagating through this model, and uncover the effects of different factors (e.g. the power-law exponent, human dynamics and the average path length of the network) on virus propagation and immunization strategies. unlike other models, this paper mainly focuses on comparing the performance of degree-based strategies and betweenness-based strategies, replacing the critical value of epidemics in a network. a detailed analysis of the propagation model is given in the following section. an email network can be viewed as a typical social network in which a connection between two nodes (individuals) indicates that they have communicated with each other before [35, 49] . generally speaking, a network can be denoted as e = v, l , where v = {v 1 , v 2 , . . . , v n } is a set of nodes and l = { v i , v j | 1 ≤ i, j ≤ n} is a set of undirected links (if v i in the hit-list of v j , there is a link between v i and v j ). a virus can propagate along links and infect more nodes in a network. in order to give a general definition, each node is represented as a tuple . -id: the node identifier, v i .i d = i. -state: the node state: i f the node has no virus, danger = 1, i f the node has virus but not in f ected, in f ected = 2, i f the node has been in f ected, immuni zed = 3, i f the node has been immuni zed. -nodelink: the information about its hit-list or adjacent neighbors, i.e. v i .n odelink = { i, j | i, j ∈ l}. -p behavior : the probability that a node will to perform a particular behavior. -b action : different behaviors. -virusnum: the total number of new unchecked viruses before the next operation. -newvirus: the number of new viruses a node receives from its neighbors at each step. in addition, two interactive behaviors are simulated according to [49] , i.e. the emailchecking time intervals and the email-clicking probabilities both follow gaussian distributions, if the sample size goes to infinity. for the same user i, the email-checking interval t i (t) in [49] has been modeled by a poisson distribution, i.e. t i (t) ∼ λe −λt . thus, the formula for p behavior in the tuple can be written as p 1 behavior = click prob and p 2 behavior = checkt ime. -clickprob is the probability of an user clicking a suspected email, -checkrate is the probability of an user checking an email, -checktime is the next time the email-box will be checked, v i .p 2 behavior = v i .checkt ime = ex pgenerator(v i .check rate). b action can be specified as b 1 action = receive_email, b 2 action = send_email, and b 3 action = update_email. if a user receives a virus-infected email, the corresponding node will update its state, i.e. v i .state ← danger. if a user opens an email that has a virus-infected attachment, the node will adjust its state, i.e. v i .state ← in f ected, and send this virus email to all its friends, according to its hit-list. if a user is immunized, the node will update its state to v i .state ← immuni zed. in order to better characterize virus propagation, some assumptions are made in the interactive email model: -if a user opens an infected email, the node is infected and will send viruses to all the friends on its hit-list; -when checking his/her mailbox, if a user does not click virus emails, it is assumed that the user deletes the suspected emails; -if nodes are immunized, they will never send virus emails even if a user clicks an attachment. the most important measurement of the effectiveness of an immunization strategy is the total number of infected nodes after virus propagation. the best strategy can effectively restrain virus propagation, i.e. the total number of infected nodes is kept to a minimum. in order to evaluate the efficiency of different immunization strategies and find the relationship between local behaviors and global dynamics, two statistics are of particular interest: 1. sid: the sum of the degrees of immunized nodes that reflects the importance of nodes in a network 2. apl: the average path length of a network. this is a measurement of the connectivity and transmission capacity of a network where d i j is the shortest path between i and j. if there is no path between i and j, d i j → ∞. in order to facilitate the computation, the reciprocal of d i j is used to reflect the connectivity of a network: if there is no path between i and j, d −1 i j = 0. based on these definitions, the interactive email model given in sect. 2.3 can be used as a test-bed to compare different immunization strategies and uncover the effects of different factors on virus propagation. the specific research questions addressed in this paper can be summarized as follows: 1. how to evaluate network immunization strategies? how to determine the performance of a particular strategy, i.e. in terms of its efficiency, cost and robustness? what is the best immunization strategy? what are the key factors that affect the efficiency of a strategy? 2. what is the process of virus propagation? what effect does the network structure have on virus propagation? 3. what effect do human dynamics have on virus propagation? the simulations in this paper have two phases. first, a existing email network is established in which each node has some of the interactive behaviors described in sect. 2.3. next, the virus propagation in the network is observed and the epidemic dynamics are studied when applying different immunization strategies. more details can be found in sect. 4. in this section, the simulation process and the structures of experimental networks are presented in sects. 4.1 and 4.2. section 4.3 uses a number of experiments to evaluate the performance (e.g. efficiency, cost and robustness) of different immunization strategies. specifically, the experiments seek to address whether or not betweenness-based immunization strategies can restrain worm propagation in email networks, and which measurements can reflect and/or characterize the efficiency of immunization strategies. finally, sects. 4.4 and 4.5 presents an in-depth analysis in order to determine the effect of network structures and human dynamics on virus propagation. the experimental process is illustrated in fig. 3 . some nodes are first immunized (protected) from the network using different strategies. the viruses are then injected into the network in order to evaluate the efficiency of those strategies by comparing the total number of infected nodes. two methods are used to select the initial infected nodes: random infection and malicious infection, i.e. infecting the nodes with maximal degrees. the user behavior parameters are based on the definitions in sect. 2.3, where μ p = 0.5, σ p = 0.3, μ t = 40, and σ t = 20. since the process of email worm propagation is stochastic, all results are averaged over 100 runs. the virus propagation algorithm is specified in alg. 1. many common networks have presented the phenomenon of scale-free [2, 21] , where nodes' degrees follow a power-law distribution [42] , i.e. the fraction of nodes having k edges, p(k), decays according to a power law p(k) ∼ k −α (where α is usually between 2 and 3) [29] . recent research has shown that email networks also follow power-law distributions with a long tail [35, 49] . therefore, in this paper, three synthetic power-law networks and a synthetic community-based network, generated using the glp algorithm [6] where the power can be tuned. the three synthetic networks all have 1000 nodes with α =1.7, 2.7, and 3.7, respectively. the statistical characteristics and visualization of the synthetic community-based network are shown in table 1 and fig. 4c , f, respectively. in order to reflect the characteristics of a real-world network, the enron email network 1 which is built by andrew fiore and jeff heer, and the university email network 2 which is complied by the members of the university rovira i virgili (tarragona) will also be studied. the structure and degree distributions of these networks are shown in table 2 and fig. 4 . in particular, the cumulative distributions are estimated with maximum likelihood using the method provided by [7] . the degree statistics are shown in table 9 . in this section, a comparison is made of the effectiveness of different strategies in an interactive email model. experiments are then used to evaluate the cost and robustness of each strategy. input: nodedata[nodenum] stores the topology of an email network. timestep is the system clock. v 0 is the set of initially infected nodes. output: simnum[timestep] [k] stores the number of infected nodes in the network in the k th simulation. (1) for k=1 to runtime //we run 100 times to obtain an average value (2) nodedata[nodenum] ← initializing an email network as well as users' checking time and clicking probability; (3) nodedata[nodenum] ← choosing immunized nodes based on different immunization strategies and adjusting their states; (4) while timestep < endsimul //there are 600 steps at each time (5) for i=1 to nodenum (6) if nodedata[i].checktime==0 (7) prob← computing the probability of opening a virus-infected email based on user's clickprob and virusnum (8) if send a virus to all friends according to its hit-list (12) endif (13) endif (14) endfor (15) for i=1 to nodenum (16) update the next checktime based on user's checkrate (17) nodedata the immunization efficiency of the following immunization strategies are compared: the targeted and random strategies [39] , the acquaintance strategy (random and maximal neighbor) [8, 16] , the d-steps strategy (d = 2 and d = 3) [12, 17] (which is introduced in sect. 2.1), the bridges between different communities: 100 the whole network: α=1.77, k =8.34 and the proposed betweenness-based strategy (node-and edge-betweenness). in the initial set of experiments, the proportion of immunized nodes (5, 10, and 30%) are varied in the synthetic networks and the enron email network. table 3 shows the simulation results in the enron email network which is initialized with two infected nodes. figure 5 shows the average numbers of infected nodes over time. tables 4, 5 , and 6 show the numerical results in three synthetic networks, respectively. the simulation results show that the node-betweenness immunization strategy yields the best results (i.e. the minimum number of infected nodes, f) except for the case where 5% of the nodes in the enron network are immunized under a malicious attack. the average degree of the enron network is k = 3.4. this means that only a few nodes have high degrees, others have low degrees (see table 9 ). in such a network, if nodes with maximal degrees are infected, viruses will rapidly spread in the network and the final number of infected nodes will be larger than in other cases. the targeted strategy therefore does not perform any better than the node-betweenness strategy. in fact, as the number of immunized nodes increases, the efficiency of the node-betweenness immunization increases proportionally there are two infected nodes with different attack modes. if there is no immunization, the final number of infected nodes is 937 with a random attack and 942 with a malicious attack, and ap l = 751.36(10 −4 ). the total simulation time t = 600 more than the targeted strategy. therefore, if global topological information is available, the node-betweenness immunization is the best strategy. the maximal s i d is obtained using the targeted immunization. however, the final number of infected nodes (f) is consistent with the average path length (ap l) but not with the s i d. that is to say, controlling a virus epidemic does not depend on the degrees of immunized nodes but on the path length of a whole network. this also explains why the efficiency of the node-betweenness immunization strategy is better than that of the targeted immunization strategy. the node-betweenness immunization selects nodes based on the average path length, while the targeted immunization strategy selects based on the size of degrees. a more in-depth analysis is undertaken by comparing the change of the ap l with respect to the different strategies used in the synthetic networks. the results are shown in fig. 6 . figure 7a , b compare the change of the final number of infected nodes over time, which correspond to fig. 6c , d, respectively. these numerical results validate the previous assertion that the average path length can be used as a measurement to design an effective immunization strategy. the best strategy is to divide the whole network into different sub-networks and increase the average path length of a network, hence cut the epidemic paths. in this paper, all comparative results are the average over 100 runs using the same infection model (i.e. the virus propagation is compared for both random and malicious attacks) and user behavior model (i.e. all simulations use the same behavior parameters, as shown in sect. 4.1). thus, it is more reasonable and feasible to just evaluate how the propagation of a virus is affected by immunization strategies, i.e. avoiding the effects caused by the stochastic process, the infection model and the user behavior. it can be seen that the edge-betweenness strategy is able to find some nodes with high degrees of centrality and then integrally divide a network into a number of sub-networks (e.g. v 4 in fig. 2) . however, compared with the nodes (e.g. v 5 in fig. 2 ) selected by the node-betweenness strategy, the nodes with higher edge betweenness can not cut the epidemic paths as they can not effectively break the whole structure of a network. in fig. 2 , the synthetic community-based network and the university email network are used as examples to illustrate why the edge-betweenness strategy can not obtain the same immunization efficiency as the node-betweenness strategy. to select two nodes as immunized nodes from fig. 2 , the node-betweenness immunization will select {v 5 , v 3 } by using the descending order of node betweenness. however, the edge-betweenness strategy can select {v 3 , v 4 } or {v 4 , v 5 } because the edges, l 1 and l 2 , have the highest edge betweenness. this result shows that the node-betweenness strategy can not only effectively divide the whole network into two communities, but also break the interior structure of communities. although the edgebetweenness strategy can integrally divided the whole network into two parts, viruses can also propagate in each community. many networks commonly contain the structure shown in fig. 2 , for example, the enron email network and university email networks. table 7 and fig. 8 present the results of the synthetic community-based network. table 8 compares different strategies in the university email network, which also has some self-similar community structures [18] . these results further validate the analysis stated above. from the above experiments, the following conclusions can be made: tables 4-8 , ap l can be used as a measurement to evaluate the efficiency of an immunization strategy. thus, when designing a distributed immunization strategy, attentions should be paid on those nodes that have the largest impact on the apl value. 2. if the final number of infected nodes is used as a measure of efficiency, then the nodebetweenness immunization strategy is more efficient than the targeted immunization strategy. 3. the power-law exponent (α) affects the edge-betweenness immunization strategy, but has a little impact on other strategies. in the previous section, the efficiency of different immunization strategies is evaluated in terms of the final number of infected nodes when the propagation reaches an equilibrium state. by doing experiments in synthetic networks, synthetic community-based network, the enron email network and the university email network, it is easily to find that the node-betweenness immunization strategy has the highest efficiency. in this section, the performance of the different strategies will be evaluated in terms of cost and robustness, as in [20] . it is well known that the structure of a social network or an email network constantly evolves. it is therefore interesting to evaluate how changes in structure affect the efficiency of an immunization strategy. -the cost can be defined as the number of nodes that need to be immunized in order to achieve a given level of epidemic prevalence ρ. generally, ρ → 0. there are some parameters which are of particular interest: f is the fraction of nodes that are immunized; f c is the critical value of the immunization when ρ → 0; ρ 0 is the infection density when no immunization strategy is implemented; ρ f is the infection density with a given immunization strategy. figure 9 shows the relationship between the reduced prevalence ρ f /ρ 0 and f. it can be seen that the node-betweenness immunization has the lowest prevalence for the smallest number of protected nodes. the immunization cost increases as the value of α increases, i.e. in order to achieve epidemic prevalence ρ → 0, the node-betweenness immunization strategy needs 20, 25, and 30% of nodes to be immunized, respectively, in the three synthetic networks. this is because the node-betweenness immunization strategy can effectively break the network structure and increase the path length of a network with the same number of immunized nodes. -the robustness shows a plot of tolerance against the dynamic evolution of a network, i.e. the change of power-law exponents (α). figure 10 shows the relationship between the immunized threshold f c and α. a low level of f c with a small variation indicates that the immunization strategy is robust. the robustness is important when an immunization strategy is deployed into a scalable and dynamic network (e.g. p2p and email networks). figure 10 also shows the robustness of the d-steps immunization strategy is close to that of the targeted immunization; the node-betweenness strategy is the most robust. [49] have compared virus propagation in synthetic networks with α = 1.7 and α = 1.1475, and pointed out that initial worm propagation has two phases. however, they do not give a detailed explanation of these results nor do they compare the effect of the power-law exponent on different immunization strategies during virus propagation. table 9 presents the detailed degree statistics for different networks, which can be used to examine the effect of the power-law exponent on virus propagation and immunization strategies. first, virus propagation in non-immunized networks is discussed. figure 11a shows the changes of the average number of infected nodes over time; fig. 11b gives the average degree of infected nodes at each time step. from the results, it can be seen that 1. the number of infected nodes in non-immunized networks is determined by attack modes but not the power-law exponent. in figs. 11a , b, three distribution curves (α = 1.7, 2.7, and 3.7) overlap with each other in both random and malicious attacks. the difference between them is that the final number of infected nodes with a malicious attack is larger than that with a random attack, as shown in fig. 11a , reflecting the fact that a malicious attack is more dangerous than a random attack. 2. a virus spreads more quickly in a network with a large power-law exponent than that with a small exponent. because a malicious attack initially infects highly connected nodes, the average degree of the infected nodes decreases in a shorter time comparing to a random attack (t 1 < t 2). moreover, the speed and range of the infection is amplified by those highly connected nodes. in phase i, viruses propagate very quickly and infect most nodes in a network. however, in phase ii, the number of total infected nodes grows slowly (fig. 11a) , because viruses aim to infect those nodes with low degrees (fig. 11b) , and a node with fewer links is more difficult to be infected. in order to observe the effect of different immunization strategies on the average degree of infected nodes in different networks, 5% of the nodes are initially protected against random and malicious attacks. figure 12 shows the simulation results. from this experiment, it can be concluded that 1. the random immunization has no effect on restraining virus propagation because the curves of the average degree of the infected nodes are basically coincident with the curves in the non-immunization case. 2. comparing fig. 12a , b, c and d, e, f, respectively, it can be seen that the peak value of the average degree is the largest in the network with α=1.7 and the smallest in the network with α=3.7. this is because the network with a lower exponent has more highly connected nodes (i.e. the range of degrees is between 50 and 80), which serve as amplifiers in the process of virus propagation. 3. as α increases, so does the number of infected nodes and the virus propagation duration (t 1 < t 2 < t 3). because a larger α implies a larger ap l , the number of infected nodes will increase; if the network has a larger exponent, a virus need more time to infect those nodes with medium or low degrees. fig. 14 the average number of infected nodes and the average degree of infected nodes, with respect to time when virus spreading in different networks. we apply the targeted immunization to protect 30% nodes in the network first, consider the process of virus propagation in the case of a malicious attack where 30% of the nodes are immunized using the edge-betweenness immunization strategy. there are two intersections in fig. 13a . point a is the intersection of two curves net1 and net3, and point b is the intersection of net2 and net1. under the same conditions, fig. 13a shows that the total number of infected nodes is the largest in net1 in phase i. corresponding to fig. 13b , the average degree of infected nodes in net1 is the largest in phase i. as time goes on, the rate at which the average degree falls is the fastest in net1, as shown in fig. 13b . this is because there are more highly connected nodes in net1 than in the others (see table 9 ). after these highly connected nodes are infected, viruses attempt to infect the nodes with low degrees. therefore, the average degree in net3 that has the smallest power-law exponent is larger than those in phases ii and iii. the total number of infected nodes in net3 continuously increases, exceeding those in net1 and net2. the same phenomenon also appears in the targeted immunization strategy, as shown in fig. 14. the email-checking intervals in the above interactive email model (see sect. 2.3) is modeled using a poisson process. the poisson distribution is widely used in many real-world models to statistically describe human activities, e.g. in terms of statistical regularities on the frequency of certain events within a period of time [25, 49] . statistics from user log files to databases that record the information about human activities, show that most observations on human behavior deviate from a poisson process. that is to say, when a person engages in certain activities, his waiting intervals follow a power-law distribution with a long tail [27, 43] . vazquez et al. [44] have tried to incorporate an email-sending interval distribution, characterized by a power-law distribution, into a virus propagation model. however, their model assumes that a user is instantly infected after he/she receives a virus email, and ignores the impact of anti-virus software and the security awareness of users. therefore, there are some gaps between their model and the real world. in this section, the statistical properties associated with a single user sending emails is analyzed based on the enron dataset [41] . the virus spreading process is then simulated using an improved interactive email model in order to observe the effect of human behavior on virus propagation. research results from the study of statistical regularities or laws of human behavior based on empirical data can offer a valuable perspective to social scientists [45, 47] . previous studies have also used models to characterize the behavioral features of sending emails [3, 13, 22] , but their correctness needs to be further empirically verified, especially in view of the fact that there exist variations among different types of users. in this paper, the enron email dataset is used to identify the characteristics of human email-handling behavior. due to the limited space, table 10 presents only a small amount of the employee data contained in the database. as can be seen from the table, the interval distribution of email sent by the same user is respectively measured using different granularities: day, hour, and minute. figure 15 shows that the waiting intervals follow a heavy-tailed distribution. the power-law exponent as the day granularity is not accurate because there are only a few data points. if more data points are added, a power-law distribution with long tail will emerge. note that, there is a peak at t = 16 as measured at an hour granularity. eckmann et al. [13] have explained that the peak in a university dataset is the interval between the time people leave work and the time they return to their offices. after curve fitting, see fig. 15 , the waiting interval exponent is close to 1.3, i.e. α ≈ 1.3 ± 0.5. although it has been shown that an email-sending distribution follows a power-law by studying users in the enron dataset, it is still not possible to assert that all users' waiting intervals follow a power-law distribution. it can only be stated that the distribution of waiting intervals has a long-tail characteristic. it is also not possible to measure the intervals between email checking since there is no information about login time in the enron dataset. however, combing research results from human web browsing behavior [10] and the effect of non-poisson activities on propagation in the barabasi group [44] , it can be found that there are similarities between the distributions of email-checking intervals and email-sending intervals. the following section uses a power-law distribution to characterize the behavior associated with email-checking in order to observe the effect human behavior has on the propagation of an email virus. based on the above discussions, a power-law distribution is used to model the email-checking intervals of a user i, instead of the poisson distribution used in [49] , i.e. t i (τ ) ∼ τ −α . an analysis of the distribution of the power-law exponent (α) for different individuals in web browsing [10] and in the enron dataset shows that the power-law exponent is approximately 1.3. in order to observe and quantitatively analyze the effect that the email-checking interval has on virus propagation, the email-clicking probability distribution (p i ) in our model is consistent with the one used by [49] , i.e. the security awareness of different users in the network follows a normal distribution, p i ∼ n (0.5, 0.3 2 ). figure 16 shows that following a random attack viruses quickly propagate in the enron network if the email-checking intervals follow a power-law distribution. the results are consistent with the observed trends in real computer networks [31] , i.e. viruses initially spread explosively, then enter a long latency period before becoming active again following user activity. the explanation for this is that users frequently have a short period of focused activity followed by a long period of inactivity. thus, although old viruses may be killed by anti-virus software, they can still intermittently break out in a network. that is because some viruses are hidden by inactive users, and cannot be found by anti-virus software. when the inactive users become active, the virus will start to spread again. the effect of human dynamics on virus propagation in three synthetic networks is also analyzed by applying the targeted [9] , d-steps [17] and aoc-based strategy [24] . the numerical results are shown in table. 11 and fig. 17 . from the above experiments, the following conclusions can be made: 1. based on the enron email dataset and recent research on human dynamics, the emailchecking intervals in an interactive email model should be assigned based on a power-law distribution. 2. viruses can spread very quickly in a network if users' email-checking intervals follow a power-law distribution. in such a situation, viruses grow explosively at the initial stage and then grow slowly. the viruses remain in a latent state and await being activated by users. in this paper, a simulation model for studying the process of virus propagation has been described, and the efficiency of various existing immunization strategies has been compared. in particular, two new betweenness-based immunization strategies have been presented and validated in an interactive propagation model, which incorporates two human behaviors based on [49] in order to make the model more practical. this simulation-based work can be regarded as a contribution to the understanding of the inter-reactions between a network structure and local/global dynamics. the main results are concluded as follows: 1. some experiments are used to systematically compare different immunization strategies for restraining epidemic spreading, in synthetic scale-free networks including the community-based network and two real email networks. the simulation results have shown that the key factor that affects the efficiency of immunization strategies is apl, rather than the sum of the degrees of immunized nodes (sid). that is to say, immunization strategy should protect nodes with higher connectivity and transmission capability, rather than those with higher degrees. 2. some performance metrics are used to further evaluate the efficiency of different strategies, i.e. in terms of their cost and robustness. simulation results have shown that the d-steps immunization is a feasible strategy in the case of limited resources and the nodebetweenness immunization is the best if the global topological information is available. 3. the effects of power-law exponents and human dynamics on virus propagation are analyzed. more in-depth experiments have shown that viruses spread faster in a network with a large power-law exponent than that with a small one. especially, the results have explained why some old viruses can still propagate in networks up till now from the perspective of human dynamics. the mathematical theory of infectious diseases and its applications emergence of scaling in random networks the origin of bursts and heavy tails in human dynamics cluster ranking with an application to mining mailbox networks small worlds' and the evolution of virulence: infection occurs locally and at a distance on distinguishing between internet power law topology generators power-law distribution in empirical data efficient immunization strategies for computer networks and populations halting viruses in scale-free networks dynamics of information access on the web a simple model for complex dynamical transitions in epidemics distance-d covering problem in scalefree networks with degree correlation entropy of dialogues creates coherent structure in email traffic epidemic threshold in structured scale-free networks on power-law relationships of the internet topology improving immunization strategies immunization of real complex communication networks self-similar community structure in a network of human interactions attack vulnerability of complex networks targeted local immunization in scale-free peer-to-peer networks the large scale organization of metabolic networks probing human response times periodic subgraph mining in dynamic networks. knowledge and information systems autonomy-oriented search in dynamic community networks: a case study in decentralized network immunization characterizing web usage regularities with information foraging agents how viruses spread among computers and people on universality in human correspondence activity enhanced: simple rules with complex dynamics network motifs simple building blocks of complex networks epidemics and percolation in small-world network code-red: a case study on the spread and victims of an internet worm the structure of scientific collaboration networks the spread of epidemic disease on networks the structure and function of complex networks email networks and the spread of computer viruses partitioning large networks without breaking communities epidemic spreading in scale-free networks epidemic dynamics and endemic states in complex networks immunization of complex networks computer virus propagation models the enron email dataset database schema and brief statistical report exploring complex networks modeling bursts and heavy tails in human dynamics impact of non-poissonian activity patterns on spreading process predicting the behavior of techno-social systems a decentralized search engine for dynamic web communities a twenty-first century science an environment for controlled worm replication and analysis modeling and simulation study of the propagation and defense of internet e-mail worms chao gao is currently a phd student in the international wic institute, college of computer science and technology, beijing university of technology. he has been an exchange student in the department of computer science, hong kong baptist university. his main research interests include web intelligence (wi), autonomy-oriented computing (aoc), complex networks analysis, and network security. department at hong kong baptist university. he was a professor and the director of school of computer science at university of windsor, canada. his current research interests include: autonomy-oriented computing (aoc), web intelligence (wi), and self-organizing systems and complex networks, with applications to: (i) characterizing working mechanisms that lead to emergent behavior in natural and artificial complex systems (e.g., phenomena in web science, and the dynamics of social networks and neural systems), and (ii) developing solutions to large-scale, distributed computational problems (e.g., distributed scalable scientific or social computing, and collective intelligence). prof. liu has contributed to the scientific literature in those areas, including over 250 journal and conference papers, and 5 authored research monographs, e.g., autonomy-oriented computing: from problem solving to complex systems modeling (kluwer academic/springer) and spatial reasoning and planning: geometry, mechanism, and motion (springer). prof. liu has served as the editor-in-chief of web intelligence and agent systems, an associate editor of ieee transactions on knowledge and data engineering, ieee transactions on systems, man, and cybernetics-part b, and computational intelligence, and a member of the editorial board of several other international journals. laboratory and is a professor in the department of systems and information engineering at maebashi institute of technology, japan. he is also an adjunct professor in the international wic institute. he has conducted research in the areas of knowledge discovery and data mining, rough sets and granular-soft computing, web intelligence (wi), intelligent agents, brain informatics, and knowledge information systems, with more than 250 journal and conference publications and 10 books. he is the editor-in-chief of web intelligence and agent systems and annual review of intelligent informatics, an associate editor of ieee transactions on knowledge and data engineering, data engineering, and knowledge and information systems, a member of the editorial board of transactions on rough sets. key: cord-048461-397hp1yt authors: coelho, flávio c; cruz, oswaldo g; codeço, cláudia t title: epigrass: a tool to study disease spread in complex networks date: 2008-02-26 journal: source code biol med doi: 10.1186/1751-0473-3-3 sha: doc_id: 48461 cord_uid: 397hp1yt background: the construction of complex spatial simulation models such as those used in network epidemiology, is a daunting task due to the large amount of data involved in their parameterization. such data, which frequently resides on large geo-referenced databases, has to be processed and assigned to the various components of the model. all this just to construct the model, then it still has to be simulated and analyzed under different epidemiological scenarios. this workflow can only be achieved efficiently by computational tools that can automate most, if not all, these time-consuming tasks. in this paper, we present a simulation software, epigrass, aimed to help designing and simulating network-epidemic models with any kind of node behavior. results: a network epidemiological model representing the spread of a directly transmitted disease through a bus-transportation network connecting mid-size cities in brazil. results show that the topological context of the starting point of the epidemic is of great importance from both control and preventive perspectives. conclusion: epigrass is shown to facilitate greatly the construction, simulation and analysis of complex network models. the output of model results in standard gis file formats facilitate the post-processing and analysis of results by means of sophisticated gis software. epidemic models describe the spread of infectious diseases in populations. more and more, these models are being used for predicting, understanding and developing control strategies. to be used in specific contexts, modeling approaches have shifted from "strategic models" (where a caricature of real processes is modeled in order to emphasize first principles) to "tactical models" (detailed representations of real situations). tactical models are useful for cost-benefit and scenario analyses. good examples are the foot-and-mouth epidemic models for uk, triggered by the need of a response to the 2001 epidemic [1, 2] and the simulation of pandemic flu in differ-ent scenarios helping authorities to choose among alternative intervention strategies [3, 4] . in realistic epidemic models, a key issue to consider is the representation of the contact process through which a disease is spread, and network models have arisen as good candidates [5] . this has led to the development of "network epidemic models". network is a flexible concept that can be used to describe, for example, a collection of individuals linked by sexual partnerships [6] , a collection of families linked by sharing workplaces/schools [7] , a collection of cities linked by air routes [8] . any of these scales may be relevant to the study and control of disease spread [9] . networks are made of nodes and their connections. one may classify network epidemic models according to node behavior. one example would be a classification based on the states assumed by the nodes: networks with discretestate nodes have nodes characterized by a discrete variable representing its epidemiological status (for example, susceptible, infected, recovered). the state of a node changes in response to the state of neighbor nodes, as defined by the network topology and a set of transmission rules. networks with continuous-state nodes, on the other hand, have node' state described by a quantitative variable (number of susceptibles, density of infected individuals, for example), modelled as a function of the history of the node and its neighbors. the importance of the concept of neighborhood on any kind of network epidemic model stems from its large overlap with the concept of transmission. in network epidemic models, transmission either defines or is defined/constrained by the neighborhood structure. in the latter case, a neighborhood structure is given a priori which will influence transmissibility between nodes. the construction of complex simulation models such as those used in network epidemic models, is a daunting task due to the large amount of data involved in their parameterization. such data frequently resides on large geo-referenced databases. this data has to be processed and assigned to the various components of the model. all this just to construct the model, then it still has to be simulated, analyzed under different epidemiological scenarios. this workflow can only be achieved efficiently by computational tools that can automate most if not all of these time-consuming tasks. in this paper, we present a simulation software, epigrass, aimed to help designing and simulating network-epidemic models with any kind of node behavior. without such a tool, implementing network epidemic models is not a simple task, requiring a reasonably good knowledge of programming. we expect that this software will stimulate the use and development of networks models for epidemiological purposes. the paper is organized as following: first we describe the software and how it is organized with a brief overview of its functionality. then we demonstrate its use with an example. the example simulates the spread of a directly transmitted infectious disease in brazil through its transportation network. the velocity of spread of new diseases in a network of susceptible populations depends on their spatial distribution, size, susceptibility and patterns of contact. in a spatial scale, climate and environment may also impact the dynamics of geographical spread as it introduces temporal and spatial heterogeneity. understanding and predicting the direction and velocity of an invasion wave is key for emergency preparedness. epigrass is a platform for network epidemiological simulation and analysis. it enables researchers to perform comprehensive spatio-temporal simulations incorporating epidemiological data and models for disease transmission and control in order to create complex scenario analyses. epigrass is designed towards facilitating the construction and simulation of large scale metapopulational models. each component population of such a metapopulational model is assumed to be connected through a contact network which determines migration flows between populations. this connectivity model can be easily adapted to represent any type of adjacency structure. epigrass is entirely written in the python language, which contributes greatly to the flexibility of the whole system due to the dynamical nature of the language. the geo-referenced networks over which epidemiological processes take place can be very straightforwardly represented in a object-oriented framework. consequently, the nodes and edges of the geographical networks are objects with their own attributes and methods (figure 1). once the archetypal node and edge objects are defined with appropriate attributes and methods, then a code representation of the real system can be constructed, where nodes (representing people or localities) and contact routes are instances of node and edge objects, respectively. the whole network is also an object with its own set of attributes and methods. in fact, epigrass also allows for multiple edge sets in order to represent multiple contact networks in a single model. figure 1 architecture of an epigrass simulation model. a simulation object contains the whole model and all other objects representing the graph, sites and edges. site object contaim model objects, which can be one of the built-in epidemiological models or a custom model written by the user. these features leads to a compact and hierarchical computational model consisting of a network object containing a variable number of node and edge objects. it also does not pose limitations to encapsulation, potentially allowing for networks within networks, if desirable. this representation can also be easily distributed over a computational grid or cluster, if the dependency structure of the whole model does not prevent it (this feature is currently being implemented and will be available on a future release of epigrass). for the end-user, this hierarchical, object-oriented representation is not an obstacle since it reflects the natural structure of the real system. even after the model is converted into a code object, all of its component objects remain accessible to one another, facilitating the exchange of information between all levels of the model, a feature the user can easily include in his/her custom models. nodes and edges are dynamical objects in the sense that they can be modified at runtime altering their behavior in response to user defined events. in epigrass it is very easy to simulate any dynamical system embedded in a network. however, it was designed with epidemiological models in mind. this goal led to the inclusion of a collection of built-in epidemic models which can be readily used for the intra-node dynamics (sir model family). epigrass users are not limited to basing their simulations on the built-in models. user-defined models can be developed in just a few lines of python code. all simulations in epigrass are done in discrete-time. however, custom models may implement finer dynamics within each time step, by implementing ode models at the nodes, for instance. the epigrass system is driven by a graphical user interface(gui), which handles several input files required for model definition and manages the simulation and output generation (figure 2). at the core of the system lies the simulator. it parses the model specification files, contained in a text file (.epg file), and builds the network from site and edge description files (comma separated values text files, csv). the simulator then builds a code representation of the entire model, simulates it, and stores the results in the database or in a couple of csv files. this output will contain the full time series of the variables in the model. additionally, a map layer (in shapefile and kml format) is also generated with summary statitics for the model (figure 3). the results of an epigrass simulation can be visualized in different ways. a map with an animation of the resulting timeseries is available directly through the gui (figure 4). other types of static visualizations can be generated through gis software from the shapefiles generated. the kml file can also be viewed in google earth™ or google maps™ (figure 5). epigrass also includes a report generator module which is controlled through a parameter in the ".epg" file. epigrass is capable of generating pdf reports with summary statistics from the simulation. this module requires a latex installation to work. reports are most useful for general verification of expected model behavior and network structure. however, the latex source files generated workflow for a typical epigrass simulation figure 3 workflow for a typical epigrass simulation. this diagram shows all inputs and outputs typical of an epigrass simulation session. epigrass graphical user interface figure 2 epigrass graphical user interface. by the module may serve as templates that the user can edit to generate a more complete document. building a model in epigrass is very simple, especially if the user chooses to use one of the built-in models. epigrass includes 20 different epidemic models ready to be used (see manual for built-in models description). to run a network epidemic model in epigrass, the user is required to provide three separate text files (optionally, also a shapefile with the map layer): 1. node-specification file: this file can be edited on a spreadsheet and saved as a csv file. each row is a node and the columns are variables describing the node. 2. edge-specification file: this is also a spreadsheet-like file with an edge per row. columns contain flow variables. 3. model-specification file: also referred to as the ".epg" file. this file specifies the epidemiological model to be run at the nodes, its parameters, flow model for the edges, and general parameters of the simulation. the ".epg" file is normally modified from templates included with epigrass. nodes and edges files on the other hand, have to be built from scratch for every new network. details of how to construct these files, as well as examples, can be found in the documentation accompanying the software, which is available at at the project's website [10] in the example application, the spread of a respiratory disease through a network of cities connected by bus transportation routes is analyzed. the epidemiological scenario is one of the invasion of a new influenza-like virus. one may want to simulate the spread of this disease through the country by the transportation network to evaluate alternative intervention strategies (e.g. different vaccination strategies). in this problem, a network can be defined as a set of nodes and links where nodes represent cities and links represents transportation routes. some examples of this kind of model are available in the literature [8, 11] . one possible objective of this model is to understand how the spread of such a disease may be affected by the pointof-entry of the disease in the network. to that end, we may look at variables such as the speed of the epidemic, number of cases after a fixed amount of time, the distribution of cases in time and the path taken by the spread. the example network was built from 76 of largest cities of brazil (>= 100 k habs). the bus routes between those cities formed the connections between the nodes of the networks. the number of edges in the network, derived from epigrass output visualized on google-earth figure 5 epigrass output visualized on google-earth. figure 4 epigrass animation output. sites are color coded (from red to blue) according to infection times. bright red is the seed site (on the ne). the bus routes, is 850. these bus routes are registered with the national agency of terrestrial transportation (antt) which provided the data used to parameterize the edges of the network. the epidemiological model used consisted of a metapopulation system with a discrete-time seir model (eq. 1). for each city, s t is the number of susceptibles in the city at time t, e t is the number of infected but not yet infectious individuals, i t is the number of infectious individuals resident in the locality, n is the population residing in the locality (assumed constant throughout the simulation), and n t is the number of individuals visiting the locality, θ t is the number of visitors who are infectious. the parameters used were taken from lipsitch et al. (2003) [12] to represent a disease like sars with an estimated basic reproduction number (r 0 ) of 2.2 to 3.6 ( table 1) . to simulate the spread of infection between cities, we used the concept of a "forest fire" model [13] . an infected individual, traveling to another city, acts as a spark that may trigger an epidemic in the new locality. this approach is based on the assumption that individuals commute between localities and contribute temporarily to the number of infected in the new locality, but not to its demography. implications of this approach are discussed in grenfell et al (2001) [13] . the number of individuals arriving in a city (n t ) is based on annual total number of passengers arriving trough all bus routes leading to that city as provided by the antt (brazilian national agency for terrestrial transportation). the annual number of passengers is used to derive an average daily number of passengers simply by dividing it by 365. stochasticity is introduced in the model at two points: the number of new cases is draw from a poisson distribution with intensity and the number of infected individuals visiting i is modelled as binomial process: where n is the total number of passengers arriving from a given neighboring city; i k, t and n k are the current number of infectious individuals and the total population size of city k, respectively. δ is the delay associated with the duration of each bus trip. the delay δ was calculated as the number of days (rounded down) that a bus, traveling at an average speed of 60 km/h, would take to complete a given trip. the lengths in kilometers of all bus routes were also obtained from the antt. vaccination campaigns in specific (or all) cities can be easily attained in epigrass, with individual coverages for each campaign on each city. we use this feature to explore vaccination scenarios in this model (figures 6 and 7). the files with this model's definition(the sites, edges and ".epg" files) are available as part of the additional files 1, 2 and 3 for this article. to determine the importance of the point of entry in the outcome of the epidemic, the model was run 500 times, randomizing the point of entry of the virus. the seeding site was chosen with a probability proportional to the log 10 of their population size. these replicates were run using epigrass' built-in support for repeated runs with the option of randomizing seeding site. for every simulation, statistics about each site such as the time it got infected and time series of incidence were saved. the time required for the epidemic to infect 50% of the cities was chosen as a global index to network susceptibility to invasion. to compare the relative exposure of cities to disease invasion, we also calculated the inverse of time , for all k neighbors elapsed from the beginning of the epidemic until the city registered its first indigenous case as a local measure of exposure. except for population size, all other epidemiological parameters were the same for all cities, that is, disease transmissibility and recovery rate. some positional features of each node were also derived: centrality, which is is a measure derived from the average distance of a given site to every other site in the network; betweeness, which is the number of times a node figures in the the shortest path between any other pair of nodes; and degree, which is the number of edges connected to a node. in order to analyze the path of the epidemic spread, we also recorded which cities provided the infectious cases which were responsible for the infection of each other city. if more than one source of infection exists, epigrass selects the city which contributed with the largest number cost in vaccines applied vs. benefit in cases avoided, for a simulated epidemic starting at the highest degree city (são paulo) figure 6 cost in vaccines applied vs. benefit in cases avoided, for a simulated epidemic starting at the highest degree city (são paulo). cost in vaccines applied vs. benefit in cases avoided, for a simulated epidemic starting at a relatively low degree city(salvador) figure 7 cost in vaccines applied vs. benefit in cases avoided, for a simulated epidemic starting at a relatively low degree city(salvador). of infectious individuals at that time-step, as the most likely infector. at the end of the simulation epigrass generates a file with the dispersion tree in graphml format, which can be read by a variety of graph plotting programs to generate the graphic seen on figure 8. the computational cost of running a single time step in an epigrass model, is mainly determined by the cost of calculating the epidemiological models on each site(node). therefore, time required to run models based on larger networks should scale linearly with the size of the network (order of the graph), for simulations of the same duration. the model presented here, took 2.6 seconds for a 100 days run, on a 2.1 ghz cpu. a somewhat larger model with 343 sites and 8735 edges took 28 seconds for a 100 days simulation. very large networks may be limited by the ammount of ram available. the authors are working on adapting epigrass to distribute processing among multiple cpus(in smp systems), or multiple computers in a cluster system. the memory demands can also be addressed by keeping the simulation objects on an objectoriented database during the simulation. steps in this direction are also being taken by the development team. the model presented here served maily the purpose of illustrating the capabilities of epigrass for simulating and analyzing reasonably complex epidemic scenarios. it should not be taken as a careful and complete analysis of a real epidemic. despite that, some features of the simulated epidemic are worth discussing. for example: the spread speed of the epidemic, measured as the time taken to infect 50% of the cities, was found to be influenced by the centrality and degree of the entry node (figures 9 and 10). the dispersion tree corresponding to the epidemic, is greatly influenced by the degree of the point of entry of spread of the epidemic starting at the city of salvador, a city with relatively small degree (that is, small number of neigh-bors) figure 8 spread of the epidemic starting at the city of salvador, a city with relatively small degree (that is, small number of neighbors). the number next to the boxes indicated the day when each city developed its first indigenous case. effect of degree(a) and betweeness(b) of entry node to the speed of the epidemic figure 9 effect of degree(a) and betweeness(b) of entry node to the speed of the epidemic. effect of betweeness of entry node on the speed of the epi-demic figure 10 effect of betweeness of entry node on the speed of the epidemic. the disease in the network. figure 8 shows the tree for the dispersion from the city of salvador. vaccination strategies must take into consideration network topology. figures 6 and 7 show cost benefit plots for three vaccination strategies investigated: uniform vaccination, top-3 degree sites only and top-10 degree sites only. vaccination of higher order sites offer cost/benefit advantages only in scenarios where the disease enter the network through one of these sites. epigrass facilitates greatly the simulation and analysis of complex network models. the output of model results in standard gis file formats facilitates the post-processing and analysis of results by means of sophisticated gis software. the non-trivial task of specifying the network over which the model will be run, is left to the user. but epigrass allows this structure to be provided as a simple list of sites and edges on text files, which can easily be contructed by the user using a spreadsheet, with no need for special software tools. besides invasion, network epidemiological models can also be used to understand patterns of geographical spread of endemic diseases [14] [15] [16] [17] . many infectious diseases can only be maintained in a endemic state in cities with population size above a threshold, or under appropriate environmental conditions(climate, availability of a reservoir, vectors, etc). the variables and the magnitudes associated with endemicity threshold depends on the natural history of the disease [18] . theses magnitudes may vary from place to place as it depends on the contact structure of the individuals. predicting which cities are sources for the endemicity and understanding the path of recurrent traveling waves may help us to design optimal surveillance and control strategies. modelling vaccination strategies against foot-and-mouth disease optimal reactive vaccination strategies for a foot-and-mouth outbreak in the uk strategy for distribution of influenza vaccine to high-risk groups and children containing pandemic influenza with antiviral agents space and contact networks: capturing the locality of disease transmission interval estimates for epidemic thresholds in two-sex network models applying network theory to epidemics: control measures for mycoplasma pneumoniae outbreaks assessing the impact of airline travel on the geographic spread of pandemic influenza modeling control strategies of respiratory pathogens epigrass website containing pandemic influenza at the source transmission dynamics and control of severe acute respiratory syndrome travelling waves and spatial hierarchies in measles epidemics travelling waves in the occurrence of dengue haemorrhagic fever in thailand modelling disease outbreaks in realistic urban social networks on the dynamics of flying insects populations controlled by large scale information large-scale spatial-transmission models of infectious disease disease extinction and community size: modeling the persistence of measles the authors would like to thank the brazilian research council (cnpq) for financial support to the authors. fcc contributed with the software development, model definition and analysis as well as general manuscript conception and writing. ctc contributed with model definition and implementation, as well as with writing the manuscript. ogc, contributed with data analysis and writing the manuscript. all authors have read and approved the final version of the manuscript. key: cord-027286-mckqp89v authors: ksieniewicz, paweł; goścień, róża; klinkowski, mirosław; walkowiak, krzysztof title: pattern recognition model to aid the optimization of dynamic spectrally-spatially flexible optical networks date: 2020-05-23 journal: computational science iccs 2020 doi: 10.1007/978-3-030-50423-6_16 sha: doc_id: 27286 cord_uid: mckqp89v the following paper considers pattern recognition-aided optimization of complex and relevant problem related to optical networks. for that problem, we propose a four-step dedicated optimization approach that makes use, among others, of a regression method. the main focus of that study is put on the construction of efficient regression model and its application for the initial optimization problem. we therefore perform extensive experiments using realistic network assumptions and then draw conclusions regarding efficient approach configuration. according to the results, the approach performs best using multi-layer perceptron regressor, whose prediction ability was the highest among all tested methods. according to cisco forecasts, the global consumer traffic in the internet will grow on average with annual compound growth rate (cagr) of 26% in years 2017-2022 [3] . the increase in the network traffic is a result of two main trends. firstly, the number of devices connected to the internet is growing due to the increasing popularity of new services including internet of things (iot ). the second important trend influencing the traffic in the internet is popularity of bandwidth demanding services such as video streaming (e.g., netflix ) and cloud computing. the internet consists of many single networks connected together, however, the backbone connecting these various networks are optical networks based on fiber connections. currently, the most popular technology in optical networks is wdm (wavelength division multiplexing), which is expected to be not efficient enough to support increasing traffic in the nearest future. in last few years, a new concept for optical networks has been deployed, i.e., architecture of elastic optical networks (eons). however, in the perspective on the next decade some new approaches must be developed to overcome the predicted "capacity crunch" of the internet. one of the most promising proposals is spectrally-spatially flexible optical network (ss-fon) that combines space division multiplexing (sdm) technology [14] , enabling parallel transmission of co-propagating spatial modes in suitably designed optical fibers such as multi-core fibers (mcfs) [1] , with flexible-grid eons [4] that enable better utilization of the optical spectrum and distanceadaptive transmissions [15] . in mcf-based ss-fons, a challenging issue is the inter-core crosstalk (xt) effect that impairs the quality of transmission (qot ) of optical signals and has a negative impact on overall network performance. in more detail, mcfs are susceptible to signal degradation as a result of the xt that happens between adjacent cores whenever optical signals are transmitted in an overlapping spectrum segment. addressing the xt constraints significantly complicates the optimization of ss-fons [8] . besides numerous advantages, new network technologies bring also challenging optimization problems, which require efficient solution methods. since the technologies and related problems are new, there are no benchmark solution methods to be directly applied and hence many studies propose some dedicated optimization approaches. however, due to the problems high complexity, their performance still needs a lot of effort to be put [6, 8] . we therefore observe a trend to use artificial intelligence techniques (with the high emphasis on pattern recognition tools) in the field of optimization of communication networks. according to the literature surveys in this field [2, 10, 11, 13] , the researchers mostly focus on discrete labelled supervised and unsupervised learning problems, such as traffic classification. regression methods, which are in the scope of that paper, are mostly applied for traffic prediction and estimation of quality of transmission (qot ) parameters such as delay or bit error rate. this paper extends our study initiated in [7] . we make use of pattern recognition models to aid optimization of dynamic mcf-based ss-fons in order to improve performance of the network in terms of minimizing bandwidth blocking probability (bbp), or in other words to maximize the amount of traffic that can be allocated in the network. in particular, an important topic in the considered optimization problem is selection of a modulation format (mf) for a particular demand, due to the fact that each mf provides a different tradeoff between required spectrum width and transmission distance. to solve that problem, we define applicable distances for each mf (i.e., minimum and maximum length of a routing path that is supported by each mf). to find values of these distances, which provide best allocation results, we construct a regression model and then combine it with monte carlo search. it is worth noting that this work does not address dynamic problems in the context of changing the concept over time, as is often the case with processing large sets, and assumes static distribution of the concept [9] . the main novelty and contribution of the following work is an in-depth analysis of the basic regression methods stabilized by the structure of the estimator ensemble [16] and assessment of their usefulness in the task of predicting the objective function for optimization purposes. in one of the previous works [7] , we confirmed the effectiveness of this type of solution using a regression algorithm of the nearest weighted neighbors, focusing, however, much more on the network aspect of the problem being analyzed. in the present work, the main emphasis is on the construction of the prediction model. its main purpose is: -a proposal to interpret the optimization problem in the context of pattern recognition tasks. the rest of the paper is organized as follows. in sect. 2, we introduce studied network optimization problem. in sect. 3, we discuss out optimization approach for that problem. next, in sect. 4 we evaluate efficiency of the proposed approach. eventually, sect. 5 concludes the work. the optimization problem is known in the literature as dynamic routing, space and spectrum allocation (rssa) in ss-fons [5] . we are given with an ss-fon topology realized using mcfs. the topology consists of nodes and physical link. each physical link comprises of a number of spatial cores. the spectrum width available on each core is divided into arrow and same-sized segments called slices. the network is in its operational state -we observe it in a particular time perspective given by a number of iterations. in each iteration (i.e., a time point), a set of demands arrives. each demand is given by a source node, destination node, duration (measured in the number of iterations) and bitrate (in gbps). to realize a demand, it is required to assign it with a light-path and reserve its resources for the time of the demand duration. when a demand expires, its resources are released. a light-path consists of a routing path (a set of links connecting demand source and destination nodes) and a channel (a set of adjacent slices selected on one core) allocated on the path links. the channel width (number of slices) required for a particular demand on a particular routing path depends on the demand bitrate, path length (in kilometres) and selected modulation format. each incoming demand has to be realized unless there is not enough free resources when it arrives. in such a case, a demand is rejected. please note that the selected light-paths in i -th iteration affect network state and allocation possibilities in the next iterations. the objective function is defined here as bandwidth blocking probability (bbp) calculated as a summed bitrate of all rejected demands divided by the summed bitrate of all offered demands. since we aim to support as much traffic as it is possible, the objective criterion should be minimized [5, 8] . the light-paths' allocation process has to satisfy three basic rssa constraints. first, each channel has to consists of adjacent slices. second, the same channel (i.e., the same slices and the same core) has to be allocated on each link included in a light-path. third, in each time point each slice on a particular physical link and a particular core can be used by at most one demand [8] . there are four modulation formats available for transmissions-8-qam, 16-qam, qpsk and bpsk. each format is described by its spectral efficiency, which determines number of slices required to realize a particular bitrate using that modulation. however, each modulation format is also characterized by the maximum transmission distance (mtd) which provides acceptable value of optical signal to noise ratio (osnr) at the receiver side. more spectrally-efficient formats consume less spectrum, however, at the cost of shorter mtds. moreover, more spectrally-efficient formats are also vulnerable to xt effects which can additionally degrade qot and lead to demands' rejection [7, 8] . therefore, the selection of the modulation format for each demand is a compromise between spectrum efficiency and qot. to answer that problem, we use the procedure introduced in [7] to select a modulation format for a particular demand and routing path [7] . let m = 1, 2, 3, 4 denote modulation formats ordered in increasing mtds (and in decreasing spectral efficiency at the same time). it means that m = 1 denotes 8-qam and m = 4 denotes bpsk. let mt d = [mtd 1 , mtd 2 , mtd3, mtd 4 ] be a vector of mtds for modulations 8-qam, 16-qam, qpsk, bpsk respectively. moreover, let at d = [atd 1 , atd 2 , atd3, atd 4 ] (where atd i <= mtd i , i = 1, 2, 3, 4) be the vector of applicable transmission distances. for a particular demand and a routing path we select most spectrally-efficient modulation format i for which atd i is grater of equal to the selected path length and the xt effect is on an acceptable level. for each candidate modulation format, we asses the xt level based on the adjacent resources' (i.e., slices and cores) availability using procedure proposed in [7] . it is important to note that we do not indicate atd 4 (for bpsk) since we assume that this modulation is able to support transmission on all candidate routing paths regardless of their length. please also note that when xt level is too high for all modulation formats, the demand is rejected regardless of the light-paths' availability. in sect. 2 we have studied rssa problem and emphasised the importance of efficient modulation selection task. for that task we have proposed solution method whose efficiency strongly depends on the applied atd vector. therefore, we aim to find atd * vector that provides best results. the vector elements have to be positive and have upper bounds given by vector mtd. moreover, the following condition have to be satisfied: atd i < atd i+1 , i = 1, 2. since solving rssa instances is a time consuming process, it is impossible to evaluate all possible atd vectors in a reasonable time. we therefore make use of regression methods and propose a scheme to find atd * depicted in fig. 1 . a representative set of 1000 different atd vectors is generated. then, for each of them we simulate allocation of demands in ss-fon (i.e., we solve dynamic rssa). for the purpose of demands allocation (i.e., selection of light-paths), we use a dedicated algorithm proposed in [7] . for each considered atd vector we save obtained bbp. based on that data, we construct a regression model, which predicts bbp based on an atd vector. having that model, we use monte carlo method to find atd * vector, which is recommended for further experiments. to solve an rssa instance for a particular atd vector, we use heuristic algorithm proposed in [7] . we work under the assumption that there are 30 candidate routing paths for each traffic demand (generated using dijkstra algorithm). since the paths are generated in advance and their lengths are known, we can use an atd vector and preselect for these paths modulation formats based on the procedure discussed in sect. 2. therefore, rssa is reduced to the selection of one of the candidate routing paths and a communication channel with respect to the resource availability and assessed xt levels. from the perspective of pattern recognition methods, the abstraction of the problem is not the key element of processing. the main focus here is the representation available to construct a proper decision model. for the purposes of considerations, we assume that both input parameters and the objective function take only quantitative and not qualitative values, so we may use probabilistic pattern recognition models to process them. if we interpret the optimization task as searching for the extreme function of many input parameters, each simulation performed for their combination may also be described as a label for the training set of supervised learning model. in this case, the set of parameters considered in a single simulation becomes a vector of object features (x n ), and the value of the objective function acquired around it may be interpreted as a continuous object label (y n ). repeated simulation for randomly generated parameters allows to generate a data set (x) supplemented with a label vector (y). a supervised machine learning algorithm can therefore gain, based on such a set, a generalization abilities that allows for precise estimation of the simulation result based on its earlier runs on the random input values. a typical pattern recognition experiment is based on the appropriate division of the dataset into training and testing sets, in a way that guarantees their separability (most often using cross-validation), avoiding the problem of data peeking and a sufficient number of repetitions of the validation process to allow proper statistical testing of mutual model dependencies hypotheses. for the needs of the proposal contained in this paper, the usual 5-fold cross validation was adopted, which calculates the value of the r 2 metric for each loop of the experiment. having constructed regression model, we are able to predict bbp value for a sample atd vector. please note that the time required for a single prediction is significantly shorter that the time required to simulate a dynamic rssa. the last step of our optimization procedure is to find atd * -vector providing lowest estimated bbp values. to this end, we use monte carlo method with a number of guesses provided by the user. the rssa problem was solved for two network topologies-dt12 (12 nodes, 36 links) and euro28 (28 nodes, 82 links). they model deutsche telecom (german national network) and european network, respectively. each network physical link comprised of 7 cores wherein each of the cores offers 320 frequency slices of 12.5 ghz width. we use the same network physical assumptions and xt levels and assessments as in [7] . traffic demands have randomly generated end nodes and birates uniformly distributed between 50 gbps and 1 tbps, with granularity of 50 gbps. their arrival follow poisson process with an average arrival rate λ demands per time unit. the demand duration is generated according to a negative exponential distribution with an average of 1/μ. the traffic load offered is λ/μ normalized traffic units (ntus). for each testing scenario, we simulate arrival of 10 6 demands. four modulations are available (8-qam, 16-qam, qpsk, bpsk) wherein we use the same modulation parameters as in [7] . for each topology we have generated 9 different datasets, each consists of 1000 samples of atd vector and corresponding bbp. the datasets differ with the xt coefficient (μ = 1 · 10 −9 indicated as "xt1", μ = 2 · 10 −9 indicated as "xt2", for more details we refer to [7] ) and network links scaling factor (the multiplier used to scale lengths of links in order to evaluate if different lengths of routing paths influence performance of the proposed approach). for dt12 we use following scaling factors: 0.4, 0.6, 0.8, . . . , 2.0. for euro28 the values are as follows: 0.104, 0.156, 0.208, 0.260, 0.312, 0.364, 0.416, 0.468, 0.520. we indicate them as "sx.xxx " where x.xxx refers to the scaling factor value. using these datasets we can evaluate whether xt coefficient (i.e., level of the vulnerability to xt effects) and/or average link length influence optimization approach performance. the experimental environment for the construction of predictive models, including the implementation of the proposed processing method, was implemented in python, following the guidelines of the state-of-art programming interface of the scikit-learn library [12] . statistical dependency assessment metrics for paired tests were calculated according to the wilcoxon test, according to the implementation contained in scipy module. each of the individual experiments was evaluated by r 2 score -a typical quality assessment metric for regression problems. the full source code, supplemented with employed datasets is publicly available in a git repository 1 . five simple recognition models were selected as the base experimental estimators: knr-k-nearest neighbors regressor with five neighbors, leaf size of 30 and euclidean metric approximated by minkowski distance, -dknr-knr regressor weighted by distance from closest patterns, mlp-a multilayer perceptron with one hidden layer of one hundred neurons, with the relu activation function and adam optimizer, dtr-cart tree with mse split criterion, lin-linear regression algorithm. in this section we evaluate performance of the proposed optimization approach. to this end, we conduct three experiments. experiment 1 focuses on the number of patterns required to construct a reliable prediction model. experiment 2 assesses the statistical dependence of built models. eventually, experiment 3 verifies efficiency of the proposed approach as a function of number of guesses in the monte carlo search. the first experiment carried out as part of the approach evaluation is designed to verify how many patterns -and thus how many repetitions of simulations -must be passed to individual regression algorithms to allow the construction of a reliable prediction model. the tests were carried out on all five considered regressors in two stages. first, the range from 10 to 100 patterns was analyzed, and in the second, from 100 to 1000 patterns per processing. it is important to note that due to the chosen approach to cross-validation, in each case the model is built on 80% of available objects. the analysis was carried out independently on all available data sets, and due to the non-deterministic nature of sampling of available patterns, its results were additionally stabilized by repeating a choice of the objects subset five times. in order to allow proper observations, the results were averaged for both topologies. plots for the range from 100 to 1000 patterns were additionally supplemented by marking ranges of standard deviation of r 2 metric acquired within the topology and presented in the range from the .8 value. the results achieved for averaging individual topologies are presented in figs. 2 and 3 . for dt12 topology, mlp and dtr algorithms are competitively the best models, both in terms of the dynamics of the relationship between the number of patterns and the overall regression quality. the linear regression clearly stands out from the rate. a clear observation is also the saturation of the models, understood by approaching the maximum predictive ability, as soon as around 100 patterns in the data set. the best algorithms already achieve quality within .8, and with 600 patterns they stabilize around .95. the relationship between each of the recognition algorithms and the number of patterns takes the form of a logarithmic curve in which, after fast initial growth, each subsequent object gives less and less potential for improving the quality of prediction. this suggests that it is not necessary to carry out further simulations to extend the training set, because it will not significantly affect the predictive quality of the developed model. very similar observations may be made for euro28 topology, however, noting that it seems to be a simpler problem, allowing faster achievement of the maximum model predictive capacity. it is also worth noting here the fact that the standard deviation of results obtained by mlp is smaller, which may be equated with the potentially greater stability of the model achieved by such a solution. the second experiment extends the research contained in experiment 1 by assessing the statistical dependence of models built on a full datasets consisting of a thousand samples for each case. the results achieved are summarized in tables 1a and b. as may be seen, for the dt12 topology, the lin algorithm clearly deviates negatively from the other methods, in absolutely every case being a worse solution than any of the others, which leads to the conclusion that we should completely reject it from considering as a base for a stable recognition model. algorithms based on neighborhood (knr and dknr) are in the middle of the rate, in most cases statistically giving way to mlp and dtr, which would also suggest departing from them in the construction of the final model. the statistically best solutions, almost equally, in this case are mlp and dtr. for euro28 topology, the results are similar when it comes to lin, knr and dknr approaches. a significant difference, however, may be seen for the achievements of dtr, which in one case turns out to be the worst in the rate, and in many is significantly worse than mlp. these observations suggest that in the final model for the purposes of optimization lean towards the application of neural networks. what is important, the highest quality prediction does not exactly mean the best optimization. it is one of the very important factors, but not the only one. it is also necessary to be aware of the shape of the decision function. for this purpose, the research was supplemented with visualizations contained in fig. 4 . algorithms based on neighborhood (knn, dknn) and decision trees (dtr) are characterized by a discrete decision boundary, which in the case of visualization resembles a picture with a low level of quantization. in the case of an ensemble model, stabilized by cross-validation, actions are taken to reduce this property in order to develop as continuous a border as possible. as may be seen in the illustrations, compensation occurs, although in the case of knn and dknn leads to some disturbances in the decision boundary (interpreted as thresholding the predicted label value), and for the dtr case, despite the general correctness of the performed decisions, it generates image artifacts. such a model may still retain high predictive ability, but it has too much tendency to overfit and leads to insufficient continuity of the optimized function to perform effective optimization. clear decision boundaries are implemented by both the lin and mlp approaches. however, it is necessary to reject lin from processing due to the linear nature of the prediction, which (i ) in each optimization will lead to the selection of the extreme value of the analyzed range and (ii ) is not compatible with the distribution of the explained variable and must have the largest error in each of the optimas. summing up the observations of experiments 1 and 2, the mlp algorithm was chosen as the base model for the optimization task. it is characterized by (i ) statistically best predictive ability among the methods analyzed and (ii ) the clearest decision function from the perspective of the optimization task. the last experiment focuses on the finding of best atd vector based on the constructed regression model. to this end, we use monte carlo method with different number of guesses. tables 2 and 3 present the obtained results as a function of number of guesses, which changes from 10 1 up to 10 9 . the results quality increases with the number of guesses up to some threshold value. then, the results do not change at all or change only a little bit. according to the presented values, monte carlo method applied with 10 3 guesses provides satisfactory results. we therefore recommend that value for further experiments. the following work has considered the topic of employing pattern recognition methods to support ss-fon optimization process. for a wide pool of generated cases, analyzing two real network topologies, the effectiveness of solutions implemented by five different, typical regression methods was analyzed, starting from logistic regression and ending with neural networks. conducted experimental analysis shows, with high probability obtained by conducting proper statistical validation, that mlp is characterized by the greatest potential in this type of solutions. even with a relatively small pool of input simulations, constructing a data set for learning purpouses, interpretable in both the space of optimization and machine learning problems, simple networks of this type achieve both high quality prediction measured by the r 2 metric, and continuous decision space creating the potential for conducting optimization. basing the model on the stabilization realized by using ensemble of estimators additionally allows to reduce the influence of noise on optimization, whichin a state-of-art optimization methods -could show a tendency to select invalid optimas, burdened by the nondeterministic character of the simulator. further research, developing ideas presented in this article, will focus on the generalization of the presented model for a wider pool of network optimization problems. high-capacity transmission over multi-core fibers a comprehensive survey on machine learning for networking: evolution, applications and research opportunities visual networking index: forecast and trends elastic optical networking: a new dawn for the optical layer on the efficient dynamic routing in spectrally-spatially flexible optical networks on the complexity of rssa of any cast demands in spectrally-spatially flexible optical networks machine learning assisted optimization of dynamic crosstalk-aware spectrallyspatially flexible optical networks survey of resource allocation schemes and algorithms in spectrally-spatially flexible optical networking data stream classification using active learned neural networks artificial intelligence (ai) methods in optical networks: a comprehensive survey an overview on application of machine learning techniques in optical networks scikit-learn: machine learning in python machine learning for network automation: overview, architecture, and applications survey and evaluation of space division multiplexing: from technologies to optical networks modeling and optimization of cloud-ready and content-oriented networks. ssdc classifier selection for highly imbalanced data streams with minority driven ensemble key: cord-148358-q30zlgwy authors: pang, raymond ka-kay; granados, oscar; chhajer, harsh; legara, erika fille title: an analysis of network filtering methods to sovereign bond yields during covid-19 date: 2020-09-28 journal: nan doi: nan sha: doc_id: 148358 cord_uid: q30zlgwy in this work, we investigate the impact of the covid-19 pandemic on sovereign bond yields amongst european countries. we consider the temporal changes from financial correlations using network filtering methods. these methods consider a subset of links within the correlation matrix, which gives rise to a network structure. we use sovereign bond yield data from 17 european countries between the 2010 and 2020 period as an indicator of the economic health of countries. we find that the average correlation between sovereign bonds within the covid-19 period decreases, from the peak observed in the 2019-2020 period, where this trend is also reflected in all network filtering methods. we also find variations between the movements of different network filtering methods under various network measures. the novel coronavirus disease 2019 (covid-19) epidemic caused by sars-cov-2 began in china in december 2019 and rapidly spread around the world. the confirmed cases increased in different cities of china, japan, and south korea in a few days of early january 2020, but spread globally with new cases in iran, spain, and italy within the middle of february. we focus on sovereign bonds during the covid-19 period to highlight the extent to which the pandemic has influenced the financial markets. in the last few years, bond yields across the euro-zone were decreasing under a range of european central bank (ecb) interventions, and overall remained stable compared with the german bund, a benchmark used for european sovereign bonds. these movements were disrupted during the covid-19 pandemic, which has affected the future trajectory of bond yields from highly impacted countries, e.g., spain and italy. however, in the last months, the european central banks intervened in financial and monetary markets to consolidate stability through an adequate supply of liquidity countering the possible margin calls and the risks of different markets and payment systems. these interventions played a specific role in sovereign bonds because, on the one side, supported the stability of financial markets and, on the other side, supported the governments' financial stability and developed a global reference interest rate scheme. understanding how correlations now differ and similarities observed in previous financial events are important in dealing with the future economic effects of covid19. we consider an analysis of sovereign bonds by using network filtering methods, which is part of a growing literature within the area of econophysics [29, 44, 30, 28, 17] . the advantages in using filtering methods is the extraction of a network type structure from the financial correlations between sovereign bonds, which allows the properties of centrality and clustering to be considered. in consequence, the correlation-based networks and hierarchical clustering methodologies allow us to understand the nature of financial markets and some features of sovereign bonds. it is not clear which approach should be used in analyzing sovereign bond yields, and so within this paper, we implement various filtering methods to the sovereign bond yield data and compare the resulting structure of different networks. our analysis shows that over the last decade, the mean correlation peaks in october 2019 and then decreases during the 2020 period, when covid-19 is most active in europe. these dynamics are reflected across all network filtering methods and represent the wide impact of covid-19 towards the spectrum of correlations, compared to previous financial events. we consider the network centrality of sovereign bonds within the covid-19 period, which remains consistent with previous years. these trends are distinctive between filtering methods and stem from the nature of correlations towards economic factors e.g., positive correlations show a stable trend in the individual centrality, compared with the volatile trends for negative correlations, where central nodes within these networks are less integrated in the euro-area. although there is a change in the magnitude of correlations, the overall structure relative to the central node is maintained within the covid-19 period. previous studies have used different methods to analyze historic correlations as random matrix theory to identify the distribution of eigenvalues concerning financial correlations [27, 39, 23] , the approaches from information theory in exploring the uncertainty within the financial system [20, 12] , multilayer network methods [1, 7, 46, 24, 18, 40] , and filtering methods. several authors have used network filtering methods to explain financial structures [31, 37] , hierarchy and networks in financial markets [50] , relations between financial markets and real economy [34] , volatility [51] , interest rates [33] , stock markets [21, 52, 53, 2] , future markets [8] or topological dynamics [45] to list a few. also, the comparison of filtering methods to market data has been used for financial instruments. birch, et al [10] consider a comparison of filtering methods of the dax30 stocks. musmeci, et al [35] propose a multiplex visual network approach and consider data of multiple stock indexes. kukreti, et al [26] use the s&p500 market data and incorporate entropy measures with a range of network filtering methods. aste, et al [5] apply a comparison of network filtering methods on the us equity market data and assess the dynamics using network measures. in order to evaluate the european sovereign bonds, based on filtering methods, this work is organized as follows. in section 2, we describe the network filtering methods and present the data sets with some preliminary empirical analyses. we apply in section 3 the filtering methods to sovereign bond yields and analyze the trend of financial correlations over the last decade and consider aspects of the network topology. we construct plots in section 4 representing the covid-19 period for all methods and analyze the clustering between countries. in section 5, we discuss the results and future directions. we introduce a range of network filtering methods and consider a framework as in [31] for sovereign bond yields. we define n ∈ n to be the number of sovereign bonds and bond yields y i (t) of the ith sovereign bond at time-t, where i ∈ {1, ..., n}. the correlation coefficients r ij (t) ∈ [−1, 1] are defined using pearson correlation as: with · denoting the average of yield values. the notion of distance d ij ∈ [0, 2] considers the values of the entries r ij of the correlation matrix r ∈ [−1, 1] n×n , with d ij = 2(1 − r ij ). a distance of d ij = 0 represents perfectly positive correlations and d ij = 2 represents bonds with negative correlations. the network filtering methods are then applied to the distance matrix d ∈ [0, 2] n×n , where a subset of links (or edges) are chosen under each filtering method. the set of edges is indicated by {(i, j) ∈ e(t) : nodes i and j are connected} at time-t, defined for each filtering method. we define the time frames of financial correlations as x for the set of observations, with n different columns and t rows. from the set of observations x, we consider windows of length 120, which is equal to six months of data values. we then displace δ windows by 10 data points, which is equal to two weeks of data values, and discard previous observations until all data points are used. by displacing the data in this way, we can examine a time series trend between each window x. we verify the statistical reliability of correlations by using a non-parametric bootstrapping approach as in efron [15] , which is used in tumminello, et al [48, 49] . we randomly choose rows equal in number to the window length t , allowing repeated rows to be chosen. we compute the correlation matrix for this window x * m and repeat the procedure until m samples are generated, which is chosen at 10,000. the error between data points described in efron [15] is equal to (1 − ρ 2 )/t , where highly positive and negative correlated values ρ have the smallest errors. the minimum spanning tree (mst) method is a widely known approach which has been used within currency markets [22] , stocks markets [42, 43] and sovereign bond yields [13] . the mst from table 1 considers the smallest edges and prioritizes connections of high correlation to form a connected and undirected tree network. this approach can be constructed from a greedy type algorithm e.g. kruskal's and prim's algorithm and satisfies the properties of subdominant ultrametric distance i.e, d ij ≤ max{d ik , d kj } ∀i, j, k ∈ {1, ..., n}. a maximum spanning tree (mast) constructs a connected and undirected tree network with n − 1 edges in maximizing the total edge weight. analyses involving mast have been used as comparisons to results seen within mst approaches [14, 19] . an mast approach is informative for connections of perfectly anti-correlation between nodes, which are not observed within the mst. a network formed from asset graphs (ag) considers positive correlations between nodes of a given threshold. within the mst, some links of positive correlation are not considered in order to satisfy the properties of the tree network. all n − 1 highest correlations are considered in an ag, allowing for the formation of cliques not observed within a mst or mast network. the use of ag has been considered in onnela, et al [38] , which identifies clustering within stock market data. as the method only considers n − 1 links, some nodes within the ag may not be connected minimum spanning tree (mst) n − 1 [25] a connected and undirected network for n nodes which minimizes the total edge weight. maximum spanning tree (mast) n − 1 [41] a connected and undirected network for n nodes which maximizes the total edge weight. asset graph (ag) n − 1 [36] choose the smallest n−1 edges from the distance matrix. triangulated maximal filtering graph (tmfg) [32] a planar filtered graph under an assigned objective function. for the given threshold and therefore the connection of unconnected nodes is unknown, relative to connected components. the triangulated maximal filtering graph (tmfg) constructs a network of 3(n − 2) fixed edges for n nodes, similar to the planar maximal filtered graph (pmfg) [47] , which has been used to analyze us stock trends [35] . the algorithm initially chooses a clique of 4 nodes, where edges are then added sequentially, in order to optimize the objective function e.g., the total edge weight of the network, until all nodes are connected. this approach is non-greedy in choosing edges and incorporates the formation of cliques within the network structure. a tmfg is also an approximate solution to the weighted planar maximal graph problem, and is computationally faster than the pmfg. the resulting network includes more information about the correlation matrix compared with spanning tree approaches, while still maintaining a level of sparsity between nodes. the european sovereign debt has evolved in the last ten years, with some situations affecting the convergence between bond yields. after the 2008 crisis, european countries experienced a financial stress situation starting in 2010 that affected bond yields, thus the investors saw an excessive amount of sovereign debt and demanded higher interest rates in low economic growth situations and high fiscal deficit levels. during 2010-2012, several european countries suffered downgrades in their bond ratings to junk status that affected investors' trust and fears of sovereign risk contagion resulting, in some cases, a differential of over 1,000 basis points in several sovereign bonds. after the introduction of austerity measures in giips countries, the bond markets returned to normality in 2015. the 2012 european debt crisis particularly revealed spillover effects between different sovereign bonds, which have been studied using various time series models e.g. var [11, 4] and garch [6] . the results showed that portugal, greece, and ireland have a greater domestic effect, italy and spain contributed to the spillover effects to other european bond markets and a core group of abfn (austria, belgium, france, and netherlands) countries had a lower contribution to the spillover effects, with some of the least impacted countries residing outside of the euro zone. during the sovereign debt crisis, public indebtedness increased after greece had to correct the public finance falsified data, and other countries created schemes to solve their public finance problems, especially, bank bailouts. in consequence, the average debt-to-gdp ratio across the euro-zone countries rose from 72% in 2006 to 119.5% in 2014, as well as the increase in sovereign credit risk [3, 9] . after the fiscal compact treaty went into effect at the start of 2013, which defined that fiscal principles had to be embedded in the national legislation of each country that signed the treaty, the yield of sovereign bonds started a correction, although some investors and institutions . four of the listed countries are part of the g7 and g20 economic groups (germany, france, italy and the uk). we consider sovereign bond yields with a 10 year maturity between january 2010 and june 2020. this data is taken from the financial news platform 1 . in total, there are 2,491 data values for each country with an average of 240 data points within 1 year. table 2 provides summary statistics of the 10y bond yield data. the results show greek yields to have the highest values across all statistical measures compared with other countries yields, particularly within the 2010-2011 (max yield of 39.9). in contrast, swiss bond yields exhibit the smallest mean and variance, with a higher than average positive skewness compared with other sovereign bonds. under the jb test for the normality of data distributions, all bond yield trends have a negligible p-value with non-gaussian distributions. the left skewed yield distributions (except for iceland), which represent an average decrease in yield values each year are high for giips countries compared with the uk, france, and germany, with flattening yield trends. we compute the correlation matrix for each window x with a displacement of δ between windows, and consider the mean and variance for the correlation matrix. we define the mean correlation r(t) given the correlations r ij for n sovereign bonds from figure 1 , we find that the mean correlation r(t) is highest at 0.95 in oct 2019. this suggests that a covid-19 impact was a continuation on the decrease of the mean correlation, and throughout the punitive lock down measures introduced by the majority of european countries in feb-mar 2020. the decreases in mean correlation are also observed within the in the 2012 period during the european debt crisis, in which several european countries received eu-imf bailouts to cope with government debt and in 2016, under a combination of political events within the uk and the increased debt accumulation by italian banks. the variance u(t) also follows a trend similar to the mean correlation, with the smallest variance of 0.002 in october 2019. within 2020, the variance increases between sovereign bonds and reflects the differences between the correlations of low and high yield. we consider the normalized network length l(t), which is introduced in onnela, et al [36] as the normalized tree length. we define the measure as the normalized network length, as this measure is considered for ag and tmfg non-tree networks. the network length is a measure of the mean link weights on the subset of links e(t), which are present within the filtered network on the distance matrix at time-t the plots in figure 2 represent the mean and variance of the network length. as each filtering method considers a subset of weighted links, the normalized length l(t) is monotonic between all methods and decreases with the increased proportion of positive correlated links within the network. we highlight the movements in the normalized network length during the covid-19 period, which is reflected across all filtering methods. this movement is observed within 2016, but only towards a subset of correlations, in which the network length of the mast and tmfg increases compared with the mst and ag. the relative difference between the normalized networks lengths is least evident in periods of low variance; this is observed in the 2019-2020 period, where the difference between all methods decreases. we find the variance is highest within the tmfg and lowest with the ag approach. the increased inclusion of links with a higher reliability error in the tmfg increases the variance, particularly within the 2014-2017 period. the variance of the mst on average is higher compared with the mast, but when considering only the highest correlated links in the ag, the variance decreases. we define the degree centrality for the node of maximum degree c(t) at time-t. this measure considers the number of direct links the mean occupation layer η(t) (mol) introduced in onnela, et al [36] is a measure of the centrality of the network, relative to the central node υ(t). we define lev i (t) as the level of the node, which is the distance of the node relative to υ(t), where the central node and nodes unconnected relative to the central node have a level value of 0 we use the betweenness centrality to define the central node υ(t) for the mol. introduced in freeman [16] , the betweenness b(t) considers the number of shortest paths σ ij (k) between i and j which pass through the node k, relative to the total number of shortest paths σ ij between i and j, within the mst, the degree centrality ranges between 3 to 5 for euro-zone countries. the trend within the mst remains stable, where the central node under degree centrality is associated with multiple sovereign bonds e.g., netherlands 19%, portugal 10% and belgium 9% across all periods. the mast has the highest variation, with a centralized network structure in some periods e.g., c(t) of 16, forming a star shaped network structure. this is usually associated with greece, iceland and hungary, which are identified as the central node 55% of the time. the degree centrality on average is naturally highest with the tmfg, under a higher network density, where the central nodes are identified as hungary and romania sovereign bonds, similar to the mast. the ag identifies the netherlands and belgium within the degree centrality, under a higher proportion of 25% and 13% compared with the mst. within figure 3 , the mol on average is smallest for the ag, because of the 0 level values from unconnected nodes, in which an unconnected node is present within 94% of considered windows. we find that the nodes within the tmfg are closest within the network, where the central node is directly or indirectly connected for all nodes, with an average path length of 1.1 across all periods. between the mst and mast, the mol is higher within the mast, where nodes within the network have a higher degree centrality. we analyze the temporal changes of sovereign bond yields between october 2019 and june 2020. the associated link weights on each filtering method for window x are the proportions in which the link appears within the correlation matrix, under the statistical reliability, across all samples m for the randomly sampled windows x * m . under the mst, austria has the highest degree centrality of 4. the network also exhibits clusters between southern european countries connected by spain, and the uk towards polish and german sovereign bond yields. within the network, there is a connection between all abfn countries, but countries within this group also facilitate the connecting component within giips countries, where belgium is connected with spain and irish sovereign bonds. the uk and eastern european countries remain on the periphery, with abfn countries occupying the core of the network structure. for the mast in figure 4 , there exists a high degree centrality for polish sovereign bonds between western european countries e.g., france and netherlands. this contrasts to the observed regional hub structure within the mst, with the existence of several sovereign bonds with high degree centrality in the network. the uk remains within the periphery of the mast structure when considering anti-correlations, and shows uk bond yields fluctuate less with movements of other european bonds compared with previous years. this is also observed for sovereign bonds for other countries with non-euro currencies such as czech republic, hungary, and iceland. we find nodes within the tmfg to have the highest degree in iceland at 13 and poland at 10. although the mst is embedded within the tmfg network structure, a high resemblance is observed to links from the mast, where 69% of links which are present within the mast are common in both networks. there is also the associated degree centrality of the mast, which is observed within the tmfg connected nodes. under the tmfg, nodes have a higher degree connectivity when considering an increased number of links, this is the case for the uk, which has 9 links compared with other spanning tree approaches. the ag exhibits three connected components between western european countries, southern european countries and the uk with eastern european countries. these unconnected nodes within the ag are associated with non-euro adopting countries, with the remaining countries connected in an individual component. by solely considering the most positive correlations, we observe the formation of 3-cliques between countries, which is prevalent within the western european group of 6 nodes. the average statistical reliability is highest at 0.92 within the mast and ag, 0.89 for the mst and 0.82 for the tmfg. under the tmfg, the increased inclusion of links with a lower magnitude in correlations decreases the reliability in link values. other filtering approaches which consider a smaller subset can still result in low reliability values between some nodes e.g. austria and romania at 0.51 in the mst, germany and netherlands at 0.47 in ag. under various constraints, we observe a commonality between sovereign bonds across network filtering methods. we find for tree networks, that euro-area countries have a high degree centrality and countries with non-euro currencies e.g. czech republic and the uk are predominately located within the periphery of the network. this is further observed within the ag, where cliques are formed between giips and abfn countries, which is distinctive during the covid-19 period compared with previous years. the anti-correlations within the mast inform the trends of the negative correlations between eastern european countries and other european countries. by considering the tmfg with an increased number of links for positive correlations, we find similarities with the mast degree centrality. as a response to the covid-19 pandemic, most countries implemented various socio-economic policies and business restrictions almost simultaneously. an immediate consequence was an increase in yield rates for these nations. the resulting upward co-movement and upward movements in other yield rates explain the decrease in the mean correlation in bond dynamics, coinciding with the pandemic outbreak. thus, understanding the dynamics of financial instruments in the euro area is important to assess the increased economic strain from events seen in the last decade. in this paper, we consider the movements of european sovereign bond yields for network filtering methods, where we particularly focus on the covid-19 period. we find that the impact of covid-banks starts to drop off, the market dynamics could adjust to economic performance and not its financial performance. in other words, the resulting dynamics could explain an increase in mean correlation in bond dynamics coinciding with the economic dynamics after the pandemic and the increment in yield rates. although we consider the sovereign bond yields with a 10y maturity as a benchmark, this research can be extended to sovereign bonds with different maturities (e.g., short term 1y, 2y or 5y, and long term 20y or 30y) because these bonds could reveal interesting effects and confirm that sovereign bonds are a good indicator to identify the economic impact of covid-19. as each sovereign bond has different yield and volatility trends, we considered using the zero-coupon curve to evaluate the full extent of covid-19 on sovereign bonds. multiplex interbank networks and systemic importance: an application to european data clustering stock markets for balanced portfolio construction the dynamics of spillover effects during the european sovereign debt turmoil sovereign bond yield spillovers in the euro zone during the financial and debt crisis correlation structure and dynamics in volatile markets spillover effects on government bond yields in euro zone. does full financial integration exist in european government bond markets? interbank markets and multiplex networks: centrality measures and statistical null models multi-scale correlations in different futures markets the geography of the great rebalancing in euro area bond markets during the sovereign debt crisis analysis of correlation based networks representing dax 30 stock price returns measuring bilateral spillover and testing contagion on sovereign bond markets in europe the entropy as a tool for analysing statistical dependences in financial time series sovereign debt crisis in the european union: a minimum spanning tree approach spanning trees and the eurozone crisis bootstrap methods: another look at the jackknife a set of measures of centrality based on betweenness comovements in government bond markets: a minimum spanning tree analysis using multiplex networks for banking systems dynamics modelling maximal spanning trees, asset graphs and random matrix denoising in the analysis of dynamics of financial networks multifractal diffusion entropy analysis on stock volatility in financial markets dynamic correlation network analysis of financial asset returns with network clustering currency crises and the evolution of foreign exchange market: evidence from minimum spanning tree correlation of financial markets in times of crisis multi-layered interbank model for assessing systemic risk on the shortest spanning subtree of a graph and the traveling salesman problem a perspective on correlation-based financial networks and entropy measures random matrix theory and financial correlations extracting the sovereigns' cds market hierarchy: a correlation-filtering approach portfolio optimization based on network topology complex networks and minimal spanning trees in international trade network hierarchical structure in financial markets network filtering for big data: triangulated maximally filtered graph interest rates hierarchical structure relation between financial market structure and the real economy: comparison between clustering methods the multiplex dependency structure of financial markets dynamic asset trees and black monday asset trees and asset graphs in financial markets clustering and information in correlation based financial networks random matrix approach to cross correlations in financial data the multi-layer network nature of systemic risk and its implications for the costs of financial crises universal and nonuniversal allometric scaling behaviors in the visibility graphs of world stock market indices pruning a minimum spanning tree on stock market dynamics through ultrametricity of minimum spanning tree causality networks of financial assets complexities in financial network topological dynamics: modeling of emerging and developed stock markets cross-border interbank networks, banking risk and contagion a tool for filtering information in complex systems spanning trees and bootstrap reliability estimation in correlation-based networks hierarchically nested factor model from multivariate data correlation, hierarchies, and networks in financial markets a cluster driven log-volatility factor model: a deepening on the source of the volatility clustering multiscale correlation networks analysis of the us stock market: a wavelet analysis network formation in a multi-asset artificial stock market key: cord-033557-fhenhjvm authors: saha, debdatta; vasuprada, t. m. title: reconciling conflicting themes of traditionality and innovation: an application of research networks using author affiliation date: 2020-10-09 journal: adv tradit med (adtm) doi: 10.1007/s13596-020-00515-w sha: doc_id: 33557 cord_uid: fhenhjvm innovation takes different forms: varying from path-breaking discoveries to adaptive changes that survive external shifts in the environment. our paper investigates the nature and process of innovation in the traditional knowledge system of ayurveda by tracing the footprints that innovation leaves in the academic research network of published papers from the pubmed database. traditional knowledge systems defy the application of standard measures of innovation such as patents and patent citations. however, the continuity in content of these knowledge systems, which are studied using modern publication standards prescribed by academic journals, indicate a kind of adaptive innovation that we track using an author-affiliation based measure of homophily. our investigation of this measure and its relationship with currently accepted standards of journal quality clearly shows how systems of knowledge can continue in an unbroken tradition without becoming extinct. rather than no innovation, traditional knowledge systems evolve by adapting to modern standards of knowledge dissemination without significant alteration in their content. one important platform for sharing knowledge, be it results of cutting-edge research or establishing old truths in a modern context, is journal publications (thyer 2008; edwards 2015; sandström and van den besselaar 2016) . medicinal sciences is of particular interest, as team collaboration is necessary to produce research outcomes (hall et al. 2008; gibbons 1994) . 1 of the existing data-sets providing details of academic collaborations and knowledge sharing in biosciences, pubmed is one of the foremost sources (falagas et al. 2008b; mcentyre and lipman 2001; anders and evans 2010) . with a collection of more than 30 million citations on biomedical literature, pubmed (maintained by the us government funded us national library of medicine and national institutes of health) offers a panorama of publications of diverse qualities and topics. of great interest is the simultaneous co-existence of research papers not only from the current mainstream of bio-medicine, but also other branches of medical knowledge, such as traditional medicine. 2 no two canons of knowledge can be as distinct from each other as bio-medicine and traditional medicine (baars and hamre 2017; mukharji 2016) , and yet academic collaborations conform to similar standards of dissemination of knowledge and is available in a common platform like pubmed. in terms of the character 1 the importance of team collaboration for producing quality research has been documented for other disciplines and across countries. see adams (2013) for a general discussion on the impact of international collaborations on knowledge sharing. 2 world health organization's report on traditional medicine (2000) defines traditional medicine as "the sum total of the knowledge, skills and practices based on the theories, beliefs and experiences indigenous to different cultures, whether explicable or not, used in the maintenance of health, as well as in the prevention, diagnosis, improvement or treatment of physical and mental illnesses." this definition finds resonance in fokunang et al. (2011). of the discipline, bio-medicine displays masculinity, 3 and low power distance 4 whereas traditional medicine strives to retain content untouched. 5 the former is marked by schumpeterian upheavals and stark innovations from time to time (such as the development of vaccines and novel drugs for treating new disease conditions, 6 ) whereas the latter pride in their continuity of knowledge handed down from generation to generation [see banerjee (2009) , shukla and sinclair (2009) and mathur (2003) ]. the simultaneous existence of research papers from both disciplines for journals conforming to uniform standards of publication automatically raises questions about the true nature of innovation in traditional knowledge systems like ayurveda. it is possible that it is an innovative discipline because it shares the same kind of research output space as bio-medicine publications. on the other hand, the nature of collaborations within the traditional knowledge journals might be 'non-innovative', despite publications in standard format journals. when knowledge systems adopt the platform of journal publications, the structure of information disseminated becomes a function of the standards and rules set by them. 7 there are specific structural restrictions, such as bibliographies of specific types (green 2000; masic 2013) , journal rankings (gonzález-pereira et al. 2010) ], double-blind peer review systems (albers et al. 2011) etc., that are imposed when knowledge is shared through journal publications. this brings us to our central query: when a medicinal system which is considered 'traditional' uses modern publication standards to disseminate knowledge, what kind of collaborative structures will be observed? how does a system that conforms with these modern publication standards insulate itself from dilution in terms of content and practices? to what extent will traditional knowledge systems engage with academic collaborations as observed in other mainstream disciplines? we contextualize our query by studying the publication network in ayurveda, a rich traditional medicinal system prevalent in south asia, and largely limit ourselves to the first two questions. there are other branches of traditional medicine, such as indigenous medicine of indians in the americas or tibetan/himalayan traditional medicine systems. in fact, in recent times, the coronavirus epidemic has shown the relevance of chinese traditional medicine. we have evidence of successful treatment of viral cases in wuhan, the centre of the outbreak. 8 the ministry of ayush, government of india, 9 has announced a taskforce (in early april 2020) with members from the indian council of medical research, the council of scientific and industrial research, the department of biotechnology, the ayush ministry and the who (see a discussion in https ://scien ce.thewi re.in/the-scien ces/minis try-of-ayush -taskforce -clini cal-trial s-herbs -proph ylact ics/), to investigate the potential of ayurvedic cures for coronavirus symptoms. as a prophylactic cure for covid-19, the taskforce has recommended clinical trial testing of some herbs, prominently ashwagandha (withania somnifera). this herb, which we research in detail in this paper, has been mentioned in recent times as a potential alternative to hydroxychloroquine. 10 these efforts are in the initial stages, but the ayush ministry has established a clear protocol for registering ayurvedic formulations to establish efficacy in treating symptoms of covid-19 11 as well as warning alerts to all regarding unsubstantiated claims of efficacy of herbal cures. 12 traditional knowledge systems exist in modern times due to its continued relevance, despite its continued and steady referencing to historical repositories of information. 7 however, knowledge flows in a discipline are, by no means, only limited to journal publications, as books, project applications and grants (dahlander and mcfarland 2013) , web-and video logs and many other forms of online and open source platforms (yan 2014; chesbrough 2006; zucker et al. 2007 ) also contribute to its denouement. 8 see the report available at http://www.xinhu anet.com/engli sh/2020-03/13/c_13887 5501.htm, which mentions that 90% of the covid-19 patients were treated with chinese traditional medicine. 9 this ministry was established by the government of india as recently as 2014 and is the regulatory authority for alternative medicine disciplines, such as ayurveda, siddha, unani and homeopathy. 10 multiple medical blogs as well as new reports mention this: https ://www.expre sspha rma.in/covid 19-updat es/gover nment -to-condu ctrando mised -contr olled -clini cal-trial -of-ashwa gandh a/; https ://www. busin ess-stand ard.com/artic le/pti-stori es/covid -19-govt-to-condu ctrando mised -contr olled -clini cal-trial -of-ashwa gandh a-12005 07012 14_1.html; https ://times ofind ia.india times .com/life-style /healt h-fitne ss/home-remed ies/covid -19-minis try-of-ayush -start s-clini cal-trial s-for-ashwa gandh a-and-4-other -ayurv edic-herbs -here-is-what-youneed-to-know/photo story /75692 669.cms;https ://www.expre sspha rma.in/ayush /ashwa gandh a-can-be-effec tive-preve ntive -drug-again stcoron aviru s-iit-delhi -resea rch/. 11 https ://www.ayush .gov.in/docs/clini cal-proto col-guide line.pdf. 12 https ://www.ayush .gov.in/docs/121.pdf. within the space of journal publications, we have to pick the best measure to capture innovation. academic paper writing with multiple authors (as is generally the case in most disciplines) involves joint ventures between diverse researchers, who reflect on the research problem from different perspectives. we explore the nature of the interconnections between authors, as these reflect, in a reduced form, the simultaneous adaptation and continuity in the process of knowledge transmission using the platform of academic journals. we postulate that the nature of these interconnections, as captured by the notions of network density and homophily in a research network, have the potential to capture innovation in traditional knowledge systems. consider network density first. this measures the proportion of potential ties that are realized in an empirical network (newman 2010) . the more dense a network, the higher the number of potential ties that are actualized leading to larger flows of information. a sparse network leads to less information transmission as well as benefits and dangers of interconnections, as hearn et al. (2003) discusses. hence, in a densely connected network, with many cross-connections between researchers, while benefits of continuous knowledge is enhanced, the possibility of disruptive changes coming through the structure of the connections also become alive. this brings us to the issue of homophily in the research network and its relationship with adaptive innovation in networks with different densities. homophily, which is the literal equivalent for the idiom 'birds of a feather flock together', in a research network reveals the extent to which 'similar' researchers form collaborations. note that most of the literature on homophily relate to a study of different attributes of researchers, such as gender (shrum et al. 1988) , race or ethnicity (leszczensky and pink 2015) , language (pezzuti et al. 2018) etc. and interest (dahlander and mcfarland 2013) . the latter also differentiate between attribute and interest-based homophily of university researchers from their organizational foci (departments and research centers). 13 note that their investigation revolves around a specific issue of tie formation versus continuations in collaborations for a particular university in the us. when it comes to traditional medicine, universities are not the optimal institutional foci for academic research, as most mainstream medical colleges teach only bio-medicine (patwardhan and patwardhan 2017). traditional medicine is practiced in dedicated research centers and some specific universities, as well as by independent researchers who publish in international peer-reviewed journals such as journal of ayurveda and integrative medicine (j-aim with a scimago rank of 0.315) or journal of ayurveda (published by the national institute of ayurveda, jaipur, india) or ayu (open access journal published by the institute for post graduate teaching & research in ayurveda, gujarat ayurved university, india) as well as others of less repute [see kotecha (2015) for concerns regarding quality of publications in ayurveda]. 14 for our study, an appropriate measure of homophily in publications has to capture the homogeneity in the quality of information that is exchanged through academic research collaborations, as information transmission leads to the genesis of innovative ideas in the research space. the more homogeneous this exchange, the higher will be the self-referencing character of the transmitted knowledge. the challenge here is to understand how to measure similarity. we propose two ways for discussing similarity of connections in a research network: (i) a macro measure that tests for similarity in connections in the overall network and (ii) a micro measure that explores the presence of similarity in author connections for each academic paper in the overall research network. the latter measure is a marriage between organizational foci and homophily, which dahlander and mcfarland (2013) treat as two independent conditions for studying academic collaborations. our work is close to dunn et al. (2012) , who treat researchers in bio-medicine in terms of their relationship with the industry: either with industry affiliations or without these associations. this kind of bifurcation limits the analysis to a study of dyadic ties or collaborations only. we use a more flexible definition for affiliation by institution in order to accommodate collaborations between more than two authors. note that there is a trade-off between the network density and homophily: knowledge perpetuation in a densely connected network requires some form of similarity among agents exchanging information such that the content of the knowledge is not subject to drastic change. this has to be the case for traditional knowledge systems that have not become extinct, but continue to co-exist with other forms of knowledge canons. we couple our measures of homophily with a measure of quality of publications (the scimago journal rankings). modern publication standards, which equate publication quality using scimago-type of journal rankings [see gonzález-pereira et al. (2010) , falagas et al. (2008a) , cite], should yield a negative relationship between low innovation possibilities (as exhibited by high homophily) in research papers and the rank of the journal publishing such papers. put together, our query about appropriate measures for innovation within traditional knowledge systems indicate certain patterns in the empirical research network. we expect to see that tm/cam research networks would be marked by an integration into modern publication standards, while 13 feld (1981) define these organizational foci as institutions which may be social or legal entities around which collaborative activity is organized. 14 the ministry of ayush, government of india maintains a database of journal articles published in reputed journals at http://ayush porta l.nic.in/defau lt.aspx. retaining characteristics of continuity within connections between researchers in the network. more precisely, our prediction is that research networks in ayurveda would exhibit: i conformity with modern publication standards: negative relationship of low research potential in the research network (measured using homophily) and journal publication standard (measured using scimago rankings); j higher homophily in more densely connected networks: ensuring self-preservation of knowledge in the process of transmission and exchange. we study our predictions in two research networks specific to two specific natural herbs: 15 withania somnifera or ashwagandha and emblica officinalis or amla. 16 most of the papers investigate the properties and effects of these herbs in a stand-alone fashion, with hardly any evidence of academic research on the combined effects of these two common ayurvedic herbs. our results corroborate the pattern we predict that perpetuates knowledge through adaptation to modern standards in publication. the more densely connected research network (emblica officinalis or amla) shows a clear causal relationship between publication standard of a journal and the lack of homophily among author connections. there is clear evidence of overall homophily in the research network, when we investigate connections between pairs of authors using the q-measure of modularity. however, this macro measure does not indicate the mechanism through which homophily is likely to result in adaptive innovation in research networks. this is possible through our perpaper affiliation-based measure of homophily. the latter is our contribution to the literature on estimating measures of homophily that allows one to study supra-dyadic collaborations (research papers with more than two authors). as most papers in journals, particularly in the sciences, contain teams of more than three or four authors, our measure provides an alternative to existing measures which only study twoperson collaborations. the discussion in this paper is organized along the following lines: "innovation and traditional medicine: a framework for analysis" section discusses the theoretical framework for understanding adaptive innovation in ayurveda. "empirical methodology: measuring channels of adaptive innovation" section details the empirical methodology, including our proposed measures for capturing innovation in research networks in ayurveda, filtered by specific herbs. "empirical results" section discusses the data sources and the empirical results, while "conclusion" section concludes the paper with a discussion of our findings as well as limitations in the light of the theoretical perspective we propose. traditional medicine based on ayurveda deals with naturally occurring ingredients, mostly plant-based extracts (yuan et al. 2016; gangadharan 2010; samy et al. 2008 ). we provide a brief description of the knowledge system of ayurveda, before investigating its positioning in modern journal publications. ayurveda, which originated 5000 years ago in india, has adapted over the years and continues to be popularly accepted as a system for retaining health as well as curing diseases (jaiswal and williams 2017) . this popularity was not limited to india alone in earlier times. for instance, salema et al. (2002) , in his description of colonial pharmacies in the first global age between 1400-1800 ce, describes the widespread application of ayurvedic herbs as medicine in many parts of the world, starting with portuguese india. he mentions that medicines originating in india, with the agency of jesuit missionaries engaging in medicinal trade, became very important in the state-sponsored health care institutions of the portuguese colonies around the world. 17 not only medicines, research on indian medicines providing information about (i) the medicinal properties of substances from the indian sub-continent (ii) commercialization of these substances and (iii) market demand were published in the form of medical reports sponsored by the portuguese overseas council in lisbon. in fact, garcia de orta's colloquies on the samples and drugs of india, published in goa in 1563 ce, was the first printed publication on indian plants and medicines, as mentioned in salema et al. (2002) . garcia de orta was a pioneer in pharmacognosy and the first european writer on indian medicine. the outreach of this knowledge and the medicinal products covered a diverse set of regions: macau, timor, mozambique, brazil, sau tome and the continental portugal (to name a few as mentioned in salema et al. (2002) ). despite this spread, ayurvedic texts such as the charaka samhita (400-200 bce), the sushruta samhita (1200-600 bce), the ashtanga hridayam (500-600 ce), ashtanga sangraha (1110-1120 ce) etc., are studied by practitioners till date in the original or abridged versions. till 1820 ce, traditional medicine, and particularly ayurveda, was the prevalent and respected system of medicine in countries like india and sri lanka. it was during the period of increasing british colonization, that is, from 1820 to 1900 ce, which saw various advances in western medicine and a consequent but slow loss of reputation of traditional medicine (saini 2016) . history has shown that the advent of western medicine has relegated traditional systems of cure such as ayurveda to a subaltern space (banerjee 2009; ravishankar and shukla 2007; saini 2016; salema et al. 2002; patwardhan 2013) . the slew of standards for proving efficacy of cure, safety of cures (for example, conduct of clinical trials) coupled with recent advances in biotechnology has been at the forefront of pharmaceutical innovation in western medicine. therefore, a natural conclusion about the decay of traditional medicine in the face of competition from its newer counterpart is attributable to its self-perpetuating standards of adaptions. as opposed to the slew of drastic innovations delivered through the institution of clinical trials and other enforceable standards in bio-medicine, ayurveda adapted to the niche branch of 'traditionality' that did not incorporate similar institutions and standards. an academic collaboration network can be modelled as a finite collection of nodes (representing individual researchers), who are connected through co-authorship edges to form a simple graph g: where e is the set of edges (co-author connections) and v is the set of nodes (authors). a few features of this definition are in order. first, an author with no connections proxies for single-authored papers. a paper with only two authors will be represented by a single edge connecting two nodes. 18 a drawback of this representation is that there is no direct way of capturing a paper with more than two authors. one way around this is to break up the collaborations in the paper and treat them in a binary fashion: with three authors, consider first the link between the first and the second author, then the link between the second and the third author and at last, between the first and the third author. this loses out the (1) g = ⟨e, v⟩ flavour of the combined effect of knowledge sharing through a team of more than two people. an effective representation here requires a modification of the simple graph to a more general network structure such as a hypergraph [see newman (2018) ]. the existing literature investigating collaborations limit the discussion to dyadic connections. our proposed micromeasure is closest to freeman and huang (2015) , who investigate homophily 19 using author-ethnicity in 2.57 million scientific papers written in the us between 1985 and 2008. they find that high homophily results in a lower potential for innovation. however, in order to work with simple graphs, freeman and huang (2015) restrict research alliances only to the first and last authors of scientific publications assuming that they have the maximum responsibility. while this filter on the space of authors allows the overall network to retain a simple graph structure, the loss of information in the process is likely to result in an inability to answer the research question of interest. this is particularly so for us, as we assume that the composition of the research team itself reveals innovative potential. a side issue with ethnicity as the defining characteristic for authors in the process of knowledge sharing. traditional knowledge is likely to circulate among limited ethnicities. what might matter more are constraints imposed by the institutional affiliation of the researcher. our measure of homophily is based on affiliations of co-authors, rather than ethnicity. similarity in institutional affiliation of authors results in homophily, as similar resources (research budget, institutional characteristics and knowledge depositories, like access to research databases) are involved in producing research output. dahlander and mcfarland (2013) mention five separate factors in their study of tie formation and continuation in academics: institutional foci, attribute and interest-based homophily, cumulative advantages from tie formation, triadic closure (third party reinforcement) and reinforcement of successful collaborations (tie inertia) as separate factors. however, their empirical investigation of these factors also limits itself to dyadic collaborations. in the context of citations in physics journals, bramoullé et al. (2012) notes the presence of homophily and biases, particularly in the formation of new ties, but in a dyadic setup. for studying integration of traditional knowledge systems with modern publication standards, there is no existing theory. we make a weak assumption about incentives that drive co-author incentives to form connections with heterogeneity in institutional affiliations: 18 figures 3 and 4 in appendix 1 depict the ashwagandha and amla research networks as simple graphs. 19 similar ethnic identities of authors indicate high homophily in freeman and huang (2015) . assumption i successful publications in high quality journals drive collaboration incentives [tie inertia, as per dahlander and mcfarland (2013) ]. given a continuum of research journals in ayurveda, it is possible for a researcher to choose his/her research connection to publish papers in any journal in that continuum depending on his/her research grants. the less is the institutional support as well as lower are the benefits of publishing in high quality journals, the less will be the innovative potential in the overall research network. note here that there are no pressures or funding coming from the downstream commercial firms discovering drugs to support research incentives in this stage of research, unlike for bio-medicine [see dunn et al. (2012) on industrysponsored research in the latter]. it is the standards of research itself and an individual researcher's incentive constraints that determine the innovative potential of the research network. our first measure is network density of the herb-specific research network. this measure captures the proportion of potential connections that are actually present in the graph using the simple graph representation of the amla and ashwagandha research networks. network density varies between a maximum value of 1 and a minimum of 0. second, we work out the micro and macro measures of homophily in the two networks. the micro measure is based on the by-paper homophily index defined by freeman and huang (2015) . for a given paper j, we define this measure h j as the sum of the squares of the shares of each affiliation group among the authors of the paper: where n = number of authors; s i = the share of the i th affiliation in the authors of paper j. this measure is akin to the herfindahl-hirschman index (hhi) used to measure concentration in markets, as mentioned by freeman and huang (2015) . note that freeman and huang (2015) define this index based on the ethnic concentration of authors writing a paper. this is straightforward, as an author can be mapped to his or her ethnicity uniquely. we do not work with author ethnicity, as we feel the nature of information flows in collaborations are better captured using the resource constraints represented by institutional affiliations. the affiliation types we consider are university departments, research centres, government-sponsored think tanks etc. there is a variety of such institutions for each author; sometimes authors have multiple affiliations. due to this, we have to provide a tie-breaking rule for authors with multiple affiliations. as a baseline, we assume that in cases where authors have multiple affiliations in a paper, the relevant affiliation is the: 1 unique affiliation of any author that is not shared with any of the other author as the relevant affiliation; 2 if the earlier option is not possible (that is, there exists no unique affiliation for the author), then we select the first of the listed affiliations of the author. this tie-breaker assumption is, of course, a bit arbitrary. in a later section, we conduct a robustness check of our results by changing this assumption to see if the regression results hold. in either case, the least homophily is exhibited when all the authors have different institutional affiliations whereas the highest degree of homophily occurs when all the authors belong to the same department in the same institution. if all of the authors on a paper have the same affiliation (i.e., they belong to the same department in an institution), then h j equals 1.0, which is the maximum value of the homophily measure. if the paper has authors of different affiliations, then h j takes different discrete values for papers depending on the number of affiliations and number of authors on a paper. 20 next, we follow up the by-paper homophily measure with the homophily or assortative mixing in the overall herb networks of amla and ashwagandha. in this more macromeasure, we work with a simple graph characterization and therefore, and use coarser categories for affiliation. here, 20 we illustrate the calculation of h j for the general case first and then for cases 1. and 2. mentioned above. for the general case: consider the paper titled 'clinical efficacy of amalaki rasayana in the management of pandu (iron deficiency anemia)' co-authored by s. layeeq (department of panchakarma, uttarakhand ayurved university) and a.b. thakar (department of panchakarma, gujarat ayurved university) in the amla research network. here, h j is the sum of (1∕2) 2 + (1∕2) 2 which is equal to 1/2 since s i for each affiliation is 0.5. now, in the case of tie-breaker 1., for the paper from the ashwagandha research network titled 'antihyperalgesic effects of ashwagandha (withania somnifera root extract) in rat models of postoperative and neuropathic pain', two out of the four authors have multiple affiliations. all the authors are affiliated to korea food research institute, but two are additionally affiliated to the korea university of science and technology. thus, we consider the unique affiliation of the last two authors and h j is calculated as the sum of (2∕4) 2 + (2∕4) 2 and is equal to 0.5. in the case of tie-breaker 2., for the paper titled 'effects of withania somnifera and tinospora cordifolia extracts on the side population phenotype of human epithelial cancer cells: toward targeting multi-drug resistance in cancer' has six authors: n. maliyakkal, a. appadath beeran, s.a. balaji, n. udupa, s. ranganath pai, a. rangarajan. the first author is affiliated to the indian institute of science (iisc) as well as manipal university, the second, fourth and fifth authors are affiliated to manipal university and the third and sixth authors are affiliated to iisc. here, since there are no unique affiliations, we take that the first author is affiliated to iisc (first of the listed affiliations). thus, we calculate h j as the sum of (3∕6) 2 + (3∕6) 2 which is equal to 0.5. authors are divided into four categories: authors whose institutions are based in india, sri lanka, rest of the world (not india or sri lanka) and multiple institution/country affiliations. the separate categories for india and sri lanka is due to the fact that these countries have a cultural tradition of ayurveda historically. we calculate newman's specification [see newman (2010) ] of the measure of modularity, q, based on affiliations to ascertain the presence of homophily or assortative mixing in our networks as follows: here, a ij = element of the adjacency matrix between nodes i and j; k i = degree of node (author) i, i.e, the number of authors that are connected to node i; c i = type of node i, i.e, whether the node i has an indian, sri lankan, foreign (other than indian or sri lankan) institution affiliation or multiple affiliations; m = total no. of edges in the network; = kronecker delta which is 1 when c i = c j , i.e, when nodes i and j are of the same type. this q measure has the advantage of comparing the presence of homophily relative to a counterfactual of what kind of connections would be present if, unlike our assumption 1, authors randomly chose co-authors for writing research papers. the deliberate strategic choice in collaborative connections, assuming that it increases the chance of publishing in high quality journals, is captured through this measure through its two terms: the first term in the formula of q represents the actual level of assortative mixing in the empirical network and the second term is the extent of this mixing that we are likely to see if all the links in the network were created randomly. a positive value of q indicates significant assortative mixing and hence homophily in the network, whereas a near-zero value of q is indicative of very little homophily in the network. the publication standard of academic journals, whose relationship we study next in relation to homophily, is measured using the scimago rank. 21 our assumption is that a high scimago rank is indicative of high quality of innovation. we use the scimago ranking since it based on the idea that 'not all citations are equal'. the alternative measure, average impact factor is, in fact, highly correlated with the average scimago rank. 22 the causal relationship we test predicts the manner in which the scimago rank of a journal (dependent variable) varies with our micro measure of homophily (independent variable) with additional controls. for this purpose, we conduct a quantile or percentile regression, since the distribution of our dependent variable (scimago ranking of journals) is skewed and not normally distributed. quantile regression is based on the estimation of conditional quantile functions as against the classical linear regression which is based on minimizing sums of squared residuals. linear regression helps in estimating models for conditional means whereas quantile regression estimates models for the conditional median as well as other conditional quantiles. further, the quantile regression treats outliers and non-normal errors more robustly than the ordinary least squares (ols) regression. we contrast our results against the standard ols regression results. we expect that less 'homophilous' are author affiliations in a paper, the higher will be the innovative potential of the paper. the likelihood of publication in a higher ranked journal therefore, higher will be the scimago ranking. hence, we expect a negative relationship between h j and average scimago ranking. we use data on research papers from pubmed database , which is maintained by the us national library of medicine and national institutes of health, for a five year period (30 july 2013 to 30 july 2018). it contains more than 28 million citations for biomedical literature from medline, life science journals, and online books. search string matters for all bibliometric research. we found that research papers which appear with the string search 'withania somnifera + ayurveda' are contained in 'ashwagandha + ayurveda' but not vice versa. hence, we used the former search string. for amla, we combined the searches 'amla + ayurveda' and 'emblica officinalis + ayurveda', that is for both traditional/ local name and scientific name, because the union set represents more papers than individual searches, and the brief overview of abstract also shows that the herb has been used in the analysis for the paper. we list information on articles, authors, and the country of institution of the author as well as authors' institutional affiliations. note that if the papers are not available online, we mark the authors' affiliation as not available. also, when an author has co-authored more than one paper, where for one paper the affiliation is given while for others it is not mentioned, then we take the affiliation which has been mentioned as relevant. the number of observations in the ashwagandha network is almost twice that of the amla network, though on an average, an author in each of the networks has the same degree. the graph density of the amla research network is higher (it is 0.035) compared to the ashwagandha network (for which graph density is 0.016). 23 these figures for graph density are extremely low, particularly in comparison with a complete graph (in which every pair of nodes is connected by a unique edge) with density equal to 1. however, relative to the amla network, ashwagangha has more research papers written over the five year period taken in consideration. continuity of knowledge, when many authors are involved in the overall research network, is ensured by: 1 similar per-paper homophily among authors by affiliation ( h j ) in the less dense ashwagandha network (average homophily score is 0.618) compared to the more densely connected amla network (average homophily score is 0.583). 2 higher variation in the quality of journals in the ashwagandha network (measured by the average sjr variable). its standard deviation in the ashwagandha network is relatively high at 0.635 compared to 0.469 for the amla network. 3 higher per-paper homophily ( h j ) in achieving higher quality publications; the value of the average scimago journal rank (sjr) is significantly higher at 0.97 for the ashwagandha network compared to 0.76 for the amla network. this implies compliance of research alliances note that the average scimago rank (average sjr), as shown in the histograms in figs. 1 and 2 respectively for ashwagandha and amla, are significantly skewed. most of the journal papers are clustered in intervals, as the bars of the histograms show in these figures. 24 not only is there an interval-specific clustering, the bulk of journals in both the networks have a low scimago rank. most of our observations (the highest density of journals) are bunched towards very low values on the x-axis, much below the mean average sjr. this can also be read off from the continuous line fitted to the histograms. though both the histograms look similar, the fitted line clearly shows that almost all the papers in the amla network are below an average rank of 2, whereas in the ashwagandha network, there is a small presence (less than 50%) of papers above the scimago rank of 2. this reveals that the overall quality of journals and therefore, papers and their innovative potential in the amla network is worse than in the ashwagandha network. this is, of course, beholden to our assumptions about inferences of innovative potential and high quality in papers as reflected by the average scimago (sjr) ranks. we point out here that the average sjr ranking is not a paper-level metric, that is, it will only change when the journal where the paper is published changes. we have papers that are published in the same journal and therefore, we have repeat values of the ranking scores. what we find is that for both networks, the median is less than the mean for all values of average scimago ranks and that low quality publications outnumber higher quality ones in our data. as mentioned earlier, there are multiple ways to measure homophily. other than our per-paper affiliation-based measure, we can comment on the extent of assortative mixing or modularity in the entire network (see the definition in "empirical methodology: measuring channels of adaptive innovation" section). we find existence of assortative mixing or homophily in the overall research networks we study, as the value of q (0.286 for the amla network and 0.425 for the ashwagandha network respectively) is higher than zero. the q measure shows a higher overall homophily in the ashwagandha network relative to the amla one. the simple graphs in figs. 3 and 4 in appendix 1 show the connections in these networks dyadically. these graphs reveal an empirical regularity seen in most modern publication networks: authors in disciplines like traditional medicine mostly work in small connected sub-graphs (indicating that collaborations are deliberate and non-random [see newman (2001) regarding limited number of collaborators in theoretical disciplines like high energy theory]. indian authors rarely form collaborations with sri lankan and other foreign authors. the presence of the latter type of authors is in predominance in the ashwagandha network than in the amla one: it is interesting that despite ayurveda's historical origins in india, foreign institutions outside the south asian region engage with the discipline. however, the nature of these academic endeavours is limited within their own cliques, giving rise to a higher q measure for the ashwagandha network than the amla network. 25 now, the two measures of homophily are not directly comparable, as their objectives are different. the q-measure works out whether connections formed in the network are strategic or random in the network as a whole (the first term in the formula for q works out the extent of strategic connections in the network relative to the second term capturing random connections). the micro measure works out homophily at the level of individual research papers in the network whereas for the calculation of the q-measure, authors are classified in terms of four affiliations (indian, sri lankan, others and multiple affiliations). a positive value of q shows that research links in the networks we study are made strategically, which supports our hypothesis that adaptive innovation works out through some form of homophily in the network. the last point that deserves a further exploration is the precise relationship between journal quality and homophily, which we work out in the next section. recall that the scimago rankings were highly skewed in both the networks. hence, we investigate the effect of homophily, after controlling for other network-specific features, on the scimago ranking of journals in each network using a quantile regression. the dependent variable is the average scimago journal ranking in the network and the independent regressor of interest is the homophily index. we control for the degree of corresponding author in the respective network and total number of references 26 and contrast our results with the ordinary least squares (ols) regression to see the effect of the skewness on the causal relationship between journal quality ranking and the independent variables. if skewness matters in the regression, then the ols regression (which predicts the effect of the independent regressor on the mean value of the independent variable, in the presence of other controls) would show a different pattern compared to the effect on the other quantiles. other than the 25th, the 50th (the median) and the 75th quantile, we consider a few other percentiles of the independent variable to depict the nonlinearity. we present in table 3 the results using h j , which is our micro measure of homophily based on freeman and huang (2015)'s methodology. we find that effect of homophily ( h j ) on the average scimago ranking works out differently for (i) different techniques of estimation techniques (ols as opposed to quantile regression) and (ii) different research networks, ashwagandha and amla. the goodness-of-fit measure for amla (as captured by the pseudo-r 2 values) are much lower as compared to those for the ashwagandha network (see appendix 2 for the results for the ashwagandha network). this could presumably be because of the relatively lower number of observations in the amla network. additionally, the homophily measure ( h j ) significantly lowers the quality of journals for the ols regression and the 55th, 60th, 65th and 70th percentiles. hence, the results of the ols average out the effect of homophily on quality of publications and the percentiles depict a comparatively accurate picture. for the ashwagandha network, we find that h j does not significantly impact the average scimago ranking in the ols regression as well as the different quantile regressions (refer to table 5 in appendix 2). the quantile regression, however, shows a non-linear impact on the different quantiles of the independent variable attributable to h j . for the 55th (or the 60th) percentile value of the independent variable, if h j increases by one unit the average sjr decreases by 0.746 (or by 0.752). these results hold at 10% level of significance. however, at the 75th quantile, the effect of h j is no longer significant. in "empirical methodology: measuring channels of adaptive innovation" section, we defined the affiliation-based micro homophily measure with two caveats for the case when authors of papers are affiliated to more than one department/institution. to check whether our results of the previous section hold, we change the two of our earlier assumptions regarding the tie-breaker on papers with multiple author affiliations as follows: i. instead of the unique affiliation for the author with multiple affiliations, we use the most common affiliation (that is, affiliation that is shared by at least one other co-author); ii. when there exists no common affiliation, then we take the last among the listed affiliations of the author with multiple affiliations, instead of the first one. we now redefine h j as h r and replicate our ols and quantile regressions. the results for the amla network are shown in table 4 . comparing table 4 with table 3 , we find that our results have not changed in any significant way. thus, we conclude that results using our micro-measure of homophily are robust to changes in the definition of h j . our paper provides a method to understand the nature of innovation (we term this adaptive innovation) that allows a canon of knowledge to not become extinct while ensuring continuity in content. 27 being traditional does not indicate rigidity. nijar (2013) studies customary law and its relationship with traditional knowledge and he observes that these systems are dynamic and exhibit flexibility through 'a process of natural indigenous resources management that embodies adaptive responses'. the presence of these adaptive responses allow for a specific type of dynamic pattern or innovation. hearn et al. (2003) discusses the role that innovation has in complex systems, which, we believe, carries over to traditional knowledge systems. their claim that it is, paradoxically, also true that innovation also requires some stability and security in the form of such things as organisational structure, discipline and focus. makes our research quest less blunt than whether innovation is possible within stable traditional systems of knowledge to a more nuanced search for how to understand the process of innovation in such systems and measure them. of the possible patterns that a complex system can exhibit, hearn et al. (2003) distinguishes four: i. self-referencing: a condition that leads to perpetuation and continuity in knowledge. ii. self-organization: which arises from exogenous changes resulting in adaptations to the existing body of knowledge. iii. self-transformation: that leads to drastic schumpeterian upheavals in established canons of knowledge, mostly through endogenous changes from within the system. iv. extinction: changes that result in complete demise of a system. of these four conditions, traditional medical systems display self-referencing, as processes and institutions that deal with these have resulted in preservation of knowledge for thousands of years. the fact that the last condition of extinction is not the case with traditional medicine, it must be the case that the institutional structures and interactions among practitioners over the years have adapted themselves [self-organization as per hearn et al. (2003) ], leading to selfperpetuation. the continuity of the structure of knowledge in disciplines like ayurveda also imply that schumpeterian innovations or drastic innovation, which would destroy channels for continuing embedded knowledge, are absent. this clearly shows that innovation is not antithetical to traditional knowledge systems, just that the processes of adaptation and change result in perpetuation in knowledge. while we do not expect to see drastic innovation that marks modern bio-medicine, a detailed study of these knowledge systems should reveal very nuanced forms of self-perpetuating adaptations. in the specific context of herb-specific academic paper networks in ayurveda, we find that a lower affiliation-based homophily is causally linked with higher publication ranking, as measured by the scimago ranks of journals publishing these papers. however, more diverse collaborations with low homophily are costly, as per our theory and the contentions of dahlander and mcfarland (2013) . simultaneously, low homophily breeds the possibility of content dilution in the knowledge system. therefore, as a natural response to retaining ties with low collaboration cost [as dahlander and mcfarland (2013) would argue], the research networks we study exhibit high levels of homophily, be it through the lens of assortative mixing or affiliation-based homophily measures. a resultant effect is that these ties allow continuity in the content and structure of knowledge itself, despite an adaptation to modern publication standards. this becomes an adaptation strategy for a traditional knowledge system that continues to persist at present with the retention of the basic structure of knowledge. our findings regarding institution-based homophily also resonate with the finding of dunn et al. (2012) that there is homophily among industry-affiliated researchers in bio-medicine. in comparison with non-industry-affiliated researchers, those with industry links publish more often and more so, with each other. this kind of perpetuation of connections seems to be the commonality of research themes that is necessary for research that has similar type of pharmaceutical industry-based funding. our result regarding similarity in institutional affiliations in publications, despite lowering of the quality of publications, indicates not only the ease of finding collaborators [as mentioned by dahlander and mcfarland (2013) ], but also the commonality of content that helps perpetuate knowledge. however, they find a continuation from the industry to research through collaboration links between industry-linked authors. in sharp contrast, the absence of institutions like clinical trials prevent any meaningful incentives for the drug manufacturing industry to invest in the research segment in ayurveda. note one problem with the publications space is its survival bias: we can only study successful collaboration, not the unsuccessful ones. this is a drawback of all studies that investigate collaborations through the space of academic publications [see dahlander and mcfarland (2013) ]. a different issue remains about our method of analysis: are research publications the appropriate space to look for adaptive innovation in traditional knowledge systems? undoubtedly, we use a modern standard and retrofit it to understand collaborative processes in traditional knowledge systems. these disciplines, which have survived many years of transitions are often best seen as lived traditions [see robbins and dewar (2011) for traditional indigenous medicine in the americas]. most practitioners of ayurveda still refer to the classic texts of charaka samhita as relevant texts in their practice. 28 in sum, if traditional medicine adapts itself to modern publication standards, the path it takes is no different from other disciplines that publish in such platforms, such the existence of a small cluster of connected authors in an otherwise sparsely connected network (see figs. 3 and 4 in appendix 1. 29 ) the modalities of the publication platform determine the quality of connections to a large extent when traditional knowledge finds these outlets for knowledge dissemination. what remains suspect is the overall engagement of traditional medicine in particular and traditional knowledge, in general, with modern publication standards. recent initiatives by the ministry of ayush, government of india, have resulted in the creation of a repository of modern journals with publications in ayurveda, just as pubmed is an international collection of such publications. however, the researcher and the practitioner is unlikely to be the same agent, as our survey of 2018 found. the rise of a culture of knowledge dissemination through journals gives rise to the possibility of a disconnect in traditional disciplines: those who publish in journals and those who practice the discipline. what the two sets of individuals believe about innovation within the discipline are likely to be very different. we have limited our analysis to the space of academic journals in this paper. the overall engagement of ayurveda with modern publication standards and what it does to the discipline is part of our future research agenda and has not been addressed in this paper. collaborations: the fourth age of research publication criteria and recommended areas of improvement within school psychology journals as reported by editors, journal board members, and manuscript authors comparison of pubmed and google scholar literature searches whole medical systems versus the system of conventional bio-medicine: a critical, narrative review of similarities, differences, and factors that promote the integration process growth rates of modern science: bibliometric analysis based on the number of publications and cited references homophily and long-run integration in social networks open innovation: a new paradigm for understanding industrial innovation ties that last: tie formation and persistence in research collaborations over time industry influenced evidence production in collaborative research communities: a network analysis dissemination of research results: on the path to practice change the direct and indirect impact of culture on innovation comparison of scimago journal rank indicator with journal impact factor comparison of pubmed, scopus, web of science, and google scholar: strengths and weaknesses the focused organization of social ties traditional medicine: past, present and future research and development prospects and integration in the national health system of cameroon collaborating with people like me: ethnic co-authorship within the united states quality of ingredients used in ayurvedic herbal preparations the new production of knowledge: the dynamics of science and research in contemporary societies locating sources in humanities scholarship: the efficacy of following bibliographic references moving the science of team science forward: collaboration and creativity phenomenological turbulence and innovation in knowledge systems a glimpse of ayurveda-the forgotten history and principles of indian traditional medicine ayurveda research publications: a serious concern keeping the doctor in the loop: ayurvedic pharmaceuticals in kerala ethnic segregation of friendship networks in school: testing a rational-choice argument of differences in ethnic homophily between classroom-and grade-level networks the importance of proper citation of references in biomedical articles who owns traditional knowledge? pubmed: bridging the information gap doctoring traditions: ayurveda, small technologies, and braided sciences scientific collaboration networks. i. network construction and fundamental results traditional knowledge systems, international law and national challenges: marginalization or emancipation? time for evidence-based ayurveda: a clarion call for action ayurveda education reforms in india does language homophily affect migrant consumers' service usage intentions? indian systems of medicine: a brief profile traditional indigenous approaches to healing and the modern welfare of traditional knowledge, spirituality and lands: a critical reflection on practices and policies taken from the canadian indigenous example physicians of colonial india (1757-1900) ayurveda at the crossroads of care and cure a compilation of bio-active compounds from ayurveda quantity and/or quality? the importance of publishing many papers friendship in school: gender and racial homophily becoming a traditional medicinal plant healer: divergent views of practicing and young healers on traditional medicinal plant knowledge skills in india covid-19: combining antiviral and anti-inflammatory treatments preparing research articles finding knowledge paths among scientific disciplines the traditional medicine and modern medicine from natural products minerva unbound: knowledge stocks, knowledge flows and new knowledge production acknowledgements we thank rinni sharma (doctoral student at uppsala university, sweden) for her assistance in data collection for this paper. we also thank dr. binoy goswami at the faculty of economics, south asian university for giving us comments on the questions in our primary survey.funding no external funding was received for our paper. ethical statement this article does not contain any studies with human participants or animals performed by any of the authors. see table 5 . key: cord-025838-ed6itb9u authors: aljubairy, abdulwahab; zhang, wei emma; sheng, quan z.; alhazmi, ahoud title: siotpredict: a framework for predicting relationships in the social internet of things date: 2020-05-09 journal: advanced information systems engineering doi: 10.1007/978-3-030-49435-3_7 sha: doc_id: 25838 cord_uid: ed6itb9u the social internet of things (siot) is a new paradigm that integrates social network concepts with the internet of things (iot). it boosts the discovery, selection and composition of services and information provided by distributed objects. in siot, searching for services is based on the utilization of the social structure resulted from the formed relationships. however, current approaches lack modelling and effective analysis of siot. in this work, we address this problem and specifically focus on modelling the siot’s evolvement. as the growing number of iot objects with heterogeneous attributes join the social network, there is an urgent need for identifying the mechanisms by which siot structures evolve. we model the siot over time and address the suitability of traditional analytical procedures to predict future relationships (links) in the dynamic and heterogeneous siot. specifically, we propose a framework, namely siotpredict, which includes three stages: i) collection of raw movement data of iot devices, ii) generating temporal sequence networks of the siot, and iii) predicting relationships among iot devices which are likely to occur. we have conducted extensive experimental studies to evaluate the proposed framework using real siot datasets and the results show the better performance of our framework. crawling the internet of things (iot) to discover services and information in a trusted-oriented way remains a prolonged challenge [21] . many solutions have been introduced to overcome the challenge. however, due to the increasing number of iot objects in a tremendous rate, these solutions do not scale up. integrating social networking features into the internet of things (iot) paradigm has received an unprecedented amount of attention for the purpose of overcoming issues related to iot. there have been many attempts to integrate iot devices in social loops such as smart-its friend procedure [8] , blog-jects [3] , things that twitter [9] , and ericson project 1 . a new paradigm has emerged from this, called social internet of things (siot), and the key idea of this paradigm is to allow iot objects to establish relationships with each other independently with respect to the heuristics set by the owners of these objects [1, 2, 15, 18] . the perspective of siot is to incorporate the social behaviour of intelligent iot objects and allow them to have their own social networks autonomously. there are several benefits to the siot paradigm. first, siot can foster resource availability and enhance services discovery easily in a distributed manner using friends and friends of friends [1] , unlike traditional iot where search engines are employed to find services in a centralized way. second, the centralized manner of searching iot objects raises scalability issue, and siot overcomes the issue because each iot object can navigate the network structure of siot to reach other objects in a distributed way [2, 20, 21] . third, based on the social structure established among iot objects, things can inquire local neighbourhood for other objects to assess the reputation of these objects. fourth, siot enables objects to start new acquaintance where they can exchange information and experience. many research efforts have been devoted to realizing the siot paradigm. however, the majority of the research activities focused on identifying possible policies, methods and techniques for establishing relationships between smart devices autonomously and without any human intervention [1, 2] . in addition, several siot architectures have been proposed [2, 5, 17] . in spite of the intensive research attempts on siot, there are insufficient considerations to model and analyze the resulted siot networks. the nature of siot is dynamic because it can grow and change quickly over time where nodes (iot objects) and edges (relationships) appear or disappear. therefore, there is a growing interest in developing models that allow studying and understanding this evolving network, in particular, predicting the establishment of future links (relationships) [1, 15] . predicting future relationships among iot objects can be utilized for several applications such as service recommendation and service discovery. thus, there is a need for identifying the mechanisms by which siot structures evolve. this is a fundamental research question that has not been addressed in siot yet, and it forms the motivation for this work. however, the size and complexity of the siot network create a number of technical challenges. firstly, the nature of the resulted network structure is dynamic because smart devices can appear and disappear overtime and the existed relationships may vanish and new relationships may establish. secondly, siot is naturally structured as a heterogeneous graph with different types of entities and various relationships [2, 18] . finally, the size of siot network is mas-sive, and hence, it requires efficient and scalable methods. therefore, this paper focuses on modelling the siot network and study, in particular, the problem of predicting future relationships among iot objects. we study the possibility of relationship establishment among iot objects when there is co-occurrence meeting in time and space. our research question centers on how likely two iot objects could create a relationship between each other when they have been approximately on the same geographical location at the same time on multiple occasions. in our work, we develop the siotpredict framework, which includes three stages: i) collecting the raw movement data of iot devices, ii) generating temporal sequence networks of siot, and iii) predicting future relationships that may be established among things. the salient contributions of our study are summarized as follows: -designing and implementing the siotpredict framework for studying the siot network. the siotpredict framework consists of three main stages for i) collecting raw movement data of iot devices, ii) generating temporal sequence networks, and iii) predicting future relationships among things. to the best of our knowledge, our framework is the first on siot relationship prediction. -generating temporal sequence networks of siot. we develop two novel algorithms in the second stage of our framework. the first algorithm identifies the stays of iot objects and extracts the corresponding locations. the second algorithm, named sweep line time overlap, discovers when and where any two iot objects have met. -developing a bayesian nonparametric prediction model. we adopt the bayesian nonparametirc learning to build our prediction model. this model can adapt the new incoming observations due to the power representation and flexibility of bayesian nonparametric learning. -conducting comprehensive experiments to assess our framework. siotpredict has been evaluated by extensive experiments using real-world siot datasets [12] . the results demonstrate that our framework outperforms the existing methods. the rest of this paper is organized as follows. section 2 discusses the related works. section 3 presents heterogeneous graph modeling for social iot and introduces the siotpredict framework. the experimental results on real siot datasets are presented in sect. 4, and finally sect. 5 concludes the paper. siot is still in the infancy stage, and several efforts have been devoted to realizing the siot paradigm. most of the current research activities focused on identifying possible policies, methods and techniques for establishing relationships between smart devices autonomously and without any human intervention [2] . atzori et al. [2] proposed several relationships that can be established between iot objects as shown in table 1 . some of these relationships are static such as por and oor, which can usually be defined in advance. other relationships are dynamic and can be established when the conditions of the relationship are met. roopa et al. [18] defined more relationships that may be established among iot objects. nevertheless, current siot research lacks effective modelling and analysis of siot networks. however, in the context of iot, there are a few attempts to exploiting the relationships among smart devices and users for recommending things to users. yao et al. [23] proposed a hyper-graph based on users' social networks. they used existing relationships among users and their things to infer relationships among iot objects. they leveraged this resulted network for recommending things of interest to users. mashal et al. [14] modelled the relationships among users, objects, and services as a tripartite graph with hyper-edges between them. then they explored existing recommendation algorithms to recommend third-party services. nevertheless, these works are mainly based on users' existing relationships and the things they own. atzori et al. [1] emphasized to modelling and analyzing the resulted social graphs (uncorrelated to human social networks) among smart objects in order to introduce proper network analysis algorithms. therefore, our work aims to model the siot network in order to allow studying relationships prediction (link prediction) among iot objects that may form in the future. link prediction is considered as one of the most essential problems that have received much attention in network analysis, and in particular, when anticipating the network structure at a future time. a large body of work has investigated link prediction with various aspects including similarity-based measures, algorithmic methods, and probabilistic and statistical methods [11, 13] . recently, there is a growing interest in developing probabilistic network models using bayesian nonparametric learning. bayesian nonparametric is capable to capthis relationship is established among objects when they come into contact, sporadically or continuously, because they or their owners come in touch with each other during daily routine dynamic ture the network evolution in different time steps by finding latent structure in observed data. latent class models such as stochastic blockmodels (sbs) [7] and mixed membership stochastic blockmodels (mmsb) [19] depend on the vertex-exchangeability perspective where nodes are the target unit to assign into clusters. however, these models suffer from generating dense networks while most of the real-world networks tend to be sparse. to overcome this limitation, edgeexchangeable models have been proposed to deal with sparse networks [4, 22] . in this perspective, edges are the main units to assign into clusters. in this section, we describe the dynamic heterogeneous siot graph modelling and then present the details of our siotpredict framework. a dynamic, heterogeneous siot graph is composed of nodes and edges where nodes represent iot devices, and edges represent relationships that could be of multiple different types. the formal definition is as follows. an siot network can be considered as a temporal sequence of networks (as depicted in fig. 1a . , e t n } contains n edges observed at time t and the set of vertices v t is the set of vertices that have at least participated in one edge up to t such that represents the feature matrix at time t where x i is the attribute vector of node v n . throughout the paper, we consider the dynamic relationship establishment using siot data as a case study. figure 1b shows an example of the siot heterogeneous network. for this application, we assume that a heterogeneous siot graph has been obtained at time t from the siot data. given these data, we will predict the likelihood of a relationship (edge) creation between any two iot devices (nodes). this section explains our siotpredict framework for predicting future relationships in siot. figure 2 gives an overview of the siotpredict framework. the framework includes three stages namely: stage 1: collection of the raw movement data of iot devices, stage 2: generating the temporal sequence networks of siot, and stage 3: prediction future relationships of the siot. in the following, we will provide more details on these three stages. in the first stage of our framework, we collect the raw movement data of iot devices. we distinguish two types of iot devices: mobile and static. the coordinates of a static device (e.g., a light pole) are stationary and known whereas the coordinates of a mobile device (e.g., a bus) is dynamic and changing while the device is moving. we assume that mobile devices include gps technology which provides the location coordinates of these devices along with the timestamp. we also assume mobile iot devices send their location history records continuously (e.g., every 60 seconds). each record contains some important fields: (device id, latitude, longitude, and timestamp). 1. definition 1. a location history record is represented by a point on the earth (latitude, longitude) and a timestamp. this record tells where an iot object is at a specific time (as illustrated in fig. 3 ). iot object (as illustrated in fig. 3 ). phase 1) identifying "stays" from the raw movement data and extracting "locations". we are interested in knowing where objects meet. therefore, we first need to identify the stays for all objects using their raw movement data. then, we extract locations from these stays. this enables us to identify where and when iot objects have stayed. a stay is a sequence of n location history records, which can be represented by (longitude, latitude, start-time, end-time). longitude and latitude represent the average of longitude and latitude values in the sequence. start-time indicates the smallest timestamp in the sequence, and end-time represents the largest timestamp (see fig. 3 ). 2. definition 4. a location can be the latitude and longitude of one stay, or it can be the average of longitude and latitude of a group of stays. this group of stays are separated by less than or equal to a distance r (see fig. 3 ). we develop algorithm 1 for identifying stays, extracting locations and, then labelling identified stays by the extracted locations. the input of this algorithm is the raw movement data of iot objects. the output is a list of identified stays labelled by extracted locations. the time complexity of this algorithm is quadratic since it is required to calculate the distances between each two observations in the raw movement data. the first step focuses on stay identification (line ). this step is to identify the stays for each iot object from the given raw movement data. first, we define the time period of the stay (for example, when the value of stay period = 10, it means that we need to identify if an object stays at a place for 10 min). according to our assumption, each record in the raw movement data is sent every one minute, so 10 records represent 10 min. the algorithm calculates the distance among the raw movement data records according to eq. 1, where, d is the distance between the two location history records, r is the radius of the sphere (earth), θ 1 , θ 2 are the latitude of the two location history records, λ 1 , λ 2 are longitude of the two location history records. then, it groups them if their distances are less than or equal to a threshold r. from line (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) , the algorithm checks each group. if a group has a number of records larger than or equal to the value of the stay period, the algorithm takes the average of the latitude and longitude of the group. it also takes the smallest timestamp of the group to be the starting time of the stay, and the largest timestamp to be the ending time of the stay. (1) the second step targets location extraction (line 28-40). the algorithm extracts the list of locations out of the identified stays. since stays are represented by latitude and longitude, the algorithm calculates the distance among these stays, and groups them using a threshold r. the algorithm takes the average of the latitude and longitude of these stays to represent one location. if there is a stay which has not been grouped with any other stays, this stay can represent a location. finally the third step focuses on labelling stays with locations. in this step, we label the identified stays by one of the extracted locations. time overlap algorithm. we develop algorithm 2 to detect and report all overlapped periods occurred among the given set of stays produced by algorithm 1. the purpose is to determine if any two iot objects have met in a location at a particular time. this novel algorithm named as sweep line time overlap (slto). the slto algorithm is inspired by a sweep-line algorithm in geometry which finds intersections between a group of line segments. however, the slto algorithm identifies if there are overlapping periods among stays of the objects in a location. the idea of this algorithm is to run a virtual sweep-line parallel to y-axis and move from left to right in order to scan intervals on x-axis. when this sweep-line detects overlaps among stays (they look like line segments which represent the stay periods of objects), it starts calculating if there is an overlap and reporting these overlaps. there are two main steps. the first step focuses on storing all the intervals of the stays (line 2-3). since our goal is to find the overlapped periods among the set of stays, the algorithm initializes two data structures: i) a priority queue q to store all the intervals of the stays we got from algorithm 1 in sorting order, and ii) a sweep-line status as s to scan the stays from left to right. the second step runs the sweep line s (line 4-10). we get the interval end points from q one by one to allow the sweep line s to scan it. the sweep line detects the start of the stay in the space, and adds it to the s. also, when it detects the end of the stay (in this case, the slto finished from scanning the stay), the algorithm checks the last element in s. if it is the start of this stay, then the algorithm removes the stay form s with no action. if the last element in s is not the start of the finished stay, then an overlap or more have been detected between the this stay and other active stays in the s. algorithm 3 reports the overlaps discovered by slto. it calculates the amount of the detected overlaps, and the results are reported. the time complexity of slto is o(nlogn +l) since the stay time is only calculated when overlaps between objects exist (fig. 4) . overlapperiod ← min(l1.end, l2.end) − max(l1.start, l2.start) 6 -report the overlap (ids of objects and overlapperiod) phase 3) generating the temporal networks of siot. after obtaining the time overlapping periods of stays among iot objects, we are able to know the count of meetings occurred between any two iot objects. according to this, we check the rules of the targeted relationship such as how many times they have met and the length of the interval period. if the rules are met, then we build the temporal sequence of the siot composed from this relationship to be used in our prediction model stage. after generating the temporal sequence networks in the second stage, our next step is to model each one of them using the bayesian non-parametric model [22] . this model allows combining structure elucidation with a predictive performance by clustering links (edges) rather than the nodes. our aim here is to predict links (relationships) between iot objects that are likely to occur in the subsequent snapshot of the network. therefore, the siot network is modelled as an exchangeable sequence of observed links (relationships) and that allows adapting the growth of the network over time. we assume that the siot network clusters into groups, and for this, we model each community using a mixture of dirichlet network distributions. the description of the model as follows: we model the relationships of the siot network using dirichlet distribution g, where δ θi is a delta function centered on θ i , and π i is the corresponding probability of an edge to exist at θ i , with i=1 ∞ π i = 1. the parameter γ controls the total number of nodes in the network. to model the size and number of clusters, we use a stick-breaking distribution gem (α) with concentration parameter α that controls the number of the clusters. the model places a distribution over all clusters, and it places per-cluster distribution over the nodes. to generate an edge, first, a cluster will be picked according to d. then, two nodes (devices) will be sampled according to g. the probability of predicting a link between any two objects is proportional to the product of the degree of these two devices. for the inference part that is based on the bayes' rule (eq. 3), we follow the same steps conducted in [22] to compute the distribution over the cluster assignment using the chinese restaurant process and evaluate the predictive distribution over the n th link, given the previous n−1 links. we perform inference using an markov chain monte carlo (mcmc) scheme [22] . we evaluated the effectiveness and efficiency of the siotpredict framework based on comprehensive experiments. in this section, we discuss the experimental design and report the results. we used the siot datasets 2 to evaluate the siotpredict framework. these datasets are based on real iot objects available in the city of santander and contain a description of iot objects. each object is represented by fields such as (device id, id user, device type, device brand, device model). the total number of iot objects is 16,216. 14,600 objects are from private users and 1,616 are from public services. the dataset includes the raw movement data of devices that are owned by users and the smart city. there are two kinds of devices: static devices and mobile devices. static devices are represented by fixed latitudes and longitudes. mobile devices are represented by latitudes, longitudes, and timestamps. the latitude and longitude values of mobile devices are dynamic. in addition, the dataset includes an adjacency matrix for siot relationship produced with some defined parameters. in table 2 , we only depict sor and sor2 relationships and their parameters to be used in our experiments. in this section, we explain the common metrics and the comparison methods. performance metrics. our performance metrics used in the experiments include accuracy, precision, recall, and f 1 score. following the work of information diffusion in [6] , we define the accuracy as the ratio of correctly predicted edges to the total edges in the true network, precision as the fraction of edges in the predicted network that are also present in the true network, recall as the fraction of edges of the true network that are also presented in the predicted network, and finally f 1 score as the weight average of precision and recall. [19] . although the aforementioned models are not explicitly designed for link prediction, they can be modified for the prediction task using the above procedure of selecting the n highest probability edges [22] . in addition, these models suffer from the limitation of assuming a fixed number of vertices. furthermore, we also compared our approach with common link prediction methods [10] : resource allocation, adamic adar index, jaccard coefficient, xgboost, and common neighbor. (a) roc curve using nodes in the training set. (b) roc curve using nodes outside the training set. we modeled the siot network to predict future interactions among devices, and that enabled us to have a better understanding on the resulted network. based on the existing bayesian models, nodes are assigned to clusters, and these clusters control the way of how these nodes establish relationships. figure 5 and fig. 6 show the performance of siotpredict against other methods. we used a small network, and the reason of that is due to the nature of sb and mmsb, which do not scale very well on large networks [16] . we experimented the performance of these models and methods in two ways. in the first experiment, we used the same nodes (i.e., iot objects) in the training set and the test set. that means there were no nodes in the test set outside the training set. we performed experiments in this way because sb, mmsb and other methods assume nodes in the test set are not outside the training set. in the second experiment, the nodes in the test set are outside the training set. for the overall performance, sb does not perform well against the mmsb and our model. the reason is that the assumption of sb states that nodes can only belong to one cluster whereas we see mmsb performs better than sb because it relaxes this assumption by allowing the nodes to belong to more than one clusters. however, sb, mmsb and other common methods do not perform well compared to our model on both settings as illustrated in fig. 5 and fig. 6 . in particular, these methods perform poorly in the second setting (i.e., the nodes in the test set are outside the training set) due to their limitation on dealing with new nodes. in contrast, our model delivers the similar performance. social internet of things (siot) can foster and enhance resource availability, discovering services, assessing object reputations, composing services, exchanging information and experience. in addition, siot enables establishing new acquaintances, collaborating to achieve common goals, and exploiting other object capabilities. therefore, instead of relying on centralized search engine, social structure resulted from the created relationships can be utilized in order to find the desired services. in this paper, we take the research line of siot to a new dimension by proposing the siotpredict framework that addresses the link prediction problem in the siot paradigm. this framework contains three stages: i) collecting raw data movement of iot devices, ii) generating temporal sequence networks of siot, and iii) predicting the links that are likely form between iot objects in the future. ongoing work includes further assessment of the siotpredict framework, and enhancement of the relationship prediction by considering the features of iot objects (e.g., services offered by the objects). from smart objects to social objects: the next evolutionary step of the internet of things the social internet of things (siot) -when social networks meet the internet of things: concept, architecture and network characterization a manifesto for networked objects -cohabiting with pigeons, arphids and aibos in the internet of things edge-exchangeable graphs and sparsity lysis: a platform for iot distributed applications over socially connected objects inferring networks of diffusion and influence stochastic blockmodels: first steps smart-its friends: a technique for users to easily establish connections between smart artefacts things that twitter: social networks and the internet of things the link-prediction problem for social networks link prediction in complex networks: a survey a dataset for performance analysis of the social internet of things a survey of link prediction in complex networks analysis of recommendation algorithms for internet of things friendship selection in the social internet of things: challenges and possible strategies bayesian models of graphs, arrays and other exchangeable random structures the cluster between internet of things and social networks: review and research challenges social internet of things (siot): foundations, thrust areas, systematic review and future directions estimation and prediction for stochastic blockmodels for graphs with latent block structure searching the web of things: state of the art, challenges, and solutions internet of things search engine nonparametric network models for link prediction things of interest recommendation by leveraging heterogeneous relations in the internet of things key: cord-003297-fewy8y4a authors: wang, ming-yang; liang, jing-wei; mohamed olounfeh, kamara; sun, qi; zhao, nan; meng, fan-hao title: a comprehensive in silico method to study the qstr of the aconitine alkaloids for designing novel drugs date: 2018-09-18 journal: molecules doi: 10.3390/molecules23092385 sha: doc_id: 3297 cord_uid: fewy8y4a a combined in silico method was developed to predict potential protein targets that are involved in cardiotoxicity induced by aconitine alkaloids and to study the quantitative structure–toxicity relationship (qstr) of these compounds. for the prediction research, a protein-protein interaction (ppi) network was built from the extraction of useful information about protein interactions connected with aconitine cardiotoxicity, based on nearly a decade of literature and the string database. the software cytoscape and the pharmmapper server were utilized to screen for essential proteins in the constructed network. the calcium-calmodulin-dependent protein kinase ii alpha (camk2a) and gamma (camk2g) were identified as potential targets. to obtain a deeper insight on the relationship between the toxicity and the structure of aconitine alkaloids, the present study utilized qsar models built in sybyl software that possess internal robustness and external high predictions. the molecular dynamics simulation carried out here have demonstrated that aconitine alkaloids possess binding stability for the receptor camk2g. in conclusion, this comprehensive method will serve as a tool for following a structural modification of the aconitine alkaloids and lead to a better insight into the cardiotoxicity induced by the compounds that have similar structures to its derivatives. the rhizomes and roots of aconitine species, a genus of the family ranunculaceae, are commonly used in treatment for various illnesses such as collapse, syncope, rheumatic fever, joints pain, gastroenteritis, diarrhea, edema, bronchial asthma, and tumors. they are also involved in the management of endocrinal disorders such as irregular menstruation [1, 2] . however, the usefulness of this aconitine species component intermingles with toxicity after it is administered to a diseased patient. so far, few articles have recorded the misuse of aconitine medicinals with strong emphasis and thus have referenced that the misuse of this medicinal can result in severe cardio-and neurotoxicity [3] [4] [5] [6] [7] . in our past research, it was evidenced that the aconitine component is the main active ingredient in this species' root and rhizome, and is responsible for both therapeutic and toxic effects [8] . this medicinal has been tested for cancerological and dermatological activities. its application to disease conditions proved to exhibit an activity that slowed down cancer tumor growth and to cure serious cases of dermatosis. it was also found to have an effect on postoperative analgesia [9] [10] [11] [12] . however, a previous safety study has revealed that aconitine toxicity is responsible for its restriction in clinical settings. further studies are needed to explain the cause of aconitine toxicity as well as to show whether the toxicity supersedes its usefulness. a combined network analysis and in silico study was once performed to obtain insight on the relationship between aconitine alkaloid toxicity and the aconitine structure, and it was found that the cardiotoxicity of aconitine is the primary cause of patient death. the aconitine poison is similar to the poison created by some pivotal proteins such as the ryanodine receptor (ryr1 and ryr2), the gap junction α-1 protein (gja1), and the sodium-calcium exchanger (slc8a1) [9] [10] [11] [12] . however, among all existing studies about the aconitine medicinal, no one has reported detail of its specific binding target protein linked to toxicity. protein-protein interactions (ppis) participate in many metabolic processes occurring in living organisms such as the cellular communication, immunological response, and gene expression control [13, 14] . a systematic description of these interactions aids in the elucidation of interrelationships among targets. the targeting of ppis with small-molecule compounds is becoming an essential step in a mechanism study [14] . the present study was designed and undertaken to identify the critical protein that can affect the cardiotoxicity of aconitine alkaloids. a ppi network built by the string database is a physiological contact for the high specificity that has been established for several protein molecules and has stemmed from computational prediction, knowledge transfer between organisms, and interactions aggregated from other databases [15] . the analysis of the ppi network is based on nodes and edges and is always performed via cluster analysis and centrality measurements [16, 17] . in cluster analysis, highly interconnected nodes and protein target nodes are divided and used to form sub-graphs. the reliability of the ppi network is identified by the content of each sub-graph [18] . the variability in centrality measurements is connected to the quantitative relationship between the protein targets and its weightiness in the network [18] . hence, ppi networks with protein targets related to aconitine alkaloid cardiotoxicity must enable us to find the most relevant protein for aconitine toxicity and to understand the mechanism at the network level. in our research, the evaluation and visualization analysis of essential proteins related to cardiotoxicity in ppis were performed by the clusterone and cytonca plugins in cytoscape 3.5, designed to find the potential protein targets via combination with conventional integrated pharmacophore matching technology built in the pharmmapper platform. structural modification of a familiar natural product, active compound, or clinical drug is an efficient method for designing a novel drug. the main purpose of the structural modification is to reduce the toxicity of the target compound while enhancing the utility of the drug [19] . the identification of the structure-function relationship is an essential step in the drug discovery and design, the determination of the 3d protein structures was the key step in identifying the internal interactions in the ligand-receptor complexes. x-ray crystallography and nmr were the only accepted techniques of determining the 3d protein structure. although the 3d structure obtained by these two powerful techniques are accurate and reliable, they are time-consuming and costly [20] [21] [22] [23] [24] . with the rapid development of structural bioinformatics and computer-aided drug design (cadd) techniques in the last decade, computational structures are becoming increasingly reliable. the application of structural bioinformatics and cadd techniques can improve the efficiency of this process [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] . the ligand-based quantitative structure-toxicity relationship (qstr) and receptor-based docking technology are regarded as effective and useful tools in analysis of structure-function relationships [35] [36] [37] [38] . the contour maps around aconitine alkaloids generated by comparative molecular field analysis (comfa) and comparative molecular similarity index analysis (comsia) were combined with the interactions between ligand substituents and amino acids obtained from docking results to gain insight on the relationship between the structure of aconitine alkaloids and their toxicity. scores from functions were used to evaluate the docking result. the value-of-fit score in moe software reflects the binding stability and affinity of the ligand-receptor complexes. when screening for the most potential target for cardiotoxicity, the experimental data was combined with the value-of-fit score by the ndcg (normalized discounted cumulative gain). the possibility of a protein being a target of cardiotoxicity corresponds with the consistency of this experimental data. since the pioneering paper entitled "the biological functions of low-frequency phonons" [39] was published in 1977, many investigations of biomacromolecules from a dynamic point of view have occurred. these studies have suggested that low-frequency (or terahertz frequency) collective motions do exist in proteins and dna [40] [41] [42] [43] [44] . furthermore, many important biological functions in proteins and dna and their dynamic mechanisms, such as cooperative effects [45] , the intercalation of drugs into dna [42] , and the assembly of microtubules [46] , have been revealed by studying the low-frequency internal motions, as summarized in a comprehensive review [40] . some scientists have even applied this kind of low-frequency internal motion to medical treatments [47, 48] . investigation of the internal motion in biomacromolecules and its biological functions is deemed as a "genuinely new frontier in biological physics," as announced in the mission of some biotech companies (see, e.g., vermont photonics). in order to consider the static structural information of the ligand-receptor complex, dynamical information should be also considered in the process of drug discovery [49, 50] . finally, molecular dynamics was carried out to verify the binding affinity and stability of aconitine alkaloids and the most potential target. this present study may be instrumental in our future studies for the synergism and attenuation of aconitine alkaloids and for the exploitation of its clinical application potential. a flowchart of procedures in our study is shown in figure 1 . of-fit score by the ndcg (normalized discounted cumulative gain). the possibility of a protein being a target of cardiotoxicity corresponds with the consistency of this experimental data. since the pioneering paper entitled "the biological functions of low-frequency phonons" [39] was published in 1977, many investigations of biomacromolecules from a dynamic point of view have occurred. these studies have suggested that low-frequency (or terahertz frequency) collective motions do exist in proteins and dna [40] [41] [42] [43] [44] . furthermore, many important biological functions in proteins and dna and their dynamic mechanisms, such as cooperative effects [45] , the intercalation of drugs into dna [42] , and the assembly of microtubules [46] , have been revealed by studying the low-frequency internal motions, as summarized in a comprehensive review [40] . some scientists have even applied this kind of low-frequency internal motion to medical treatments [47, 48] . investigation of the internal motion in biomacromolecules and its biological functions is deemed as a "genuinely new frontier in biological physics," as announced in the mission of some biotech companies (see, e.g., vermont photonics). in order to consider the static structural information of the ligand-receptor complex, dynamical information should be also considered in the process of drug discovery [49, 50] . finally, molecular dynamics was carried out to verify the binding affinity and stability of aconitine alkaloids and the most potential target. this present study may be instrumental in our future studies for the synergism and attenuation of aconitine alkaloids and for the exploitation of its clinical application potential. a flowchart of procedures in our study is shown in figure 1 . the whole framework of the comprehensive in silico method for screening potential targets and studying the quantitative structure-toxicity relationship (qstr). the 33 compounds were aligned over, under the superimposition of the common moiety and template compound 6. the statistical parameters for database alignment-q 2 , r 2 , f, and see-were the whole framework of the comprehensive in silico method for screening potential targets and studying the quantitative structure-toxicity relationship (qstr). the 33 compounds were aligned over, under the superimposition of the common moiety and template compound 6. the statistical parameters for database alignment-q 2 , r 2 , f, and see-were table 1 . the comfa model with the optimal number of 6 components presented a q 2 of 0.624, an r 2 of 0.966, an f of 124.127, and an see of 0.043, and contributions of the steric and electrostatic fields were 0.621 and 0.379, respectively. the comsia model with the optimal number of 4 components presented a q 2 of 0.719, an r 2 of 0.901, an f of 157.458, and an see of 0.116, and the contributions of steric, electrostatic, hydrophobic, hydrogen bond acceptor, and hydrogen bond donor fields were 0.120, 0.204, 0.327, 0.216, and 0.133, respectively. the statistical results proved that the aconitine alkaloids qstr model of comfa and comsia under the database alignment have adequate predictability. experimental and predicted pld 50 values of both the training set and test set are shown in figure 2 , and the comfa ( figure 2a ) and comsia ( figure 2b ) model gave the correlation coefficient (r 2 ) value of 0.9698 and 0.977, respectively, which demonstrated the internal robustness and external high prediction of the qstr models. experimental and predicted pld50 values of both the training set and test set are shown in figure 2 residuals vs. leverage williams plots of the aconitine qstr models are shown in figure 3a ,b. all values of standardized residuals fall between 3σ and −3σ, and the values of leverage are less than h*, so the two models demonstrate potent extensibility and predictability. residuals vs. leverage williams plots of the aconitine qstr models are shown in figure 3a ,b. all values of standardized residuals fall between 3σ and −3σ, and the values of leverage are less than h*, so the two models demonstrate potent extensibility and predictability. under mesh (medical subject headings), a total of 491 articles (261 articles were received from web of science, and others were received from pubmed) were retrieved. after selecting cardiotoxicity-related and excluding repetitive articles, 274 articles were used to extract the correlative proteins and pathways for building a ppi network in the string server. the correlative proteins or pathways are shown in table 2 . all proteins were taken as input protein in the string database to find its direct and functional partners [51] , and proteins and its partners were then imported into the cytoscape 3.5 to generate the ppi network with 148 nodes and 872 edges ( figure 4 ). potassium voltage-gated channel h2 7 scn3a sodium voltage-gated channel type 3, 3 scn2a sodium voltage-gated channel type 2 3 scn8a sodium voltage-gated channel type 8 2 scn1a sodium voltage-gated channel type 1 2 scn4a sodium voltage-gated channel type 4 1 kcnj3 potassium inwardly-rectifying channel j3 1 during the case of screening of the essential proteins in ppi network, three centrality measurements (subgraph centrality, betweenness centrality, and closeness centrality) in cytonca were utilized to evaluate the weight of nodes. after removing the central node "ac," the centrality measurements of 147 nodes were calculated by cytonca and documented in table s1 . the top 10% of three centrality measurement values of all node are painted with a different color in figure 4a . to screen the node with the high values of each three centrality measures, nodes with three colors were overlapped and merged into sub-networks in figure 4b . under mesh (medical subject headings), a total of 491 articles (261 articles were received from web of science, and others were received from pubmed) were retrieved. after selecting cardiotoxicity-related and excluding repetitive articles, 274 articles were used to extract the correlative proteins and pathways for building a ppi network in the string server. the correlative proteins or pathways are shown in table 2 . all proteins were taken as input protein in the string database to find its direct and functional partners [51] , and proteins and its partners were then imported into the cytoscape 3.5 to generate the ppi network with 148 nodes and 872 edges ( figure 4 ). table 2 . proteins related to aconitine alkaloids induced cardiotoxicity extracted from 274 articles. classification frequency ryanodine receptor 2 19 ryr1 ryanodine receptor 1 15 gja1 gap junction α-1 protein (connexin43) 13 slc8a1 sodium/calcium exchanger 1 11 atp2a1 calcium transporting atpase fast twitch 1 9 kcnh2 potassium voltage-gated channel h2 7 scn3a sodium voltage-gated channel type 3, 3 scn2a sodium voltage-gated channel type 2 3 scn8a sodium voltage-gated channel type 8 2 scn1a sodium voltage-gated channel type 1 2 scn4a sodium voltage-gated channel type 4 1 kcnj3 potassium inwardly-rectifying channel j3 1 during the case of screening of the essential proteins in ppi network, three centrality measurements (subgraph centrality, betweenness centrality, and closeness centrality) in cytonca were utilized to evaluate the weight of nodes. after removing the central node "ac," the centrality measurements of 147 nodes were calculated by cytonca and documented in table s1 . the top 10% of three centrality measurement values of all node are painted with a different color in figure 4a . to screen the node with the high values of each three centrality measures, nodes with three colors were overlapped and merged into sub-networks in figure 4b . in the sub-networks, the voltage-gated calcium and sodium channel accounted for a large proportion, which is consistent with our research in clustering the network (clusters 1, 2, and 9). all proteins in the sub-networks will be utilized to predict the results of the pharmmapper server to receive the potential target of cardiotoxicity induced by aconitine alkaloids (in figure 5a ,b). in the meantime, 2v7o (camk2g) and 2vz6 (camk2a) were identified as the potential targets with higher fit scores. in the sub-networks, the voltage-gated calcium and sodium channel accounted for a large proportion, which is consistent with our research in clustering the network (clusters 1, 2, and 9). all proteins in the sub-networks will be utilized to predict the results of the pharmmapper server to receive the potential target of cardiotoxicity induced by aconitine alkaloids (in figure 5a ,b). in the meantime, 2v7o (camk2g) and 2vz6 (camk2a) were identified as the potential targets with higher fit scores. all compounds were docked into three potential targets. the values of ndcg are shown in table 3 . the dock study of three proteins with an ndcg of 0.8503 and 0.9122, respectively (the detailed in the sub-networks, the voltage-gated calcium and sodium channel accounted for a large proportion, which is consistent with our research in clustering the network (clusters 1, 2, and 9). all proteins in the sub-networks will be utilized to predict the results of the pharmmapper server to receive the potential target of cardiotoxicity induced by aconitine alkaloids (in figure 5a ,b). in the meantime, 2v7o (camk2g) and 2vz6 (camk2a) were identified as the potential targets with higher fit scores. all compounds were docked into three potential targets. the values of ndcg are shown in table 3 . the dock study of three proteins with an ndcg of 0.8503 and 0.9122, respectively (the detailed all compounds were docked into three potential targets. the values of ndcg are shown in table 3 . the dock study of three proteins with an ndcg of 0.8503 and 0.9122, respectively (the detailed docking result is shown in table s2 ) proves that the result of the dock study of 2v7o is consistent with the experimental pld 50 , so the protein 2v7o was utilized for the ligand interaction analysis. table 3 . ranking results by experimental and predicted pld 50 and fit score. experimental pld 50 fit score (2v7o) fit score (2vz6) 6 1 3 3 20 2 1 12 12 3 4 9 1 4 2 4 11 5 7 2 14 6 8 13 16 7 5 6 7 8 17 15 8 9 10 11 27 10 23 17 13 11 12 19 15 12 11 5 32 13 18 18 5 14 22 8 33 15 13 29 21 16 15 1 25 17 9 20 22 18 25 25 17 19 20 16 28 20 24 30 9 21 16 32 29 22 32 14 2 23 30 24 30 24 31 26 18 25 21 27 10 26 26 21 23 27 29 31 31 28 33 7 26 29 14 23 4 30 28 33 3 31 6 10 19 32 27 28 24 33 19 22 ndcg 1 0.9122 0.8503 the 3d-qstr contour maps were utilized to visualize the information on the comfa and comsia model properties in three-dimensional space. these maps used characteristics of compounds that are crucial for activity and display the regions around molecules where the variance of activities is expected based on physicochemical property changes in molecules [52] . the analysis of favorable and unfavorable regions of steric, electrostatic, hydrophobic, hbd, and hba atom fields contribute to the realization of the relationship between the aconitine alkaloid's toxic activity and its structure. steric and electrostatic contour maps of the comfa qstr model are shown in figure 4a ,b, respectively. hydrophobic, hbd, and hba contour maps of the comsia qstr model are shown in figure 4c -e. compound 6 has the most toxic activity, so it was chosen as the reference structure for the generation of the comfa and comsia contour maps. in the case of the comfa study, the steric contour map around compound 6 is shown in figure 6a . the yellow regions near r2, r7, and r6 showed the substituents of the molecule, which proved that these positions were not ideal for sterically favorable functional groups. therefore, compounds 19, 24, and 26 (with pld 50 values of 1.17, 0.84, and 1.82, respectively), which consist of sterically esterified moieties at positions r2 and r7, were less toxic than compounds 6 and 20 (with pld 50 values of 5.00 and 4.95), which were substituted by a small hydroxyl group, and compound 3 (with a pld 50 value of 1.44) has less toxic activity due to the esterified moiety in r6. the green regions, sterically favorable the comfa electrostatic contour map is shown in figure 6b . the blue region near the r2 and r7 substitution revealed that the replacement of electropositive groups is in favor of toxicity. this can be proven by the fact that the compounds with hydroxy in these two positions had higher pld 50 values than the compound with acetoxy or no substituents. the red region surrounding molecular scaffolds was not distinct, which revealed that there was no connection between the electronegative and the toxicity. the comsia hydrophobic contour map is shown in figure 6c . the r2, r6, and r7 around the white region indicated that the hydrophobic groups were unfavorable for the toxicity, so the esterification of hydrophilic hydroxyl or dehydroxylation decreased the toxicity, which is consistent with the steric and electrostatic contour map. the yellow contour map near the r12 manifested that the hydrophilic hydroxy was unfavorable to the toxicity, which can be validated by the fact that aconitine alkaloids with hydroxy substituents in r12 (compound 10, with a pld 50 the ppi network of aconitine alkaloids cardiotoxicity was divided into nine clusters using clusterone. statistical parameters are shown in figure 5 . six clusters, namely clusters 1, 3, 4, 5, 7, and 9, which possess quality scores higher than 0.5, a density higher than 0.45, and a p-value less than 0.05, were selected for further analysis (in figure 7) . clusters 1, 4, and 7 consisted of proteins mainly involved in the effects of various calcium, potassium, and sodium channels. cluster 1 mainly the comsia contour map of hbd is shown in figure 6d . the cyan regions at r2, r6, and r7 represented a favorable condition for the hbd atom, which clearly validated the fact that the compounds with hydroxy in this region show potent toxicity. a purple region was found near r12, which proved that the hbd atom (hydroxyl) in this region has an adverse effect on toxicity. the hba contour map is shown in figure 6 . the magenta region around r1 substitution proved that this substitution was favorable to the hba atom, so compounds 13, 15, 32, and 33 with the hba atom in the r1 substitution exhibit more potent toxicity (with pld 50 values of 3.52, 3.30, 3.16, and 2.84) than compounds with methoxymethyl substituents (compounds 19, 24, and 26 with pld 50 values of 1.17, 0.84, and 1.82). the red contour map where hba atoms are unfavorable for the toxicity was positioned around r2 and r6. these contours were well validated by the lower pld 50 value of compounds with carbonyl in these substitutions. the ppi network of aconitine alkaloids cardiotoxicity was divided into nine clusters using clusterone. statistical parameters are shown in figure 5 . six clusters, namely clusters 1, 3, 4, 5, 7, and 9, which possess quality scores higher than 0.5, a density higher than 0.45, and a p-value less than 0.05, were selected for further analysis (in figure 7) . clusters 1, 4, and 7 consisted of proteins mainly involved in the effects of various calcium, potassium, and sodium channels. cluster 1 mainly consisted of three channel types related to the cardiotoxicity of aconitine alkaloids, cluster 4 contained calcium and sodium channels and some channel exchangers (such as ryr1 and ryr2), and cluster 7 mainly consisted of various potassium channels. all of these findings are consistent with previous research about the arrhythmogenic properties of the toxicity of aconitine alkaloids: the aconitine binds to ion channels and affects their open state, and thus the corresponding ion influx into the cytosol [53] [54] [55] . the channel exchangers play a crucial role in keeping the ion transportation and homeostasis inside and outside of the cell. cluster 9 contained some regulatory proteins that can activate or repress the ion channels through the protein expression level. atp2a1, ryr2, ryr1, cacna1c, cacna1d, and cacna1s mediate the release of calcium, thereby playing a key role in triggering cardiac muscle contraction and maintaining the calcium homeostasis [56, 57] . aconitine may cause aberrant channel activation and lead to cardiac arrhythmia. clusters 3 and 5 consisted of camp-dependent protein kinase (capk), cgmp-dependent protein kinase (cgpk), and guanine nucleotide binding protein (g protein). they have not been fully studied to prove whether the cardiotoxicity induced by aconitine alkaloids is linked to the capk, cgpk, and g proteins; however, some studies have shown that cardiotoxicity-related protein kcnj3 (potassium inwardly-rectifying channel) is controlled by g proteins and the cardiac sodium/calcium exchanger and is said to be regulated by capk and cgpk [58, 59] . the result of clusterone indicated that the constructed network is consistent with existing studies and that the network can be used to screen essential proteins in the cytonca plugin. the protein 2v7o belonging to the camkii (calcium/calmodulin (ca 2+ /cam)-dependent serine/threonine kinases ii) isozyme protein family plays a central role in cellular signaling by transmitting ca 2+ signals. the camkii enzymes transmit calcium ion (ca 2+ ) signals released inside the cell by regulating signal transduction pathways through phosphorylation. ca 2+ first binds to the small regulatory protein cam, and this ca 2+ /cam complex then binds to and activates the kinase, which then phosphorylates other proteins such as ryanodine receptor and sodium/calcium exchanger. thus, these proteins are related to the cardiotoxicity induced by aconitine alkaloids [60] [61] [62] . the excessive activity of camkii has been observed in some structural heart disease and arrhythmias [63] , and past findings demonstrate neuroprotection in neuronal cultures treated with inhibitors of camkii immediately prior to excitotoxic activation of the camkii [64] . the acute cardiotoxicity of the aconitine alkaloids is possibly related to this target. based on the analysis of the ppi network above, camkii was selected as the potential target for further molecular docking and dynamic simulation. the dock result of 2v7o is shown in figure 8a . compound 20 has the highest fit scores, so it was selected as the template for conformational analysis. the mechanisms of camkii activation and inactivation are shown in figure 8b . compound 20 affects the normal energy metabolism of the myocardial cell via binding in the atp-competitive site in figure 8c . the inactive state of the camkii was regulated by cask-mediated t306/t307 phosphorylation, and this state can be inhibited by the binding of compound 20 in the atp-competitive site. such binding moves camkii toward a ca 2+ /cam-dependent activation active state and a ca 2+ /cam-dependent activation through structural rearrangement of the inhibitory helix caused by ca 2+ /cam binding and the subsequent autophosphorylation of t287 [65] , which will induce the excessive activity of camkii and dynamic imbalance of the calcium ions in the myocardial cell, eventually leading to heart disease and arrhythmias. molecules 2018, 23, x for peer review 10 of 24 channel) is controlled by g proteins and the cardiac sodium/calcium exchanger and is said to be regulated by capk and cgpk [58, 59] . the result of clusterone indicated that the constructed network is consistent with existing studies and that the network can be used to screen essential proteins in the cytonca plugin. the protein 2v7o belonging to the camkii (calcium/calmodulin (ca 2+ /cam)-dependent serine/threonine kinases ii) isozyme protein family plays a central role in cellular signaling by transmitting ca 2+ signals. the camkii enzymes transmit calcium ion (ca 2+ ) signals released inside the cell by regulating signal transduction pathways through phosphorylation. ca 2+ first binds to the small regulatory protein cam, and this ca 2+ /cam complex then binds to and activates the kinase, which then phosphorylates other proteins such as ryanodine receptor and sodium/calcium the information of a binding pocket of a receptor for its ligand is very important for drug design, particularly for conducting mutagenesis studies [28] . as has been reported in the past [66] , the binding pocket of a protein receptor to a ligand is usually defined by those residues that have at least one heavy atom within a distance of 5 å from a heavy atom of the ligand. such a criterion was originally used to define the binding pocket of atp in the cdk5-nck5a complex [20] , which was later proved to be very useful in identifying functional domains and stimulating the relevant truncation experiments. a similar approach has also been used to define the binding pockets of many other receptor-ligand interactions important for drug design [30, 31, 33, [67] [68] [69] [70] . the information of a binding pocket of camkii for the aconitine alkaloids will serve as a guideline for designing drugs with similar scaffolds, particularly for conducting mutagenesis studies. in figure 8a , four top fit scores-compounds 1, 6, 12, and 20-generated similar significant interactions with amino acid residues around the atp-competitive binding pocket. four compounds formed with many van der waals interactions within the noncompetitive inhibitor pocket through amino acid residues such as asp157, lys43, glu140, lys22, and leu143. the ligand-receptor interaction showed that the hydroxy in r2 formed a side chain donor interaction with asp157. in addition, the hydroxy in r6 and r7 also formed a side chain acceptor interaction with glu140 and ser26, respectively (the docking result of compounds 6 and 12 in figure 8a ). these results correspond to the comfa and comsia contour maps. however, the small electropositive and hydrophilic group in r2, r6, and r7 possess a certain enhancement function to toxicity. there were aromatic interactions between the phenyl group in r9 and amino acid residues. the phenyl group in r9 formed aromatic interactions with leu20, leu142, and phe90, while the small group hydroxyl did not form any interaction with asp91, which demonstrate that bulky phenyl group is crucial to this binding pattern and toxicity. this was mainly equal to the comfa steric contour map, where r9 was ideal for sterically favorable groups. the methoxymethyl r1 generated backbone acceptor with lys43, which correspond to the comsia hba contour map, where r1 was favorable for the hba atom. compound 20 docked into 2v7o, and the atp-competitive pocket was painted green; the t287, t307, and t308 phosphorylation sites were painted green, orange, and yellow, respectively; the inhibitory helix was painted red. the result of md simulation is shown in figure 9 . the red plot represented the rmsd values of the docked protein. the values of rmsd reached 2.41 å in 1.4 ns and then remained between 2 and 2.5 å throughout the simulation for up to 5 ns. the averaged value of the rmsd was 2.06 å. the md simulation demonstrated that the ligand was stabilized in the active site. finally, we combined the ligand-based 3d-qstr analysis with the structure-based molecular docking study to identify the necessary moiety related to the cardiotoxicity mechanism of the aconitine alkaloids (in figure 10 ). finally, we combined the ligand-based 3d-qstr analysis with the structure-based molecular docking study to identify the necessary moiety related to the cardiotoxicity mechanism of the aconitine alkaloids (in figure 10 ). to build the ppi network of aconitine alkaloids, literature from 1 january 2007 to 31 february 2017 was retrieved from pubmed (http://pubmed.cn/) and web of science (http://www.isiknowledge.com/) with the mesh word "aconitine" and "toxicity" and without language restriction. all documents about cardiotoxicity caused by aconitine alkaloids were collected. the proteins related to the aconitine alkaloids cardiotoxicity of this decade were gathered and taken as the input protein in the string (https://string-db.org/) database [51, 71] , used to search for related proteins or pathways that had been reported. finally, all the proteins and its partners were recorded in excel in order to import information and build a ppi network in cytoscape software. cytoscape is a free, open-source, java application for visualizing molecular networks and integrating them with gene expression profiles [71, 72] . plugins are available for network and molecular profiling analyses, new layouts, additional file format support, making connections with figure 10 . crucial requirement of cardiotoxicity mechanism was obtained from the ligand-based 3d-qstr and structure-based molecular docking study. to build the ppi network of aconitine alkaloids, literature from 1 january 2007 to 31 february 2017 was retrieved from pubmed (http://pubmed.cn/) and web of science (http://www.isiknowledge.com/) with the mesh word "aconitine" and "toxicity" and without language restriction. all documents about cardiotoxicity caused by aconitine alkaloids were collected. the proteins related to the aconitine alkaloids cardiotoxicity of this decade were gathered and taken as the input protein in the string (https://string-db.org/) database [51, 71] , used to search for related proteins or pathways that had been reported. finally, all the proteins and its partners were recorded in excel in order to import information and build a ppi network in cytoscape software. cytoscape is a free, open-source, java application for visualizing molecular networks and integrating them with gene expression profiles [71, 72] . plugins are available for network and molecular profiling analyses, new layouts, additional file format support, making connections with databases, and searching within large networks [71] . clusterone (clustering with overlapping neighborhood expansion) of cytoscape was utilized to cluster the ppi network into overlapping sub-graphs of highly interconnected nodes. clusterone is a plugin for detecting and clustering potentially overlapping protein complexes from ppi data. the quality of a group was assessed by the number of sub-graphs, p-values, and density. the cluster was discarded when the number of sub-graphs was smaller than 3, the density was less than 0.45, the quality was less than 0.5, and the p-value was under 0.05 [73] . the clustering results of the clusterone are instrumental to understanding how the reliability of the ppi network relates to aconitine alkaloids' cardiotoxicity. cytonca is a plugin in cytoscape integrating calculation, evaluation, and visualization analysis for multiple centrality measures. there are eight centrality measurements provided by cytonca: betweenness, closeness, degree, eigenvector, local average connectivity-based, network, subgraph, and information centrality [74] . the primary purpose of the centrality analysis was to confirm the essential proteins in the pre-built ppi network. the three centrality measurements in the cytonca plugin-subgraph centrality, betweenness centrality, and closeness centrality-were used for evaluating and screening the essential protein in the merged target network. the subgraph centrality characterizes the participation of each node in all subgraphs in a network. smaller subgraphs are given more weight than larger ones, which makes this measurement an appropriate one for characterizing network properties. the subgraph centrality of node "u" can be calculated by [75] µ l (u) is the uth diagonal entry of the lth power of the weight adjacency matrix of the network. v 1 , v 2 , . . . , v n is be an orthonormal basis composed of r n composed by eigenvectors of a associated to the eigenvalues λ 1 , λ 2 , . . . , λ n v u v , which is the uth component of v v [75] . the betweenness centrality finds a wide range of applications in network theory. it represents the degree to which nodes stand between each other. betweenness centrality was devised as a general measure of centrality. it is applicable to a wide range of problems in network theory, including problems related to social networks, biology, transport, and scientific cooperation. the betweenness centrality of a node u can be calculated by [76] ρ (s, t) is the total number of shortest paths from node s to node ρ (s, u, t), which is the number of those paths that pass through u. closeness centrality of a node is a measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph. thus, the more central a node is, the closer it is to all other nodes. the closeness centrality of a node u can be calculated by [77] |nu| is the number of node u's neighbors, and dist (u, v) is the distance of the shortest path from node u to node v. pharmmapper serves as a valuable tool for identifying potential targets for a novel synthetic compound, a newly isolated natural product, a compound with known biological activity, or an existing drug [78] . of all the aconitine alkaloids in this research, compounds 6, 12, and 20 exhibited the most toxic activity and were used for the potential target prediction. the mol2 format of three compounds was submitted to the pharmmapper server. the parameters of generate conformers and maximum generated conformations was set as on and 300, respectively. other parameters used default values. finally, the result of the clusterone and pharmmapper will be combined together to select the potential targets for the following docking study [78] . comparative molecular field analysis (comfa) and comparative molecular similarity index analysis (comsia) are efficient tools in ligand-based drug design and are in use for contour map generation and identification of favorable and unfavorable regions in a moiety [52, 79] . the comfa consists of a steric and electrostatic contour map of molecules that are correlated with toxic activity, while the comsia consists of hydrophobic field, hydrogen bond donor (hbd)/hydrogen bond acceptor (hba) [80] , and steric/electrostatic fields that are correlated with toxic activity. the comfa and comsia have been utilized to generate a 3d-qstr model [81] . all molecule models and the generation of 3d-qstr were performed with sybyl x2.0. alkaloids in mice with ld 50 values listed in table 4 were extracted from recent literature [70] . the ld 50 values of all aconitine alkaloids were converted into pld 50 with a standard tripos force field. these pld50 values were used as a dependent variable, while comfa and comsia descriptors were used as an independent variable. the sketch function of sybyl x2.0 was utilized to illustrate structure and charges, and was calculated by the gasteiger-huckel method. additionally, the tripose force field was utilized for energy minimization of these aconitine alkaloid molecules [81] . the 31 molecules were divided into a ratio of 3:1. the division was done in a way that showed that both datasets are balanced and consist of both active and less active molecules [81] . the reliability of the 3d-qstr model depends on the database molecular alignment. the most toxic aconitine alkaloids (compound 6) was selected as the template molecule, and the tetradecahydro-2h-3,6,12-(epiethane [1,1,2] triyl)-7,9-methanonaphtho [2,3-b] azocine was selected as the common moiety. pls (partial least squares) techniques are associated with field descriptors with activity values such as [80] leave one out (loo) values, the optimal number of components, the standard error of estimation (see), cross-validated coefficients (q 2 ), and the conventional coefficient (r 2 ). these statistical data are pivotal in the evaluation of the 3d-qstr model and can be worked out in the pls method [81] . the model is said to be good when the q 2 value is more than 0.5 and the r 2 value is more than 0.6. the q 2 and r 2 values reflect a model's soundness. the best model has the highest q 2 and r 2 values, the lowest see, and an optimal number of components [80, 82, 83] . in the case of comfa and comsia analysis, the values of the optimal number of components, see, and q 2 can be worked out by loo validation, with use sampls turned on and components set to 5, while in the process of calculating r 2 , the use sampls was turned off and the column filtration was set to 2.0 kcal mol −1 in order to speed up the calculation without the need to sacrifice information content [81] [82] [83] [84] . therefore, components were set to 6 and 4, respectively, which were optimal numbers of components calculated by performing a sampls run. see and r 2 were utilized to assess the non-cross validated model. the applicability domain (ad) of the topomer comfa and comsia model was confirmed by the williams plot of residuals vs. leverage. leverage of a query chemical is proportional to its mahalanobis distance measure from the centroid of the training set [85, 86] . the leverages are calculated for a given dataset x by obtaining the leverage matrix (h) with the equation below: x is the model matrix, while xt is its transpose matrix. the plot of standardized residuals vs. leverage values was drawn, and compounds with standardized residuals greater than three standard deviation units (±3σ) were considered as outliers [85] . the critical leverage value is considered 3 p/n, where p is the number of model variables plus one, and n is the number of objects used to calculate the model. h > 3 p/n mean predicted response is not acceptable [85] [86] [87] . (cadd) software program that incorporates the functions of qsar, molecular docking, molecular dynamics, adme (absorption, distribution, metabolism, and excretion), and homologous modeling. all of these functions are regarded as conducive instruments in the field of drug discovery and biochemistry. the molecular docking and dynamics technology were performed in moe2016 software to detect the stability and affinity between the ligands and predictive targets [88, 89] . the docking process involves the prediction of ligand conformation and orientation within a targeted binding site. docking analysis is an important step in the docking process. it has been widely used to study the reasonable binding mode and obtain information of interactions between amino acids in active protein sites and ligands. the molecular docking analysis was carried out to determine the toxicity-related moiety of aconitine alkaloids through the ligand-amino-acid interaction function in moe2015. the pdb format of 2v7o and 2vz6 was downloaded from the pdb (protein data bank) database (https://www.rcsb.org/), and the mol2 format of compounds was from the sybyl software of qstr research. the structure preparation function in moe software will be carried out to minimize the energy and optimize the structure of the protein skeleton. based on the london dg score and induced fit refinement, all compounds will be docked into the active site of every potential target by taking score values as the scoring function [90] . the dcg (discounted cumulative gain) algorithm was utilized to examine the consistency between the ranking result of pld 50 and our research (fit scores of dock study). they rely on the formula that refers to pld 50 . the idcg (ideal dcg) refers to the ordered pld 50 values. the closer the normalized discounted cumulative gain (ndcg) value is to 1, the better the consistency [91] . preliminary md simulations for the model protein were performed using the program namd (nanoscale molecular dynamics program, v 2.9), and all files were generated using visual molecular dynamics (vmd). namd is a freely available software designed for high-performance simulation of large biomolecular systems [92] . during the md simulation, minimization and equilibration of original and docked proteins occurred in a 15 å3 size water box. a charmm 22 force field file was applied for energy minimization and equilibration with gasteiger-huckel charges using boltzmann initial velocity [93, 94] . integrator parameters also included 2 fs/step for all rigid bonds and nonbonded frequencies were selected for 1 å and full electrostatic evaluations for 2 å were used with 10 steps for each cycle [93] . the particle mesh ewald method was used for electrostatic interactions of the simulation system periodic boundary conditions with grid dimensions of 1.0 å [94] . the pressure was maintained at 101.325 kpa using the langevin piston and the temperature was controlled at 310 k using langevin dynamics. covalent interactions between hydrogen and heavy atoms were constrained using the shake/rattle algorithm. finally, 5 ns md simulations for original and docked protein were carried out for comparing and verifying the binding affinity and stability of the ligand-receptor complex. the method combining network analysis and the in silico method was carried out to illustrate the qstr and toxic mechanisms of aconitine alkaloids. the 3d-qstr was built in sybyl with internal robustness and external high prediction, enabling identification of pivotal molecule moieties related to toxicity in aconitine alkaloids. the comfa model had q 2 , r 2 , optimum component, and correlation coefficient (r 2 ) values of 0.624, 0.966, 6, and 0. 9698, respectively, and the comsia model had q 2 , r 2 , optimum component, and correlation coefficient (r 2 ) values of 0.719, 0.901, 4, and 0.9770. the network was built with cytoscape software and the string database, which demonstrated the reliability of cluster analysis. the 2v7o and 2vz6 proteins were identified as potential targets with the cytonca plugin with pharmmapper server for interactions between the aconitine alkaloids and key amino acids in the dock study. the result of the dock study demonstrates the consistency of the experimental pld 50 . the md simulation indicated that aconitine alkaloids exhibit potent binding affinity and stability to the receptor camk2g. finally, we incorporate pivotal molecule moieties and ligand-receptor interactions to realize the qstr of aconitine alkaloids. this research serves as a guideline for studies of toxicity, including neuro-, reproductive, and embryo-toxicity. with a deep understanding of the relationship between toxicity and structure of aconitine alkaloids, subsequent structural modification of aconitine alkaloids can be carried out to enhance their efficacy and to reduce their toxic side effects. based on such research, aconitine alkaloids can bring us closer to medical and clinical applications. in addition, as pointed out in past research [95] , user-friendly and publicly accessible web servers represent the future direction of reporting various important computational analyses and findings [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] . they have significantly enhanced the impacts of computational biology on medical science [110, 111] . the research in this paper will serve as a foundation for constructing web servers for qstr studies and target identifications of compounds. immunomodulating agents of plant origin. i: preliminary screening chinese drugs plant origin aconitine poisoning: a global perspective ventricular tachycardia after ingestion of ayurveda herbal antidiarrheal medication containing aconitum fatal accidental aconitine poisoning following ingestion of chinese herbal medicine: a report of two cases five cases of aconite poisoning: toxicokinetics of aconitines a case of fatal aconitine poisoning by monkshood ingestion determination of aconitine and hypaconitine in gucixiaotong ye by capillary electrophoresis with field-amplified sample injection a clinical study in epidural injection with lappaconitine compound for post-operative analgesia therapeutic effects of il-12 combined with benzoylmesaconine, a non-toxic aconitine-hydrolysate, against herpes simplex virus type 1 infection in mice following thermal injury aconitine: a potential novel treatment for systemic lupus erythematosus aconitine-containing agent enhances antitumor activity of dichloroacetate against ehrlich carcinoma complex discovery from weighted ppi networks prediction and analysis of the protein interactome in pseudomonas aeruginosa to enable network-based drug target selection the string database in 2017: quality-controlled protein-protein association networks, made broadly accessible identification of functional modules in a ppi network by clique percolation clustering united complex centrality for identification of essential proteins from ppi networks the ppi network and cluster one analysis to explain the mechanism of bladder cancer the progress of novel drug delivery systems mitochondrial uncoupling protein 2 structure determined by nmr molecular fragment searching structural basis for membrane anchoring of hiv-1 envelope spike unusual architecture of the p7 channel from hepatitis c virus architecture of the mitochondrial calcium uniporter structure and mechanism of the m2 proton channel of influenza a virus computer-aided drug design using sesquiterpene lactones as sources of new structures with potential activity against infectious neglected diseases successful in silico discovery of novel nonsteroidal ligands for human sex hormone binding globulin in silico discovery of novel ligands for antimicrobial lipopeptides for computer-aided drug design structural bioinformatics and its impact to biomedical science coupling interaction between thromboxane a2 receptor and alpha-13 subunit of guanine nucleotide-binding protein prediction of the tertiary structure and substrate binding site of caspase-8 study of drug resistance of chicken influenza a virus (h5n1) from homology-modeled 3d structures of neuraminidases insights from investigating the interaction of oseltamivir (tamiflu)with neuraminidase of the 2009 h1 n1 swine flu virus prediction of the tertiary structure of a caspase-9/inhibitor complex design novel dual agonists for treating type-2 diabetes by targeting peroxisome proliferator-activated receptors with core hopping approach heuristic molecular lipophilicity potential (hmlp): a 2d-qsar study to ladh of molecular family pyrazole and derivatives fragment-based quantitative structure & ndash; activity relationship (fb-qsar) for fragment-based drug design investigation into adamantane-based m2 inhibitors with fb-qsar hp-lattice qsar for dynein proteins: experimental proteomics (2d-electrophoresis, mass spectrometry) and theoretic study of a leishmania infantum sequence the biological functions of low-frequency phonons: 2. cooperative effects low-frequency collective motion in biomacromolecules and its biological functions quasi-continuum models of twist-like and accordion-like low-frequency motions in dna collective motion in dna and its role in drug intercalation biophysical aspects of neutron scattering from vibrational modes of proteins biological functions of soliton and extra electron motion in dna structure low-frequency resonance and cooperativity of hemoglobin solitary wave dynamics as a mechanism for explaining the internal motion during microtubule growth designed electromagnetic pulsed therapy: clinical applications steps to the clinic with elf emf molecular dynamics study of the connection between flap closing and binding of fullerene-based inhibitors of the hiv-1 protease molecular dynamics studies on the interactions of ptp1b with inhibitors: from the first phosphate-binding site to the second one the cambridge structural database: a quarter of a million crystal structures and rising molecular similarity indices in a comparative analysis (comsia) of drug molecules to correlate and predict their biological activity single channel analysis of aconitine blockade of calcium channels in rat myocardiocytes conversion of the sodium channel activator aconitine into a potent alpha 7-selective nicotinic ligand aconitine blocks herg and kv1.5 potassium channels inactivation of ca 2+ release channels (ryanodine receptors ryr1 and ryr2) with rapid steps in [ca 2+ ] and voltage targeted disruption of the atp2a1 gene encoding the sarco(endo)plasmic reticulum ca 2+ atpase isoform 1 (serca1) impairs diaphragm function and is lethal in neonatal mice cyclic gmp-dependent protein kinase activity in rat pulmonary microvascular endothelial cells different g proteins mediate somatostatin-induced inward rectifier k + currents in murine brain and endocrine cells cardiac myocyte calcium transport in phospholamban knockout mouse: relaxation and endogenous camkii effects inhibition of camkii phosphorylation of ryr2 prevents induction of atrial fibrillation in fkbp12.6 knock-out mice regulation of ca 2+ and electrical alternans in cardiac myocytes: role of camkii and repolarizing currents the role of calmodulin kinase ii in myocardial physiology and disease excitotoxic neuroprotection and vulnerability with camkii inhibition structure of the camkiiδ/calmodulin complex reveals the molecular mechanism of camkii kinase activation a model of the complex between cyclin-dependent kinase 5 and the activation domain of neuronal cdk5 activator binding mechanism of coronavirus main proteinase with ligands and its implication to drug design against sars an in-depth analysis of the biological functional studies based on the nmr m2 channel structure of influenza a virus molecular therapeutic target for type-2 diabetes novel inhibitor design for hemagglutinin against h1n1 influenza virus by core hopping method the string database in 2011: functional interaction networks of proteins, globally integrated and scored cytoscape: a software environment for integrated models of biomolecular interaction networks detecting overlapping protein complexes in protein-protein interaction networks cytonca: a cytoscape plugin for centrality analysis and evaluation of protein interaction networks subgraph centrality and clustering in complex hyper-networks ranking closeness centrality for large-scale social networks enhancing the enrichment of pharmacophore-based target prediction for the polypharmacological profiles of drugs comparative molecular field analysis (comfa). 1. effect of shape on binding of steroids to carrier proteins sample-distance partial least squares: pls optimized for many variables, with application to comfa a qsar analysis of toxicity of aconitum alkaloids recent advances in qsar and their applications in predicting the activities of chemical molecules, peptides and proteins for drug design unified qsar approach to antimicrobials. 4. multi-target qsar modeling and comparative multi-distance study of the giant components of antiviral drug-drug complex networks comfa qsar models of camptothecin analogues based on the distinctive sar features of combined abc, cd and e ring substitutions applicability domain for qsar models: where theory meets reality comparison of different approaches to define the applicability domain of qsar models molecular docking and qsar analysis of naphthyridone derivatives as atad2 bromodomain inhibitors: application of comfa, ls-svm, and rbf neural network concise applications of molecular modeling software-moe medicinal chemistry and the molecular operating environment (moe): application of qsar and molecular docking to drug discovery qsar models of cytochrome p450 enzyme 1a2 inhibitors using comfa, comsia and hqsar estimating a ranked list of human hereditary diseases for clinical phenotypes by using weighted bipartite network biomolecular simulation on thousands processors molecular dynamics and docking investigations of several zoanthamine-type marine alkaloids as matrix metaloproteinase-1 inhibitors salts influence cathechins and flavonoids encapsulation in liposomes: a molecular dynamics investigation review: recent advances in developing web-servers for predicting protein attributes irna-ai: identifying the adenosine to inosine editing sites in rna sequences iss-psednc: identifying splicing sites using pseudo dinucleotide composition irna-pseu: identifying rna pseudouridine sites ploc-mplant: predict subcellular localization of multi-location plant proteins by incorporating the optimal go information into general pseaac ploc-mhum: predict subcellular localization of multi-location human proteins via general pseaac to winnow out the crucial go information iatc-misf: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals psuc-lys: predict lysine succinylation sites in proteins with pseaac and ensemble random forest approach irnam5c-psednc: identifying rna 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition ikcr-pseens: identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier iacp: a sequence-based tool for identifying anticancer peptides ploc-meuk: predict subcellular localization of multi-label eukaryotic proteins by extracting the key go information into general pseaac iatc-mhyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals ihsp-pseraaac: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition irna-psecoll: identifying the occurrence sites of different rna modifications by incorporating collective effects of nucleotides into pseknc impacts of bioinformatics to medicinal chemistry an unprecedented revolution in medicinal chemistry driven by the progress of biological science this article is an open access article distributed under the terms and conditions of the creative commons attribution (cc by) license key: cord-198449-cru40qp4 authors: carballosa, alejandro; mussa-juane, mariamo; munuzuri, alberto p. title: incorporating social opinion in the evolution of an epidemic spread date: 2020-07-09 journal: nan doi: nan sha: doc_id: 198449 cord_uid: cru40qp4 attempts to control the epidemic spread of covid19 in the different countries often involve imposing restrictions to the mobility of citizens. recent examples demonstrate that the effectiveness of these policies strongly depends on the willingness of the population to adhere them. and this is a parameter that it is difficult to measure and control. we demonstrate in this manuscript a systematic way to check the mood of a society and a way to incorporate it into dynamical models of epidemic propagation. we exemplify the process considering the case of spain although the results and methodology can be directly extrapolated to other countries. both the amount of interactions that an infected individual carries out while being sick and the reachability that this individual has within its network of human mobility have a key role on the propagation of highly contagious diseases. if we picture the population of a given city as a giant network of daily interactions, we would surely find highly clustered regions of interconnected nodes representing families, coworkers and circles of friends, but also several nodes that interconnect these different clustered regions acting as bridges within the network, representing simple random encounters around the city or perhaps people working at customer-oriented jobs. it has been shown that the most effective way to control the virulent spread of a disease is to break down the connectivity of these networks of interactions, by means of imposing social distancing and isolation measures to the population [1] . for these policies to succeed however, it is needed that the majority of the population adheres willingly to them since frequently these contention measures are not mandatory and significant parts of the population exploit some of the policies gaps or even ignore them completely. in diseases with a high basic reproduction number, i.e., the expected number of new cases directly generated by one infected case, such is the case of covid19, these individuals represent an important risk to control the epidemic as they actually conform the main core of exposed individuals during quarantining policies. in case of getting infected, they can easily spread the disease to their nearest connections in their limited but ongoing everyday interactions, reducing the effectiveness of the social distancing constrains and helping on the propagation of the virus. measures of containment and estimating the degree of adhesion to these policies are especially important for diseases where there can be individuals that propagate the virus to a higher number of individuals than the average infected. these are the so-called super-spreaders [2, 3] and are present in sars-like diseases such as the covid19. recently, a class of super-spreaders was successfully incorporated in mathematical models [4] . regarding the usual epidemiological models based on compartments of populations, a viable option is to introduce a new compartment to account for confined population [5] . again, this approach would depend on the adherence of the population to the confinement policies, and taking into account the rogue individuals that bypass the confinement measures, it is important to accurately characterize the infection curves and the prediction of short-term new cases of the disease, since they can be responsible of a dramatic spread. here, we propose a method that quantitatively measures the state of the public opinion and the degree of adhesion to an external given policy. then, we incorporate it into a basic epidemic model to illustrate the effect of changes in the social network structure in the evolution of the epidemic. the process is as follows. we reconstruct a network describing the social situation of the spanish society at a given time based on data from social media. this network is like a radiography of the social interactions of the population considered. then, a simple opinion model is incorporated to such a network that allows us to extract a probability distribution of how likely the society is to follow new opinions (or political directions) introduced in the net. this probability distribution is later included in a simple epidemic model computed along with different complex mobility networks where the virus is allowed to spread. the framework of mobility networks allows the explicit simulation of entire populations down to the scale of single individuals, modelling the structure of human interactions, mobility and contact patterns. these features make them a promising tool to study an epidemic spread (see [6] for a review), especially if we are interested in controlling the disease by means of altering the interaction patterns of individuals. at this point, we must highlight the difference between the two networks considered: one is collected from real data from social media and it is used to feel the mood of the collective society, while the other is completely in-silico and proposed as a first approximation to the physical mobility of a population. the study case considered to exemplify our results considers the situation in spain. this country was hard-hit by the pandemic with a high death-toll and the government reacted imposing a severe control of the population mobility that it is still partially active. the policy worked and the epidemic is controlled, nevertheless it has been difficult to estimate the level of adherence to those policies and the repercussions in the sickness evolution curve. this effect can also be determinant during the present transition to the so-called 'new normal'. the manuscript is organized as follows. in section 2 we describe the construction of the social network from scratch using free data from twitter, the opinion model is also introduced here and described its coupling to the epidemiological model. section 3 contains the main findings and computations of the presented models, and section 4 a summary and a brief discussion of the results, with conclusions and future perspectives. in order to generate a social network, we use twitter. we downloaded several networks of connections (using the tool nodexl [7] ). introducing a word of interest, nodexl brings information of users that have twitted a message containing the typed word and the connections between them. the topics of the different searches are irrelevant. in fact, we tried to choose neutral topics with potentiality to engage many people independently of political commitment, age, or other distinctions. the importance of each subnet is that it reveals who is following who and allows us to build a more complete network of connections once all the subnets are put together. each one of the downloaded networks will have approximately 2000 nodes [8] . in this way, downloading as many of such subnets as possible gives us a more realistic map of the current situation of the spanish twitter network and, we believe, a realistic approximation to the social interactions nationwide. we intended to download diverse networks politically inoffensive. 'junction' accounts will be needed to make sure that all sub-networks overlap. junction accounts are these accounts that are part of several subnets and warrant the connection between them. if these junction accounts did not exist, isolated local small networks may appear. go to supplementary information to see the word-of-interest networks downloaded and overlapped. twitter, as a social network, changes in time [9] , [10] , [11] and it is strongly affected by the current socio-political situation, so important variations in its configuration are expected with time. specifically, when a major crisis, such as the current one, is ongoing. taking this into consideration, we analyze two social neworks corresponding to different moments in time. one represents the social situation in october 2019 (with = 17665 accounts) which describes a pre-epidemic social situation and another from april 2020 (with = 24337 accounts) which describes the mandatory-confinement period of time. the networks obtained are directed and the links mark which nodes are following which. so, a node with high connectivity means it is following the opinions of many other nodes. the two social networks obtained with this protocol are illustrated in figure 1 . a first observation of their topologies demonstrate that they fit a scale free network with a power law connectivity distribution and exponents = 1.39 for october'19 and = 1.77 for april'20 network [12] . the significantly different exponents demonstrate the different internal dynamics of both networks. we generate the graphs in (a) and (b) using the algorithm force atlas 2 from gephi [13] . force atlas 2 is a forced-directed algorithm that stimulates the physical system to get the network organized through the space relying on a balance of forces; nodes repulse each other as charged particles while links attract their nodes obeying a hooke's law. so that, nodes that are more distant exchange less information. we consider a simple opinion model based on the logistic equation [14] but that has proved to be of use in other contexts [15, 16] . it is a two variable dynamical model whose nonlinearities are given by: where and account for the two different opinions. as + remains constant, we can use the normalization equation + = 1, and, thus, the system reduces to a single equation: is a time rate that modifies the rhythm of evolution of the variable , is a coupling constant and controls the stationary value of . this system has two fixed points ( 0 = 0 and 0 = + ⁄ + being the latest stable and 0 = 0 unstable. we now consider that each node belongs to a network and the connections between nodes follow the distribution measured in the previous section. the dynamic equation becomes [17] , each of the nodes obey the internal dynamic given by ( ) while being coupled with the rest of the nodes with a strength / where is a diffusive constant and is the connectivity degree for node (number of nodes each node is interacting with, also named outdegree). note that this is a directed non-symmetrical network where means that node is following the tweets from nodes. is the laplacian matrix, the operator for the diffusion in the discrete space, = 1, … , . we can obtain the laplacian matrix from the connections established within the network as = − , being the adjacency matrix notice that the mathematical definition in some references of the laplacian matrix has the opposite sign. we use the above definition given by [17] in parallelism with fick's law and in order to keep a positive sign in our diffusive system. now, we proceed as follows. we consider that all the accounts (nodes in our network) are in their stable fixed point 0 = + + , from equation (6), with a 10% of random noise. then a subset of accounts is forced to acquire a different opinion, = 1 with a 10% of random noise, ∀ / = 1, . . and we let the system to evolve following the dynamical equations (3) . in this case, accounts are sorted by the number of followers that it is easily controllable. therefore, some of the nodes shift their values to values closer to 1 that, in the context of this simplified opinion model, means that those nodes shifted their opinion to values closer to those leading the shift in opinion. this process is repeated in order to gain statistical significance and, as a result, it provides the probability distribution of nodes eager to change the opinion and adhere to the new politics. our epidemiological model is based on the classic sir model [18] and considers three different states for the population: susceptible (s), infected (i) and recovered or removed individuals (r) with the transitions as sketched in figure 2 . here represents the probability of infection and the probability of recovering. we assume that recovered individuals gain immunity and therefore cannot be infected again. we consider an extended model to account for the epidemic propagation where each node interacts with others in order to spread the virus. in this context we consider that each node belongs to a complex network whose topology describes the physical interactions between individuals. the meaning of node here is a single person or a set of individuals acting as a close group (i.e. families). the idea is that the infected nodes can spread the disease with a chance to each of its connections with susceptible individuals, thus becomes a control parameter of how many individuals an infected one can propagate the disease to at each time step. then, each infected individual has a chance of being recovered from the disease. a first order approach to a human mobility network is the watts-strogatz model [19] , given its ability to produce a clustered graph where nearest nodes have higher probability of being interconnected while keeping some chances of interacting with distant nodes (as in an erdös-renyi random graph [20] ). according to this model, we generate a graph of nodes, where each node is initially connected to its nearest neighbors in a ring topology and the connections are then randomly rewired with distant nodes with a probability . the closer this probability is to 1 the more resembling the graph is to a fully random network while for = 0 it becomes a purely diffusive network. if we relate this ring-shaped network with a spatial distribution of individuals, when is small the occurrence of random interactions with individuals far from our circle of neighbors is highly severed, mimicking a situation with strict mobility restrictions where we are only allowed to interact with the individuals from our neighborhood. this feature makes the watts-strogatz model an even more suitable choice for the purposes of our study since it allows us to impose further mobility restrictions to our individuals in a simple way. on the other hand, the effects of clustering in small-world networks with epidemic models are important and have been already studied [21] [22] [23] [24] . the network is initialized setting an initial number of nodes as infected while the rest are in the susceptible state and, then, the simulations starts. at each time step, the chance that each infected individual spreads the disease to each of its susceptible connections is evaluated by means of a monte carlo method [25] . then, the chance of each infected individual being recovered is evaluated at the end of the time step in the same manner. this process is repeated until the pool of infected individuals has decreased to zero or a stopping criterion is achieved. the following step in our modelling is to include the opinion model results from the previous section in the epidemic spread model just described. first, from the outcome of the opinion model , we build a probability density ( ̅) where ̅ = 1 − represents the disagreement with the externally given opinion. these opinion values are assigned to each of the nodes in the watts-strogatz network following the distribution ( ̅). next, we introduce a modified parameter, which varies depending on the opinion value of each node. it can be understood in terms of a weighted network modulated by the opinions, it is more likely that an infection occurs between two rogue individuals (higher value of ̅) rather than between two individuals who agree with the government confinement policies (̅ almost zero or very close to zero). we introduce, then, the weight ′ = ⋅ ̅ ⋅ ̅ , which accounts for the effective probability of infection between an infected node and a susceptible node . at each time step of the simulation, the infection chances are evaluated accordingly to the value ′ of the connection and the process is repeated until the pool of infected individuals has decreased to zero or the stopping criterion is achieved. in figure 3 , we exemplify this process through a network diagram, where white, black and grey nodes represent susceptible, infected and recovered individuals respectively. black connections account for possible infections with chance ′ . to account for further complexity, this approach could be extrapolated to more complex epidemic models already presented in the literature [4, 6, 26] . nevertheless, for the sake of illustration, this model still preserves the main features of an epidemic spread without adding the additional complexity to account for real situations such as the covid19 case. following the previous protocol, we run the opinion model considering the two social networks analyzed. figure 4 shows the distribution of the final states of the variable for the october'19 network (orange) and the april'20 network (green) when the new opinion is introduced in a 30% of the total population (r=30%). different percentages of the initial population r were considered but the results are equivalent (see figure s1 in the supplementary information). figure 4 clearly shows that the population on april'20 is more eager to follow the new opinion (political guidelines) comparing with the situation in october'19. in the pandemic scenario (network of april '20) it is noticeable that larger values of the opinion variable, , are achieved corresponding with the period of the quarantine. preferential states are also observed around = 0, = 0.5 and = 1. note that the network of april'20 allows to change opinions more easily than in the case of october'19. during the sanitary crisis in spain, the government imposed heavy restrictions on the mobility of the population. to better account for this situation, we rescaled the probability density of disagreement opinions ( ̅) to values between 0 and 0.3, leading to the probability densities of figure 5 . from here on, we shall refer to this maximum value of the rescaled probability density as the cutoff imposed to the probability density. note that this probability distribution is directly included into de mobility model as a probability to interact with other individuals, thus, this cutoff means that the government policy is enforced reducing up to a 70% of the interactions and the reminder 30% is controlled by the population decision to adhere to the official opinion. in figure 6 we summarized the main results obtained from the incorporation of the opinion model into the epidemiological one. we established four different scenarios: for the first one we considered a theoretical situation where we imposed that around the 70% of the population will adopt social distancing measures, but leave the other 30% in a situation where they either have an opinion against the policies or they have to move around interacting with the rest of the network for any reason (this means, ̅ = 0.3 for all the nodes). in contrast to this situation we introduce the opinion distribution of the social networks of april'20 and october'19. finally, we consider another theoretical population where at least 90% of the population will adopt social distancing measures (note that in a real situation, around 10% of the population occupies essential jobs and, thus, are still exposed to the virus). however, for the latter the outbreak of the epidemic does not occur so there is no peak of infection. note that the first and the last ones are completely in-silico scenarios introduced for the sake of comparison. figure 6a shows the temporal evolution of the infected population in the first three of the above scenarios. the line in blue shows the results without including an opinion model and considering that a 70% of the population blindly follows the government mobility restrictions while the reminding 30% continue interacting as usual. orange line shows the evolution including the opinion model with the probability distribution derived as in october'19. the green line is the evolution of the infected population considering the opinion model derived from the situation in april'20. note that the opinion model stated that the population in april'20 was more eager to follow changes in the opinion than in october'19, and this is directly reflected in the curves in figure 6a . also note that as the population becomes more conscious and decides to adhere to the restriction-of-mobility policies, the maximum of the infection curve differs in time and its intensity is diminished. this figure clearly shows that the state of the opinion inferred from the social network analysis strongly influences the evolution of the epidemic. the results from the first theoretical case (blue curve) show clearly that the disease reaches practically all the rogue individuals (around the 30% of the total population that we set with the rescaling of the probability density), while the other two cases with real data show that further agreement with the given opinion results in flatter curves of infection spreading. we have analyzed both the total number of infected individuals on the peaks and its location in time of the simulation, but, since our aim is to highlight the incorporation of the opinion model we show in figures 6b and 6c the values of the maximum peak infection as well as the delay introduced in achieving this maximum scaled with the corresponding values of the first case (blue line). we see that the difference on the degree of adhesion of the social networks outcomes a further 12% reduction approx. on the number of infected individuals at the peak, and a further delay of around the 20% in the time at which this peak takes place. note that for the april'20 social network, a reduction of almost the 50% of individuals is obtained for the peak of infection, and a similar value is achieved for the time delay of the peak. this clearly reflects the fact that a higher degree of adhesion is important to flatten the infection curve. finally, in the latter theoretical scenario, where we impose a cutoff of ̅ = 0.1, the outbreak of the epidemic does not occur, and thus there is no peak of infection. this is represented in figure 6b and 6c as a dash-filled bar indicating the absence of the said peak. changing the condition on the cutoff imposed for the variable ̅ can be of interest to model milder or stronger confinement scenarios such as the different policies ruled in different countries. in figure 7 we show the infection peak statistics (maximum of the infection curve and time at maximum) for different values of the cutoffs and for both social opinion networks. in both cases, the values are scaled with those from the theoretical scenario with all individuals having their opinion at the cutoff value. both measurements (figures 7a and 7b) are inversely proportional to the value of the cutoff. this effect can be understood in terms of the obtained probability densities. for both networks (october'19 and april '20) we obtained that most of the nodes barely changed their opinion, and thus for increasing levels on the cutoff of ̅ these counts dominate on the infection processes so the difference between both networks is reduced. on the other hand, this highlights the importance of rogue individuals in situations with increasing levels of confinement policies since for highly contagious diseases each infected individual propagates the disease rapidly. each infected individual matter and the less connections he or she has the harder is for the virus to spread along the exposed individuals. note that for all the scenarios, the social network of april'20 represents the optimum situation in terms of infection peak reduction and its time delay. it is particularly interesting the case for the cutoff in ̅ = 0.2. all simulations run for this cutoff show an almost non-existent peak. this is represented on figure 7a with almost a reduction of the 100% of the infection peak (the maximum value found on the infection curve was small but not zero) and the value of the time delay (figure 7b) as discussed in the previous section, we are considering a watts-strogatz model for the mobility network. this type of network is characterized by a probability of rewiring (as introduced in the previous section) that stablishes the number of distant connections for each individual in the network. all previous results were obtained considering a probability of rewiring of 0.25. figure 8 shows the variation of the maximum for the infection curve and time for the maximum versus this parameter. the observed trend indicates that the higher the clustering (thus, the lower the probability of rewiring) the more difficult is for the disease to spread along the network. this result is supported by previous studies in the field, which show that clustering decreases the size of the epidemics and in cases of extremely high clustering, it can die out within the clusters of population [21, 24] . this can be understood in terms of the average shortest path of the network [12] , which is a measure of the network topology that tells the average minimum number of steps required to travel between any two nodes of the network. starting from the ring topology, where only the nearest neighbors are connected, the average shortest path between any two opposite nodes is dramatically reduced with the random rewirings. remember that these new links can be understood as short-cuts or long-distance connections within the network. since the infection process can only occur between active links between the nodes, it makes sense that the propagation is limited if less of these long-distance connections exist in the network. the average shortest path length decays extremely fast with increasing values of the random rewiring, and thus we see that the peak statistics are barely affected for random rewirings larger than the 25%. if one is interested on further control of the disease, the connections with distant parts of the network must be minimized to values smaller than this fraction. regarding the performance of both opinion biased epidemic cases, we found again a clear difference between the two of them. in the april'19 case, the outcome of the model present always a more favorable situation to control the expansion of the epidemic, stating the importance of the personal adherence to isolation policies in controlling the evolution of the epidemic. we have parametrized the social situation of the spanish society at two different times with the data collected from a social media based on microblogging (twitter.com). the topology of these networks combined with a simple opinion model provides us with an estimate of how likely this society is to follow new opinions and change their behavioral habits. the first analysis presented here shows that the social situation in october 2019 differs significantly from that of april 2020. in fact, we have found that the latter is more likely to accept opinions or directions and, thus, follow government policies such as social distancing or confining. the output of these opinion models was used to tune the mobility in an epidemic model aiming to highlight the effect that the social 'mood' has on the pandemic evolution. the histogram of opinions was directly translated into a probability density of people choosing to follow or not the directions, modifying their exposedness to being infected by the virus. although we exemplify the results with an over-simplified epidemic model (sir), the same protocol can be implemented in more complicated epidemic models. we show that the partial consensus of the social network, although non perfect, induces a significant impact on the infection curve, and that this impact is quantitatively stronger in the network of april 2020. our results are susceptible to be included in more sophisticated models used to study the evolution of the covid19. all epidemic models lack to include the accurate effect of the society and their opinions in the propagation of epidemics. we propose here a way to monitor, almost in real time, the mood of the society and, therefore, include it in a dynamic epidemic model that is biased by the population eagerness to follow the government policies. further analysis of the topology of the social network may also provide insights of how likely the network can be influenced and identify the critical nodes responsible for the collective behavior of the network. in order to check the statistical accuracy and relevance of our networks, we considered different scenarios with more or less subnets (each subnet corresponding with a single hashtag) and estimate the exponent of the scale-free-network fit. this result is illustrated in figure s1a for the october'19 case and in figure s1b for the april'20 case. note that as the number of subnets (hashtags) is increased, the exponent converges. for 1 subnet all the exponents were calculated and for n subnets just one combination is possible so that non deviation is shown. distribution of the final states of the variable for the october'19 network (orange) and the april'20 network (green) when the new opinion is introduced by three different percentages of the total population (r parameter) is shown in figure s2 . note that in all cases the results are qualitatively equivalent and, once included in the opinion model, the results are similar. figure s2 . distribution of the concentrations for the twitter network from october 2019 (orange) and april 2020 (green) for r=20% (a), r=30% (b) and r=40% (c) of the initial accounts in the state 1 with a 10% of noise ( =0.0001, =0.01, =0.0001, 0=0.01, =20000). figure s3 shows the evolution of the number of infected individuals with time for the epidemic model biased with the opinion model of april 2020. results for different values of the ̅ cutoff are shown. note how for ̅ = 0.2 the peak of infection vanishes, and the epidemic dies out due to its lack of ability to spread among the nodes. on the other hand, figure s4 shows for different values of the cutoff on ̅, the comparison between the three cases presented in the main text (see figure 6 ): the theoretical scenario where the opinion is fixed on the cutoff value for all the nodes, and the epidemic model biased with the opinions of october '19 and april '20 scenarios. see how the difference between the theoretical scenario and the opinion biased models diminishes with growing values of the cutoff value on ̅ finally, figure s5 shows the effect that higher values of the rewiring probability of the watt-strogatz model has in the time evolution of the infected individuals. as shown in the main text, lower values of the rewiring probability has an important impact on the peak of infection, while values above = 0.3 barely change the statistics on the said peak, or fall within the error of the measurements. sectoral effects of social distancing one world, one health: the novel coronavirus covid-19 epidemic the role of superspreaders in infectious disease mathematical modeling of covid-19 transmission dynamics with a case study of wuhan predictability: can the turning point and end of an expanding epidemic be precisely forecast? epidemic processes on complex networks nodexl: a free and open network overview, discovery and exploration addin for excel evolving centralities in temporal graphs: a twitter network analysis analyzing temporal dynamics in twitter profiles for personalized recommendations in the social web emerging topic detection on twitter based on temporal and social terms evaluation gephi: an open source software for exploring and manipulating networks resherches mathematiques sur la loi d'accroissement de la population the coupled logistic map: a simple model for the effects of spatial heterogeneity on population dynamics logistic map with memory from economic model turing patterns in network-organized activatorinhibitor systems mathematical epidemiology of infectious diseases: model building, analysis and interpretation collective dynamics of small-world networks on random graphs. publicationes mathematicae epidemics and percolation in small-world networks the effects of local spatial structure on epidemiological invasions properties of highly clustered networks critical behavior of propagation on small-world networks metropolis, monte carlo and the maniac infectious diseases in humans this research is supported by the spanish ministerio de economía y competitividad and european regional development fund, research grant no. cov20/00617 and rti2018-097063-b-i00 aei/feder, ue; by xunta de galicia, research grant no. 2018-pg082, and the cretus strategic partnership, agrup2015/02, supported by xunta de galicia. all these programs are co-funded by feder (ue). we also acknowledge support from the portuguese foundation for science and technology (fct) within the project n. 147. 2. opinion distributions depending on the initial number of nodes with different opinion. the list of hashtags used to construct both networks is in table 1 for the october'19 case (column on the left) and for the april'20 scenario (right column). all hashtags used were neutral in the sense of political bias or age meaning. april '20 #eleccionesgenerales28a #cuidaaquientecuida #eldebatedecisivolasexta #estevirusloparamosunidos #pactosarv #quedateconesp #rolandgarros #semanaencasayoigo #niunamenos #quedateencasa #selectividad2019 #superviviente2020 #anuncioeleccions28abril #autonomosabandonados #blindarelplaneta #renta2019 #diamundialdelabicicleta' #encasaconsalvame #emergenciaclimatica27s' #diamundialdelasalud #cuarentenaextendida #asinonuvigo #ahoratocalucharjuntos #house_party #encasaconsalvame apoyare_a_sanchez pleno_del_congreso viernes_de_dolores key: cord-010751-fgk05n3z authors: holme, petter title: objective measures for sentinel surveillance in network epidemiology date: 2018-08-15 journal: nan doi: 10.1103/physreve.98.022313 sha: doc_id: 10751 cord_uid: fgk05n3z assume one has the capability of determining whether a node in a network is infectious or not by probing it. then problem of optimizing sentinel surveillance in networks is to identify the nodes to probe such that an emerging disease outbreak can be discovered early or reliably. whether the emphasis should be on early or reliable detection depends on the scenario in question. we investigate three objective measures from the literature quantifying the performance of nodes in sentinel surveillance: the time to detection or extinction, the time to detection, and the frequency of detection. as a basis for the comparison, we use the susceptible-infectious-recovered model on static and temporal networks of human contacts. we show that, for some regions of parameter space, the three objective measures can rank the nodes very differently. this means sentinel surveillance is a class of problems, and solutions need to chose an objective measure for the particular scenario in question. as opposed to other problems in network epidemiology, we draw similar conclusions from the static and temporal networks. furthermore, we do not find one type of network structure that predicts the objective measures, i.e., that depends both on the data set and the sir parameter values. infectious diseases are a big burden to public health. their epidemiology is a topic wherein the gap between the medical and theoretical sciences is not so large. several concepts of mathematical epidemiology-like the basic reproductive number or core groups [1] [2] [3] -have entered the vocabulary of medical scientists. traditionally, authors have modeled disease outbreaks in society by assuming any person to have the same chance of meeting anyone else at any time. this is of course not realistic, and improving this point is the motivation for network epidemiology: epidemic simulations between people connected by a network [4] . one can continue increasing the realism in the contact patterns by observing that the timing of contacts can also have structures capable of affecting the disease. studying epidemics on time-varying contact structures is the basis of the emerging field of temporal network epidemiology [5] [6] [7] [8] . one of the most important questions in infectious disease epidemiology is to identify people, or in more general terms, units, that would get infected early and with high likelihood in an infectious outbreak. this is the sentinel surveillance problem [9, 10] . it is the aspect of node importance, which is the one most actively used in public health practice. typically, it works by selecting some hospitals (clinics, cattle farms, etc.) to screen, or more frequently test, for a specific infection [11] . defining an objective measure-a quantity to be maximized or minimized-for sentinel surveillance is not trivial. it depends on the particular scenario one considers and the means of interventions at hand. if the goal for society is to detect as many outbreaks as possible, it makes sense to choose sentinels to * holme@cns.pi.titech.ac.jp maximize the fraction of detected outbreaks [9] . if the objective rather is to discover outbreaks early, then one could choose sentinels that, if infected, are infected early [10, 12] . finally, if the objective is to stop the disease as early as possible, it makes sense to measure the time to extinction or detection (infection of a sentinel) [13] . see fig. 1 for an illustration. to restrict ourselves, we will focus on the case of one sentinel. if one has more than one sentinel, the optimal set will most likely not be the top nodes of a ranking according to the three measures above. their relative positions in the network also matter (they should not be too close to each other) [13] . in this paper, we study and characterize our three objective measures. we base our analysis on 38 empirical data sets of contacts between people. we analyze them both in temporal and static networks. the reason we use empirical contact data, rather than generative models, as the basis of this study is twofold. first, there are so many possible structures and correlations in temporal networks that one cannot tune them all in models [8] . it is also hard to identify the most important structures for a specific spreading phenomenon [8] . second, studying empirical networks makes this paper-in addition to elucidating the objective measures of sentinel surveillance-a study of human interaction. we can classify data sets with respect how the epidemic dynamics propagate on them. as mentioned above, in practical sentinel surveillance, the network in question is rather one of hospitals, clinics or farms. one can, however, also think of sentinel surveillance of individuals, where high-risk individuals would be tested extra often for some diseases. in the remainder of the paper, we will describe the objective measures, the structural measures we use for the analysis, and the data sets, and we will present the analysis itself. we will primarily focus on the relation between the measures, secondarily on the structural explanations of our observations. assume that the objective of society is to end outbreaks as soon as possible. if an outbreak dies by itself, that is fine. otherwise, one would like to detect it so it could be mitigated by interventions. in this scenario, a sensible objective measure would be the time for a disease to either go extinct or be detected by a sentinel: the time to detection or extinction t x [13] . suppose that, in contrast to the situation above, the priority is not to save society from the epidemics as soon as possible, but just to detect outbreaks fast. this could be the case if one would want to get a chance to isolate a pathogen, or start producing a vaccine, as early as possible, maybe to prevent future outbreaks of the same pathogen at the earliest possibility. then one would seek to minimize the time for the outbreak to be detected conditioned on the fact that it is detected: the time to detection t d . for the time to detection, it does not matter how likely it is for an outbreak to reach a sentinel. if the objective is to detect as many outbreaks as possible, the corresponding measure should be the expected frequency of outbreaks to reach a node: the frequency of detection f d . note that for this measure a large value means the node is a good sentinel, whereas for t x and t d a good sentinel has a low value. this means that when we correlate the measures, a similar ranking between t x and f d or t d and f d yields a negative correlation coefficient. instead of considering the inverse times, or similar, we keep this feature and urge the reader to keep this in mind. there are many possible ways to reduce our empirical temporal networks to static networks. the simplest method would be to just include a link between any pair of nodes that has at least one contact during the course of the data set. this would however make some of the networks so dense that the static network structure of the node-pairs most actively in contact would be obscured. for our purpose, we primarily want our network to span many types of network structures that can impact epidemics. without any additional knowledge about the epidemics, the best option is to threshold the weighted graph where an edge (i, j ) means that i and j had more than θ contacts in the data set. in this work, we assume that we do not know what the per-contact transmission probability β is (this would anyway depend on both the disease and precise details of the interaction). rather we scan through a very large range of β values. since we anyway to that, there is no need either to base the choice of θ on some epidemiological argument, or to rescale β after the thresholding. note that the rescaled β would be a non-linear function of the number of contacts between i and j . (assuming no recovery, for an isolated link with ν contacts, the transmission probability is 1 − (1 − β ) ν .) for our purpose the only thing we need is that the rescaled β is a monotonous function of β for the temporal network (which is true). to follow a simple principle, we omit all links with a weight less than the median weight θ . we simulate disease spreading by the sir dynamics, the canonical model for diseases that gives immunity upon recovery [2, 14] . for static networks, we use the standard markovian version of the sir model [15] . that is, we assume that diseases spread over links between susceptible and infectious nodes the infinitesimal time interval dt with a probability β dt. then, an infectious node recovers after a time that is exponentially distributed with average 1/ν. the parameters β and ν are called infection rate and recovery rate, respectively. we can, without loss of generality, put ν = 1/t (where t is the duration of the sampling). for other ν values, the ranking of the nodes would be the same (but the values of the t x and t d would be rescaled by a factor ν). we will scan an exponentially increasing progression of 200 values of β, from 10 −3 to 10. the code for the disease simulations can be downloaded [16] . for the temporal networks, we use a definition as close as possible to the one above. we assume an exponentially distributed duration of the infectious state with mean 1/ν. we assume a contact between an infectious and susceptible node results in a new infection with probability β. in the case of temporal networks, one cannot reduce the problem to one parameter. like for static networks, we sample the parameter values in exponential sequences in the intervals 0.01 β 1 and 0.01 ν/t 1 respectively. for temporal networks, with our interpretation of a contact, β > 1 makes no sense, which explains the upper limit. furthermore, since temporal networks usually are effectively sparser (in terms of the number of possible infection events per time), the smallest β values will give similar results, which is the reason for the higher cutoff in this case. for both temporal and static networks, we assume the outbreak starts at one randomly chosen node. analogously, in the temporal case we assume the disease is introduced with equal probability at any time throughout the sampling period. for every data set and set of parameter values, we sample 10 7 runs of epidemic simulations. as motivated in the introduction, we base our study on empirical temporal networks. all networks that we study record contacts between people and falls into two classes: human proximity networks and communication networks. proximity networks are, of course, most relevant for epidemic studies, but communication networks can serve as a reference (and it is interesting to see how general results are over the two classes). the data sets consist of anonymized lists of two identification numbers in contact and the time since the beginning of the contact. many of the proximity data sets we use come from the sociopatterns project [17] . these data sets were gathered by people wearing radio-frequency identification (rfid) sensors that detect proximity between 1 and 1.5 m. one such datasets comes from a conference, hypertext 2009, (conference 1) [18] , another two from a primary school (primary school) [19] and five from a high school (high school) [20] , a third from a hospital (hospital) [21] , a fourth set of five data sets from an art gallery (gallery) [22] , a fifth from a workplace (office) [23] , and a sixth from members of five families in rural kenya [24] . the gallery data sets consist of several days where we use the first five. in addition to data gathered by rfid sensors, we also use data from the longer-range (around 10m) bluetooth channel. the cambridge 1 [25] and 2 [26] datasets were measured by the bluetooth channel of sensors (imotes) worn by people in and around cambridge, uk. st andrews [27] , conference 2 [25] , and intel [25] are similar data sets tracing contacts at, respectively, the university of st. andrews, the conference infocom 2006, and the intel research laboratory in cambridge, uk. the reality [28] and copenhagen bluetooth [29] data sets also come from bluetooth data, but from smartphones carried by university students. in the romania data, the wifi channel of smartphones was used to log the proximity between university students [30] , whereas the wifi dataset links students of a chinese university that are logged onto the same wifi router. for the diary data set, a group of colleagues and their family members were self-recording their contacts [31] . our final proximity data, the prostitution network, comes from from self-reported sexual contacts between female sex workers and their male sex buyers [32] . this is a special form of proximity network since contacts represent more than just proximity. among the data sets from electronic communication, facebook comes from the wall posts at the social media platform facebook [33] . college is based on communication at a facebook-like service [34] . dating shows interactions at an early internet dating website [35] . messages and forum are similar records of interaction at a film community [36] . copenhagen calls and copenhagen sms consist of phone calls and text messages gathered in the same experiment as copenhagen bluetooth [29] . finally, we use four data sets of e-mail communication. one, e-mail 1, recorded all e-mails to and from a group of accounts [37] . the other three, e-mail 2 [38] , 3 [39] , and 4 [40] recorded e-mails within a set of accounts. we list basic statistics-sizes, sampling durations, etc.-of all the data sets in table i . to gain further insight into the network structures promoting the objective measures, we correlate the objective measures with quantities describing the position of a node in the static networks. since many of our networks are fragmented into components, we restrict ourselves to measures that are well defined for disconnected networks. otherwise, in our selection, we strive to cover as many different aspects of node importance as we can. degree is simply the number of neighbors of a node. it usually presented as the simplest measure of centrality and one of the most discussed structural predictors of importance with respect to disease spreading [42] . (centrality is a class of measures of a node's position in a network that try to capture what a "central" node is; i.e., ultimately centrality is not more well-defined than the vernacular word.) it is also a local measure in the sense that a node is able to estimate its degree, which could be practical when evaluating sentinel surveillance in real networks. subgraph centrality is based on the number of closed walks a node is a member of. (a walk is a path that could be overlapping itself.) the number of paths from node i to itself is given by a λ ii , where a is the adjacency matrix and λ is the length of the path. reference [43] argues that the best way to weigh paths of different lengths together is through the formula as mentioned, several of the data sets are fragmented (even though the largest connected component dominates components of other sizes). in the limit of high transmission table i. basic statistics of the empirical temporal networks. n is the number of nodes, c is the number of contacts, t is the total sampling time, t is the time resolution of the data set, m is the number of links in the projected and thresholded static networks, and θ is the threshold. probabilities, all nodes in the component of the infection seed will be infected. in such a case it would make sense to place a sentinel in the largest component (where the disease most likely starts). closeness centrality builds on the assumption that a node that has, on average, short distances to other nodes is central [44] . here, the distance d(i, j ) between nodes i and j is the number of links in the shortest paths between the nodes. the classical measure of closeness centrality of a node i is the reciprocal average distance between i and all other nodes. in a fragmented network, for all nodes, there will be some other node that it does not have a path to, meaning that the closeness centrality is ill defined. (assigning the distance infinity to disconnected pairs would give the closeness centrality zero for all nodes.) a remedy for this is, instead of measuring the reciprocal average of distances, measuring the average reciprocal distance [45] , where d −1 (i, j ) = 0 if i and j are disconnected. we call this the harmonic closeness by analogy to the harmonic mean. vitality measures are a class of network descriptor that capture the impact of deleting a node on the structure of the entire network [46, 47] . specifically, we measure the harmonic closeness vitality, or harmonic vitality, for short. this is the change of the sum of reciprocal distances of the graph (thus, by analogy to the harmonic closeness, well defined even for disconnected graphs): here the denominator concerns the graph g with the node i deleted. if deleting i breaks many shortest paths, then c c (i) decreases, and thus c v (i) increases. a node whose removal disrupts many shortest paths would thus score high in harmonic vitality. our sixth structural descriptor is coreness. this measure comes out of a procedure called k-core decomposition. first, remove all nodes with degree k = 1. if this would create new nodes with degree one, delete them too. repeat this until there are no nodes of degree 1. then, repeat the above steps for larger k values. the coreness of a node is the last level when it is present in the network during this process [48] . like for the static networks, in the temporal networks we measure the degree of the nodes. to be precise, we define the degree as the number of distinct other nodes a node in contact with within the data set. strength is the total number of contacts a node has participated in throughout the data set. unlike degree, it takes the number of encounters into account. temporal networks, in general, tend to be more disconnected than static networks. for node i to be connected to j in a temporal networks there has to be a time-respecting path from i to j , i.e., a sequence of contacts increasing in time that (if time is projected out) is a path from i to j [7, 8] . thus two interesting quantities-corresponding to the component sizes of static networks-are the fraction of nodes reachable from a node by time-respecting paths forward (downstream component size) and backward in time (upstream component size) [49] . if a node only exists in the very early stage of the data, the sentinel will likely not be active by the time the outbreak happens. if a node is active only at the end of the data set, it would also be too late to discover an outbreak early. for these reasons, we measure statistics of the times of the contacts of a node. we measure the average time of all contacts a node participates in; the first time of a contact (i.e., when the node enters the data set); and the duration of the presence of a node in the data (the time between the first and last contact it participates in). we use a version of the kendall τ coefficient [50] to elucidate both the correlations between the three objective measures, and between the objective measures and network structural descriptors. in its basic form, the kendall τ measures the difference between the number of concordant (with a positive slope between them) and discordant pairs relative to all pairs. there are a few different versions that handle ties in different ways. we count a pair of points whose error bars overlap as a tie and calculate where n c is the number of concordant pairs, n d is the number of discordant pairs, and n t is the number of ties. we start investigating the correlation between the three objective measures throughout the parameter space of the sir model for all our data sets. we use the time to detection and extinction as our baseline and compare the other two objective measures with that. in fig. 2 , we plot the τ coefficient between t x and t d and between t x and f d . we find that for low enough values of β, the τ for all objective measures coincide. for very low β the disease just dies out immediately, so the measures are trivially equal: all nodes would be as good sentinels in all three aspects. for slightly larger β-for most data sets 0.01 < β < 0.1-both τ (t x , t d ) and τ (t x , f d ) are negative. this is a region where outbreaks typically die out early. for a node to have low t x , it needs to be where outbreaks are likely to survive, at least for a while. this translates to a large f d , while for t d , it would be beneficial to be as central as possible. if there are no extinction events at all, t x and t d are the same. for this reason, it is no surprise that, for most of the data sets, τ (t x , t d ) becomes strongly positively correlated for large β values. the τ (t x , f d ) correlation is negative (of a similar magnitude), meaning that for most data sets the different methods would rank the possible sentinels in the same order. for some of the data sets, however, the correlation never becomes positive even for large β values (like copenhagen calls and copenhagen sms). these networks are the most fragmented onesm meaning that one sentinel unlikely would detect the outbreak (since it probably happens in another component). this makes t x rank the important nodes in a way similar to f d , but since diseases that do reach a sentinel do it faster in a small component than a large one, t x and t d become anticorrelated. in fig. 3 , we perform the same analysis as in the previous section but for static networks. the picture is to some extent similar, but also much richer. just as for the case of static networks, τ (t x , f d ) is always nonpositive, meaning the time to detection or extinction ranks the nodes in a way positively correlated with the frequency of detection. furthermore, like the static networks, τ (t x , t d ) can be both positively and negatively correlated. this means that there are regions where t d ranks the nodes in the opposite way than the t x . these regions of negative τ (t x , t d ) occur for low β and ν. for some data sets-for example the gallery data sets, dating, copenhagen calls, and copenhagen sms-the correlations are negative throughout the parameter space. among the data sets with a qualitative difference between the static and temporal representations, we find prostitution and e-mail 1 both have strongly positive values of τ (t x , t d ) for large β values in the static networks but moderately negative values for temporal networks. in this section, we take a look at how network structures affect our objective measures. in fig. 4 , we show the correlation between our three objective measures and the structural descriptors as a function of β for the office data set. panel (a) shows the results for the time to detection or extinction. there is a negative correlation between this measure and traditional centrality measures like degree or subgraph centrality. this is because t x is a quantity one wants to minimize to find the optimal sentinel, whereas for all the structural descriptors a large value means that a node is a candidate sentinel node. we see that degree and subgraph centrality are the two quantities that best predict the optimal sentinel location, while coreness is also close (at around −0.65). this in line with research showing that certain biological problems are better determined by degree than more elaborate centrality measures [51] . over all, the τ curves are rather flat. this is partly explained by τ being a rank correlation for t d [ fig. 4(b) ], most curves change behavior around β = 0.2. this is the region when larger outbreaks could happen, so one can understand there is a transition to a situation similar to t x [ fig. 4(a) ]. f d [fig. 4(c) ] shows a behavior similar to t d in that the curves start changing order, and what was a correlation at low β becomes an anticorrelation at high β. this anticorrelation is a special feature of this particular data set, perhaps due to its pronounced community structure. nodes of degree 0, 1, and 2 have a strictly increasing values of f d , but for some of the high degree nodes (that all have f d close to one) the ordering gets anticorrelated with degree which makes kendall's τ negative. since rank-based correlations are more principled for skew-distributed quantities common in networks, we keep them. we currently investigate what creates these unintuitive anticorrelations among the high degree nodes in this data set. next, we proceed with an analysis of all data sets. we summarize plots like fig. 4 by the structural descriptor with the largest magnitude of the correlation |τ |. see fig. 2 . we can see, that there is not one structural quantity that uniquely determines the ranking of nodes, there is not even one that dominates over (1) degree is the strongest structural determinant of all objective measures at low β values. this is consistent with ref. [13] . (2) component size only occurs for large β. in the limit of large β, f d is only determined by component size (if we would extend the analysis to even larger β, subgraph centrality would have the strongest correlation for the frequency of detection). (3) harmonic vitality is relatively better as a structural descriptor for t d , less so for t x and f d . t x and f d capture the ability of detecting an outbreak before it dies, so for these quantities one can imagine more fundamental quantities like degree and the component size are more important. (4) subgraph centrality often shows the strongest correlation for intermediate values of β. this is interesting, but difficult to explain since the rationale of subgraph centrality builds on cycle counts and there is no direct process involving cycles in the sir model. (5) harmonic closeness rarely gives the strongest correlation. if it does, it is usually succeeded by coreness and the data set is typically rather large. (6) datasets from the same category can give different results. perhaps college and facebook is the most conspicuous example. in general, however, similar data sets give similar results. the final observation could be extended. we see that, as β increases, one color tends to follow another. this is summarized in fig. 6 , where we show transition graphs of the different structural descriptors such that the size corresponds to their frequency in fig. 7 , and the size of the arrows show how often one structural descriptor is succeeded by another as β is increased. for t x , the degree and subgraph centrality are the most important structural descriptors, and the former is usually succeeded by the latter. for t d , there is a common peculiar sequence of degree, subgraph centrality, coreness component size, and harmonic vitality that is manifested as the peripheral, clockwise path of fig. 6(b) . finally, f d is similar to t x except that there is a rather common transition from degree to coreness, and harmonic vitality is, relatively speaking, a more important descriptor. in fig. 7 , we show the figure for temporal networks corresponding to fig. 5 . just like the static case, even though every data set and objective measure is unique, we can make some interesting observations. (1) strength is most important for small ν and β. this is analogous to degree dominating the static network at small parameter values. (2) upstream component size dominates at large ν and β. this is analogous to the component size of static networks. since temporal networks tend to be more fragmented than static ones [49] , this dominance at large outbreak sizes should be even more pronounced for temporal networks. (3) most of the variation happens in the direction of larger ν and β. in this direction, strength is succeeded by degree which is succeeded by upstream component size. (4) like the static case, and the analysis of figs. 5 and 7 , t x and f d are qualitatively similar compared to t d . (5) temporal quantities, such as the average and first times of a node's contacts, are commonly the strongest predictors of t d . (6) when a temporal quantity is the strongest predictor of t x and f d it is usually the duration. it is understandable that this has little influence on t d , since the ability to be infected at all matters for these measures; a long duration is beneficial since it covers many starting times of the outbreak. (7) similar to the static case, most categories of data sets give consistent results, but some differ greatly (facebook and college is yet again a good example). the bigger picture these observations paint is that, for our problem, the temporal and static networks behave rather similarly, meaning that the structures in time do not matter so much for our objective measures. at the same time, there is not only one dominant measure for all the data sets. rather are there several structural descriptors that correlate most strongly with the objective measures depending on ν and β. in this paper, we have investigated three different objective measures for optimizing sentinel surveillance: the time to detection or extinction, the time to detection (given that the detection happens), and the frequency of detection. each of these measures corresponds to a public health scenario: the time to detection or extinction is most interesting to minimize if one wants to halt the outbreak as quickly as possible, and the frequency of detection is most interesting if one wants to monitor the epidemic status as accurately as possible. the time to detection is interesting if one wants to detect the outbreak early (or else it is not important), which could be the case if manufacturing new vaccine is relatively time consuming. we investigate these cases for 38 temporal network data sets and static networks derived from the temporal networks. our most important finding is that, for some regions of parameter space, our three objective measures can rank nodes very differently. this comes from the fact that sir outbreaks have a large chance of dying out in the very early phase [52] , but once they get going they follow a deterministic path. for this reason, it is thus important to be aware of what scenario one is investigating when addressing the sentinel surveillance problem. another conclusion is that, for this problem, static and temporal networks behave reasonably similarly (meaning that the temporal effects do not matter so much). naturally, some of the temporal networks respond differently than the static ones, but compared to, e.g., the outbreak sizes or time to extinction [53] [54] [55] , differences are small. among the structural descriptors of network position, there is no particular one that dominates throughout the parameter space. rather, local quantities like degree or strength (for the temporal networks) have a higher predictive power at low parameter values (small outbreaks). for larger parameter values, descriptors capturing the number of nodes reachable from a specific node correlate most with the objective measures rankings. also in this sense, the static network quantities dominate the temporal ones, which is in contrast to previous observations (e.g., refs. [53] [54] [55] ). for the future, we anticipate work on the problem of optimizing sentinel surveillance. an obvious continuation of this work would be to establish the differences between the objective metrics in static network models. to do the same in temporal networks would also be interesting, although more challenging given the large number of imaginable structures. yet an open problem is how to distribute sentinels if there are more than one. it is known that they should be relatively far away [13] , but more precisely where should they be located? modern infectious disease epidemiology infectious diseases in humans temporal network epidemiology a guide to temporal networks principles and practices of public health surveillance stochastic epidemic models and their statistical analysis pretty quick code for regular (continuous time, markovian) sir on networks, github.com/pholme/sir proceedings, acm sigcomm 2006-workshop on challenged networks (chants) crawdad dataset st_andrews/sassy third international conference on emerging intelligent data and web technologies proc. natl. acad. sci. usa proceedings of the 2nd acm workshop on online social networks, wosn '09 proceedings of the tenth acm international conference on web search and data mining, wsdm '17 proceedings of the 14th international conference networks: an introduction network analysis: methodological foundations distance in graphs we thank sune lehmann for providing the copenhagen data sets. this work was supported by jsps kakenhi grant no. jp 18h01655. key: cord-027719-98tjnry7 authors: said, abd mlak; yahyaoui, aymen; yaakoubi, faicel; abdellatif, takoua title: machine learning based rank attack detection for smart hospital infrastructure date: 2020-05-31 journal: the impact of digital technologies on public health in developed and developing countries doi: 10.1007/978-3-030-51517-1_3 sha: doc_id: 27719 cord_uid: 98tjnry7 in recent years, many technologies were racing to deliver the best service for human being. emerging internet of things (iot) technologies made birth to the notion of smart infrastructures such as smart grid, smart factories or smart hospitals. these infrastructures rely on interconnected smart devices collecting real-time data in order to improve existing procedures and systems capabilities. a critical issue in smart infrastructures is the information protection which may be more valuable than physical assets. therefore, it is extremely important to detect and deter any attacks or breath to the network system for information theft. one of these attacks is the rank attack that is carried out by an intruder node in order to attract legitimate traffic to it, then steal personal data of different persons (both patients and staffs in hospitals). in this paper, we propose an anomaly based rank attack detection system against an iot network using support vector machines. as a use case, we are interested in the healthcare sector and in particular in smart hospitals which are multifaceted with many challenges such as service resilience, assets interoperability and sensitive information protection. the proposed intrusion detection system (ids) is implemented and evaluated using conticki cooja simulator. results show a high detection accuracy and low false positive rates. nowadays, the deployment of the internet of things (iot) where many objects are connected to the internet cloud services becomes highly recommended in many applications in various sectors. a highly important concept in the iot is wireless sensor networks or wsns where end nodes rely on sensors that can collect data from the environment to ensure tasks such as surveillance or monitoring for wide areas [7] . this capability made the birth to the notion of smart infrastructures such as smart metering systems, smart grid or smart hospitals. in such infrastructures, end devices collecting data are connected to intermediate nodes that forward data in order to reach border routers using routing protocols. these end nodes are in general limited in terms of computational resources, battery and memory capacities. also, their number is growing exponentially. therefore, new protocols are proposed under the iot paradigm to optimize energy consumption and computations. two of these protocols are considered the de facto protocols for the internet of things (iot): rpl (routing protocol for low power lossy network) and 6lowpan (ipv6 over low power wireless private area network). these protocols are designed for constrained devices in recent iot applications. routing is a key part of the ipv6 stack that remains to be specified for 6low-pan networks [6] . rpl provides a mechanism whereby multipoint-to-point traffic from devices inside the low-power and lossy-networks (llns) towards a central control point as well as point-to-multipoint traffic from the central control point to the device inside the lln are supported [8, 9] . rpl involves many concepts that make it a flexible protocol, but also rather complex [10] : • dodag (destination oriented directed acyclic graph): a topology similar to a tree to optimize routes between sink and other nodes for both the collect and distribute data traffics. each node within the network has an assigned rank, which increases as the teals move away from the root node. the nodes resend packets using the lowest range as the route selection criteria. • dis (dodag information solicitation): used to solicit a dodag information object from rpl nodes. • dio (dodag information object): used to construct, maintain the dodag and to periodically refresh the information of the nodes on the topology of the network. • dao (dodag advertisment object): used by nodes to propagate destination information upward along the dodag in order to update the information of their parents. with the enormous number of devices that are now connected to the internet, a new solution was proposed: 6lowpan a lightweight protocol that defines how to run ip version 6 (ipv6) over low data rate, low power, small footprint radio networks as typified by the ieee 802.15.4 radio [11] . in smart infrastructures, the huge amount of sensitive data exchanged among these modules and throughout radio interfaces need to be protected. therefore, detecting any network or device breach becomes a high priority challenge for researchers due to resource constraints for devices (low processing power, battery power and memory size). rank attack is one of the most known rpl attacks where the attacker attracts other nodes to establish routes through it by advertising false rank. this way, intruders collect all the data that pass in the network [12] . for this reason, developing specific security solutions for iot is essential to let users catch all opportunities it offers. one of defense lines designed for detecting attackers is intrusion detection systems [13] (ids). in this paper, we propose a centralized anomaly-based ids for smart infrastructures. we chose o-svm (one class support vector machines) algorithm for its low energy consuming compared to other machine learning algorithms for wireless sensor network (wsn) [20] . as a use case, we are interested in smart hospital infrastructures. such hospitals have a wide range of resources that are essential to maintain their operations, patients, employees and the building itself [1, 2] safety such as follow: • remote care assets: medical equipment for tele-monitoring and tele-diagnosis. • networked medical devices: wearable mobile devices (heartbeat bracelet, wireless temperature counters, glucose measuring devices...) or an equipment installed to collect health service related data. • networking equipment: standards equipment providing connectivity between different equipment (transmission medium, router, gateway...). • data: for both clinical and patient data, and staff data, which considered the most critical asset stored in huge datasets or private clouds. • building and facilities: the sensors are distributed in the hospital building that monitor the patient safety (temperature sensor for patient room and operation theater, gas sensor are among used sensors). we target a common iot architecture that can be considered for smart hospitals. in such architecture, there are mainly three type of components: sensing node: composed of remote care asset, network medical device and different sensors. these sensors will send different type of data and information (patient and staff data, medical equipment status...). they are linked to microcontrollers and radio modules to transmit these data to the processing unit [3] . edge router: an edge router or border router is a specialized router residing at the edge or boundary of a network. this node ensures the connectivity of its network with external networks; a wide area network or the internet. an edge router uses an external border gateway protocol, which is used extensively over the internet to provide connectivity with remote networks. instead of providing communication with an internal network, which the core router already manages, a gateway may provide communication with different networks and autonomous systems [4] . interface module and database: this module is the terminal of the network containing all the collected data from different nodes of the network and analyze those information in order to ensure the safety of patient and improve the healthcare system. figure 1 [5], presents the typical iot e-health architecture, where sensors are distributed (medical equipment,room sensors and others) and send data to the iot gateway. in one hand, this gives the opportunity to medical supervisor to control the patient health status. in the other hand, this data will be saved into databases for more analysis. the rest of the paper is structured as follows. section 2 presents the related work. section 3 presents the rank attack scenario. section 4 presents our proposed approach. section 5 presents our main results and sect. 6 concludes the paper and presents its perspectives. rpl protocol security especially in the healthcare domain is a crucial aspect for preserving personnel data. nodes rank is an important parameter for an rpl network. it can be used for route optimization, loop prevention, and topology maintenance. in fact, the rank attack can decrease the network performance in terms of packet delivery rate (pdr) to almost 60% [23] . there were different proposed solutions to detect and mitigate rpl attacks such as rank authentication mechanism to avoid false announced ranks by using cryptography technique which was proposed in [24] . however, this technique is not very efficient because of its high computational cost and energy consumption. authors in [25] propose a monitoring node (mn) based scheme but it is also not efficient because using a large network of mns causes a communication overhead. in [26] , authors propose the ids called "svelte" that can only be used for detection of simple rank attack and has high false alarm rate. a host-based ids was proposed in [27] . the ids uses a probabilistic scheme but it is discouraged by rfc6550 for resource constrained networks. routing choice "rc" was proposed by zhag et al. [28] . it is not directly related to the rank attack but it is based on false preferred parent selection. it has a high communication overhead in rpl networks. trusted platform module (tpm) was proposed by seeber et al. [29] . it introduces an overlay network of tpm nodes for detection of network attacks. securerpl (srpl) [30] technique prevents rpl network from rank attack, however it is characterized by a high energy consumption. therefore, anomaly based solutions using machine learning permit a more efficient detection. authors of [22] compared several unsupervised machine learning approaches based on local outlier factor, near neighbors, mahalanobis distance and svms for intrusion detection. their experiments showed that o-svm is the most appropriate technique to detect selective forwarding and jamming attacks. actually, we rely on these results in our choice of o-svm. rank attack is one of well known attacks against the routing protocol for low power and lossy networks (rpl) protocol in the network layer of the internet of things. the rank in rpl protocol as shown in fig. 2 is the physical position of the node with respect to the border router and neighbor nodes [12] . since our network is dynamic due to the mobility of its nodes (sensor moving with patient...), the rpl protocol periodically reformulates the dodag. as shown in fig. 3 , an attacker may insert a malicious mote into the network to attract other nodes to establish routes through it by advertising false ranks while the reformulation of the dodag is done [14] . by default, rpl has the security mechanisms to mitigate the external attacks but it can not mitigate the internal attacks efficiently. in that case, the rank attack is considered one of dangerous attacks in dynamic iot networks since the attacker controls an existing node (being one of the internal attack that can affect the rpl) in the dodag or he can identify the network and insert his own malicious node and that node will act as the attack node as shown in fig. 4 . the key features required for our solution are to be adaptive, lightweight, and able to learn from the past. we design an iot ids and we implement and evaluate it as authors did in [18, 20] . placement choice: one of the important decision in intrusion detection is the placement of the ids in the network. we use a centralized approach by installing the ids at the border router. therefore, it can analyze all the packets that pass through it. the choice of the centralized ids was done to avoid the placement of ids modules in constrained devices which requires more storage and processing capabilities [15, 16] . however, theses devices have limited resources. detection method choice: an intrusion detection system (ids) is a tool or mechanism to detect attacks against a system or a network by analyzing the activity in the network or in the system itself. once an attack is detected an ids may log information about it and/or report an alarm [15, 16] . broadly speaking, we aim to choose the anomaly based detection mechanisms: it tries to detect anomalies in the system by determining the ordinary behavior and using it as baseline. any deviations from that baseline is considered as an anomaly. this technique have the ability to detect almost any attack and adapt to new environments. we chose support vector machines (svm) as an anomaly based machine learning technique. it is a discriminating classifier formally defined by a separating hyper-lane. given labeled training data (supervised learning), the algorithm outputs an optimal hyper-lane which categorizes new examples. in two dimensional space this hyper-lane is a line dividing a plane in two parts where each class lays in either side. it uses a mathematical function named the kernel to reformulate data. after these transformations, it defines an optimal borderline between the labels. mainly, it does some extremely complex data transformations to find a solution how to separate the data based on the labels or outputs defined. the concept of svm learning approach is based on the definition of the optimal separating hyper-plane (fig. 5) [21] which maximizes the margin of the training data [17, 18] . the choice of this machine learning algorithm refers to one important point, it works well with the structured data as tables of values compared to other algorithms. we implement the ids in the smart iot gateway shown in fig. 1 . to investigate the effectiveness of our proposed ids, we implement three scenarios of rank attack using contiki-cooja simulator [19] . we assess how our ids module can detect them. we present next the simulation setup, evaluation metrics, and we discuss the results achieved. our simulation scenario consists of a total 11 motes spread across an area of 200 × 200 m (simulation of area of hospital where different sensors are placed in every area to control the patient rooms). the topology is shown in fig. 6 using four scenarios. there is one sink (mote id:0 with green dot) and 10 senders (yellow motes from id:1 to id:10). every mote sends packet to the sink at the rate of 1 packet every 1 min. we implement the centralized anomaly based ids at the root mote or the sink and we collect and analyze network data as shown in table 1 summarizes the used simulation parameters. we run four simulation scenarios for 1 h (fig. 6 ): • scenario 1: iot network without malicious motes. • scenario 2: iot network with 1 randomly placed malicious mote. • scenario 3: iot network with 2 randomly placed malicious motes. • scenario 4: iot network with 4 randomly placed malicious motes. to evaluate the accuracy of the proposed ids, we rely on the energy consumption parameter. we collect power tracking data per mote in terms of radio on energy, radio transmission tx energy, radio reception rx energy and radio interfered int energy. in order to calculate this metrics we used the formula [31] (eq. 1, table 2 ) as follow : energy(mj) = (transmit * 19.5ma + listen * 21.8ma we used data containing 1000 instances of consumed energy values for each node in the network. figure 7 depicts the evolution of power tracking of each node in the four scenarios: • scenario 1: when we have a normal behavior in the network, all the sensors show a regular energy consumption in terms of receiving (node 0) and sending (nodes from 1 to 10). we use this simulation to collect the training data for the proposed ids. • scenario 2, 3 and 4: for those scenarios, we have a high sending values for the malicious motes. this is explained by the fact that when a malicious mote joins the network, it asks the other motes to recreate the dodag tree and also to send data that they have, in order to steal as much data as it can. that is why it have a high receiving values too. the other motes do not distinguish that this is a malicious mote, therefore they recreate the dodag tree, and send their information through the malicious node. we used the first simulation scenario as dataset for our ids, describing the normal behavior of the network. this 1 h information was enough to detect the malicious activities of the rank attack. meanwhile, each time we add a malicious mote, the anomaly detection rate increases as shown in fig. 8 . in each simulation of malicious mote, the proposed ids indicates the anomaly detection ratio which increases each time while adding another malicious mote. this aims to determine the impact of the number malicious motes compared to normal behavior of the system. in this paper, we propose an intrusion detection system "ids" for smart hospital infrastructure data protection. the chosen ids is centralized and anomaly based using a machine learning algorithm osvm. simulation results show the efficiency of the approach by a high detection accuracy which is more precise when the number of malicious nodes increases. as future work, we are interested in developing a machine learning based ids for more rpl attacks detection. furthermore, we aim to extend this solution to anomaly detection in iot systems composed not only of wsn networks but also of cloud-based services. open access this chapter is licensed under the terms of the creative commons attribution 4.0 international license (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license and indicate if changes were made. the images or other third party material in this chapter are included in the chapter's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the chapter's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. smart hospital based on internet of things smart hospital care system middleware challenges for wireless sensor networks study of the gateway of wireless sensor networks smart e-health gateway: bringing intelligence to internet-ofthings based ubiquitous healthcare systems rpl in a nutshell: a survey evaluating and analyzing the performance of rpl in contiki rpl: ipv6 routing protocol for lowpower and lossy networks security considerations in the ip-based internet of things an implementation and evaluation of the security features of rpl ipv6 over low-power wireless personal area networks (6lowpans): overview, assumptions, problem statement, and goals rank attack using objective function in rpl for low power and lossy networks intrusion detection systems for wireless sensor networks: a survey routing attacks and countermeasures in the rpl-based internet of things a survey of intrusion detection in internet of things active learning for wireless iot intrusion detection genetic algorithm to improve svm based network intrusion detection system proceedings the 2009 international workshop on information security and application iot emulation with cooja read: reliable event and anomaly detection system in wireless sensor networks rbf kernel based support vector machine with universal approximation and its application a comparative study of anomaly detection techniques for smart city wireless sensor networks the impacts of internal threats towards routing protocol for low power and lossy network performance vera-version number and rank authentication in rpl specification-based ids for securing rpl from topology attacks svelte: real-time intrusion detection in the internet of things secure parent node selection scheme in route construction to exclude attacking nodes from rpl network intrusion detection system for rpl from routing choice intrusion towards a trust computing architecture for rpl in cyber physical systems a secure routing protocol based on rpl for internet of things impact of rpl objective functions on energy consumption in ipv6 based wireless sensor networks key: cord-163462-s4kotii8 authors: chaoub, abdelaali; giordani, marco; lall, brejesh; bhatia, vimal; kliks, adrian; mendes, luciano; rabie, khaled; saarnisaari, harri; singhal, amit; zhang, nan; dixit, sudhir title: 6g for bridging the digital divide: wireless connectivity to remote areas date: 2020-09-09 journal: nan doi: nan sha: doc_id: 163462 cord_uid: s4kotii8 in telecommunications, network sustainability as a requirement is closely related to equitably serving the population residing at locations that can most appropriately be described as remote. the first four generations of mobile communication ignored the remote connectivity requirements, and the fifth generation is addressing it as an afterthought. however, sustainability and its social impact are being positioned as key drivers of sixth generation's (6g) standardization activities. in particular, there has been a conscious attempt to understand the demands of remote wireless connectivity, which has led to a better understanding of the challenges that lie ahead. in this perspective, this article overviews the key challenges associated with constraints on network design and deployment to be addressed for providing broadband connectivity to rural areas, and proposes novel approaches and solutions for bridging the digital divide in those regions. in 2018, 55% of the global population lived in urban areas. further, 67% of the total world's population had a mobile subscription, but only 3.9 billion people were using internet, leaving 3.7 billion unconnected, with many of those living in remote or rural areas [1] . people in these regions are not part of the information era and this digital segregation imposes several restrictions to their daily lives. children growing up without access to the latest communication technologies and online learning tools are unlikely to be competitive in the job and commercial markets. unreliable internet connection also hinders people from remote areas to benefit from online commerce and engage in the digital world, thereby compounding already existing social and economic inequalities. abdelaali chaoub is with the national institute of posts and telecommunications (inpt), morocco (email: chaoub.abdelaali@gmail.com). marco giordani is with the department of information engineering, university of padova, padova, italy (email: giordani@dei.unipd.it). brejesh lall is with the indian institute of technology delhi, india (email: brejesh@ee.iitd.ac.in). vimal bhatia is with the indian institute of technology indore, india (email: vbha-tia@iiti.ac.in). adrian kliks is with the poznan university of technology's institute of radiocommunications, poland (email: adrian.kliks@put.poznan.pl). luciano mendes is with the national institute of telecommunications (inatel), brazil (email: luciano@inatel.br). khaled rabie is with the manchester metropolitan university, uk (email: k.rabie@mmu.ac.uk). harri saarnisaari is with the university of oulu, finland (email: harri.saarnisaari@oulu.fi). amit singhal is with the bennett university, india (email: singhalamit.iitd@gmail.com). nan zhang is with the department of algorithms, zte corporation (email: zhang.nan152@zte.com.cn). sudhir dixit is with the basic internet foundation and university of oulu (email: sudhir.dixit@ieee.org). however, rural areas are now becoming more and more attractive as the new coronavirus (covid-19) pandemic has shown, since it has reshaped our living preferences and pushed many people to work remotely from wherever makes them most comfortable [2] . such agglomerations where people live and work are referred to as "oases" in this paper. wireless connectivity in rural areas is expected to have a significant economic impact too. hence, the use of technology in farms and mines will increase the productivity and open new opportunities for local communities. technology will also provide better education, higher quality entertainment, increased digital social engagement, enhanced business opportunities, higher income, and efficient health systems to those living in the most remote zones. despite these premises, advances in the communication standards towards provisioning of wireless broadband connectivity to remote regions have been, so far, relegated to the very bottom, if not entirely ignored. the fundamental challenges are low return on investment, inaccessibility that hinders deployment and regular maintenance of network infrastructures, and lack of favorable spectrum and critical infrastructure such as backhaul and power grid, respectively. in these regards, despite being in its initial stages, the 6th generation (6g) of wireless networks is building upon the leftover from the previous generations [3] , and will be developed by taking into account the peculiarities of the remote and rural sector, with the objective of providing connectivity for all and reach digital inclusion [4] . specifically, the research community should ensure that this critical market segment is not overlooked in favor of the more appealing research areas such as artificial intelligence (ai), machine learning (ml), terahertz communications, 3d augmented reality (ar)/virtual reality (vr), and haptics. boosting remote connectivity can start by addressing spectrum availability issues. licenced spectrum in sub-1 ghz, in fact, is a cumbersome and costly resource, and may require new frequency reuse strategy in remote regions because of their unique requirements. utilization of locally unexploited frequencies and unlicensed bands judiciously may help in reducing the overall cost, thereby making remote connectivity a viable business opportunity. advanced horizontal and vertical spectrum sharing models, along with enhanced co-existence schemes, are two other powerful solutions to improve signal reach in these areas. innovative business and regulator models may be suitable, to encourage new players, such as community-based micro-operators, to build and operate the local networks. local, flexible and pluralistic spectrum licensing could be the way forward to boost the remote market. another issue is that remote areas may not have ample connectivity to the power sources. hence, it is imperative that 6g solutions for remote areas are designed as self-reliant in terms of their power/energy requirements, and/or with the capability to scavenge from the surrounding, possibly scarce, resources. governments can assuage this situation to an extent by making it attractive for the profit-wary service providers to deploy solutions in remote areas. revised government policies and appropriate business models should be parallelly explored as they have direct implications on the technology requirements. environmentally-friendly thinking should also be included throughout the chain of energy consumed from mining to manufacturing and recycling. moreover, the abundant renewable sources need to be integrated into power systems at all scales for sustainable energy provisioning. remote maintenance of network infrastructures and incorporation of some degrees of self-healing capability is also very important since it might be difficult to access remote areas due to a difficult terrain, harsh weather, or lack of transportconnectivity. suitable specifications for fault tolerance and fallback mechanisms need therefore to be incorporated. based on the above introduction, the objective of this article is two-fold: (i) highlight the challenges that hinder progress in the development and deployment of solutions for catering to the remote areas, and (ii) suggest novel approaches to address those challenges. in particular, the paper targets the 6g mobile standard, such that these important issues are considered into the design process from the very beginning. we deliberately skip a detailed literature survey, because a clear and comprehensive review is provided in [4] . we focus, rather, on discussing the requirements and the corresponding challenges, and proposing novel approaches to address some of those issues. a summary of these challenges and possible solutions is shown in fig 1. the rest of the article is organized as follows. sec. ii discusses the question of how future 6g can deliver affordable connectivity to remote users. sec. iii provides a range of promising technical solutions capable of facilitating access to broadband connectivity in remote locations. sec. iv promotes the use of a variety of dynamic spectrum access schemes and suggests how they can evolve to meet the surging needs in the unconnected areas. sec. v presents approaches for integrating infrastructure sharing, renewable sources, and emerging energy-efficient technologies to boost optimal and environmentally friendly power provision. sec. vi presents innovative ways to simplify maintenance operations in hardto-reach zones. finally, the conclusions are summarized in sec. vii. one of the biggest impediments to connecting the unconnected part of the world is the high costs involved and the prevailing low income of the target population. fortunately, there are many affordable emerging alternatives in 6g which may bring new possibilities, as enumerated in this section. dedicated remote-centred connectivity layer. besides 5g's typical service pillars (i.e., embb, ulrrc, and mmtc), 6g should introduce a fourth service grade with basic connectivity target key performance indicators (kpis). however, this remote mode cannot be just a plain version of the urban 6g, since it has to be tailored to the specificities of the remote sector. some kpis relevant to remote connectivity scenarios like coverage and cost-effectiveness need to be expanded, whereas the new service class needs more relaxed constraints in terms of some conventional 5g performance metrics like throughput and latency. this novel service class should have its dedicated slice and endowed with specific and moderate levels of edge and caching capabilities: the involved data can then be processed on edge, local or central data centers for better scalability, as illustrated in fig. 2 . accordingly, such connectivity services can be charged at reduced prices. local access in remote areas can be designed to aggregate multiple and heterogeneous rats. remote streams can then be split over one or more rats, thus allowing flexibility and providing the highest performance possible at minimal cost in everyday life and work. at the same time, digitalization in remote areas calls for large coverage solutions (e.g., tv or gsm white spaces (wss)) to increase the number of users within a base station and helps reduce the network deployment and management costs, albeit at some performance trade-offs. radio frequency (rf) solutions can be complemented by the emerging optical wireless communications (owcs). in particular, short range visible light communications (vlcs) category operating over the visible spectrum can boost the throughput in indoor, fronthaul and underwater environments (see fig. 3 ) while serving the intuitive goal of illumination making it a cost-efficient technology. low-cost networking and end-user devices. one way to reduce cost is the exploitation of legacy infrastructure. tv stations can be shared with mobile network operators (mnos) to provide both tower and electricity. the latest developments in wireless communications can be applied in outdoor power line communication (plc) to provide high data rate connectivity over the high and medium voltages power lines, increasing the capability of the backhaul networks in remote areas. existing base stations and the already-installed fibers alongside roads or embedded inside electrical cables can also serve as a backhaul solution for connectivity in rural regions. end-user devices and modems should also be affordable and usable everywhere, i.e., when people move or travel to different places under harsh conditions. therefore, the possibility to use off-the-shelf equipment at both the user's and network's sides is important and integration with appropriate software stacks is welcome to reduce capital and operational expenditures (capex and opex). the remote infrastructure is likely to be deployed by small internet service providers (isps) and the cost of specialized hardware equipment is an issue to be overcome. open source approaches allow mnos to choose common hardware from any vendor and implement the radio access network (ran) and core functionalities using software defined radio (sdr) and software defined networking (sdn) frameworks. moreover, virtualized and cloudified network functions may reduce infrastructure, maintenance and upgrade costs [5] . these solutions are especially interesting for new players building the remote network from scratch, to foster the inter-operability and cost-effectiveness of hardware and software. however, this field still requires further research and development work before commercial deployment. in remote areas in order to provide long-lived broadband connectivity, a minimum service quality must be continuously guaranteed. in this perspective, this section reviews potential solutions to promote resilient service accessibility in rural areas. multi-hop network elasticity. the access network has, over generations, become multi-hop to provide flexibility in the architecture design, despite some increase in complexity. given the typical geographic, topographic, and demographic constraints of present scenarios, performance levels (e.g., coverage, latency, and bandwidth) of individual hops can be made adaptive. the idea is to extend performance elasticity beyond air-interface to include other hops in the ran. the same approach can be brought to backhaul connections (see fig. 2 ). similarly, rural cell boundaries experiencing poor coverage can reap the elasticity benefits through the use of device-to-device (d2d) communications as depicted in fig. 3 . network protocols should be extended to include static-(e.g., location-based) besides temporal-quality adaptation to handle variations in channel quality over time. wireless backhaul solutions. service accessibility in rural areas involves prohibitive deployment expenditures for network operators and requires high-capacity backhaul connections for several different use cases. fig. 2 provides a comprehensive overview of potential backhaul solutions envisioned in this paper to promote remote connectivity. on one side, laying more fiber links substantially boost broadband access in those areas, but at the expense of increased costs. plc connections, on the other side, provide ease of reach at lower costs making use of ubiquitous wired infrastructures as a physical medium for data transmission, but some inherent challenges related to harsh channel conditions and connected loads are still to be overcome. fig. 2 illustrates also how, even though the use of conventional microwave and satellite links can fulfill the performance requirements of hard-to-reach zones, emerging long-range wireless technologies, such as tv and gsm ws systems, are capable of delivering the intended service over longer distances with less power while penetrating through difficult terrain like mountains and lakes. another recent trend is building efficient cost-effective backhaul links using software-defined technology embedded into off-the-shelf multi-vendor hardware to connect the unconnected remote communities (e.g., oasis 1 in fig. 2) . recently, the research community has also investigated integrated access and backhaul (iab) as a solution to replace fiber-like infrastructures with self-configuring easier-to-deploy relays operating through wireless backhaul using part of the access link radio resources [6] . for example, the tv ws tower in fig. 2 may use the tv spectrum holes to provide both access to oasis 3 and connection to the backhaul link for oasis 4. iab has lower complexity as compared to fiber-like networks and facilitates site installation in rural areas where cable buildout is difficult and costly. the potential of the iab paradigm is magnified when wireless backhaul is realized at millimeter waves (mmwaves), thus exploiting a much larger bandwidth than in sub-6-ghz systems. moreover, mmwave iab enables multiplexing the access and backhaul data within the same bands, thereby removing the need for additional hardware and/or spectrum license costs. nowadays, free space optical (fso) links are being considered as a powerful full-duplex and license-free alternative to increase network footprint in isolated areas with challenging terrains. however, fso units are very sensitive to optical misalignment. for instance, the hop1 fso unit depicted in fig. 2 should be permanently and perfectly aligned with the fso unit installed in the hop3 location. in-depth research in spherical receivers and beam scanning is hence needed to improve the capability of intercepting laser lights emanating from multiple angles. physical-layer solutions for front/mid/backhaul. even though wireless backhauling can reduce deployment costs, service accessibility in rural regions still requires a minimum number of fiber infrastructures to be already deployed. fiber capacity can hence be increased if existing wavelength division multiplexing networks are migrated to elastic optical networks (eons) by technology upgradation at nodes; the outdated technology of urban regions may then be reused to establish connectivity in under-served rural regions without significant investment. besides backhaul, midhaul and fronthaul should also be improved by ai/ml-based solutions providing cognitive capabilities for prudent use of available licensed and unlicensed spectrum [7] . this is especially useful in remote areas where the sparse distribution of users may result in spectrum holes. the unlicensed spectrum, in particular, can provide significant cost-savings for service delivery and improve network elasticity. new possibilities including evolved multiple access schemes and waveforms, like non-orthogonal multiple access (noma) for mmtc, should be investigated; this technology is particularly interesting for internet of things (iot) services where some sensors are close to and some far away from a base station [8] . ai/ml can be also exploited to control physical and link layers for smooth and context-aware modulation and coding schemes (mcss) transitions, even though this approach would need to be lightweight to reduce cost and maintenance, and optimized for the intended market segment. non-terrestrial network solutions. network densification in rural areas is complicated by the heterogeneous terrain that may be encountered when installing fibers between cellular stations. to solve this issue, 6g envisions the deployment of non-terrestrial networks (ntns) where air/spaceborne platforms like unmanned aerial vehicles (uavs), high altitude platform stations (hapss), and satellites, provide ubiquitous global connectivity when terrestrial infrastructures are unavailable [9] . potential beneficiaries of this trend are shown in 3 , including inter-regional transport, farmlands, ships, mountainous areas, and remote maintenance facilities. the evolution towards ntns will be favored by architectural advancements in the aerial/space industry (e.g., through solidstate lithium batteries and gallium nitride technologies), new spectrum developments (e.g., by transitioning to mmwave and optical bands), and novel antenna designs (e.g., through reconfigurable phased/inflatable/fractal antennas realized with metasurface material). despite these premises, however, there are still various challenges that need to be addressed, including those related to latency and coverage constraints. ntns can also provide remote-ready, low-cost (yet robust), and longrange backhaul solutions for terrestrial devices with no wired backhaul. self-organizing networks (sons). to explicitly address the problem of network outages (e.g., due to backhaul failure), which are very common in remote locations, 6g should transition towards sons implementing network slicing, dynamic spectrum management, edge computing, and zero-touch automation functionalities. this approach provides extra degrees of freedom for combating service interruptions, and improves network robustness. in this context, ai/ml can help both the radio access and backhaul networks to self-organize and selfconfigure themselves, e.g., to discover each other, coordinate, and manage the signaling and data traffic. we now present some promising solutions to address spectrum availability issues, which currently pose a serious impediment to broadband connectivity in remote areas. leveraging cognitive radio networks. one of the major barriers for network deployment in rural areas is spectrum licensing, since participation in spectrum auction is typically difficult, from an economic point of view, for small isps. in this perspective, new licensing schemes can prosper the cognitive radio approach, allowing local isps to deploy networks in areas where large operators are not interested in providing their service [10] . spectrum awareness mechanisms, e.g., geolocation database and spectrum sensing, can be used to inform network providers about vacant spectrum in a given area, as well as providing protection against unauthorized transmissions and unpredictable propagation conditions. for instance, fig. 3 shows how tv and gsm ws towers can expand the connectivity beyond the rural households to reach more distant locations like farms and wilderness areas. spectrum co-existence. sub-6 ghz frequencies remain critical for remote connectivity thanks to their favourable propagation properties and wide reach. in these crowded bands, spectrum re-farming and inter/intra-operator spectrum sharing can considerably increase spectrum availability [10] . nevertheless, coverage gaps and low throughput in the legacy bands call for advanced multi-connectivity schemes to combine frequencies above and below 6 ghz. using advanced carrier aggregation techniques in 6g systems, the resource scheduling unit can choose the optimal frequency combination(s) according to service requirements, device capabilities, and network conditions. the proposed model offers a scalable bandwidth that maintains service continuity in case of connectivity loss in those spectrum bands that are more sensitive to surrounding relief, atmospheric effects, and water absorption: for example fig. 3 illustrates a scenario in which vital facilities in rural communities enjoy permanent connectivity using the lower bands in case of communication failure on the higher bands. likewise, multi-connectivity provides diversity, improved system resilience, and situation awareness by establishing multiple links from separate sources to one destination. this aggregation can be achieved at various protocol and/or architecture levels ranging from the radio link up to the core network, allowing effortless deployments of elastic networks in areas difficult to access. utilizing unlicensed bands. a combination of licensed and unlicensed bands has been acknowledged by many stan-dardization organizations to improve network throughput and capacity in unserved/under-served rural areas, as depicted in fig. 3 . while the fcc has recently released 1.2 ghz in the precious 6 ghz bands to expand the unlicensed spectrum, the huge bandwidth available at millimeter-and terahertzwave bands will further support uplink and downlink split, in addition to hybrid spectrum sharing solutions that can adaptively orchestrate network operations in the licensed and unlicensed bands. high frequencies require line of sight (los) for proper communication, complicating harmonious operation with lower bands. accordingly, time-frequency synchronization, as well as control procedures and listening mechanisms, like listen-before-talk (lbt), need to evolve towards more cooperative and distributed protocols to avoid misleading spectrum occupancy. the management of uncoordinated competing users in unlicensed bands will emerge as important issue, and it needs to be addressed in 6g networks. regional licenses and micro-operators. deployment of terrestrial networks for remote areas is challenging due to terrain, lack of infrastructure and personnel. network operators would then rather roam their services from telecommunication providers already operating in those areas than building their own infrastructure. however, such an approach may entail the need for advanced horizontal (between operators of the same priorities) and vertical (when stakeholders of various priorities coexist) spectrum/infrastructure sharing frameworks. solutions like license shared access (lsa, in europe) and spectrum access system (sas, in the us) are mature examples of such an approach with two-tiers and three-tiers of users, respectively. this can evolve to include n-tiers of users belonging to m different mnos. an example of a fourtiered access is provided in fig. 3 . from the top, we find the e-safety services with the highest priority, a tier-2 layer devoted to e-learning sessions and e-government transactions, a middle-priority tier-3 layer for iot use cases that generate sporadic traffic, and a final lower-priority tier-4 layer that uses the remainder of the available spectrum (e.g., for ecommerce services). such solutions, however, need to be supported by innovative business and regulatory models to motivate new market entrants (e.g., micro-operators, which are responsible for last-mile service delivery and infrastructure management) to offer competitive and affordable services in remote zones [11] . power supply is among the highest expenses of mnos and a major bottleneck for ensuring reliable connectivity in remote areas. mnos' profitability and reliable powering can be improved following (a combination of) these solutions, as summarized in fig. 4 . infrastructure sharing. local communication/power operators, as well as various stockholders such as companies, manufacturers, governmental authorities and standardization bodies, should build an integrated design which entails a joint network development process right from the installation phase. in particular, the different players should cooperate to avoid deploying several power plants for different use cases, thus saving precious (already limited) economic resources for other types of expenses. efficient and optimal energy usage. the 6g remote area solutions should be energy efficient and allow base and relay stations to minimize power consumption while guaranteeing affordable yet sufficient service for residents [12] . in particular, energy efficiency should target iot sensors' design and deployment, since the increasing use of a massive number of iot devices, e.g., to boost farming and other activities such as environmental monitoring, is expected to significantly increase in the near future. at the moment, these efforts have been made after the standardization work was completed, but 6g should include efficient use of energy during the standardization process itself. techniques like cell zooming relying on power control and adaptive coverage can be reused at various network levels for flexible, energy-saving front/mid/backhaul layouts. ai/ml techniques can be very helpful in these scenarios. for example, the traffic load statistics on each node can be monitored to choose the optimal cell sleeping and on/off switching strategies to deliver increased power efficiency in all the involved steps of communication. technological breakthroughs. in addition to obvious energy sources such as solar, wind, and hydraulics, energy harvesting through the ambient resources (e.g., electromagnetic signals, vibration, movement, sound, and heat) could provide a viable efficient solution by enabling energy-constrained nodes to scavenge the energy while simultaneous wireless information and power transfer [13] . another recent advancement promoting energy-efficient wireless operations is the use of intelligent reflecting surfaces (irss), equipped with a large number of passive elements smartly coordinated to reflect any incident signal to its intended destination without necessitating any rf chain. although still in its infancy, this technology offers significant advantages in making the propagation conditions in harsh remote areas more favorable with substantial energy savings [14] . vi. intelligent and affordable maintenance operations, administration and management (oam) functionalities and dedicated maintenance for each network component are of paramount importance to overall system performance and user experience in traditional commercial 4g/5g networks. this comes at the expense of complicated and costly tasks, especially in hard-to-reach areas. in this section we present innovative ideas to enable intelligent and cost-effective maintenance in 6g network deployed in rural regions. network status sensing and diagnosing. traditionally, the oam system is adopted for network status monitoring with a major drawback, i.e, manual post-processing and reporting time delay due to huge amounts of gathered data. to enable an intelligent and predictive maintenance, network diagnostics relying on ai-based techniques is advised [15] . with the development of edge computing technologies, multi-level sensing can be employed to achieve near-real time processing and multi-dimensional information collection within a tolerable reporting interval. for instance, processing operations related to short-term network status could be mostly done at the edge node to ensure fast access to this vital information in rural zones. network layout planning and maintenance. as mentioned in the previous sections, the network in remote areas is mainly composed of cost-effective nodes along the path from the access to the core parts (e.g., radio, centralized and distributed units, iab-donors and relays) that need to be organized in either single or multiple hops. in this situation, the whole system will be harmed if one of these nodes experiences an accidental failure. to enhance the resilience of such networks, more flexible and intelligent network layout maintenance is required. more precisely, using evolved techniques such as sons (see sec. iii), the link among each couple of nodes within the network can be permanently controlled and dynamically substituted or restored in case of an outage (see fig. 5 ). additionally, since a big part of the next generation mobile network is virtualized, appropriate tools or even a dedicated server may be needed for automatic software updates monitoring, periodic backups and scheduled maintenance to avoid or at least minimize the need for on-site intervention in those remote facilities. automatic fallback mechanisms can also be scheduled to downgrade the connectivity to another technology under bad network conditions, e.g., by implementing appropriate multi-connectivity schemes, as described in sec. iv. network performance optimization. network optimization in rural areas should take into account remote-specific requirements and constraints. for example, access to the edge resources, which are finite and costly and can be rapidly exhausted, should be optimized taking into consideration the intended services, terminal capabilities, and charging policy of the network and its operator(s). a summary of the maintenance life cycle in remote and rural areas is shown in fig. 5 . in particular, after intelligently building and processing relevant system information data sets, maintenance and repair activities (e.g. system updates or operational parameters optimization) can be performed remotely and safely using 3d virtual environments such as ar and vr. the problem of providing connectivity to rural areas will be a pillar of future 6g standardization activities. in this article we discuss the challenges and possible approaches to addressing the needs of the remote areas. it is argued that such service should be optimized for providing a minimum fallback capability, while still providing full support for spatiotemporal service scalability and graceful quality degradation. we also give insights on the constraints on network design and deployment for rural connectivity solutions. we claim that optimally integrating ntn and fso technologies along the path from the end-point to the core element using open software built on the top of off-the-shelf hardware can provide low-cost broadband solutions in extremely harsh and inaccessible environments, and can be the next disruptive technology for 6g remote connectivity. integration of outdated technology should also be provisioned so that they may be innovatively used to service the remote areas. such provisions should extend to integrate open and off-the-shelf solutions to fully benefit from cost advantage gains. spectrum, regulatory, and standardization issues are also discussed because of their importance to achieve the goal of remote area connectivity. it is fair to say that including remote connectivity requirements in the 6g standardization process will lead to a more balanced and universal social as well as digital equality. 6g white paper on connectivity for remote areas the covid-19 pandemic and its implications for rural economies toward 6g networks: use cases and technologies a key 6g challenge and opportunity -connecting the base of the pyramid: a survey on rural connectivity wireless personal communications integrated access and backhaul in 5g mmwave networks: potential and challenges the roadmap to 6g: ai empowered wireless networks application of non-orthogonal multiple access in wireless sensor networks for smart agriculture a comprehensive simulation platform for space-airground integrated network 5g technology: towards dynamic spectrum sharing using cognitive radio networks business models for local 5g micro operators closing the coverage gap -how innovation can drive rural connectivity a critical review of roadway energy harvesting technologies towards smart and reconfigurable environment: intelligent reflecting surface aided wireless network mechanical fault diagnosis and prediction in iot based on multi-source sensing data fusion rabat (morocco) since 2015. his research interests are related to spectrum sharing for 5g/b5g networks, cognitive radio networks, smart grids, cooperative communications in wireless networks, and multimedia content delivery. he is a paper reviewer for several leading international journals and conferences. he has accumulated intersectoral skills through work experience both in academia and industry as a senior voip solutions consultant at alcatel-lucent information engineering in 2020 from the university of padova, italy, where he is now a postdoctoral researcher and adjunct professor. he visited nyu and toyota infotechnology center he is currently a professor in the department of electrical engineering at indian institute of technology delhi. previously, he served in the digital signal processing group of hughes software systems for 8 years. his research interests lie in the areas of signal processing and machine learning. he has extensively applied signal processing / machine learning techniques he received his ph.d. degree from institute for digital communications at the university of edinburgh (uoe), uk in 2005. during his ph.d. studies he also received the ieee fellowship for collaborative research on at carleton university, ottawa, canada. he has authored/co-authored more than 230 peer-reviewed journals and conferences sm] is an assistant professor at poznan university of technology's institute of radiocommunications, poland, and he is a cofounder and board member of rimedo labs company. his research interests include new waveforms for wireless systems applying either non-orthogonal or noncontiguous multicarrier schemes, cognitive radio, advanced spectrum management, deployment and resource management in small cells since 2001 he is a professor at the national institute of telecommunications (inatel), brazil, where he acts as research coordinator of the radiocommunications reference center higher education academy, received his ph.d. degree from the university of manchester, uk. he is currently an assistant professor at the manchester metropolitan university, uk. his primary research focuses on various aspects of the nextgeneration wireless communication systems. he received the best student paper award at the ieee isplc (tx, usa, 2015) and the ieee access editor of the month award for sm] received his ph.d degree from the university of oulu in 2000, where he has been with centre for wireless communications since 1994. he is currently a university researcher and his current research interest include remote area connectivity amit singhal [m] received his phd degree in electrical engineering from the indian institute of technology delhi in 2016. he is currently working as an assistant professor at bennett university, greater noida, india. his research interests include next generation communication systems, fourier decomposition method, image retrieval and molecular communications he received the bachelor degree in communication engineering and the master degree in integrated circuit engineering from tongji university he was a distinguished chief technologist and cto of the communications and media services for the americas region of hewlett-packard enterprise services in palo alto, ca, and the director of hewlett-packard labs india in palo alto and bangalore. before joining hp, he worked for blackberry, nsn, nokia and verizon. he has published 8 books, over 200 papers and holds 21 us patents key: cord-103150-e9q8e62v authors: mishra, shreya; srivastava, divyanshu; kumar, vibhor title: improving gene-network inference with graph-wavelets and making insights about ageing associated regulatory changes in lungs date: 2020-11-04 journal: biorxiv doi: 10.1101/2020.07.24.219196 sha: doc_id: 103150 cord_uid: e9q8e62v using gene-regulatory-networks based approach for single-cell expression profiles can reveal un-precedented details about the effects of external and internal factors. however, noise and batch effect in sparse single-cell expression profiles can hamper correct estimation of dependencies among genes and regulatory changes. here we devise a conceptually different method using graph-wavelet filters for improving gene-network (gwnet) based analysis of the transcriptome. our approach improved the performance of several gene-network inference methods. most importantly, gwnet improved consistency in the prediction of generegulatory-network using single-cell transcriptome even in presence of batch effect. consistency of predicted gene-network enabled reliable estimates of changes in the influence of genes not highlighted by differential-expression analysis. applying gwnet on the single-cell transcriptome profile of lung cells, revealed biologically-relevant changes in the influence of pathways and master-regulators due to ageing. surprisingly, the regulatory influence of ageing on pneumocytes type ii cells showed noticeable similarity with patterns due to effect of novel coronavirus infection in human lung. inferring gene-regulatory-networks and using them for system-level modelling is being widely used for understanding the regulatory mechanism involved in disease and development. the interdependencies among variables in the network is often represented as weighted edges between pairs of nodes, where edge weights could represent regulatory interactions among genes. gene-networks can be used for inferring causal models [1] , designing and understanding perturbation experiments, comparative analysis [2] and drug discovery [3] . due to wide applicability of network inference, many methods have been proposed to estimate interdependencies among nodes. most of the methods are based on pairwise correlation, mutual information or other similarity metrics among gene expression values, provided in a different condition or time point. however, resulting edges are often influenced by indirect dependencies owing to low but effective background similarity in patterns. in many cases, even if there is some true interaction among a pair of nodes, its effect and strength is not estimated properly due to noise, background-pattern similarity and other indirect dependencies. hence recent methods have started using alternative approaches to infer more confident interactions. such alternative approach could be based on partial correlations [4] or aracne's method of statistical threshold of mutual information [5] . 1 single-cell expression profiles often show heterogeneity in expression values even in a homogeneous cell population. such heterogeneity can be exploited to infer regulatory networks among genes and identify dominant pathways in a celltype. however, due to the sparsity and ambiguity about the distribution of gene expression from single-cell rna-seq profiles, the optimal measures of gene-gene interaction remain unclear. hence recently, sknnider et al. [6] evaluated 17 measures of association to infer gene co-expression based network. in their analysis, they found two measures of association, namely phi and rho as having the best performance in predicting co-expression based gene-gene interaction using scrna-seq profiles. in another study, chen et al. [7] performed independent evaluation of a few methods proposed for genenetwork inference using scrna-seq profiles such as scenic [8] , scode [9] , pidc [10] . chen et al. found that for single-cell transcriptome profiles either generated from experiments or simulations, these methods had a poor performance in reconstructing the network. performance of such methods can be improved if gene-expression profiles are denoised. thus the major challenge of handling noise and dropout in scrna-seq profile is an open problem. the noise in single-cell expression profiles could be due to biological and technical reasons. the biological source of noise could include thermal fluctuations and a few stochastic processes involved in transcription and translation such as allele specific expression [11] and irregular binding of transcription factors to dna. whereas technical noise could be due to amplification bias and stochastic detection due to low amount of rna. raser and o'shea [12] used the term noise in gene expression as measured level of its variation among cells supposed to be identical. raser and o'shea categorised potential sources of variation in geneexpression in four types : (i) the inherent stochasticity of biochemical processes due to small numbers of molecules; (ii) heterogeneity among cells due to cell-cycle progression or a random process such as partitioning of mitochondria (iii) subtle micro-environmental differences within a tissue (iv) genetic mutation. overall noise in gene-expression profiles hinders in achieving reliable inference about regulation of gene activity in a cell-type. thus, there is demand for pre-processing methods which can handle noise and sparsity in scrna-seq profiles such that inference of regulation can be reliable. the predicted gene-network can be analyzed further to infer salient regulatory mechanisms in a celltype using methods borrowed from graph theory. calculating gene-importance in term of centrality, finding communities and modules of genes are common downstream analysis procedures [2] . just like gene-expression profile, inferred gene network could also be used to find differences in two groups of cells(sample) [13] to reveal changes in the regulatory pattern caused due to disease, environmental exposure or ageing. in particular, a comparison of regulatory changes due to ageing has gained attention recently due to a high incidence of metabolic disorder and infection based mortality in the older population. especially in the current situation of pandemics due to novel coronavirus (sars-cov-2), when older individuals have a higher risk of mortality, a question is haunting researchers. that question is: why old lung cells have a higher risk of developing severity due to sars-cov-2 infection. however, understanding regulatory changes due to ageing using gene-network inference with noisy single-cell scrna-seq profiles of lung cells is not trivial. thus there is a need of a noise and batch effect suppression method for investigation of the scrna-seq profile of ageing lung cells [14] using a network biology approach. here we have developed a method to handle noise in gene-expression profiles for improving genenetwork inference. our method is based on graphwavelet based filtering of gene-expression. our approach is not meant to overlap or compete with existing network inference methods but its purpose is to improve their performance. hence, we compared other output of network inference methods with and without graph-wavelet based pre-processing. we have evaluated our approach using several bulk sample and single-cell expression profiles. we further investigated how our denoising approach influences the estimation of graph-theoretic properties of gene-network. we also asked a crucial question: how the gene regulatory-network differs between young and old individual lung cells. further, we compared the pattern in changes in the influence of genes due to ageing with differential expression in covid infected lung. our method uses a logic that cells (samples) which are similar to each other, would have a more similar expression profile for a gene. hence, we first make a network such that two cells are connected by an edge if one of them is among the top k nearest neighbours (knn) of the other. after building knn-based network among cells (samples), we use graph-wavelet based approach to filter expression of one gene at a time (see fig. 1 ). for a gene, we use its expression as a signal on the nodes of the graph of cells. we apply a graph-wavelet transform to perform spectral decomposition of graph-signal. after graph-wavelet transformation, we choose the threshold for wavelet coefficients using sureshrink and bayesshrink or a default percentile value determined after thorough testing on multiple data-sets. we use the retained values of the coefficient for inverse graph-wavelet transformation to reconstruct a filtered expression matrix of the gene. the filtered gene-expression is used for gene-network inference and other down-stream process of analysis of regulatory differences. for evaluation purpose, we have calculated inter-dependencies among genes using 5 different co-expression measurements, namely pearson and spearman correlations, φ and ρ scores and aracne. the biological and technical noise can both exist in a bulk sample expression profile ( [12] ). in order to test the hypothesis that graph-based denoising could improve gene-network inference, we first evaluated the performance of our method on bulk expression data-set. we used 4 data-sets made available by dream5 challenge consortium [15] . three data-sets were based on the original expression profile of bacterium escherichia coli and the single-celled eukaryotes saccharomyces cerevisiae and s aureus. while the fourth data-set was simulated using in silico network with the help of genenetweaver, which models molecular noise in transcription and translation using chemical langevin equation [16] . the true positive interactions for all the four data-sets are also available. we compared graph fourier based low passfiltering with graph-wavelet based denoising using three different approaches to threshold the waveletcoefficients. we achieved 5 -25 % improvement in score over raw data based on dream5 criteria [15] with correlation, aracne and rho based network prediction. with φ s based gene-network prediction, there was an improvement in 3 out of 4 dream5 data-sets ( fig. 2a) . all the 5 network inference methods showed improvement after graphwavelet based denoising of simulated data (in silico) from dream5 consortium ( fig. 2a) . moreover, graph-wavelet based filtering had better performance than chebyshev filter-based low pass filtering in graph fourier domain. it highlights the fact that even bulk sample data of gene-expression can have noise and denoising it with graph-wavelet after making knn based graph among samples has the potential to improve gene-network inference. moreover, it also highlights another fact, well known in the signal processing field, that wavelet-based filtering is more adaptive than low pass-filtering. in comparison to bulk samples, there is a higher level of noise and dropout in single-cell expression profiles. dropouts are caused by non-detection of true expression due to technical issues. using low-pass filtering after graph-fourier transform seems to be an obvious choice as it fills in a background signal at missing values and suppresses high-frequency outlier-signal [17] . however, in the absence of information about cell-type and cellstates, a blind smoothing of a signal may not prove to be fruitful. hence we applied graph-wavelet based filtering for processing gene-expression dataset from the scrna-seq profile. we first used scrna-seq data-set of mouse embryonic stem cells (mescs) [18] . in order to evaluate network inference in an unbiased manner, we used gene regulatory interactions compiled by another research group [19] . our approach of graph-wavelet based pre-processing of mesc scrna-seq data-set improved the performance of gene-network inference methods by 8-10 percentage (fig. 2b) . however, most often, the gold-set of interaction used for evaluation of gene-network inference is incomplete, which hinders the true assessment of improvement. figure 1 : the flowchart of gwnet pipeline. first, a knn based network is made between samples/cell. a filter for graph wavelet is learned for the knn based network of samples/cells. gene-expression of one gene at a time is filtered using graph-wavelet transform. filtered gene-expression data is used for network inference. the inferred network is used to calculate centrality and differential centrality among groups of cells. figure 2 : improvement in gene-network inference by graph-wavelet based denoising of gene-expression (a) performance of network inference methods using bulk gene-expression data-sets of dream5 challenge. three different ways of shrinkage of graph-wavelet coefficients were compared to graph-fourier based low pass filtering. the y-axis shows fold change in area under curve(auc) for receiver operating characteristic curve (roc) for overlap of predicted network with golden-set of interactions. for hard threshold, the default value of 70% percentile was used. (b) performance evaluation using single-cell rna-seq (scrna-seq) of mouse embryonic stem cells (mescs) based network inference after filtering the gene-expression. the gold-set of interactions was adapted from [19] (c) comparison of graph wavelet-based denoising with other related smoothing and imputing methods in terms of consistency in the prediction of the gene-interaction network. here, phi (φ s ) score was used to predict network among genes. for results based on other types of scores see supplementary figure s1 . predicted networks from two scrna-seq profile of mesc were compared to check robustness towards the batch effect. hence we also used another approach to validate our method. for this purpose, we used a measure of overlap among network inferred from two scrna-seq data-sets of the same cell-type but having different technical biases and batch effects. if the inferred networks from both data-sets are closer to true gene-interaction model, they will show high overlap. for this purpose, we used two scrnaseq data-set of mesc generated using two different protocols(smartseq and drop-seq). for comparison of consistency and performance, we also used a few other imputation and denoising methods proposed to filter and predict the missing expression values in scrna-seq profiles. we evaluated 7 other such methods; graph-fourier based filtering [17] , magic [20] , scimpute [21] , dca [22] , saver [23] , randomly [24] , knn-impute [25] . graphwavelet based denoising provided better improvement in auc for overlap of predicted network with known interaction than other 7 methods meant for imputing and filtering scrna-seq profiles (supplementary figure s1a ). similarly in comparison to graph-wavelet based denoising, the other 7 methods did not provided substantial improvement in auc for overlap among gene-network inferred by two data-sets of mesc (fig. 2c , supplementary figure s1b ). however, graph wavelet-based filtering improved the overlap between networks inferred from different batches of scrna-seq profile of mesc even if they were denoised separately (fig. 2c , supplementary figure s1b ). with φ s based edge scores the overlap among predicted gene-network increased by 80% due to graph-wavelet based denoising (fig. 2c ). the improvement in overlap among networks inferred from two batches hints that graph-wavelet denoising is different from imputation methods and has the potential to substantially improve gene-network inference using their expression profiles. improved gene-network inference from single-cell profile reveal agebased regulatory differences improvement in overlap among inferred genenetworks from two expression data-set for a cell type also hints that after denoising predicted networks are closer to true gene-interaction profiles. hence using our denoising approach before estimat-ing the difference in inferred gene-networks due to age or external stimuli could reflect true changes in the regulatory pattern. such a notion inspired us to compare gene-networks inferred for young and old pancreatic cells using their scrna-seq profile filtered by our tool [26] . martin et al. defined three age groups, namely juvenile ( 1month-6 years), young adult (21-22 years) and aged (38-54 years) [26] . we applied graph-wavelet based denoising of pancreatic cells from three different groups separately. in other words, we did not mix cells from different age groups while denoising. graph-wavelet based denoising of a singlecell profile of pancreatic cells caused better performance in terms of overlap with protein-protein interaction (ppi) (fig. 3a , supplementary figure s2a ). even though like chen et al. [7] we have used ppi to measure improvement in genenetwork inference, it may not be reflective of all gene-interactions. hence we also used the criteria of increase in overlap among predicted networks for same cell-types to evaluate our method for scrnaseq profiles of pancreatic cells. denoising scrnaseq profiles also increased overlap between inferred gene-network among pancreatic cells of the old and young individuals (fig. 3b , supplementary figure s2b ). we performed quantile normalization of original and denoised expression matrix taking all 3 age groups together to bring them on the same scale to calculate the variance of expression across cells of every gene. the old and young pancreatic alpha cells had a higher level of median variance of expression of genes than juvenile. however, after graph-wavelet based denoising, the variance level of genes across all the 3 age groups became almost equal and had similar median value (fig. 3c ). notice that, it is not trivial to estimate the fraction of variances due to transcriptional or technical noise. nonetheless, graph-wavelet based denoising seemed to have reduced the noise level in single-cell expression profiles of old and young adults. differential centrality in the co-expression network has been used to study changes in the influence of genes. however, noise in single-cell expression profiles can cause spurious differences in centrality. hence we visualized the differential degree of genes in network inferred using young and old cells scrna-seq profiles. the networks inferred from non-filtered expression had a much higher number of non-zero differential degree values in comparison to the de-noised version (fig. 3d, supplementary figure s2c ). thus denoising seems to reduce differences among centrality, which could be due to randomness of noise. next, we analyzed the properties of genes whose variance dropped most due to graphwavelet based denoising. surprisingly, we found that top 500 genes with the highest drop in variance due to denoising in old pancreatic beta cells were significantly associated with diabetes mellitus and hyperinsulinism. whereas, top 500 genes with the highest drop in variance in young pancreatic beta cells had no or insignificant association with diabetes (fig. 3e) . a similar trend was observed with pancreatic alpha cells (supplementary figure s2d ) . such a result hint that ageing causes increase in stochasticity of the expression level of genes associated with pancreas function and denoising could help in properly elucidating their dependencies with other genes. improvement in gene-network inference for studying regulatory differences among young and old lung cells. studying cell-type-specific changes in regulatory networks due to ageing has the potential to provide better insight about predisposition for disease in the older population. hence we inferred genenetwork for different cell-types using scrna-seq profiles of young and old mouse lung cells published by kimmel et al. [14] .the lower lung epithelia where a few viruses seem to have the most deteriorating effect consists of multiple types of cells such as bronchial epithelial and alveolar epithelial cells, fibroblast, alveolar macrophages, endothelial and other immune cells. the alveolar epithelial cells, also called as pneumocytes are of two major types. the type 1 alveolar (at1) epithelial cells for major gas exchange surface of lung alveolus has an important role in the permeability barrier function of the alveolar membrane. type 2 alveolar cells (at2) are the progenitors of type 1 cells and has the crucial role of surfactant production. at2 cells ( or pneumocytes type ii) cells are a prime target of many viruses; hence it is important to understand the regulatory patterns in at2 cells, especially in the context of ageing. we applied our method of denoising on scrnaseq profiles of cells derived from old and young mice lung [14] . graph wavelet based denoising lead to an increase in consistency among inferred genenetwork for young and old mice lung for multiple cell-types (fig. 4a) . graph-wavelet based denoising also lead to an increase in consistency in predicted gene-network from data-sets published by two different groups (fig. 4b) . the increase in overlap of gene-networks predicted for old and young cells scrna-seq profile, despite being denoised separately, hints about a higher likelihood of predicting true interactions. hence the chances of finding gene-network based differences among old and young cells were less likely to be dominated by noise. we studied ageing-related changes in pagerank centrality of nodes(genes). since pagerank centrality provides a measure of "popularity" of nodes, studying its change has the potential to highlight the change in the influence of genes. first, we calculated differential pagerank of genes among young and old at2 cells (supporting file-1) and performed gene-set enrichment analysis using enrichr [27] . the top 500 genes with higher pagerank in young at2 cells had enriched terms related to integrin signalling, 5ht2 type receptor mediated signalling, h1 histamine receptor-mediated signalling pathway, vegf, cytoskeleton regulation by rho gtpase and thyrotropin activating receptor signalling (fig. 4c) . we ignored oxytocin and thyrotropin-activating hormone-receptor mediated signalling pathways as an artefact as the expression of oxytocin and trh receptors in at2 cells was low. moreover, genes appearing for the terms "oxytocin receptor-mediated signalling" and "thyrotropin activating hormone-mediated signalling" were also present in gene-set for 5ht2 type receptormediated signalling pathway. we found literature support for activity in at2 cells for most of the enriched pathways. however, there were very few studies which showed their differential importance in old and young cells, such as bayer et al. demonstrated mrna expression of several 5-htr including 5-ht2, 5ht3 and 5ht4 in alveolar epithelial cells type ii (at2) cells and their role in calcium ion mobilization. similarly, chen et al. [28] showed that histamine 1 receptor antagonist reduced pulmonary surfactant secretion from adult rat alveolar at2 cells in primary culture. vegf pathway is active in at2 cells, and it is known that ageing has an effect on vegf mediated angiogenesis in lung. moreover, vegf based angiogenesis is for comparing two networks it is important to reduce differences due to noise. hence the plot here shows similarity of predicted networks before and after graph-wavelet based denoising. the result shown here are for correlation-based co-expression network, while similar results are shown using ρ score in supplementary figure s2 . (c) variances of expression of genes across single-cells before and after denoising (filtering) is shown here. variances of genes in a cell-type was calculated separately for 3 different stages of ageing (young, adult and old). the variance (estimate of noise) is higher in older alpha and beta cells compared to young. however, after denoising variance of genes in all ageing stage becomes equal (d) effect of noise in estimated differential centrality is shown is here. the difference in the degree of genes in network estimated for old and young pancreatic beta cells is shown here. the number of non-zero differential-degree estimated using denoised expression is lower than unfiltered expression based networks.(e) enriched panther pathway terms for top 500 genes with the highest drop in variance after denoising in old and young pancreatic beta cells. known to decline with age [29] . we further performed gene-set enrichment analysis for genes with increased pagerank in older mice at2 cells. for top 500 genes with higher pagerank in old at2 cells, the terms which appeared among 10 most enriched in both kimmel et al. and angelids et al. data-sets were t cell activation, b cell activation, cholesterol biosynthesis and fgf signaling pathway, angiogenesis and cytoskeletal regulation by rho gtpase (fig. 4d) . thus, there was 60% overlap in results from kimmel et al. and angelids et al. data-sets in terms of enrichment of pathway terms for genes with higher pagerank in older at2 cells (supplementary figure s3a , supporting file-2, supporting file-3). overall in our analysis, inflammatory response genes showed higher importance in older at2 cells. the increase in the importance of cholesterol biosynthesis genes hand in hand with higher inflammatory response points towards the influence of ageing on the quality of pulmonary surfactants released by at2. al saedy et al. recently showed that high level of cholesterol amplifies defects in surface activity caused by oxidation of pulmonary surfactant [30] . we also performed enrichr based analysis of differentially expressed genes in old at2 cells (supporting file-4). for genes up-regulated in old at2 cells compared to young, terms which reappeared were cholesterol biosynthesis, t cell and b cell activation pathways, angiogenesis and inflammation mediated by chemokine and cytokine signalling. whereas few terms like ras pathway, jak/stat signalling and cytoskeletal signalling by rho gt-pase did not appear as enriched for genes upregulated in old at2 cells ( figure 3b , supporting file-4). however previously, it has been shown that the increase in age changes the balance of pulmonary renin-angiotensin system (ras), which is correlated with aggravated inflammation and more lung injury [31] . jak/stat pathway is known to be involved in the oxidative-stress induced decrease in the expression of surfactant protein genes in at2 cells [32] . overall, these results indicate that even though the expression of genes involved in relevant pathways may not show significant differences due to ageing, but their regulatory influence could be changing substantially. in order to further gain insight, we analyzed the changes in the importance of transcription factors in ageing at2 cells. among top 500 genes with higher pagerank in old at2 cells, we found several relevant tfs. however, to make a stringent list, we considered only those tfs which had nonzero value for change in degree among gene-network for old and young at2 cells. overall, with kimmel at el. data-set, we found 46 tfs with a change in pagerank and degree (supplementary table-1) due to ageing for at2 cells (fig. 4e) . the changes in centrality (pagerank and degree) of tfs with ageing was coherent with pathway enrichment results. such as etv5 which has higher degree and pagerank in older cells, is known to be stabilized by ras signalling in at2 cells [33] . in the absence of etv5 at2 cell differentiate to at1 cells [33] . another tf jun (c-jun) having stronger influence in old at2 cells, is known to regulate inflammation lung alveolar cells [34] . we also found jun to be having co-expression with jund and etv5 in old at2 cell (supplementary figure s4) . jund whose influence seems to increase in aged at2 cells is known to be involved in cytokine-mediated inflammation. among the tfs stat 1-4 which are involved in jak/stat signalling, stat4 showed higher degree and pagerank in old at2. androgen receptor(ar) also seem to have a higher influence in older at2 cells (fig. 4e ). androgen receptor has been shown to be expressed in at2 cells [35] . we further performed a similar analysis for the scrna-seq profile of interstitial macrophages(ims) in lungs and found literature support for the activity of enriched pathways (supporting file-5). whereas gene-set enrichment output for important genes in older ims had some similarity with results from at2 cells as both seem to have higher pro-inflammatory response pathway such as t cell activation and jak/stat signalling. however, unlike at2 cells, ageing in ims seem to cause an increase in glycolysis and pentose phosphate pathway. higher glycolysis and pentose phosphate pathway activity levels have been previously reported to be involved in the pro-inflammatory response in macrophages by viola et al. [36] . in our results, ras pathway was not enriched significantly for genes with a higher importance in older macrophages. such results show that the pro-inflammatory pathways activated due to aging could vary among different cell-types in lung. for the same type of cells, the predicted networks for old and young cells seem to have higher overlap after graph-wavelet based filtering. the label "raw" here means that, both networks (for old and young) were inferred using unfiltered scrna-seq profiles. wheres the same result from denoised scrna-seq profile is shown as filtered. networks were inferred using correlation-based co-expression. in current pandemic due to sars-cov-2, a trend has emerged that older individuals have a higher risk of developing severity and lung fibrosis than the younger population. since our analysis revealed changes in the influence of genes in lung cells due to ageing, we compared our results with expression profiles of lung infected with sars-cov-2 published by blanco-melo et al. [37] . recently it has been shown that at2 cells predominantly express ace2, the host cell surface receptor for sars-cov-2 attachment and infection [38] . thus covid infection could have most of the dominant effect on at2 cells. we found that genes with significant upregulation in sars-cov-2 infected lung also had higher pagerank in gene-network inferred for older at2 cells (fig. 5a) . we also repeated the process of network inference and calculating differential centrality among old and young using all types of cells in the lung together (supporting file-6). we performed gene-set enrichment for genes up-regulated in sars-cov-2 infected lung. majority of the 7 panther pathway terms enriched for genes up-regulated in sars-cov-2 infected lung also had enrichment for genes with higher pagerank in old lung cells (combined). total 6 out of 7 significantly enriched panther pathways for genes up-regulated in covid-19 infected lung, were also enriched for genes with higher pagerank in older at2 cells in either of the two data-sets used here (5 in angelids et al., 3 in kimmel et al. data-based results). among the top 10 enriched wikipathway terms for genes up-regulated in covid infected lung, 7 has significant enrichment for genes with higher pagerank in old at2 cells (supporting file-7). however, the term type-ii interferon signalling did not have significant enrichment for genes with higher pagerank in old at2 cells. we further investigated enriched motifs of transcription factors in promoters of genes up-regulated in covid infected lungs (supplementary methods). for promoters of genes up-regulated in covid infected lung top two enriched motifs belonged to irf (interferon regulatory factor) and ets family tfs. notice that etv5 belong to sub-family of ets groups of tfs. further analysis also revealed that most of the genes whose expression is positively cor-related with etv5 in old at2 cells is up-regulated in covid infected lung. in contrast, genes with negative correlation with etv5 in old at2 cells were mostly down-regulated in covid infected lung. a similar trend was found for stat4 gene. however, for erg gene with higher pagerank in young at2 cell, the trend was the opposite. in comparison to genes with negative correlation, positively correlated genes with erg in old at2 cell, had more downregulation in covid infected lung. such trend shows that a few tfs like etv5, stat4 with higher pagerank in old at2 cells could be having a role in poising or activation of genes which gain higher expression level on covid infection. inferring regulatory changes in pure primary cells due to ageing and other conditions, using singlecell expression profiles has tremendous potential for various applications. such applications could be understanding the cause of development of a disorder or revealing signalling pathways and master regulators as potential drug targets. hence to support such studies, we developed gwnet to assist biologists in work-flow for graph-theory based analysis of single-cell transcriptome. gwnet improves inference of regulatory interaction among genes using graph-wavelet based approach to reduce noise due to technical issues or cellular biochemical stochasticity in gene-expression profiles. we demonstrated the improvement in gene-network inference using our filtering approach with 4 benchmark data-sets from dream5 consortium and several single-cell expression profiles. using 5 different ways for inferring network, we showed how our approach for filtering gene-expression can help genenetwork inference methods. our results of comparison with other imputation, smoothing methods and graph-fourier based filtering showed that graph-wavelet is more adaptive to changes in the expression level of genes with changing neighborhood of cells. thus graph-wavelet based denoising is a conceptually different approach for preprocessing of gene-expression profiles. there is a huge body of literature on inferring gene-networks from bulk gene-expression profile and utilizing it to find differences among two groups of samples. however, applying classical procedures on singleshown for erg, which have higher pagerank in young at2 cells. most of the genes which had a positive correlation with etv5 and stat4 expression in old murine at2 cells were up-regulated in covid infected lung. whereas for erg the trend is the opposite. genes positively correlated with erg genes in old at2 had more down-regulation than genes with negative correlation. such results hint that tfs whose influence (pagerank) increase during ageing could be involved activating or poising the genes up-regulated in covid infection. cell transcriptome profiles has not proved to be effective. our method seems to resolve this issue by increasing consistency and overlap among gene-networks inferred using an expression from different sources (batches) for the same cell-type even if each data-sets was filtered independently. such an increase in overlap among predicted network from independently processed data-sets from different sources hint that estimated dependencies among genes reach closer to true values after graphwavelet based denoising of expression profiles. having network prediction closer to true values increases the reliability of comparison of a regulatory pattern among two groups of cells. moreover, recently chow and chen [39] have shown that age-associated genes identified using bulk expression profiles of the lung are enriched among those induced or suppressed by sars-cov-2 infection. however, they did not perform analysis with systems-level approach. our analysis highlighted ras and jak/stat pathways to be enriched for genes with stronger influence in old at2 cells and genes up-regulated in covid infected lung. ras/mapk signalling is considered essential for self-renewal of at2 cell [33] . similarly, jak/stat pathway is known to be activated in the lung during injury [40] and influence surfactant quality [32] . we have used murine aging-lung scrna-seq profiles however our analysis provides an important insight that regulatory patterns and master-regulators in old at2 cells are in such a configuration that they could be predisposing it for a higher level of ras and jak/stat signalling. androgen receptor (ar) which has been implicated in male pattern baldness and increased risk of males towards covid infection [41] had higher pagerank and degree in old at2 cells. however, further investigation is needed to associate ar with severity on covid infection due to ageing. on the other hand, in young at2 cells, we find a high influence of genes involved in histamine h1 receptor-mediated signalling, which is known to regulate allergic reactions in lungs [42] . another benefit of our approach of analysis is that it can highlight a few specific targets of further study for therapeutics. such as a kinase that binds and phosphorylates c-jun called as jnk is being tested in clinical trials for pulmonary fibrosis [43] . androgen deprivation therapy has shown to provide partial protection against sars-cov-2 infection [44] . on the same trend, our analysis hints that etv5 could also be considered as drug-target to reduce the effect of ageing induced ras pathway activity in the lung. we used the term noise in gene-expression according to its definition by several researchers such as raser and o'shea [12] ; as the measured level of variation in gene-expression among cells supposed to be identical. hence we first made a base-graph (networks) where supposedly identical cells are connected by edges. for every gene we use this basegraph and apply graph-wavelet transform to get an estimate of variation of its expression in every sample (cells) with respect to other connected samples at different levels of graph-spectral resolution. for this purpose, we first calculated distances among samples (cells). to get a better estimate of distances among samples (cells) one can perform dimension reduction of the expression matrix using tsne [45] or principal component analysis. we considered every sample (cell) as a node in the graph and connected two nodes with an edge only when one of them was among k-nearest neighbors of the other. here we decide the value of k in the range of 10-50, based on the number of samples(cells) in the expression data-sets. thus we calculated the preliminary adjacency matrix using k-nearest neighbours (knn) based on euclidean distance metric between samples of the expression matrix. we used this adjacency matrix to build a base-graph. thus each vertex in the base-graph corresponds to each sample and edge weights to the euclidean distance between them. the weighted graph g built using knn based adjacency matrix comprises of a finite set of vertices v which corresponds to cells (samples), a set of edges e denoting connection between samples (if exist) and a weight function which gives nonnegative weighted connections between cells (samples). this weighted matrix can also be defined as a n xn (n being number of cells) weighted adjacency matrix a where a ij is 0 if there is no edge between cells i and j , otherwise a ij = weight(i, j) if there exist an edge between i, j. the degree of a cell in the graph is the sum of weights of edges incident on that cell. also, diagonal degree matrix d of this graph comprises of degree d(i) if i = j, 0 otherwise. a non-normalized graph laplacian operator l for a graph is defined as l = d − a. the normalized form of graph laplacian operator is defined as : both laplacian operators produce different eigenvectors [46] . however, we have used a normalized form of laplacian operator for the graph between cells. the graph laplacian is further used for graph fourier transformation of signals on nodes (see supplementary methods) ([47] [46] ). for filtering in the fourier domain, we used chebyshev-filter for gene expression profile. we took the expression of each gene at a time considering it as a signal and projected it onto the raw graph (where each vertex corresponds to each sample) object [17] . we took forward fourier transform of signal and filtered the signal using chebyshev filter in the fourier domain and then inverse transformed the signal to calculate filtered expression. this same procedure was repeated for every gene. this would finally give us filtered gene expression. spectral graph wavelet entails choosing a nonnegative real-valued kernel function which can behave as a bandpass filter and is similar to fourier transform. the re-scaled kernel function of graph laplacian gives wavelet operator which eventually produce graph wavelet coefficients at each scale. however, using continuous functional calculus one can define a function of self adjoint operator on the basis of spectral representation of graph. although for a graph with finite dimensional laplacian, this can be achieved by eigenvalues and eigenvectors of laplacian l [47] . the wavelet operator is given by t g = g(l). t g f gives wavelet coefficients for a signal f at scale = 1. this operator operates on eigenvectors u l as t g u l = g(λ l )u l . hence, for any graph signal, operator t g operates on the signal by adjusting each graph fourier coefficient as and inverse fourier transform given as the wavelet operator at every scale s is given as t s g = g(sl). these wavelet operators are localized to obtain individual wavelets by applying them to δ n , with δ n being a signal with 1 on vertex n and zero otherwise [47] . thus considering coefficients at every scale, the inverse transform can be obtained as here, in spite of filtering in fourier domain, we took wavelet coefficients of each gene expression signal at different scales. thresholding was applied on each scale to filter wavelet coefficients. we applied both hard and soft thresholding on wavelet coefficients. for soft thresholding, we implemented well-known methods sure shrink and bayes shrink. finding an optimal threshold for wavelet coefficients for denoising linear-signals and images has remained a subject of intensive research. we evaluated both soft and hard thresholding approaches and tested an information-theoretic criterion known as the minimum description length principle (mdl). using our tool gwnet, user can choose from multiple options of finding threshold such as visushrink, sureshrink and mdl. here, we have used hard-thresholding for most the data-sets as proper soft-thresholding of graph-wavelet coefficient is itself a topic of intensive research and may need further fine-tuning. one can also use hardthreshold value based on the best overlap among predicted gene-network and protein-protein interaction (ppi). while applying it on multiple datasets we realized that threshold cutoffs estimated by mdl criteria and best overlap of predicted network with known interaction and ppi, were in the range of 60-70 percentile. for comparing predicted network from multiple data-sets, we needed uniform percentile cutoff to threshold graph-wavelet coefficients. hence for uniform analysis of several datasets, we have set the default threshold value of 70 percentile. hence in default mode, wavelet coefficient with absolute value less than 70 percentile was made equal to zero. gwnet tool is flexible, and any network inferences method can be plugged in it for making regulatory inferences using a graph-theoretic approach. here, for single-cell rna-seq data, we have used gene-expression values in the form of fpkm (fragments per kilobase of exon model per million reads mapped). we pre-processed single-cell gene expression by quantile normalization and log transformation. to start with, we used spearman and pearson correlation to achieve a simple estimate of the measure of inter-dependencies among genes. we also used aracne ( algorithm for the reconstruction of accurate cellular networks) to infer network among genes. aracne first computes mutual information for each gene-pair. then it considers all possible triplet of genes and applies the data processing inequality (dpi) to remove indirect interactions. according to dpi, if gene i and gene j do not interact directly with each other but show dependency via gene k, the following inequality hold where i(g i , g j ) represents mutual information between gene i and gene j. aracne also removes interaction with mutual information less than a particular threshold eps. we have used eps value to recently skinnider et al., [6] showed superiority of two measures of proportionality rho(ρ) and phi(φ s ) [48] for estimating gene-coexpression network using single-cell transcriptome profile. hence we also evaluated the benefit of graph-wavelet based denoising of gene-expression with measures of proportionality ρ and φ s . the measures of proportionality φ can be defined as φ(g i , g j ) = var(g i − g j ) var(g i ) where g i is the vector containing log values of expression of a gene i across multiple samples (cells) and var() represents variance function. the symmetric version of φ can be written as whereas rho can be defined as to estimate both measures of proportionality, ρ and φ, we used 'propr' package2.0 [49] . the networks inferred from filtered and unfiltered gene-expression were compared to the ground truth. ground truth for dream5 challenge dataset was already available while for single-cell expression, we assembled the ground truth from hip-pie (human integrated protein-protein interaction reference) database [50] . we considered all edges possible in network, sorted them based on the significance of edge weights. we calculated the area under the receiver operator curve for both raw and filtered networks by comparing against edges in the ground truth. receiver operator is a standard performance evaluation metrics from the field of machine learning, which has been used in the dream5 evaluation method with some modifications. the modification for receiver operating curve here is that for x-axis instead of false-positive rate, we used a number of edges sorted according to their weights. for evaluation all possible edges sorted based on their weights in network are taken from the gene-network inferred from filtered and raw graphs. we calculated improvement by measuring fold change between raw and filtered scores. we compared the results of our approach of graphwavelet based denoising with other methods meant for imputation or reducing noise in scrna-seq profiles. for comparison we used graph-fourier based filtering [17] , magic [20] , scimpute [21] , dca [22] , saver [23] , randomly [24] , knn-impute [25] . brief descriptions and corresponding parameters used for other methods are written in supplementary method. the bulk gene-expression data used here evaluation was download from dream5 portal (http://dreamchallenges.org/project/dream-5network-inference-challenge/). the single-cell expression profile of mesc generated using different protocols [18] was downloaded for geo database (geo id: gse75790). single-cell expression profile of pancreatic cells from individuals with different age groups was downloaded from geo database (geo id:gse81547). the scrna-seq profile of murine aging lung published by kimmel et al. [14] is available with geo id : gse132901. while aging lung scrna-seq data published by angelids et al. [51] is available with geo id: gse132901. the code for graph-wavelet based filtering of gene-expression is available at http://reggen. iiitd.edu.in:1207/graphwavelet/index.html. the codes are present at https://github. com/reggenlab/gwnet/ and supporting files are present at https://github.com/reggenlab/ gwnet/tree/master/supporting$_$files. an integrative approach for causal gene identification and gene regulatory pathway inference singlecell transcriptomics unveils gene regulatory network plasticity chemogenomic profiling of plasmodium falciparum as a tool to aid antimalarial drug discovery supervised, semi-supervised and unsupervised inference of gene regulatory networks reverse engineering cellular networks evaluating measures of association for single-cell transcriptomics evaluating methods of inferring gene regulatory networks highlights their lack of performance for single cell gene expression data scenic: single-cell regulatory network inference and clustering scode: an efficient regulatory network inference algorithm from single-cell rna-seq during differentiation gene regulatory network inference from single-cell data using multivariate information measures characterizing noise structure in single-cell rna-seq distinguishes genuine from technical stochastic allelic expression noise in gene expression: origins, consequences, and control, science comparative assessment of differential network analysis methods murine single-cell rna-seq reveals cellidentity-and tissue-specific trajectories of aging wisdom of crowds for robust gene network inference genenetweaver: in silico benchmark generation and performance profiling of network inference methods enhancing experimental signals in single-cell rna-sequencing data using graph signal processing comparative analysis of single-cell rna sequencing methods a gene regulatory network in mouse embryonic stem cells recovering gene interactions from single-cell data using data diffusion an accurate and robust imputation method scimpute for single-cell rna-seq data single-cell rna-seq denoising using a deep count autoencoder saver: gene expression recovery for singlecell rna sequencing a random matrix theory approach to denoise single-cell data missing value estimation methods for dna microarrays single-cell analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns enrichr: interactive and collaborative html5 gene list enrichment analysis tool histamine stimulation of surfactant secretion from rat type ii pneumocytes aging impairs vegf-mediated, androgen-dependent regulation of angiogenesis dysfunction of pulmonary surfactant mediated by phospholipid oxidation is cholesterol-dependent age-dependent changes in the pulmonary renin-angiotensin system are associated with severity of lung injury in a model of acute lung injury in rats mapk and jak-stat signaling pathways are involved in the oxidative stress-induced decrease in expression of surfactant protein genes transcription factor etv5 is essential for the maintenance of alveolar type ii cells, proceedings of the national academy of sciences of the united states of targeted deletion of jun/ap-1 in alveolar epithelial cells causes progressive emphysema and worsens cigarette smoke-induced lung inflammation androgen receptor and androgen-dependent gene expression in lung the metabolic signature of macrophage responses imbalanced host response to sars-cov-2 drives development of covid-19 single cell rna sequencing of 13 human tissues identify cell types and receptors of human coronaviruses the aging transcriptome and cellular landscape of the human lung in relation to sars-cov-2 jak-stat pathway activation in copd, the european androgen hazards with covid-19 the h1 histamine receptor regulates allergic lung responses late breaking abstract -evaluation of the jnk inhibitor, cc-90001, in a phase 1b pulmonary fibrosis trial androgen-deprivation therapies for prostate cancer and risk of infection by sars-cov-2: a population-based study (n = 4532) visualizing data using t-sne discrete signal processing on graphs: frequency analysis wavelets on graphs via spectral graph theory how should we measure proportionality on relative gene expression data? propr: an r-package for identifying proportionally abundant features using compositional data analysis hippie v2.0: enhancing meaningfulness and reliability of protein-protein interaction networks an atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics we thank dr gaurav ahuja for providing us valuable advice on analysis of single-cell expression profile of ageing cells. none declared.vibhor kumar is an assistant professor at iiit delhi, india. he is also an adjunct scientist at genome institute of singapore. his interest include genomics and signal processing.divyanshu srivastava completed his thesis on graph signal processing for masters degree at computational biology department in iiit delhi, india. he has applied graph signal processing on protein structures and gene-expression data-sets.shreya mishra is a phd student at computational biology department in iiit delhi, india. her interest include data sciences and genomics. • we found that graph-wavelet based denoising of gene-expression profiles of bulk samples and singlecells can substantially improve gene-regulatory network inference.• more consistent prediction of gene-network due to denoising lead to reliable comparison of predicted networks from old and young cells to study the effect of ageing using single-cell transcriptome.• our analysis revealed biologically relevant changes in regulation due to aging in lung pneumocyte type ii cells, which had similarity with effects of covid infection in human lung.• our analysis highlighted influential pathways and master regulators which could be topic of further study for reducing severity due to ageing. key: cord-027851-95bsoea2 authors: wang, daojuan; schøtt, thomas title: coupling between financing and innovation in a startup: embedded in networks with investors and researchers date: 2020-06-25 journal: int entrep manag j doi: 10.1007/s11365-020-00681-y sha: doc_id: 27851 cord_uid: 95bsoea2 innovation may be a basis for starting a business, and financing is typically needed for starting. innovation and financing may conceivably be negatively related, or be unrelated, or plausibly be beneficially related. these possible scenarios frame the questions: what is the coupling between innovation and financing at inception, and what is the embeddedness of coupling in networks around the entrepreneur, specifically networks with investors and researchers? these questions are addressed with a globally representative sample of entrepreneurs interviewed at inception of their business. innovation and financing are found to be decoupled, typically; less frequently to be loosely coupled, and rarely to be tightly coupled. coupling is promoted by networking with both investors and researchers, with additive effects and with a synergy effect. by ascertaining coupling and its embeddedness in networks as a way for building capability in a startup, the study contributes to empirically supported theorizing about capability building. innovation may be the basis for starting a business. an entrepreneur may innovate something that is novel to some potential customers, or may use a technique that has not been used earlier, or may be doing something that only few other businesses are doing. in these ways, innovation may be a foundation for the startup. much scholarship studies innovation. a stream of research focuses on innovation as a basis for starting a business (e.g. weiblen and chesbrough 2015; colombelli et al. 2016; colombelli and quatraro 2017) . financing may be needed for starting. indeed, typically, a startup requires some financing. an entrepreneur may need little financing, but occasionally requires much financing. typically, an entrepreneur has some funds of their own for investing in the startup. frequently, an entrepreneur also requires some funding from other sources, such as family and friends. an entrepreneur often borrows from a loan organization, and sometimes obtains venture capital. a stream of research focuses on financing of startups (e.g. van osnabrugge and robinson 2000; hsu 2004; mason and stark 2004; croce et al. 2017) . innovation and financing may be unrelated in a startup. innovation may be accomplished before starting a business, or may be completed on a shoestring budget, and in these situations innovation and financing are unrelated. furthermore, potential investors may shy away from the riskiness of supporting innovation and prefer to invest in routine production, and also in this situation there is no relation between financing and innovation in a startup. in the extreme, if innovation is pursued without financing, and if a copy-cat startup attracts financing, then innovation and financing are even negatively related. conversely, innovation may require financing and the entrepreneur may obtain it, and an investor may prefer to finance an innovative rather than a routine startup. in such a situation, innovation and financing will go hand in hand and be coupled beneficially. these scenariosa negative coupling between innovation and financing, or no coupling between them, or a beneficial coupling between themrepresent a gap in our understanding of startups. this gap frames our first question for research: what is the coupling between innovation and financing at inception? a second issue is the sources of their coupling. an entrepreneur has networks that channel, enable and constrain the endeavor. an entrepreneur may seek financing by networking with formal and informal investors. an entrepreneur may pursue innovation by networking with researchers and inventors. and an entrepreneur may couple financing and innovation by networking with both investors and researchers. investors and researchers are not substitutable partners. rather, investors and researchers provide complementary resources, and there may even be a synergy between their inputs into a startup. such embeddedness of coupling is another gap in our understanding of new ventures. this gap frames our second question: what is the embeddedness of coupling in the networks around an entrepreneur, specifically the networks with investors and researchers? by addressing these two gaps in understanding a startup, this study makes several specific contributions. first, coupling of innovation and financing at inception is a way of building capability in the new business, and the study thus contributes to our understanding of capability building. second, this study fills a gap by investigating whether and how these two important elementsinnovation and financingare channeled, enabled and constrained by different networksespecially networks with the investors and researchersaround the entrepreneur. third, by focusing on the business founding stage, we overcome the methodological problems caused by hindsight bias, memory decay, and survivorship bias, which pervade retrospective studies. the following first reviews the theoretical background as a basis for developing hypotheses, then describes the research design, reports analyses, and concludes by relating findings to the literature of entrepreneurship, venture capital and investment. the phenomenon of financing and innovation being interrelated is conceptualized as a coupling. the concept of coupling is classical in studies of organizations (weick 1976; orton and weick 1990) . elements of an organization have a coupling, in that they tend to occur together and to be connected, intertwined, reciprocal, reinforcing, and mutually sustaining within the organization. the coupling has strength; it may be loose, in that the elements are rather independent of one another, or it may be tight, in that the elements are highly interdependent. loose coupling is often found in an educational organization, whereas tight coupling is more frequent in a firm (ibid.). here we apply the concept of coupling to the intertwining between two elements of a startup: financing and innovation. coupling is tight if the startup is financing and innovating simultaneously. coupling is loose if the startup is pursuing one of the two, but hardly the other. the elements may be termed decoupled if the startup is pursuing one of the two, but not the other at all. finally, of course, a startup may be without ambition of any kind; not pursuing any of the endeavors. an established business may benefit from self-reinforcing dynamics between innovation and financing. accomplished innovation may attract investors, and, reciprocally, financing is a means for innovation. for a nascent entrepreneur, however, the dynamics between financing and innovation are quite different. the entrepreneur is in the process of starting. there cannot yet be any reciprocal interaction between market feedback and the entrepreneur's learning and capability development. no market returns from sales can be employed to strengthen innovation capability. moreover, the business opportunity pursued by the entrepreneur is still up for evaluation and modification; the business is still an idea. nevertheless, at this formative stage, there are interactions between the entrepreneur and stakeholders, e.g., potential investors, inventors, and incubators. through this interaction, the entrepreneur modifies ideas and visions for the business and anticipates feasibility, outcomes and attractiveness (lichtenstein et al. 2006) . in this realm, the entrepreneur shapes strategic aspirations, and confidence in achieving specific strategic goals. ambition for financing and ambition for innovation tend to co-evolve as they build on similar underlying organizational strengths. hence, we may reasonably expect that in some resource environments around the entrepreneur there will be a presence of ambition for financing and simultaneously also ambition for innovation. moreover, it is reasonable to theorize that the more the entrepreneur is exposed to such environments, the more likely the entrepreneur's aspirations are to include elements of financing and innovation. this external influence on the entrepreneur's aspiration formation processes is represented by ansoff (1987) in his model of paradigmatic complexity of the strategy formation process. at a more general level, it is also represented in ajzen's model of planned behavior, in which stakeholders' resources, norms and expectations shape the entrepreneur's intentions, goals and aspirations (liñán and chen 2009) . a similar prediction can be made from social comparison theory (boyd and vozikis 1994) . because the entrepreneur's business is not yet tested in the market, the entrepreneur may rely on modeling and imitation as a source of self-efficacy. in this process beliefs in own capabilities, and thereby aspirations, will be assessed by successes and failures of similar others (wood and bandura 1989) . finally, the entrepreneur's aspirations may be influenced by persuasion through encouragement, even in situations where encouragement is given on unrealistic grounds (ibid.). these mechanisms may work through direct relationships held by the entrepreneur or indirectly as the entrepreneur observes and interprets stimuli from the wider environment. coupling may be pursued as a strategy. as a strategy it may be partly based on an analysis of strengths, weaknesses, opportunities and threats, swot, as business students learn and managers and owners apply. an entrepreneur can hardly estimate any of the elements with reasonable validity and reliability. but the entrepreneur is likely to discuss such matters with others, listen to them, and take their advice into consideration when pursuing financing and innovation. thus the coupling is likely to be influenced by the network around the entrepreneur, the network of people giving the entrepreneur advice on the new business. as suggested by literature on entrepreneurial opportunity and alertness, it is a social process to create and grow a new venture, entailing efforts by entrepreneurs to use their networks to mobilize and deploy resources to exploit an identified opportunity and achieve the success (ebbers 2014; adomako et al. 2018) . besides, "an important part of the nascent entrepreneurial process is a continuing evaluation of the opportunity, resulting in learning and changes in beliefs" (mccann and vroom 2015) . the pursuit of coupling is thus embedded in the advice network, which channels, enables and constrains beliefs and strategy, specifically pursuit of coupling of financing and innovation. the people giving advice to the entrepreneur are often drawn from a wide spectrum, both from the private sphere of family and friends and from the public sphere comprising the work-place, the professions, the market and the international environment (jensen and schøtt 2017 ). an entrepreneur's networking in the private sphere and networking in the public sphere differ in their consequences for the startup. networking in the public sphere promotes, whereas networking in the private sphere impedes, such business endeavors as innovation, exporting and expectations for growth (schøtt and sedaghat 2014; ashourizadeh and schøtt 2015; schøtt and cheraghi 2015) . we here consider how such networking influences coupling of financing and innovation. an entrepreneur's networking in the private sphere of family and friends may shape the coupling between innovation and financing through its influence on ambition of the entrepreneur. the entrepreneur's family is often putting its wealth at risk in the startup, and is likely to be cautious and to caution the entrepreneur against being overly selfefficacious, overly optimistic about business opportunities, and overly risk-willing. furthermore, due to mutual trust, frequent contacts, intimacy and reciprocal commitments in such relationships (granovetter 1973 (granovetter , 1985 greve 1995; anderson et al. 2005) , their influence tends to be deep and significant. when private sphere networking constrains the entrepreneurial mindset, the entrepreneur becomes less ambitious and will pursue less financing or less innovation, and will be especially reluctant to pursue financing for innovation. conversely, an entrepreneur without such a constraining network in the private sphere will plausibly feel rather free, and will more wishfully think of own capability, of own efficacy, of opportunities, and of risks, and consequently will be more ambitious and therefore also pursue both financing and innovation. family members and friends tend to move within the same circles with the entrepreneur (anderson et al. 2005) . they know each other and are likely to have high degree of social, cultural, educational and professional homophily (granovetter 1973 (granovetter , 1985 greve 1995) . the members within such network are likely to possess or access much overlapping information and multiple redundant ties therefore often add little value when an entrepreneur is seeking novel resources/information and financing. the consideration concerning the private sphere leads us to hypothesize, hypothesis 1: networking within the private sphere reduces coupling between financing and innovation. public sphere networking shaping coupling an entrepreneur's networking for advice in the public sphere is drawn from the workplace, professions, market and the international environment. these formal and informal advisors are mostly business people and business-related people. they are likely to be more self-efficacious, optimistic about opportunities, and risk-willing, than the entrepreneur's private sphere network. they are likely to influence the entrepreneur to be more self-efficacious, optimistic and risk-willing, and thereby more ambitious and more likely to pursue both financing and innovation. apart from such positive mindset influence, a diverse set of persons working in different public contexts with quite different knowledge bases, experiences, mental patterns, and associations enable the entrepreneur to access to a broad array of nonredundant novel ideas and expanded financing opportunities (hsu 2005; burt 2004; dyer et al. 2008) . particularly, some critical contacts in the public sphere, such as venture capitalists, successful entrepreneurs, and business incubators, not only directly bring the nascent entrepreneur valuable suggestions, creative ideas, and financial resources simultaneously, but also play the role of business referrals and endorsements and further broaden the entrepreneur's opportunities for acquiring and enhancing innovation and financing capabilities (van osnabrugge and robinson 2000; mason and stark 2004; löfsten and lindelöf 2005; cooper and park 2008; ramos-rodríguez et al. 2010; croce et al. 2017) , generating a "snowballing effect". these arguments thus lead us to specify, hypothesis 2: networking in the public sphere promotes coupling between financing and innovation. investors, especially venture capitalists and angel investors, often appreciate and encourage innovation with financial support (kortum and lerner 2000; engel and keilbach 2007; bertoni et al. 2010) . investors frequently bring the entrepreneurs more than purely financial capital, such as their technical expertise, market knowledge, customer resources, strategic advices, and network augmentation (sapienza and de clercq 2000; mason and stark 2004; brown et al. 2018) . investors, angel investors and vcs like to syndicate their investments with others, and to share the investment risk and strengthen evaluating and monitoring capacities (kaplan and strömberg 2004; wong et al. 2009; brown et al. 2018 ), which will expand and strengthen their financial and innovation support. as observed by brown et al. (2018) , a key feature of the entrepreneurs who use equity crowdfunding is their willingness to innovate and they are very proficient at combining financial resources from different sources and drawing on the networks to alleviate and overcome their internal resource constraints. therefore, networking with these investors is likely to spur and enable the entrepreneur in risktaking and innovative behavior. meanwhile, being in the investors' circle, the entrepreneur is easily identified and accessed. in the networking process, the actors learn more about each other, trust emerges from repeated interactions, and then stimulates closer interpersonal interaction and mitigates the fear of opportunistic behaviors caused by information asymmetry (jensen and meckling 1976; de bettignies and brander 2007) . moreover, the endorsement by reputable investors can send a favorable signal to the investment market about the entrepreneur and the project, and attract more investors to join (see zip case by steier and greenwood 2000) . especially, as found by van osnabrugge and robinson (2000) , angel investors often have entrepreneurial and business operation experience, and have empathy for an innovative entrepreneur, and have the passion to help, and perform less due diligence but invest more by instincts. altogether, this may enhance the matching opportunity between innovative ideas and funding needs and investment desire, leading to a coupling between innovation and financing. therefore, we hypothesize, hypothesis 3: networking with potential investors promotes coupling between financing and innovation. timmons and bygrave (1986, p.170) identified a shared view between founders of innovative ventures and venture capitalists that "the roots of new technological opportunities depend upon a continuing flow of knowledge from basic research". thus researchers and inventors are generators and carriers of knowledge, intellectual property, and patents. by networking with them, the entrepreneur may acquire these innovative resources. codified and tacit knowledge is transferred in different ways, notably through education, consulting, and r&d-based project cooperation, and conversations. indeed, the benefits of networking with researchers or inventors is expressed in arrangements in innovation systems, such as the triple helix model (etzkowitz 2003) ; science parks (löfsten and lindelöf 2005) , entrepreneurial universities, incubators, research-based spin-offs, open innovation (etzkowitz 2003; rothaermel et al. 2007; enkel et al. 2009 ), and industrial ph.d. projects. these models, polices, organizational formats and education programs are proposed with the same strategic intention: to provide a nurturing environment, and link talent, technology, capital and know-how to spur innovation and commercialization of technology. networking with researchers and inventors not only enables the entrepreneur to tap into a broader research community, but may also sends a signal to the market about the quality and veracity of the project and its knowledge foundation, and may reduce the investors' worries about their investment (hsu 2004; murray 2004) , especially for an early-stage entrepreneur without established reputation and performance record, and particularly when the venture is innovative. therefore, we propose: hypothesis 4:networking with researchers promotes coupling between financing and innovation. networking with both investors and researchers can generate synergy leading to further coupling of innovation and financing as elaborated in the following. as argued above, networking with investors and with researchers or inventors separately can provide the entrepreneur with both financial resources, knowledge and talents for innovation. when the entrepreneur networks with both investors and researchers, the resources obtained from the two parties may generate an additional "positive loop effect", which means more sophisticated innovation brought by the ties with researchers and inventors attract more capital, and more capital available for r&d further enhance innovation aspiration, which again attract more capital and then more r&d investment, and then enhance innovation; in mutual reinforcement. moreover, networking with both an investor and a researcher, implies that when legitimacy is obtained from one of the two, this sends a signal to the other encouraging the other to bestow legitimacy on the entrepreneur, which may attract further financing and ideas for innovation. we may call this a "reinforced signaling effect". the ties with researchers, investors, and their network contacts help open up more relations for acquiring additional funds and knowledge like "reinforced snowball effect". timmons and bygrave (1986) had observed that there were geographical oases for incubating a bulk of innovative technological ventures, where the founders, entrepreneurs, technologists, and investors cluster. using a longitudinal case study, calia et al. (2007) illustrate how a technological innovation network (with the involvement of universities, venture investors, and banks) enables a case company to establish its business and to survive and grow. these synergies suggest an effect that is over and above the two separate effects of networking with investors and networking with researchers, hypothesis 5: networking with both investors and researchers further enhances coupling between financing and innovation. the hypothesized effects are illustrated in fig. 1 . the world's entrepreneurs are surveyed by the global entrepreneurship monitor (bosma 2013) . in most countries covered in the period 2009 to 2014 the survey included questions about networking, financing and innovation. sampling gem samples adults in two stages. the first stage occurs when a country is included, namely when a national team is formed and joins gem to conduct the survey in its country. hereby 50 countries were covered where the essential questions were asked. these countries are drawn from a diversity of regions, cultures, economies, and levels of development, and form a sample of countries which is fairly representative of the countries around the world. the second stage of sampling is the fairly random sampling of adults within a country, and then identifying the starting entrepreneurs. entrepreneurs at inception are identified as those who are currently trying to start a business, have taken action to start, will own all or part of the business, and have not yet received, or just begun to receive, some kind of compensation. by this identification of entrepreneurs, this sample is 10,582 entrepreneurs who reported their networking, financing, and innovation. representativeness of sampling enables generalization to the world's starting entrepreneurs and their startups. financing of the startup was measured by asking the entrepreneur, how much money, in total, will be required to start this new business? please include both loans and equity/ownership investments. the amount is recorded in the local currency, an amount from 0 upward. to make this comparable across countries, the amount is normalized by dividing by the median for the country's responding entrepreneurs. then, to reduce the skew, we take the logarithm (first adding 1), a measure that runs from 0 for no financing, and then upward. this indicator of financing enters into the measurement of coupling. innovativeness in the startup was indicated by asking three questions, have the technologies or procedures required for this product or service been available for less than a year, or between one to five years, or longer than five years? will all, some, or none of your potential customers consider this product or service new and unfamiliar? right now, are there many, few, or no other businesses offering the same products or services to your potential customers? the answer to each question is here coded 0, 1, 2 for increasing innovativeness. the three measures are inter-correlated positively. the three measures are averaged as an index of innovation, running from 0 to 2. this index of innovation enters into the measurement of coupling. two business practices, here innovation and financing, are coupled in so far as they are pursued jointly. the coupling between two practices in a business is indicated by their co-occurrence at inception of the business. coupling between innovation and financing is high to the extent that innovation is high and financing is high. conversely, coupling is low when either of them is low. when the occurrence of each practice is measured on a scale from 0 upward, the coupling of the two practices is indicated by the product of the two measures: if financing is 0 or if innovation is 0, then coupling is 0. conversely, if both financing is high and innovation is high, then coupling is very high. the scale has no intrinsic meaning, so, for analyses, the measure of coupling is standardized. validity can be ascertained. coupling expectedly correlates positively with expectation for growth, as an indication of performance at inception. growth-expectation is indicated as expected number of persons working for the business when five years old (transformed logarithmically to reduce skew). the correlation is positive (.26 with p < .0005) confirming validity of the operationalization of coupling. the network around an entrepreneur is indicated by asking the entrepreneur to report on getting advice, various people may give you advice on your new business. have you received advice from any of the following? your spouse or life-companion? your parents? other family or relatives? friends? current work colleagues? a current boss? somebody in another country? somebody who has come from abroad? somebody who is starting a business? somebody with much business experience? a researcher or inventor? a possible investor? a bank? a lawyer? an accountant? a public advising services for business? a firm that you collaborate with? a firm that you compete with? a supplier? a customer? networking in the private sphere is measured as number of advisors among the four: spouse, parent, other family, and friends, a measure going from 0 to 3. networking with a researcher or inventor is measured dichotomously, 1 if advised by a researcher or inventor, and 0 if not. networking with a possible investor is measured dichotomously, likewise, 1 if advised by a possible investor, and 0 if not. the network with others in the public sphere is measured as number of advisors among the other 14, a measure going from 0 to 14 (jensen and schøtt 2017) . validity can be assessed. in the theoretical section we argued that private sphere networking is associated negatively, and public sphere networking is associated positively, with self-efficacy and opportunity-perception. these correlations all turn out to be as expected indicating validity of the operationalization of networks. the analysis controls for attributes of the entrepreneur and the business. gender is coded 0 for males and 1 for females. age is measured in years. education is indicated in years of schooling. motive for starting the business is either seeing a business opportunity or necessity to make a living, coded 1 and 0, respectively. owners is number of owners, transformed logarithmically to reduce skew. we also control for macro-level context in two respects, national wealth as gni per capita, and the elaboration of the national entrepreneurial eco-system, measured as the mean of the framework conditions measured by gem in its national expert survey (bosma 2013) . the population is the world's entrepreneurs, where a respondent is surveyed in a country. the data are thus hierarchical with two levels, individuals nested within countries. the country should be taken into account, both because level of activity, e.g. networking and innovation, differs among countries, and because behavior is similar within each country. these circumstances of country are taken into account in hierarchical linear modeling (snijders and bosker 2012) . hierarchical linear modeling is otherwise very similar to linear regression. notably, the effect of a condition is tested and estimated by a coefficient. hierarchical linear modeling is used in table 3 . the sample of 10,582 starting entrepreneurs is described by correlations, table 1 . furthermore, among the entrepreneurs, 9% were networking with a researcher or inventor, and 13% were networking with a potential investor. although these two kinds of networking are not common, they are not rare. these two kinds of networking tend to go hand in hand, unsurprisingly, and are also correlated with networking with others in the public sphere and networking in the private sphere, but none of these correlations are high. the correlations among variables of interest and between variables of interest and control variables are mostly weak, indicating that there is no problem of multicollinearity in the analysis. coupling of innovation and financing is high to the extent that innovation is high and financing is high. conversely, coupling is low when either of them is low. to see whether coupling is typical, we cross-tabulate the startups according to their innovation and their financing, table 2 . coupling is high in the startups where both innovation and financing is high, the bold-faced 12% in table 2 . conversely, coupling is low in the startups where either innovation or financing is low, the italicized 10 + 12 + 9 + 8 + 8%. in between, coupling is medium where one is medium and the other is medium or high, the 12 + 14 + 15% in table 2 . numerical independent variables are standardized, then centered within country the table does not clearly display a tendency for innovation and financing to go hand in hand. indeed, the correlation between financing and innovation is .06 (p < .0005). thus there is a weak tendency for innovation and financing to co-occur, a coupling that is loose rather than tight. coupling is affected by the various kinds of networks around the entrepreneur, we hypothesized. effects on coupling are estimated in the hierarchical linear model, table 3 . hypothesis 1 is that coupling is affected negatively by networking in the private sphere. the effect is tested in the first model in table 3 . this effect is negative, thus supporting hypothesis 1. hypothesis 2 is that coupling is affected positively by networking in the public sphere, with advisors other than investors and researchers. this effect is tested in the first model. the effect is positive, thus supporting hypothesis 2. hypothesis 3 is that coupling is affected positively by networking with potential investors. this effect is positive, thus supporting hypothesis 3. hypothesis 4 is that coupling is affected positively by networking with researchers. this effect is positive, supporting hypothesis 4. the effects of investors and of researchers are substantial, and the effects of networking in the public sphere and networking in the private sphere are of notable magnitude. hypothesis 5 is that coupling is affected positively by networking with investors together with researchers, as a synergy effect that is in addition to the separate effect of investors and the separate effect of researchers. this is tested by expanding the model by including the interaction term, the product of the dichotomy for networking with investors and the dichotomy of networking with researchers. the effect of the interaction is estimated in the second model in table 3 . the interaction effect is positive, thus supporting hypothesis 5. the effect is actually of a magnitude that is quite substantial. in short, the five hypotheses are all supported. the analyses have addressed the two research questions. what is the coupling between innovation and financing at inception? what is the embeddedness of coupling in the networks around the entrepreneur, specifically the networks with investors and researchers? the questions have been addressed by a survey of a globally representative sample of entrepreneurs at inception of their startup. the representativeness of sampling implies that findings can be generalized to the world's starting entrepreneurs. the next sections discuss our findings concerning, first, coupling as a phenomenon, and, second, embeddedness of coupling in networks. coupling as a phenomenon was found to be infrequent, in that a typical startup does not pursue both financing and innovation. often, a startup is either innovative or well financed. rather few startups are both highly innovative and well financed. across startups, innovativeness and financing are positively correlated, but only weakly, indicating that innovation and financing have a coupling that is loose rather than tight (section 4.2). coupling between innovation and financing is a capability. pursuing such coupling in a startup is building an organizational capability. coupling goes beyond the capability to innovate and goes beyond the capability to finance starting. coupling is a competitive advantage in the competition among startups, a competition to enter the market, survive, expand and grow. coupling in a startup correlates positively with expectation to grow (section 3.2.3), indicating the benefit to be expected from coupling, and thus indicating that coupling is a competitive advantage. it is theoretically surprising to find that coupling is so loose, when coupling is a competitive advantage. but empirically it is less surprising, when we bear in mind that, typically at inception, financing is not invested in innovation, but is invested in production. a loose coupling could also be caused by information asymmetry, where the entrepreneurs who have creative idea and innovation capability cannot be identified by the investors. such interpretation can find some evidence in the study by shu et al. (2018) . alternatively, it could also be the entrepreneurs who have financing capability or financial resources lack the incentives, energy or capability to polish their ideas or projects but are eager to start the new ventures. these could be the so-called necessitydriven or desperate entrepreneurial activities, which are in contrast to opportunitydriven actions (mühlböck et al. 2018; fernández-serrano et al. 2018) . the study by mühlböck et al. (2018) , using the data from the global entrepreneurship monitor (gem), has provided some evidence. they observed that many entrepreneurs sprung up during the outbreak of the economic crisis, but these businesses were started even without (or with a negative) perception of business opportunities and entrepreneurial skills. the authors term this phenomenon as "nons-entrepreneurship" driven by necessity, meaning there are no other options for a job but only to start their own businesses. usually in such cases, the institutional environment is favorable. besides, according to their findings, there is a considerable share of such individuals among early stage entrepreneurs. additionally, we suspect that the coupling is very loose at the inception, because the competitive advantage of coupling has not yet taken effect at inception. the coupling will have an effect only later, we expect, namely as the startup competes in the market, for survival, expansion and growth. therefore, coupling is appropriately considered strategic building of capabilities. future research may advance our knowledge regarding such loose coupling. from these findings, we may learn at least two practical lessons. first, financing and innovation do not go hand in hand at the inception, although this is actually important for a new venture to succeed. the failure rate of entrepreneurial firms is high mainly due to the resource scarcity and financial constrains (colombo et al. 2014) . such loose coupling, as discussed above, could be caused by low participation willingness of the capital owners or a lack of effective channels for two sides to identify/know each other. policymakers may give special attention to these, and design some mechanisms, set up the rules, or provide the supports to attract or guide the capital into the inception phase and reduce the potential problem of information asymmetry between two sides. besides, with aforementioned potential reason of the presence of necessity-driven entrepreneurs that causes the loose coupling, both policymakers and investors are suggested to distinguish between the necessity-(especially desperate) and opportunitybased entrepreneurs and take actions. as considered by mühlböck et al. (2018) and confirmed by fernández-serrano et al. (2018) , those desperate or necessity-driven entrepreneurs with a lesser feasibility and skills may be less successful and thus less beneficial for the economy than opportunity-driven entrepreneurs. second, entrepreneurs, and especially nascent entrepreneurs, should pay attention to create such coupling, and networking can be an efficient way as will be discussed below. a recent study by rezaeizadeh et al. (2017) found interpersonal skills for networking is one of the top competences that the entrepreneurs should possess, and they suggest such competence development be included in university education. meanwhile, they suggest that continuous training programs with a network of proactive peers, engaged academics, and a wider business community will help sustain and develop entrepreneurial intentions and behaviors, as well as expand the entrepreneurs' networks. below, we discuss the network influence in more detail. coupling is channeled, enabled and constrained by networks around the entrepreneur. on a broader level, we may say networking capability is one of the important organizational capabilities, especially in the increasingly knowledge-intensive and turbulent economic environment, since different networks represent different conduits of information and resources that the organization can constantly access. thereby, the organization can become more flexible and adaptive. as also advocated by windsperger et al. (2018, p.671) , entrepreneurial networks should be used by the firms "to complement their resources and capabilities in order to realize static and dynamic efficiency advantages". networking is typically thought of as inherently beneficialthe more, the merrierbut some networking may be a waste of time and energy, and some networking may even be detrimental, so networking has its "dark side" (klyver et al. 2011) . networking in the private sphere was here found to be detrimental for coupling, as hypothesized. this finding is consistent with earlier studies, showing negative effects of networking in the private sphere upon outputs such as innovation, exporting, and expectation for growth of the business (schøtt and sedaghat 2014; ashourizadeh and schøtt 2015; cheraghi et al. 2014 ). more generally, whereas networking in the private sphere is beneficial for legitimacy and emotional support (liu et al. 2019) , networking in the private sphere seems detrimental for outputs. network research should not presume that a network is homogenous (as presumed in the most common measure of an actor's social capital as number of contacts), but should distinguish between the dark side and the bright side of a network (klyver et al. 2011) . on the bright side, we found that an entrepreneur's networking in the public sphere i.e. in the workplace, professions, market, and international environmentis beneficial for coupling between innovation and financing. drawing advices from a wide spectrum in the public sphere, a wide spectrum of knowledgeable specialists (also apart from researchers and investors), enables the entrepreneur to combine various kinds of knowledge, information, and resources, which is beneficial for the simultaneous pursuits of innovation and financing. an entrepreneur's networking with a potential investor was also found to benefit the coupling between financing and innovation in the startup, as expected. as also expected, networking with a researcher benefits the coupling. over and above these two additive effects, coupling was found to be further enhanced by simultaneously networking with an investor and with a researcher, discerned as an interaction effect in a multivariate model. networking with an investor and networking with a researcher are not substitutable for one another, and their effects do not simply add up. rather, there is a synergy effect, a further enhancing effect over and above the two separate effects of networking with an investor and networking with a researcher. the theory of competitive advantage through structural holes in the network around an actor can help us understanding the synergy benefit (burt 1992a, b) . a focal actor has a structural hole in the network of contacts, when two contacts are not interrelated. the hole between the two implies that they cannot combine something from one with another thing from the other. the focal actor, however, can acquire something from one and another thing from the other, and can thereby combine the two things and, following schumpeterian thinking, the combination constitutes a competitive advantage in the competition among actors for new things. the literal meaning of 'entrepreneur' is going in between and taking a benefit, and in our study the entrepreneur is going between an investor and a researcher, and combining advice or investment from the former with advice or new idea from the latter, and thereby promotes a coupling of financing and innovation, a synergy that builds a capability and a competitive advantage. from the resource-based view (barney 1991; grant 1991) and the dynamic capabilities perspective (teece et al. 1997 ), a firm's resources and capabilities will determine its competitive advantage and value creation, and a firm needs to constantly adapt, renew, reconfigure and re-create its resources and capabilities to the volatile and competitive environment, so that a competitive advantage can be developed and maintained. however, the entrepreneurial firms, especially those at formation stage run by nascent entrepreneurs, usually lack the strategic resources and capabilities at the beginning, e.g., financial resources and financing capabilities, innovation resources and capabilities, business management skills, and have lesser competitive disadvantages. furthermore, the emergence and development of a new venture is a dynamic process with many uncertainties, requiring different resources, information, and knowledge at different time points (hayter 2016; steier and greenwood 2000) . different relationship networks, especially the professional ones discussed in this study, can provide new ventures with opportunities for continually accessing needed resources, forming a basis that enables coupling of financing and innovation, synergy creation from integrating various resources, develop and sustain the new venture's competitive advantages, and gain profit (see also davidsson and honig 2003; batjargal and liu 2004) . along the same lines, holding a relational governance view of competitive advantage, dyer and singh (1998) argue for the critical resources that enable the firm's competitive advantage to extend beyond firm boundaries and are embedded in inter-firm resources and routines, including such components as relation-specific assets, knowledge-sharing routines and complementary resources/capabilities. in summary, we may say networking and coupling capabilities are two crucial capabilities for the nascent entrepreneurs, on top of the others, for identifying, pursuing and creating market opportunities, and for attaining and sustaining the new ventures' competitive advantages. joining the discussion of the influence of strong vs. weak ties (or private vs. public networks) on the entrepreneurs, results of this study, falling in line with some of the research (granovetter 1973; davidsson and honig 2003; afandi et al. 2017) , further remind the entrepreneurs to be aware of potential detrimental effect of being overembedded in the private sphere network that is bringing information and resource redundancy and social obligation. rather, they are well advised to actively and judiciously pursue, develop, and maintain public sphere networking, especially the professional networks with the investors and researchers/inventors, which enable and promote the coupling between innovation and financing, and capability development in these regards. the entrepreneur network capability framework developed by shu et al. (2018) can be a good reference, four dimensions comprising network orientation, network building, network maintenance, and network coordination. network orientation should be in the first place, which means a person should be willing to develop and depend on social networks in own daily socialization, believe, pay special attention to and act on the norms of dependence, cooperation, and reciprocation. in terms of the orientation, as discussed, this study suggests the importance and benefits of widening and diversifying the entrepreneurs' social relations, especially being in and crossing different professional communities. however, most of the entrepreneurs may be not aware of this. for instance, a study of university spin-off by hayter (2016) found that early-stage academic entrepreneurs have their contacts mainly within academic communities that are typically located in their home institutions, and such homophilous ties would further constrain the entrepreneurial development. with clear orientation, the entrepreneur shall monitor surroundings and make effective investment to establish and expand the networks. however, as reminded by semrau and werner (2012) , it is not a good idea to extend the network size without boundaries because there is an opportunity cost of time and the cost can surpass the benefits that the networks can bring. our study further suggests that it is worthwhile to invest in developing the contacts at least in two communities, i.e., with capital holders and knowledgeable and new ideas generators, due to the unique and mutual-reinforced synergistic contributions to founding the new venture, as discussed earlier. it can happen that the nascent entrepreneurs have sufficient personal or family wealth to self-finance the start-up process. however, the entrepreneurs ought to remember that sometimes it is not the "capital" itself that makes the success of a new venture, but the capital-associated resources that help, i.e., from the sources providing capital. . the entrepreneurs ought to think about the other benefits that the investors could bring, such as commercialization competences, business management skills, reputation, more diverse network access, synergistic effect, as shown in several studies (van osnabrugge and robinson 2000; hsu 2004; mason and stark 2004; croce et al. 2017) . further, while network maintenance is to ensure stable and long-term exchange relationships with them, network cooperation is to manage multiple and dynamic relationships, and to mobilize and integrate resources. moreover, these results may also be relevant for well-established organizations that seek to enhance their innovation and financing capabilities and gain a competitive advantage, suggesting that strategically developing, managing and utilizing the bridging social ties may be an efficient way. at the individual level, this may encompass designing an incentive scheme and training program to improve the employees' entrepreneurial spirit, networking awareness and capability. at the organizational level, the firms should strategically manage inter-organizational relationships, both formal and informal, and build systems that can monitor the surroundings, and thereby identify and evaluate new business opportunities outside the organizational boundaries. relevant concepts, models and strategies can be, e.g., cooperative entrepreneurship (rezazadeh & nobari, 2018) and open innovation (enkel et al. 2009 ). as concluded by rezazadeh and nobari (2018) , cooperative entrepreneurship is likely to lead to improvement of firms' agility, customer relationship management, learning, innovation, and sensing capabilities. from a public policy perspective, the above results have important policy implications, stemming essentially from the contribution to innovation coming from networking with researchers, inventors and investors. if innovation and entrepreneurial businesses are important for economic development and for people's life, the study clearly suggests that public policy should be designed to encourage, facilitate and support business networking activities, researcher-business collaboration, and investorentrepreneur connections. besides, university education should be another focus by the policy-makers, since it can be an efficient way or a starting point to foster people's entrepreneurial spirits, develop the students' entrepreneurial competences, especially their networking and relationship management capabilities, and even provide some opportunities for them to develop their networks which may enable them to be an entrepreneur in the future. some strategies and models can be, as documented, the university-based incubation programs, entrepreneurship education programs, researchbased spin-off, and building entrepreneurial universities (clarysse and moray 2004; rothaermel et al. 2007; budyldina 2018) . our research design was to investigate coupling at inception of the startup. this design has the advantages of avoiding attrition when startups are abandoned and avoiding retrospection if interviews were to be conducted later. but the cross-sectional focus on inception implies that the fate of a startup and its coupling are unknown. coupling is presumably yielding a competitive advantage, but at inception this is not enacted. another limitation is that the data are from around 2014, so we have observed the same constraints confronted by other scholars of entrepreneur and entrepreneurship (e.g., mühlböck et al. 2018; fernández-serrano et al. 2018) . entrepreneurial behavior has changed since networking was surveyed by gem, and organizing is changing even more with the covid-19 pandemic. the limitations suggest further research on coupling. coupling appears important as a strategy for building capability and competitive advantage. therefore, an important research question is, what is the effect of coupling in a startup upon its ability to compete, survive, expand and grow? an indication of the effect of coupling upon growth was seen in the substantial correlation between coupling and expectation for growth of the business (section 3.2.3). but, of course, effects of coupling are far better ascertained through longitudinal research. the current covid-19 pandemic is an eco-systemic intervention that is changing competition and organizational behavior. based on our findings, we hypothesize current exits to be especially prevalent among entrepreneurs without coupling of financing and innovation, and we hypothesize that success is especially likely for entrepreneurs with a tight coupling between innovation and financing. such hypotheses may well be tested with some of the surveys that are underway in the wake of the pandemic. entrepreneurial alertness and new venture performance: facilitating roles of networking capability social capital and entrepreneurial process the role of family members in entrepreneurial networks: beyond the boundaries of the family firm the emerging paradigm of strategic behavior. strategic management journal exporting embedded in culture and transnational networks around entrepreneurs firm resources and sustained competitive advantage entrepreneurs' access to private equity in china: the role of social capital venture capital investments and patenting activity of high-tech start-ups: a micro-econometric firm-level analysis the global entrepreneurship monitor (gem) and its impact on entrepreneurship research the influence of self-efficacy on the development of entrepreneurial intentions and actions working the crowd: improvisational entrepreneurship and equity crowdfunding in nascent entrepreneurial ventures entrepreneurial universities and regional contribution structural holes the social structure of competition structural holes and good ideas innovation networks: from technological development to business model reconfiguration growth-expectations among women entrepreneurs: embedded in networks and culture in algeria, morocco, tunisia and in belgium and france a process study of entrepreneurial team formation: the case of a researchbased spin-off green start-ups and local knowledge spillovers from clean and dirty technologies to be born is not enough: the key role of innovative startups ownership structure, horizontal agency costs and the performance of high-tech entrepreneurial firms the impact of incubator' organizations on opportunity recognition and technology innovation in new, entrepreneurial high-technology ventures how business angel groups work: rejection criteria in investment evaluation the role of social and human capital among nascent entrepreneurs financing entrepreneurship: bank finance versus venture capital the relational view: cooperative strategy and sources of interorganizational competitive advantage entrepreneur behaviors, opportunity recognition, and the origins of innovative ventures networking behavior and contracting relationships among entrepreneurs in business incubators firm-level implications of early stage venture capital investment -an empirical investigation open r&d and open innovation: exploring the phenomenon innovation in innovation: the triple helix of university-industry-government relations efficient entrepreneurial culture: a cross-country analysis of developed countries the strength of weak ties economic action and social structure: the problem of embeddedness the resource-based theory of competitive advantage: implications for strategy formulation networks and entrepreneurship -an analysis of social relations, occupational background, and use of contacts during the establishment process constraining entrepreneurial development: a knowledge-based view of social networks among academic entrepreneurs what do entrepreneurs pay for venture capital affiliation? formation of industrial innovation mechanisms through the research institute theory of the firm: managerial behavior, agency costs and ownership structure components of the network around an actor characteristics, contracts, and actions: evidence from venture capitalist analyses social networks and new venture creation: the dark side of networks assessing the contribution of venture capital to innovation measuring emergence in the dynamics of new venture creation development and cross-cultural application of a specific instrument to measure entrepreneurial intentions women's experiences of legitimacy, satisfaction and commitment as entrepreneurs: embedded in gender hierarchy and networks in private and business spheres r&d networks and product innovation patterns-academic and nonacademic new technology-based firms on science parks what do investors look for in a business plan? a comparison of the investment criteria of bankers, venture capitalists and business angels opportunity evaluation and changing beliefs during the nascent entrepreneurial process desperate entrepreneurs: no opportunities, no skills the role of academic inventors in entrepreneurial firms: sharing the laboratory life loosely coupled systems: a reconceptualization what you know or who you know? the role of intellectual and social capital in opportunity recognition core entrepreneurial competencies and their interdependencies: insights from a study of irish and iranian entrepreneurs, university students and academics antecedents and consequences of cooperative entrepreneurship: a conceptual model and empirical investigation university entrepreneurship: a taxonomy of the literature venture capitalist-entrepreneur relationships in technology-based ventures. enterprise and innovation management studies gendering pursuits of innovation: embeddedness in networks and culture innovation embedded in entrepreneurs' networks and national educational systems: a global study the two sides of the story: network investments and new venture creation building networks into discovery: the link between entrepreneur network capability and entrepreneurial opportunity discovery multilevel analysis: an introduction to basic and advanced multilevel modeling entrepreneurship and the evolution of angel financial networks dynamic capabilities and strategic management venture capital's role in financing innovation for economic growth angel investing: matching startup funds with startup companies-the guide for entrepreneurs and individual investors engaging with startups to enhance corporate innovation educational organizations as loosely coupled systems governance and strategy of entrepreneurial networks: an introduction angel finance: the other venture capital. strategic change social cognitive theory of organizational management publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations key: cord-269711-tw5armh8 authors: ma, junling; van den driessche, p.; willeboordse, frederick h. title: the importance of contact network topology for the success of vaccination strategies date: 2013-05-21 journal: journal of theoretical biology doi: 10.1016/j.jtbi.2013.01.006 sha: doc_id: 269711 cord_uid: tw5armh8 abstract the effects of a number of vaccination strategies on the spread of an sir type disease are numerically investigated for several common network topologies including random, scale-free, small world, and meta-random networks. these strategies, namely, prioritized, random, follow links and contact tracing, are compared across networks using extensive simulations with disease parameters relevant for viruses such as pandemic influenza h1n1/09. two scenarios for a network sir model are considered. first, a model with a given transmission rate is studied. second, a model with a given initial growth rate is considered, because the initial growth rate is commonly used to impute the transmission rate from incidence curves and to predict the course of an epidemic. since a vaccine may not be readily available for a new virus, the case of a delay in the start of vaccination is also considered in addition to the case of no delay. it is found that network topology can have a larger impact on the spread of the disease than the choice of vaccination strategy. simulations also show that the network structure has a large effect on both the course of an epidemic and the determination of the transmission rate from the initial growth rate. the effect of delay in the vaccination start time varies tremendously with network topology. results show that, without the knowledge of network topology, predictions on the peak and the final size of an epidemic cannot be made solely based on the initial exponential growth rate or transmission rate. this demonstrates the importance of understanding the topology of realistic contact networks when evaluating vaccination strategies. the importance of contact network topology for the success of vaccination strategies for many viral diseases, vaccination forms the cornerstone in managing their spread and the question naturally arises as to which vaccination strategy is, given practical constraints, the most effective in stopping the disease spread. for evaluating the effectiveness of a vaccination strategy, it is necessary to have as precise a model as possible for the disease dynamics. the widely studied key reference models for infectious disease epidemics are the homogeneous mixing models where any member of the population can infect or be infected by any other member of the population; see, for example, anderson and may (1991) and brauer (2008) . the advantage of a homogeneous mixing model is that it lends itself relatively well to analysis and therefore is a good starting point. due to the homogeneity assumption, these models predict that the fraction of the population that needs to be vaccinated to curtail an epidemic is equal to 1à1=r 0 , where r 0 is the basic reproduction number (the average number of secondary infections caused by a typical infectious individual in a fully susceptible population). however, the homogeneous mixing assumption poorly reflects the actual interactions within a population, since, for example, school children and office co-workers spend significant amounts of time in close proximity and therefore are much more likely to infect each other than an elderly person who mostly stays at home. consequently, efforts have been made to incorporate the network structure into models, where individuals are represented by nodes and contacts are presented by edges. in the context of the severe acute respiratory syndrome (sars), it was shown by meyers et al. (2005) that the incorporation of contact networks may yield different epidemic outcomes even for the same basic reproduction number r 0 . for pandemic influenza h1n1/09, pourbohloul et al. (2009) and davoudi et al. (2012) used network theory to obtain a real time estimate for r 0 . numerical simulations have shown that different networks can yield distinct disease spread patterns; see, for example, bansal et al. (2007) , miller et al. (2012) , and section 7.6 in keeling and rohani (2008) . to illustrate this difference for the networks and parameters we use, the effect of different networks on disease dynamics is shown in fig. 1 . descriptions of these networks are given in section 2 and appendix b. at the current stage, most theoretical network infectious disease models incorporate, from a real world perspective, idealized random network structures such as regular (all nodes have the same degree), erd + os-ré nyi or scale-free random networks where clustering and spatial structures are absent. for example, volz (2008) used a generating function formalism (an alternate derivation with a simpler system of equations was recently found by miller, 2011) , while we used the degree distribution in the effective degree model presented in lindquist et al. (2011) . in these models, the degree distribution is the key network characteristic for disease dynamics. from recent efforts (ma et al., 2013; volz et al., 2011; moreno et al., 2003; on incorporating degree correlation and clustering (such as households and offices) into epidemic models, it has been found that these may significantly affect the disease dynamics for networks with identical degree distributions. fig. 2 shows disease dynamics on networks with identical degree distribution and disease parameters, but with different network topologies. clearly, reliable predictions of the epidemic process that only use the degree distribution are not possible without knowledge of the network topology. such predictions need to be checked by considering other topological properties of the network. network models allow more precise modeling of control measures that depend on the contact structure of the population, such as priority based vaccination and contact tracing. for example, shaban et al. (2008) consider a random graph with a pre-specified degree distribution to investigate vaccination models using contact tracing. kiss et al. (2006) compared the efficacy of contact tracing on random and scale-free networks and found that for transmission rates greater than a certain threshold, the final epidemic size is smaller on a scale-free network than on a corresponding random network, while they considered the effects of degree correlations in kiss et al. (2008) . cohen et al. (2003) (see also madar et al., 2004) considered different vaccination strategies on scale-free networks and found that acquaintance immunization is remarkably effective. miller and hyman (2007) considered several vaccination strategies on a simulation of the population of portland oregon, usa, and found it to be most effective to vaccinate nodes with the most unvaccinated susceptible contacts, although they found that this strategy may not be practical because it requires considerable computational resources and information about the network. bansal et al. (2006) took a contact network using data from vancouver, bc, canada, considered two vaccination strategies, namely mortality-and morbidity-based, and investigated the detrimental effect of vaccination delays. and found that, on realistic contact networks, vaccination strategies based on detailed network topology information generally outperform random vaccination. however, in most cases, contact network topologies are not readily available. thus, how different network topologies affect various vaccination strategies remains of considerable interest. to address this question, we explore two scenarios to compare percentage reduction by vaccination on the final size of epidemics across various network topologies. first, various network topologies are considered with the disease parameters constant, assuming that these have been independently estimated. second, different network topologies are used to fit to the observed incidence curve (number of new infections in each day), so that their disease parameters are different yet they all line up to the same initial exponential growth phase of the epidemic. vaccines are likely lacking at the outbreak of an emerging infectious disease (as seen in the 2009 h1n1 pandemic, conway et al., 2011) , and thus can only be given after the disease is already widespread. we investigate numerically whether network topologies affect the effectiveness of vaccination strategies started with a delay after the disease is widespread; for example, a 40 day delay as in the second wave of the 2009 influenza pandemic in british columbia, canada (office of the provincial health officer, 2010). details of our numerical simulations are given in appendix a. this paper is structured as follows. in section 2, a brief overview of the networks and vaccination strategies (more details are provided in appendices b and c) is given. in section 3, we investigate the scenario where the transmission rate is fixed, while in section 4 we investigate the scenario where the growth rate of the incidence curve is fixed. to this end, we compute the incidence curves and reductions in final sizes (total number of infections during the course of the epidemic) due to vaccination. for the homogeneous mixing model, these scenarios are identical (ma and earn, 2006) , but as will be shown, when taking topology into account, they are completely different. we end with conclusions in section 5. . on all networks, the average degree is 5, the population size is 200,000, the transmission rate is 0.06, the recovery rate is 0.2, and the initial number of infectious individuals is set to 100. both graphs represent the same data but the left graph has a semi-log scale (highlighting the growth phase) while the right graph has a linear scale (highlighting the peak). (b)) on networks with identical disease parameters and degree distribution (as shown in (a)). the network topologies are the random, meta-random, and near neighbor networks. see appendix b for details of the constructions of these networks. detailed network topologies for human populations are far from known. however, this detailed knowledge may not be required when the main objective is to assert the impact that topology has on the spread of a disease and on the effects of vaccination. it may be sufficient to consider a number of representative network topologies that, at least to some extent, can be found in the actual population. here, we consider the four topologies listed in table 1 , which we now briefly describe. in the random network, nodes are connected with equal probability yielding a poisson degree distribution. in a scale-free network, small number of nodes have a very large number of links and large number of nodes have a small number of links such that the degree distribution follows a power law. small world (sw) networks are constructed by adding links between randomly chosen nodes on networks in which nodes are connected to the nearest neighbors. the last network considered is what we term a meta-random network where random networks of various sizes are connected with a small number of interlinks. all networks are undirected with no self loops or multiple links. the histograms of the networks are shown in table 2 , and the details of their construction are given in appendix b. the vaccination strategies considered are summarized in table 3 . in the random strategy, an eligible node is randomly chosen and vaccinated. in the prioritized strategy, nodes with the highest degrees are vaccinated first, while in the follow links strategy, inspired by notions from social networks, a randomly chosen susceptible node is vaccinated and then all its neighbors and then its neighbor's neighbors and so on. finally, in contact tracing, the neighbors of infectious nodes are vaccinated. for all the strategies, vaccination is voluntary and quantity limited. that is, only susceptibles who do not refuse vaccination are vaccinated and each day only a certain number of doses is available. in the case of (relatively) new viral diseases, the supply of vaccines will almost certainly be constrained, as was the case for the pandemic influenza h1n1/09 virus. also in the case of mass vaccinations, there will be resource limitations with regard to how many doses can be administered per day. the report (office of the provincial health officer, 2010) states that the vaccination program was prioritized and it took 3 weeks before the general population had access to vaccination. thus we assume that a vaccination program can be completed in 4-6 weeks or about 40 days, this means that for a population of 200,000, a maximum of 5000 doses a day can be used. for each strategy for each time unit, first a group of eligible nodes is identified and then up to the maximum number of doses is dispensed among the eligible nodes according to the strategy chosen. more details of the vaccination strategies and their motivations are given in appendix c. to study the effect of delayed availability of vaccines during an emerging infectious disease, we compare the effect of vaccination programs starting on the first day of the epidemic with those vaccination programs starting on different days. these range from 5 to 150 days after the start of the epidemic, with an emphasis on a 40 day delay that occurred in british columbia, canada, during the influenza h1n1/2009 pandemic. when a node is vaccinated, the vaccination is considered to be ineffective in 30% of the cases (bansal et al., 2006) . in such cases, the vaccine provides no immunity at all. for the 70% of the nodes for which the vaccine will be effective, a two week span to reach full immunity is assumed (clark et al., 2009) . during the two weeks, we assume that the immunity increases linearly starting with 0 at the time of vaccination reaching 100% after 14 days. the effect of vaccination strategies has been studied (see, for example, conway et al., 2011) using disease parameter values estimated in the literature. however, network topologies were not the focus of these studies. in section 3, the effect of vaccination strategies on various network topologies is compared with a fixed per link transmission rate. the per link transmission rate b is difficult to obtain directly and is usually derived as a secondary quantity. to determine b, we pick the basic reproduction number r 0 ¼ 1:5 and the recovery rate g ¼ 0:2, which are close to that of the influenza a h1n1/09 virus; see, for example, pourbohloul et al. (2009 ), tuite et al. (2010 . in the case of the homogeneous mixing sir model, the basic reproduction number is given by r 0 ¼ t=g, where t is the per-node transmission rate. our table 1 illustration of the different types of networks used in this paper. scale-free small world meta-random table 2 degree histograms of the networks in table 1 with 200,000 nodes. scale-free small world meta-random parameter values yield t ¼ 0:3. for networks, t ¼ b/ks. with the assumption that the average degree /ks ¼ 5, the above gives the per-link transmission rate b ¼ 0:06. the key parameters are summarized in table 4 . in section 3, we use this transmission rate to compare the incidence curves for the networks in table 1 with the vaccination strategies in table 3 . some of the most readily available data in an epidemic are the number of reported new cases per day. these cases generally display exponential growth in the initial phase of an epidemic and a suitable model therefore needs to match this initial growth pattern. the exponential growth rates are commonly used to estimate disease parameters (chowell et al., 2007; lipsitch et al., 2003) . in section 4, we consider the effects of various network topologies on the effectiveness of vaccination strategies for epidemics with a fixed exponential growth rate. the basic reproduction number r 0 ¼ 1:5 and the recovery rate g ¼ 0:2 yield an exponential growth rate l ¼ tàg ¼ 0:1 for the homogeneous mixing sir model. we tune the transmission rate for each network topology to give this initial growth rate. in this section, the effectiveness of vaccination strategies on various network topologies is investigated for a given set of parameters, which are identical for all the simulations. the values of the disease parameters are chosen based on what is known from influenza h1n1/09. qualitatively, these chosen parameters should provide substantial insight into the effects topology has on the spread of a disease. unless indicated otherwise the parameter values listed in table 4 are used. the effects of the vaccination strategies summarized in table 3 when applied without delay are shown in fig. 3 . for reference, fig. 1 shows the incidence curves with no vaccination. since the disease dies out in the small world network (see fig. 1 ), vaccination is not needed in this network for the parameter values taken. especially in the cases of the random and meta-random networks, the effects of vaccination are drastic while for the scale-free network they are still considerable. what is particularly notable is that when comparing the various outcomes, topology has as great if not a greater impact on the epidemic than the vaccination strategy. besides the incidence curves, the final sizes of epidemics and the effect vaccination has on these are also of great importance. table 5 shows the final sizes and the reductions in the final sizes for the various networks on which the disease can survive (for the chosen parameter values) with vaccination strategies for the cases where there is no delay in the vaccination. fig. 4 and table 6 show the incidence curves and the reductions in final sizes for the same parameters as used in fig. 3 and table 5 but with a delay of 40 days in the vaccination. as can be expected for the given parameters, a delay has the biggest effect for the scale-free network. in that case, the epidemic is already past its peak and vaccinations only have a minor effect. for the random and meta-random networks, the table 3 illustration of vaccination strategies. susceptible nodes are depicted by triangles, infectious nodes by squares, and the vaccinated nodes by circles. the average degree in these illustrations has been reduced to aid clarity. the starting point for contact tracing is labeled as a while the starting point for the follow links strategy is labeled as b. the number of doses dispensed in this illustration is 3. random follow links contact tracing table 3 for the network topologies in table 1 given a fixed transmission rate b. there is no delay in the vaccination and parameters are equal to those used in fig. 1 . to further investigate the effects of delay in the case of random vaccination, we compute reductions in final sizes for delays of 5, 10, 15,y,150 days, in random, scale-free, and meta-random networks. fig. 5 shows that, not surprisingly, these reductions diminish with longer delays. however, the reductions are strongly network dependent. on a scale-free network, the reduction becomes negligible as the delay approaches the epidemic peak time, while on random and meta-random networks, the reduction is about 40% with the delay at the epidemic peak time. this section clearly shows that given a certain transmission rate b, the effectiveness of a vaccination strategy is impossible to predict without having reliable data on the network topology of the population. next, we consider the case where instead of the transmission rate, the initial growth rate is given. we line up incidence curves on various network topologies to a growth rate l predicted by a homogeneous mixing sir model with the basic reproduction number r 0 ¼ 1:5 and recovery rate g ¼ 0:2 (in this case with exponential, l ¼ ðr 0 à1þg ¼ 0:1). table 7 summarizes the transmission rates that yield this exponential growth rate on the corresponding network topologies. the initial number of infectious individuals for models on each network topology needs to be adjusted as well so that the curves line up along the homogeneous mixing sir incidence curve for 25 days. as can be seen from the table, the variations in the parameters are indeed very large, with the transmission rate for the small world network being nearly 8 times the value of the transmission rate for the scale-free network. the incidence curves corresponding to the parameters in table 7 are shown in fig. 6 . as can clearly be seen, for these parameters, the curves overlap very well for the first 25 days, thus showing indeed the desired identical initial growth rates. however, it is also clear that the curves diverge strongly later on, with the epidemic on the small world network being the most severe. these results show that the spread of an epidemic cannot be predicted on the basis of having a good estimate of the growth rate alone. in addition, comparing figs. 1 and 6, a higher transmission rate yields a much larger final size and a longer epidemic on the meta-random network. the effects of the various vaccination strategies for the case of a given growth rate are shown in fig. 7 . given the large differences in the transmission rates, it may be expected that the final sizes show significant differences as well. this is indeed the case as can be seen in table 8 , which shows the percentage reduction in final sizes for the various vaccination strategies. with no vaccination, the final size of the small world network is more than 3 times that of the scale-free network, but for all except the follow links vaccination strategy the percentage reduction on the small world network is greater. the effects of a 40-day delay in the start of the vaccination are shown in fig. 8 and table 9 . besides the delay, all the parameters are identical to those in fig. 7 and table 8 . the delay has the largest effect on the final sizes of the small world network, increasing it by a factor of 20-30 except in the follow links case. on a scale-free network, the delay renders all vaccination strategies nearly ineffective. these results also confirm the importance of network topology in disease spread even when the incidence curves have identical initial growth. the initial stages of an epidemic are insufficient to estimate the effectiveness of a vaccination strategy on reducing the peak or final size of an epidemic. the relative importance of network topology on the predictability of incidence curves was investigated. this was done by considering whether the effectiveness of several vaccination strategies is impacted by topology, and whether the growth in the daily incidences has a network topology independent relation with the disease transmission rate. it was found that without a fairly detailed knowledge of the network topology, initial data cannot predict epidemic progression. this is so for both a given transmission rate b and a given growth rate l. for a fixed transmission rate and thus a fixed per link transmission probability, given that a disease spreads on a network with a fixed average degree, the disease spreads fastest on scale-free networks because high degree nodes have a very high probability to be infected as soon as the epidemic progresses. in turn, once a high degree node is infected, on average it passes on the infection to a large number of neighbors. the random and meta-random networks show identical initial growth rates because they have the same local network topology. on different table 1 without vaccination for the case where the initial growth rate is given. the transmission rates and initial number of infections for the various network topologies are given in table 7 , while the remaining parameters are the same as in fig. 1 meta-random network fig. 7 . the effects of the vaccination strategies for different topologies when the initial growth rate is given. the transmission rates b are as indicated in table 7 , while the remaining parameters are identical to those in fig. 6 . network topologies, diseases respond differently to parameter changes. for example, on the random network, a higher transmission rate yields a much shorter epidemic, whereas on the metarandom network, it yields a longer one with a more drastic increase in final size. these differences are caused by the spatial structures in the meta-random network. considering that a metarandom network is a random network of random networks, it is likely that the meta-random network represents a general population better than a random network. for a fixed exponential growth rate, the transmission rate needed on the scale-free network to yield the given initial growth rate is the smallest, being about half that of the random and the meta-random networks. hence, the per-link transmission probability is the lowest on the scale-free network, which in turn yields a small epidemic final size. for different network topologies, we quantified the effect of delay in the start of vaccination. we found that the effectiveness of vaccination strategies decreases with delay with a rate strongly dependent on network topology. this emphasizes the importance of the knowledge of the topology, in order to formulate a practical vaccination schedule. with respect to policy, the results presented seem to warrant a significant effort to obtain a better understanding of how the members of a population are actually linked together in a social network. consequently, policy advice based on the rough estimates of the network structure should be viewed with caution. this work is partially supported by nserc discovery grants (jm, pvdd) and mprime (pvdd). we thank the anonymous reviewers for their constructive comments. the nodes in the network are labeled by their infectious status, i.e. susceptible, infectious, vaccinated, immune, refusing vaccination (but susceptible), and vaccinated but susceptible (the vaccine is not working), respectively. the stochastic simulation is initialized by first labeling all the nodes as susceptible and then randomly labeling i 0 nodes as infectious. then, before the simulation starts, 50% of susceptible nodes are labeled as refusing vaccination but susceptible. during the simulation, when a node is vaccinated, the vaccine has a probability of 30% to be ineffective. if it is not effective, the node remains fully susceptible, but will not be vaccinated again. if it is effective, then the immunity is built up linearly over a certain period of time, taken as 2 weeks. we assume that infected persons generally recover in about 5 days, giving a recovery rate g ¼ 0:2. the initial number of infectious individuals i 0 is set to 100 unless otherwise stated, to reduce the number of runs where the disease dies out due to statistical fluctuations. all simulation results presented in sections 4 and 5 are averages of 100 runs, each with a new randomly generated network of the chosen topology. the parameters in the simulations are shown in table 4 . the population size n was chosen to be sufficiently large to be representative of a medium size town and set to n ¼ 200,000, while the degree average is taken as /ks ¼ 5 with a maximum degree m¼ 100 (having a maximum degree only affects the scalefree network since the probability of a node having degree m is practically zero for the other network types). when considering a large group of people, a good first approximation is that the links between these people are random. although it is clear that this cannot accurately represent the population since it lacks, for example, clustering and spatial aggregation (found in such common contexts as schools and work places), it may be possible that if the population is big enough, most if not all nonrandom effects average out. furthermore, random networks lend themselves relatively well to analysis so that a number of interesting (and testable) properties can be derived. as is usually the case, the random network employed here originates from the concepts first presented rigorously by erd + os and ré nyi (1959). our random networks are generated as follows: (1) we begin by creating n unlinked nodes. (2) in order to avoid orphaned nodes, without loss of generality, first every node is linked to another uniformly randomly chosen node that is not a neighbor. (3) two nodes that are not neighbors and not already linked are uniformly randomly selected. if the degree d of both the nodes is less than the maximum degree m, a link is established. if one of the nodes has maximum degree m, a new pair of nodes is uniformly randomly selected. (4) step 3 is repeated n â /ksàn times. when considering certain activities in a population, such as the publishing of scientific work or sexual contact, it has been found that the links are often well described by a scale-free network structure where the relationship between the degree and the number of nodes that have this degree follows a negative power law; see, for example, the review paper by albert and barabá si (2002) . scale-free networks can easily be constructed with the help of a preferential attachment. that is to say, the network is built up step by step and new nodes attach to existing nodes with a probability that is proportional to the degree of the existing nodes. our network is constructed with the help of preferential attachment, but two modifications are made in order to render the scale-free network more comparable with the other networks investigated here. first, the maximum degree is limited to m not by restricting the degree from the outset but by first creating a scale-free network and then pruning all the nodes with a degree larger than m. second, the number of links attached to each new node is either two or three dependent on a certain probability that is set such that after pruning the average degree is very close to that of the random network (i.e. /ks ¼ 5). our scale-free network is generated as follows: (1) start with three fully connected nodes and set the total number of links l¼3. (2) create a new node. with a probability of 0.3, add 2 links. otherwise add 3 links. for each of these additional links to be added find a node to link to as outlined in step 3. (3) loop through the list of nodes and create a link with probability d=ð2lþ, where d is the degree of the currently considered target node. (4) increase l by 2 or 3 depending on the choice in step 2. (5) repeat nà3 times steps 2 and 3. (6) prune nodes with a degree 4 m. small world networks are characterized by the combination of a relatively large number of local links with a small number of non-local links. consequently, there is in principle a very large number of possible small world networks. one of the simplest ways to create a small world network is to first place nodes sequentially on a circle and couple them to their neighbors, similar to the way many coupled map lattices are constructed (willeboordse, 2006) , and to then create some random short cuts. this is basically also the way the small world network used here is generated. the only modification is that the coupling range (i.e. the number of neighbors linked to) is randomly varied between 2 and 3 in order to obtain an average degree equal to that of the random network (i.e. /ks ¼ 5). we also use periodic boundary conditions, which as such is not necessary for a small world network but is commonly done. the motivation for studying small world networks is that small groups of people in a population are often (almost) fully linked (such as family members or co-workers) with some connections to other groups of people. our small world network is generated as follows: (1) create n new unlinked nodes with index i ¼ 1 . . . n. (2) with a probability of 0.55, link to neighboring and second neighboring nodes (i.e. create links i2ià1, i2iþ 1, i2ià2, i2iþ 2). otherwise, also link up to the third neighboring nodes (i.e. create links i2ià1, i2i þ1, i2ià2, i2i þ2, i2ià3, i2i þ3). periodic boundary conditions are used (i.e. the left nearest neighbor of node 1 is node n while the right nearest neighbor of node n is node 1). (3) create the 'large world' network by repeating step 2 for each node. (4) with a probability of 0.05 add a link to a uniformly randomly chosen node excluding self and nodes already linked to. (5) create the small world network by carrying out step 4 for each node. in the random network, the probability for an arbitrary node to be linked to any other arbitrary node is constant and there is no clear notion of locality. in the small world network on the other hand, tightly integrated local connections are supplemented by links to other parts of the network. to model a situation in between where randomly linked local populations (such as the populations of villages in a region) are randomly linked to each other (for example, some members of the population of one village are linked to some members of some other villages), we consider a meta-random network. when increasing the number of shortcuts, a meta-random network transitions to a random network. it can be argued that among the networks investigated here, a meta-random network is the most representative of the population in a state, province or country. our meta-random network is generated as follows: (1) create n new unlinked nodes with index i ¼ 1 . . . n. (2) group the nodes into 100 randomly sized clusters with a minimum size of 20 nodes (the minimum size was chosen such that it is larger than /ks, which equals five throughout, to exclude fully linked graphs). this is done by randomly choosing 99 values in the range from 1 to n to serve as cluster boundaries with the restriction that a cluster cannot be smaller than the minimum size. (3) for each cluster, create an erd + os-ré nyi type random network. (4) for each node, with a probability 0.01, create a link to a uniformly randomly chosen node of a uniformly randomly chosen cluster excluding its own cluster. the network described in this subsection is a near neighbor network and therefore mostly local. nevertheless, there are some shortcuts but shortcuts to very distant parts of the network are not very likely. it could therefore be called a medium world network (situated between small and large world networks). the key feature of this network is that despite being mostly local its degree distribution is identical to that of the random network. our near neighbor network is generated as follows: (1) create n new unlinked nodes with index i ¼ 1 . . . n. (2) for each node, set a target degree by randomly choosing a degree with a probability equal to that for the degree distribution of the random network. (3) if the node has reached its target degree, continue with the next node. if not continue with step 4. (4) with a probability of 0.5, create a link to a node with a smaller index, otherwise create a link to a node with a larger index (using periodic boundary conditions). (5) starting at the nearest neighbor by index and continuing by decreasing (smaller indices) or increasing (larger indices) the index one by one while skipping nodes already linked to, search for the nearest node that has not reached its target degree yet and create a link with this node. (6) create the network by repeating steps 3-5 for each node. for all the strategies, vaccination is voluntary and quantity limited. that is to say only susceptibles who do not refuse vaccination are vaccinated and each day only a certain number of doses is available. for each strategy for each time unit, first a group of eligible nodes is identified and then up to the maximum number of doses is dispensed among the eligible nodes according to the strategy chosen. in this strategy, nodes with the highest degrees are vaccinated first. the motivation for this strategy is that high degree nodes on average can be assumed to transmit a disease more often than low degree nodes. numerically, the prioritized vaccination strategy is implemented as follows: (1) for each time unit, start at the highest degree (i.e. consider nodes with degree d¼m) and repeat the steps below until either the number of doses per time step or the total number of available doses is reached. (2) count the number of susceptible nodes for degree d. (3) if the number of susceptible nodes with degree d is zero, set d ¼ dà1 and return to step 2. (4) if the number of susceptible nodes with degree d is smaller than or equal to the number of available doses, vaccinate all the nodes, then set d ¼ dà1 and continue with step 2. otherwise continue with step 5. (5) if the number of susceptible nodes with degree d is greater than the number of currently available doses, randomly choose nodes with degree d to vaccinate until the available number of doses is used up. (6) when all the doses are used up, end the vaccination for the current time unit and continue when the next time unit arrives. in practice prioritizing on the basis of certain target groups such as health care workers or people at high risk of complications can be difficult. prioritizing on the basis of the number of links is even more difficult. how would such individuals be identified? one of the easiest vaccination strategies to implement is random vaccination. numerically, the random vaccination strategy is implemented as follows: (1) for each time unit, count the total number of susceptible nodes. (2) if the total number of susceptible nodes is smaller than or equal to the number of doses per unit time, vaccinate all the susceptible nodes. otherwise do step 3. (3) if the total number of susceptible nodes is larger than the number of doses per unit time, randomly vaccinate susceptible nodes until all the available doses are used up. one way to reduce the spread of a disease is by splitting the population into many isolated groups. this could be done by vaccinating nodes with links to different groups. however given the network types studied here, breaking links between groups is not really feasible since besides the random cluster network, there is no clear group structure in the other networks. another approach is the follow links strategy, inspired by notions from social networks, where an attempt is made to split the population by vaccinating the neighbors and the neighbor's neighbors and so on of a randomly chosen susceptible node. numerically, the follow links strategy is implemented as follows: (1) count the total number of susceptible nodes. (2) if the total number of susceptible nodes is smaller than or equal to the number of doses per unit time, vaccinate all the susceptible nodes. (3) if the total number of susceptible nodes is greater than the number of available doses per unit time, first randomly choose a susceptible node, label it as the current node, and vaccinate it. (4) vaccinate all the susceptible neighbors of the current node. (5) randomly choose one of the neighbors of the current node. (6) set the current node to the node chosen in step 5. (7) continue with steps 4-6 until all the doses are used up or no available susceptible neighbor can be found. (8) if no available susceptible neighbor can be found in step 7, randomly choose a susceptible node from the population and continue with step 4. contact tracing was successfully used in combating the sars virus. in that case, everyone who had been in contact with an infectious individual was isolated to prevent a further spread of the disease. de facto, this kind of isolation boils down to removing links rendering the infectious node degree 0, a scenario not considered here. here contact tracing tries to isolate an infectious node by vaccinating all its susceptible neighbors. numerically, the contact tracing strategy is implemented as follows: (1) count the total number of susceptible nodes. (2) if the total number of susceptible nodes is smaller than or equal to the number of doses per unit time, vaccinate all the susceptible nodes. (3) count only those susceptible nodes that have an infectious neighbor. (4) if the number of susceptible nodes neighboring an infectious node is smaller than or equal to the number of doses per unit time, vaccinate all these nodes. (5) if the number of susceptible nodes neighboring an infectious node is greater than the number of available doses repeat step 6 until all the doses are used up. (6) randomly choose an infectious node that has susceptible neighbors and vaccinate its neighbors until all the doses are used up. statistical mechanics of complex networks infectious diseases of humans a comparative analysis of influenza vaccination programs when individual behaviour matters: homogeneous and network models in epidemiology compartmental models in epidemiology comparative estimation of the reproduction number for pandemic influenza from daily case notification data trial of 2009 influenza a (h1n1) monovalent mf59-adjuvanted vaccine efficient immunization strategies for computer networks and populations vaccination against 2009 pandemic h1n1 in a population dynamical model of vancouver, canada: timing is everything early real-time estimation of the basic reproduction number of emerging infectious diseases. phys. rev. x 2, 031005. erd + os modeling infectious diseases in humans and animals infectious disease control using contact tracing in random and scale-free networks the effect of network mixing patterns on epidemic dynamics and the efficacy of disease contact tracing effective degree network disease models transmission dynamics and control of severe acute respiratory syndrome generality of the final size formula for an epidemic of a newly invading infectious disease effective degree household network disease models immunization and epidemic dynamics in complex networks network theory and sars: predicting outbreak diversity a note on a paper by erik volz: sir dynamics in random networks effective vaccination strategies for realistic social networks edge based compartmental modelling for infectious disease spread epidemic incidence in correlated complex networks office of the provincial health officer, 2010. b.c.s response to the h1n1 pandemic initial human transmission dynamics of the pandemic (h1n1) 2009 virus in north america dynamics and control of diseases in networks with community structure a high-resolution human contact network for infectious disease transmission networks, epidemics and vaccination through contact tracing estimated epidemiologic parameters and morbidity associated with pandemic h1n1 influenza sir dynamics in random networks with heterogeneous connectivity effects of heterogeneous and clustered contact patterns on infectious disease dynamics dynamical advantages of scale-free networks key: cord-303197-hpbh4o77 authors: humboldt-dachroeden, sarah; rubin, olivier; frid-nielsen, snorre sylvester title: the state of one health research across disciplines and sectors – a bibliometric analysis date: 2020-06-06 journal: one health doi: 10.1016/j.onehlt.2020.100146 sha: doc_id: 303197 cord_uid: hpbh4o77 there is a growing interest in one health, reflected by the rising number of publications relating to one health literature, but also through zoonotic disease outbreaks becoming more frequent, such as ebola, zika virus and covid-19. this paper uses bibliometric analysis to explore the state of one health in academic literature, to visualise the characteristics and trends within the field through a network analysis of citation patterns and bibliographic links. the analysis focuses on publication trends, co-citation network of scientific journals, co-citation network of authors, and co-occurrence of keywords. the bibliometric analysis showed an increasing interest for one health in academic research. however, it revealed some thematic and disciplinary shortcomings, in particular with respect to the inclusion of environmental themes and social science insights pertaining to the implementation of one health policies. the analysis indicated that there is a need for more applicable approaches to strengthen intersectoral collaboration and knowledge sharing. silos between the disciplines of human medicine, veterinary medicine and environment still persist. engaging researchers with different expertise and disciplinary backgrounds will facilitate a more comprehensive perspective where the human-animal-environment interface is not researched as separate entities but as a coherent whole. further, journals dedicated to one health or interdisciplinary research provide scholars the possibility to publish multifaceted research. these journals are uniquely positioned to bridge between fields, strengthen interdisciplinary research and create room for social science approaches alongside of medical and natural sciences. one health joins the three interdependent sectors -animal health, human health, and ecosystems -with the goal to holistically address health issues such as zoonotic diseases or antimicrobial resistance (1) . in 2010, the food and agriculture organization (fao), the world organisation for animal health (oie) and the world health organization (who) engaged in a tripartite collaboration to ensure a multisectoral perspective to effectively manage and coordinate a one health approach. one health is defined as "an approach to address a health threat at the human-animal-environment interface based on collaboration, communication, and coordination across all relevant sectors and disciplines, with the ultimate goal of achieving optimal health outcomes for both people and animals; a one health approach is applicable at the subnational, national, regional, and global level" (2). this paper uses bibliometric analysis to explore the state of one health in academic literature, to visualise between the disciplines of human medicine, veterinary medicine and environment still persist -even in the face of the one health approach. the data for the bibliometric analysis is drawn from the web of science (wos). the wos is arguably one of the largest academic multidisciplinary databases, and it contains more than 66,9 million contributions from the natural sciences (science citation index expanded), social sciences (social sciences citation index) and humanities (arts & humanities citation index) (7). the broad scope of the database aligns well with the one health concept's cross-disciplinary approach. the analytical period is demarcated by the first one health publication included in the wos in 1998 and it ends in december 2019. the search term "one health" was applied to compile the first crude sample of articles that mention the concept of one health in their title, keywords or abstract. the basic assumption is that articles conducting one health research ( whether conceptually, methodologically and/or empirically) would as a minimum have mentioned "one health" in the abstract, title or keywords. the literature search resulted in 2.004 english articles, see flow chart in figure 1. however, this sample also included a sizable group of articles that just made use of "one health" in a sentence such as "one health district" or "one health professional". to restrict the sample to contributions only pertaining to the concept of one health, two subsequent screening measures were taken. first, 587 contributions which used one health as a keyword were automatically included in the the bibliometric analysis was conducted with the bibliometrix package for the r programming language. the analysis focuses on: 1) publication trends, 2) co-citation network of scientific journals, 3) co-citation network of authors, and 4) co-occurrence of keywords. the publication trend is outlined using both absolute and relative number of one health publications. the co-citation networks of scientific journals provide information on the disciplinary structure of the field of one health while the co-citation network of authors disaggregates further to the citation patterns of individual authors. the co-citation network of journals shows the relation between the publications within the outlets. for example, when a publication within journal a cites publications within journals b and c, it indicates that journals b and c share similar characteristics. the more journals citing both b and c, the stronger their similarity. to minimise popularity bias among frequently cited journals, co-citation patterns are normalised through the jaccard index. the jaccard index measures the similarity between journals b and c as the intersection of journals citing both b and c, divided by the total number of journals that cited b and c individually (8, 9) . like the co-citation network of journals, the co-citation network for authors measures the similarity of authors in terms of how often they are cited by other authors , also normalised through the jaccard index. when author a cites both authors b and c, it signifies that b and c share similar characteristics. the study also investigates the co-occurrence of keywords to identify the content of one health publications. here, co-occurrence measures the similarity of keywords based on the number of times they occur together in different articles. it provides information on the main other topical keywords linked to one health and can thus be used to gauge the knowledge structure of the field. here, the articles keywords plus are the unit of analysis. wos automatically generates keywords plus based on the words or phrases appearing most frequently in an articles bibliography. keywords plus are more fruitful for bibliometric analyses than author keywords, as they convey more general themes, methods and research techniques (10) . disciplinary clusters within the networks, illustrated by the colours in figures 3 to 5, are identified empirically applying the louvain clustering algorithm. louvain is a hierarchical clustering algorithm that attempts to maximise modularity, measured by the density of edges between nodes within communities and sparsity between nodes across communities. the nodes represent the aggregated citations of the academic journals and the edges, the line between two nodes, display the relation between the journals. the shorter the path between the nodes the stronger their relation. node size indicates "betweenness centrality" in the network, which is a measure of the number of shortest paths passing through each node (11) . betweenness centrality estimates the importance of a node on the flow of information through the network, based on the assumption that information generally flows through the most direct communicative pathways. for example, the one health publications in our sample relating to ebola have more than tripled after 2016. one might, therefore, expect to observe a similar spike in one health publications that study the covid-19 outbreak in 2020. while the use of the one health concept has increased, the co-citation network shows that the increase is mostly driven by the sectors of human and veterinary medicine, evidenced by their centrality in terms of information flows within the network. relations to other clusters. the area of parasitology is also mostly co-cited in its own area. here, most aggregated citations are rooted in the journal plos neglected tropical diseases. in these last two clusters, microbiology and parasitology, the journals cover topics mainly exclusively pertaining to medical or biological sciences. the most active one health scholars, publishing more than ten articles over the last 12 years, are from the field of veterinary research. of the top six researchers, five have a veterinary background (jakob zinsstag, jonathan rushton, esther schelling, barbara häsler and bassirou bonfoh). while degeling is the only researcher of the top six with an education in the social sciences, the remaining five veterinarian scholars do touch upon social science themes within their publications, relating to systemic or conceptual approaches, sociopolitical dimensions and knowledge integration (e.g. zinsstag and schelling (14) ; häsler (15) ; rushton (16) . five of the six most productive researchers work in europe and three of them are associated with the same institute, namely the swiss tropical and public health institute (zinsstag, schelling and bonfoh) (17) .there has been some cooperation across institutes and department as evidence by the coauthorships of zinsstag and häsler, häsler and rushton, rushton and zinsstag (e.g. (18) (19) (20) ). figure 4 illustrates the co-citation network of authors. four clusters of authors emerged in the network (green: zoonoses and epidemiology; blue: biodiversity and ecohealth; purple: animal health, public health; red: policy-related disciplines). academic scholars are mainly found in the green, blue and purple clusters, whereas the authors of the red clusters are mainly represented by organisations such as the who, cdc, perspectives from the environmental and ecological sector have been neglected within one health research (24, 25) . further, the co-occurrence network of keywords illustrated that research into one health is mainly undertaken in the medical science cluster with the most connections to the other clusters. this indicates that a majority of articles is constructed around medical themes, and that there is most interdisciplinary research across areas in the medical science cluster. however, few keywords indicate research into administrative or anthropological approaches to examine the management of one health. making these thematic perspectives more central to the network could strengthen the one health approach regarding implementation and institutionalisation. one health initiatives and projects that specifically promote mixed methods studies and engage researchers with various expertise could facilitate implementing comprehensive initiatives. here, a gap in the one health research could be addressed, facilitating not only quantitative but a qualitative research to comprehensively approach the multifaceted issues implied in one health topics (26) . there is no shortage of existing outlets, frameworks and approaches that promote interdisciplinary research. already in 2008, a strategic framework was developed by the tripartite collaborators, as well as the un system influenza coordination, unicef and the world bank, outlining approaches for collaboration, to prevent crises, to govern disease control and surveillance programmes (27) . rüegg et al. developed a handbook to adapt, improve and optimise one health activities could also provide some guidance on how to strengthen future one health activities and evaluate already ongoing one health initiatives (18) . coker et al. produced a conceptual framework for one health, which can be used to develop a strong research the fao-oie-who collaboration -sharing responsibilities and coordinating global activities to address health risks at the animal-human-ecosystems interfaces -a tripartite concept note applied informetrics for digital libraries: an overview of foundations, problems and current approaches transdisciplinary and social-ecological health frameworks-novel approaches to emerging parasitic and vector-borne diseases posthumanist critique and human health: how nonhumans (could) figure in public health research citation index is not critically important to veterinary pathology on the normalization and visualization of author co-citation data: salton's cosine versus the jaccard index similarity measures in scientometric research: the jaccard index versus salton's cosine formula. information processing & management comparing keywords plus of wos and author keywords: a case study of patient adherence research fast unfolding of communities in large networks ebola outbreak distribution in west africa ebola virus disease) reporting and surveillance -zika virus from "one medicine" to "one health" and systemic approaches to health and well-being knowledge integration in one health policy formulation, implementation and evaluation towards a conceptual framework to support one-health research for policy on emerging zoonoses swiss tph -swiss tropical and public health institute integrated approaches to health: a handbook for the evaluation of one health a review of the metrics for one health benefits a blueprint to evaluate one health. front public health implementing a one health approach to emerging infectious disease: reflections on the socio-political, ethical and legal dimensions overcoming challenges for designing and a framework for one health research. one health the growth and strategic functioning of one health networks: a systematic analysis. the lancet planetary health qualitative research for one health: from methodological principles to impactful applications. front vet sci contributing to one world, one health* -a strategic framework for reducing risks of infectious diseases at the animal -human-ecosystems interface birds of a feather: homophily in social networks homophily in co-autorship networks key: cord-034824-eelqmzdx authors: guo, chungu; yang, liangwei; chen, xiao; chen, duanbing; gao, hui; ma, jing title: influential nodes identification in complex networks via information entropy date: 2020-02-21 journal: entropy (basel) doi: 10.3390/e22020242 sha: doc_id: 34824 cord_uid: eelqmzdx identifying a set of influential nodes is an important topic in complex networks which plays a crucial role in many applications, such as market advertising, rumor controlling, and predicting valuable scientific publications. in regard to this, researchers have developed algorithms from simple degree methods to all kinds of sophisticated approaches. however, a more robust and practical algorithm is required for the task. in this paper, we propose the enrenew algorithm aimed to identify a set of influential nodes via information entropy. firstly, the information entropy of each node is calculated as initial spreading ability. then, select the node with the largest information entropy and renovate its l-length reachable nodes’ spreading ability by an attenuation factor, repeat this process until specific number of influential nodes are selected. compared with the best state-of-the-art benchmark methods, the performance of proposed algorithm improved by 21.1%, 7.0%, 30.0%, 5.0%, 2.5%, and 9.0% in final affected scale on cenew, email, hamster, router, condmat, and amazon network, respectively, under the susceptible-infected-recovered (sir) simulation model. the proposed algorithm measures the importance of nodes based on information entropy and selects a group of important nodes through dynamic update strategy. the impressive results on the sir simulation model shed light on new method of node mining in complex networks for information spreading and epidemic prevention. complex networks are common in real life and can be used to represent complex systems in many fields. for example, collaboration networks [1] are used to cover the scientific collaborations between authors, email networks [2] denote the email communications between users, protein-dna networks [3] help people gain a deep insight on biochemical reaction, railway networks [4] reveal the structure of railway via complex network methods, social networks show interactions between people [5, 6] , and international trade network [7] reflects the products trade between countries. a deep understanding and controlling of different complex networks is of great significance in information spreading and network connectivity. on one hand, by using the influential nodes, we can make successful advertisements for products [8] , discover drug target candidates, assist information weighted networks [54] and social networks [55] . however, the node set built by simply assembling the nodes and sorting them employed by the aforementioned methods may not be comparable to an elaborately selected set of nodes due to the rich club phenomenon [56] , namely, important nodes tend to overlap with each other. thus, lots of methods aim to directly select a set of nodes are proposed. kempe et al. defined the problem of identifying a set of influential spreaders in complex networks as influence maximization problem [57] , and they used hill-climbing based greedy algorithm that is within 63% of optimal in several models. greedy method [58] is usually taken as the approximate solution of influence maximization problem, but it is not efficient for its high computational cost. chen et al. [58] proposed newgreedy and mixedgreedy method. borgatti [59] specified mining influential spreaders in social networks by two classes: kpp-pos and kpp-neg, based on which he calculated the importance of nodes. narayanam et al. [60] proposed spin algorithm based on shapley value to deal with information diffusion problem in social networks. although the above greedy based methods can achieve relatively better result, they would cost lots of time for monte carlo simulation. so more heuristic algorithms were proposed. chen et al. put forward simple and efficient degreediscount algorithm [58] in which if one node is selected, its neighbors' degree would be discounted. zhang et al. proposed voterank [61] which selects the influential node set via a voting strategy. zhao et al. [62] introduced coloring technology into complex networks to seperate independent node sets, and selected nodes from different node sets, ensuring selected nodes are not closely connected. hu et al. [63] and guo et al. [64] further considered the distance between independent sets and achieved a better performance. bao et al. [65] sought to find dispersive distributed spreaders by a heuristic clustering algorithm. zhou [66] proposed an algorithm to find a set of influential nodes via message passing theory. ji el al. [67] considered percolation in the network to obtain a set of distributed and coordinated spreaders. researchers also seek to maximize the influence by studying communities [68] [69] [70] [71] [72] [73] . zhang [74] seperated graph nodes into communities by using k-medoid method before selecting nodes. gong et al. [75] divided graph into communities of different sizes, and selected nodes by using degree centrality and other indicators. chen et al. [76] detected communities by using shrink and kcut algorithm. later they selected nodes from different communities as candidate nodes, and used cdh method to find final k influential nodes. recently, some novel methods based on node dynamics have been proposed which rank nodes to select influential spreaders [77, 78] .şirag erkol et al. made a systematic comparison between methods focused on influence maximization problem [79] . they classify multiple algorithms to three classes, and made a detailed explanation and comparison between methods. more algorithms in this domain are described and classified clearly by lü et al. in their review paper [80] . most of the non-greedy strategy methods suffer from a possibility that some spreaders are so close that their influence may overlap. degreediscount and voterank use iterative selection strategy. after a node is selected, they weaken its neighbors' influence to cope with the rich club phenomenon. however, these two algorithms roughly induce nodes' local information. besides, they do not further make use of the difference between nodes when weakening nodes' influence. in this paper, we propose a new heuristic algorithm named enrenew based on node's entropy to select a set of influential nodes. enrenew also uses iterative selection strategy. it initially calculates the influence of each node by its information entropy (further explained in section 2.2), and then repeatedly select the node with the largest information entropy and renovate its l-length reachable nodes' information entropy by an attenuation factor until specific number of nodes are selected. experiments show that the proposed method yields the largest final affected scale on 6 real networks in the susceptible-infected-recovered (sir) simulation model compared with state-of-the-art benchmark methods. the results reveal that enrenew could be a promising tool for related work. besides, to make the algorithm practically more useful, we provide enrenew's source code and all the experiments details on https://github.com/yangliangwei/influential-nodes-identification-in-complex-networksvia-information-entropy, and researchers can download it freely for their convenience. the rest of paper is organized as follows: the identifying method is presented in section 2. experiment results are analyzed and discussed in section 3. conclusions and future interest research topics are given in section 4. the best way to measure the influence of a set of nodes in complex networks is through propagation dynamic process on real life network data. a susceptible infected removed model (sir model) is initially used to simulate the dynamic of disease spreading [23] . it is later widely used to analyze similar spreading process, such as rumor [81] and population [82] . in this paper, the sir model is adopted to objectively evaluate the spreading ability of nodes selected by algorithms. each node in the sir model can be classified into one of three states, namely, susceptible nodes (s), infected nodes (i), and recovered nodes (r). at first, set initial selected nodes to infected status and all others in network to susceptible status. in each propagation iteration, each infected node randomly choose one of its direct neighbors and infect it with probability µ. in the meantime, each infected node will be recovered with probability β and won't be infected again. in this study, λ = µ β is defined as infected rate, which is crucial to the spreading speed in the sir model. apparently, the network can reach a steady stage with no infection after enough propagation iterations. to enable information spreads widely in networks, we set µ = 1.5µ c , where µ c = k k 2 − k [83] is the spreading threshold of sir, k is the average degree of network. when µ is smaller than µ c , spreading in sir could only affect a small range or even cannot spread at all. when it is much larger than µ c , nearly all methods could affect the whole network, which would be meaningless for comparison. thus, we select µ around µ c on the experiments. during the sir propagation mentioned above, enough information can be obtained to evaluate the impact of initial selected nodes in the network and the metrics derived from the procedure is explained in section 2.4. the influential nodes selecting algorithm proposed in this paper is named enrenew, deduced from the concept of the algorithm. enrenew introduces entropy and renews the nodes' entropy through an iterative selection process. enrenew is inspired by voterank algorithm proposed by zhang et al. [61] , where the influential nodes are selected in an iterative voting procedure. voterank assigns each node with voting ability and scores. initially, each node's voting ability to its neighbors is 1. after a node is selected, the direct neighbors' voting ability will be decreased by 1 k , where k = 2 * m n is the average degree of the network. voterank roughly assigns all nodes in graph with the same voting ability and attenuation factor, which ignores node's local information. to overcome this shortcoming, we propose a heuristic algorithm named enrenew and described as follows. in information theory, information quantity measures the information brought about by a specific event and information entropy is the expectation of the information quantity. these two concepts are introduced into complex network in reference [44] [45] [46] to calculate the importance of node. information entropy of any node v can be calculated by: where p uv = d u ∑ l∈γv d l , ∑ l∈γ v p lv = 1, γ v indicates node v's direct neighbors, and d u is the degree of node u. h uv is the spreading ability provided from u to v. e v is node v's information entropy indicating its initial importance which would be renewed as described in algorithm 1. a detailed calculating of node entropy is shown in figure 1 . it shows how the red node's (node 1) entropy is calculated in detail. node 1 has four neighbors from node 2 to node 5. node 1's information entropy is then calculated by simply selecting the nodes with a measure of degree as initial spreaders might not achieve good results. because most real networks have obvious clumping phenomenon, that is, high-impact nodes in the network are often connected closely in a same community. information cannot be copiously disseminated to the whole network. to manage this situation, after each high impact node is selected, we renovate the information entropy of all nodes in its local scope and then select the node with the highest information entropy, the process of which is shown in algorithm 1. e k = − k · 1 k · log 1 k and k is the average degree of the network. 1 2 l−1 is the attenuation factor, the farther the node is from node v, the smaller impact on the node will be. e k can be seen as the information entropy of any node in k -regular graph if k is an integer. from algorithm 1, we can see that after a new node is selected, the renew of its l-length reachable nodes' information entropy is related with h and e k , which reflects local structure information and global network information, respectively. compared with voterank, enrenew replaces voting ability by h value between connected nodes. it induces more local information than directly set voting ability as 1 in voterank. at the same time, enrenew uses h e k as the attenuate factor instead of 1 k in voterank, retaining global information. computational complexity (usually time complexity) is used to describe the relationship between the input of different scales and the running time of the algorithm. generally, brute force can solve most problems accurately, but it cannot be applied in most scenarios because of its intolerable time complexity. time complexity is an extremely important indicator of an algorithm's effectiveness. through analysis, the algorithm is proved to be able to identify influential nodes in large-scale network in limited time. the computational complexity of enrenew can be analyzed in three parts, initialization, selection and renewing. n, m and r represent the number of nodes, edges and initial infected nodes, respectively. at start, enrenew takes o(n · k ) = o(m) for calculating information entropy. node selection selects the node with the largest information entropy and requires o(n), which can further be decreased to o(log n) if stored in an efficient data structure such as red-black tree. renewing the l-length reachable nodes' information entropy needs o( k l ) = o( m l n l ). as suggested in section 3.3, l = 2 yields impressive results with o( m 2 n 2 ). since selection and renewing parts need to be performed r times to get enough spreaders,the final computational complexity is o(m + n) + o(r log n) + o(r k 2 ) = o(m + n + r log n + rm 2 n 2 ). especially, when the network is sparse and r n, the complexity will be decreased to o(n). the algorithm's performance is measured by the selected nodes' properties including its spreading ability and location property. spreading ability can be measured by infected scale at time t f(t) and final infected scale f(t c ), which are obtained from sir simulation and widely used to measure the spreading ability of nodes [61, [84] [85] [86] [87] [88] . l s is obtained from selected nodes' location property by measuring their dispersion [61] . infected scale f(t) demonstrates the influence scale at time t and is defined by where n i(t) and n r(t) are the number of infected and recovered nodes at time t, respectively. at the same time step t, larger f(t) indicates more nodes are infected by initial influential nodes, while a shorter time t indicates the initial influential nodes spread faster in the network. f(t c ) is the final affected scale when the spreading reaches stable state. this reflects the final spreading ability of initial spreaders. the larger the value is, the stronger the spreading capacity of initial nodes. f(t c ) is defined by: where t c is the time when sir propagation procedure reaches its stable state. l s is the average shortest path length of initial infection set s. usually, with larger l s , the initial spreaders are more dispersed and can influence a larger range. this can be defined by: where l u,v denotes the length of the shortest path from node u to v. if u and v is disconnected, the shortest path is replaced by d gc + 1, where d gc is the largest diameter of connected components. an example network shown in figure 2 is used to show the rationality of nodes the proposed algorithm chooses. the first three nodes selected by enrenew is distributed in three communities, while those selected by the other algorithms are not. we further run the sir simulation on the example network with enrenew and other five benchmark methods. the detailed result is shown in table 1 for an in-depth discussion. this result is obtained by averaging 1000 experiments. . this network consists of three communities at different scales. the first nine nodes selected by enrenew are marked red. the network typically shows the rich club phenomenon, that is, nodes with large degree tend to be connected together. table 2 shows the experiment results when choosing 9 nodes as the initial spreading set. greedy method is usually used as the upper bound, but it is not efficient in large networks due to its high time complexity. enrenew and pagerank distribute 4 nodes in community 1, 3 nodes in community 2, and 1 node in community 3. the distribution matches the size of community. however, the nodes selected by the other algorithms tend to cluster in community 1 except for greedy method. this will induce spreading within high density area, which is not efficient to spread in the entire network. enrenew and pagerank can adaptively allocate reasonable number of nodes based on the size of the community just as greedy method. nodes selected by enrenew have the second largest average distance except greedy, which indicates enrenew tends to distribute nodes sparsely in the graph. it aptly alleviates the adverse effect of spreading caused by the rich club phenomenon. although enrenew's average distance is smaller than pagerank, it has a higher final infected scale f(t c ). test result on pagerank also indicates that just select nodes widely spread across the network may not induce to a larger influence range. enrenew performs the closest to greedy with a low computational cost. it shows the proposed algorithm's effectiveness to maximize influence with limited nodes. note: n and m are the total number of nodes and edges, respectively, and k = 2 * m n stands for average node degree and k max = max v∈v d v is the max degree in the network and average clustering coefficient c measures the degree of aggregation in the network. c = 1 n ∑ n i=1 2 * i i |γ i | * (|γ i |−1) , where i i denotes the number of edges between direct neighbors of node i. table 2 describes six different networks varying from small to large-scale, which are used to evaluate the performance of the methods. cenew [89] is a list of edges of the metabolic network of c.elegans. email [90] is an email user communication network. hamster [91] is a network reflecting friendship and family links between users of the website http://www.hamsterster.com, where node and edge demonstrate the web user and relationship between two nodes, respectively. router network [92] reflects the internet topology at the router level. condmat (condense matter physics) [93] is a collaboration network of authors of scientific papers from the arxiv. it shows the author collaboration in papers submitted to condense matter physics. a node in the network represents an author, and an edge between two nodes shows the two authors have collaboratively published papers. in the amazon network [94] , each node represents a product, and an edge between two nodes represents two products were frequently purchased together. we firstly conduct experiments on the parameter l, which is the influence range when renewing the information entropy. if l = 1, only the direct neighbors' importance of selected node will be renewed, and if l = 2, the importance of 2-length reachable nodes will be renewed and so forth. the results with varying parameter l from 1 to 4 on four networks are shown in figure 3 . it can be seen from figure 3 that, when l = 2, the method gets the best performance in four of the six networks. in network email, although the results when l = 3 and l = 4 are slightly better comparing with the case of l = 2, the running time increases sharply. besides, the three degrees of influence (tdi) theory [95] also states that a individual's social influence is only within a relatively small range. based on our experiments, we set the influence range parameter l at 2 in the preceding experiments. with specific ratio of initial infected nodes p, larger final affected scale f(t c ) means more reasonable of the parameter l. the best parameter l differs from different networks. in real life application, l can be used as an tuning parameter. many factors affect the final propagation scale in networks. a good influential nodes mining algorithm should prove its robustness in networks varying in structure, nodes size, initial infection set size, infection probability, and recovery probability. to evaluate the performance of enrenew, voterank , adaptive degree, k-shell, pagerank, and h-index algorithms are selected as benchmark methods for comparing. furthermore, greedy method is usually taken as upper bound on influence maximization problem, but it is not practical on large networks due to its high time computational complexity. thus, we added greedy method as upper bound on the two small networks (cenew and email). the final affected scale f(t c ) of each method on different initial infected sizes are shown in figure 4 . it can be seen that enrenew achieves an impressing result on the six networks. in the small network, such as cenew and email, enrenew has an apparent better result on the other benchmark methods. besides, it nearly reaches the upper bound on email network. in hamster network, it achieves a f(t c ) of 0.22 only by ratio of 0.03 initial infected nodes, which is a huge improvement than all the other methods. in condmat network, the number of affected nodes are nearly 20 times more than the initial ones. in a large amazon network, 11 nodes will be affected on average for one selected initial infected node. but the algorithm performs unsatisfactory on network router. all the methods did not yield good results due to the high sparsity structure of the network. in this sparse network, the information can hardly spread out with small number of initial spreaders. by comparing the 6 methods from the figure 4 , enrenew surpasses all the other methods on five networks with nearly all kinds of p varying from small to large. this result reveals that when the size of initial infected nodes varies, enrenew also shows its superiority to all the other methods. what is worth noticing is that enrenew performs about the same as other methods when p is small, but it has a greater improvement with the rise of initial infected ratio p. this phenomenon shows the rationality of the importance renewing process. the renewing process of enrenew would influence more nodes when p is larger. the better improvement of enrenew than other methods shows the renewing process reasonability redistributes nodes' importance. timestep experiment is made to assess the propagation speed when given a fixed number of initial infected nodes. the exact results of f(t) varying with time step t are shown in figure 5 . from the experiment, it can be seen that with same number of initial infected nodes, enrenew always reaches a higher peak than the benchmark methods, which indicates a larger final infection rate. in the steady stage, enrenew surpasses the best benchmark method by 21.1%, 7.0%, 30.0%, 5.0%, 2.5% and 9.0% in final affected scale on cenew, email, hamster, router, condmat and amazon networks, respectively. in view of propagation speed, enrenew reaches the peak at about 300th time step in cenew, 200th time step in email, 400th time step in hamster, 50th time step in router, 400th time step in condmat and 150th time step in amazon. enrenew always takes less time to influence the same number of nodes compared with other benchmark methods. from figure 5 , it can also be seen that k-shell also performs worst from the early stage in all the networks. nodes with high core value tend to cluster together, which makes information hard to dissipate. especially in the amazon network, after 100 timesteps, all other methods reach a f(t) of 0.0028, which is more than twice as large as k-shell. in contrast to k-shell, enrenew spreads the fastest from early stage to the steady stage. it shows that the proposed method not only achieve a larger final infection scale, but also have a faster infection rate of propagation. in real life situations, the infected rate λ varies greatly and has huge influence on the propagation procedure. different λ represents virus or information with different spreading ability. the results on different λ and methods are shown in figure 6 . from the experiments, it can be observed that in most of cases, enrenew surpasses all other algorithms with λ varying from 0.5 to 2.0 on all networks. besides, experiment results on cenew and email show that enrenew nearly reaches the upper bound. it shows enrenew has a stronger generalization ability comparing with other methods. especially, the enrenew shows its impressing superiority in strong spreading experiments when λ is large. generally speaking, if the selected nodes are widely spread in the network, they tend to have an extensive impact influence on information spreading in entire network. l s is used to measure dispersity of initial infected nodes for algorithms. figure 7 shows the results of l s of nodes selected by different algorithms on 6 different networks. it can be seen that, except for the amazon network, enrenew always has the largest l s , indicting the widespread of selected nodes. especially in cenew, enrenew performs far beyond all the other methods as its l s is nearly as large as the upper bound. in regard to the large-scale amazon network, the network contains lots of small cliques and k-shell selects the dispersed cliques, which makes k-shell has the largest l s . but other experimental results of k-shell show a poor performance. this further confirms that enrenew does not naively distribute selected nodes widely across the network, but rather based on the potential propagation ability of each node. figure 5 . this experiment compares different methods regard to spreading speed. each subfigure shows experiment results on one network. the ratio of initial infected nodes is 3% for cenew, email, hamster and router, 0.3% for condmat and 0.03% for amazon. the results are obtained by averaging on 100 independent runs with spread rate λ = 1.5 in sir. with the same spreading time t, larger f(t) indicates larger influence scale in network, which reveals a faster spreading speed. it can be seen from the figures that enrenew spreads apparently faster than other benchmark methods on all networks. on the small network cenew and email, enrenew's spreading speed is close to the upper bound. 0.5 0. 8 figure 6 . this experiment tests algorithms' effectiveness on different spreading conditions. each subfigure shows experiment results on one network. the ratio of initial infected nodes is 3% for cenew, email, hamster and router, 0.3% for condmat, and 0.03% for amazon. the results are obtained by averaging on 100 independent runs. different infected rate λ of sir can imitate different spreading conditions. enrenew gets a larger final affected scale f(t c ) on different λ than all the other benchmark methods, which indicates the proposed algorithm has more generalization ability to different spreading conditions. . this experiment analysis average shortest path length l s of nodes selected by different algorithms. each subfigure shows experiment results on one network. p is the ratio of initial infected nodes. generally speaking, larger l s indicates the selected nodes are more sparsely distributed in network. it can be seen that nodes selected by enrenew have the apparent largest l s on five networks. it shows enrenew tends to select nodes sparsely distributed. the influential nodes identification problem has been widely studied by scientists from computer science through to all disciplines [96] [97] [98] [99] [100] . various algorithms that have been proposed aim to solve peculiar problems in this field. in this study, we proposed a new method named enrenew by introducing entropy into a complex network, and the sir model was adopted to evaluate the algorithms. experimental results on 6 real networks, varying from small to large in size, show that enrenew is superior over state-of-the-art benchmark methods in most of cases. besides, with its low computational complexity, the presented algorithm can be applied to large scale networks. the enrenew proposed in this paper can also be well applied in rumor controlling, advertise targeting, and many other related areas. but, for influential nodes identification, there still remain many challenges from different perspectives. from the perspective of network size, how to mine influential spreaders in large-scale networks efficiently is a challenging problem. in the area of time-varying networks, most of these networks are constantly changing, which poses the challenge of identifying influential spreaders since they could shift with the changing topology. in the way of multilayer networks, it contains information from different dimensions with interaction between layers and has attracted lots of research interest [101] [102] [103] . to identify influential nodes in multilayer networks, we need to further consider the method to better combine information from different layers and relations between them. the scientific collaboration networks in university management in brazil arenas, a. self-similar community structure in a network of human interactions insights into protein-dna interactions through structure network analysis statistical analysis of the indian railway network: a complex network approach social network analysis network analysis in the social sciences prediction in complex systems: the case of the international trade network the dynamics of viral marketing extracting influential nodes on a social network for information diffusion structure and dynamics of molecular networks: a novel paradigm of drug discovery: a comprehensive review efficient immunization strategies for computer networks and populations a study of epidemic spreading and rumor spreading over complex networks epidemic processes in complex networks unification of theoretical approaches for epidemic spreading on complex networks epidemic spreading in time-varying community networks suppression of epidemic spreading in complex networks by local information based behavioral responses efficient allocation of heterogeneous response times in information spreading process absence of influential spreaders in rumor dynamics a model of spreading of sudden events on social networks daniel bernoulli?s epidemiological model revisited herd immunity: history, theory, practice epidemic disease in england: the evidence of variability and of persistency of type infectious diseases of humans: dynamics and control thermodynamic efficiency of contagions: a statistical mechanical analysis of the sis epidemic model a rumor spreading model based on information entropy an algorithmic information calculus for causal discovery and reprogramming systems the hidden geometry of complex, network-driven contagion phenomena extending centrality the h-index of a network node and its relation to degree and coreness identifying influential nodes in complex networks identifying influential nodes in large-scale directed networks: the role of clustering collective dynamics of ?small-world?networks identification of influential spreaders in complex networks ranking spreaders by decomposing complex networks eccentricity and centrality in networks the centrality index of a graph a set of measures of centrality based on betweenness a new status index derived from sociometric analysis mutual enhancement: toward an understanding of the collective preference for shared information factoring and weighting approaches to status scores and clique identification dynamical systems to define centrality in social networks the anatomy of a large-scale hypertextual web search engine leaders in social networks, the delicious case using mapping entropy to identify node centrality in complex networks path diversity improves the identification of influential spreaders how to identify the most powerful node in complex networks? a novel entropy centrality approach a novel entropy-based centrality approach for identifying vital nodes in weighted networks node importance ranking of complex networks with entropy variation key node ranking in complex networks: a novel entropy and mutual information-based approach a new method to identify influential nodes based on relative entropy influential nodes ranking in complex networks: an entropy-based approach discovering important nodes through graph entropy the case of enron email database identifying node importance based on information entropy in complex networks ranking influential nodes in complex networks with structural holes ranking influential nodes in social networks based on node position and neighborhood detecting rich-club ordering in complex networks maximizing the spread of influence through a social network efficient influence maximization in social networks identifying sets of key players in a social network a shapley value-based approach to discover influential nodes in social networks identifying a set of influential spreaders in complex networks identifying effective multiple spreaders by coloring complex networks effects of the distance among multiple spreaders on the spreading identifying multiple influential spreaders in term of the distance-based coloring identifying multiple influential spreaders by a heuristic clustering algorithm spin glass approach to the feedback vertex set problem effective spreading from multiple leaders identified by percolation in the susceptible-infected-recovered (sir) model finding influential communities in massive networks community-based influence maximization in social networks under a competitive linear threshold model a community-based algorithm for influence blocking maximization in social networks detecting community structure in complex networks via node similarity community structure detection based on the neighbor node degree information community-based greedy algorithm for mining top-k influential nodes in mobile social networks identifying influential nodes in complex networks with community structure an efficient memetic algorithm for influence maximization in social networks efficient algorithms for influence maximization in social networks local structure can identify and quantify influential global spreaders in large scale social networks identifying influential spreaders in complex networks by propagation probability dynamics systematic comparison between methods for the detection of influential spreaders in complex networks vital nodes identification in complex networks sir rumor spreading model in the new media age stochastic sir epidemics in a population with households and schools thresholds for epidemic spreading in networks a novel top-k strategy for influence maximization in complex networks with community structure identifying influential spreaders in complex networks based on kshell hybrid method identifying key nodes based on improved structural holes in complex networks ranking nodes in complex networks based on local structure and improving closeness centrality an efficient algorithm for mining a set of influential spreaders in complex networks the large-scale organization of metabolic networks the koblenz network collection the network data repository with interactive graph analytics and visualization measuring isp topologies with rocketfuel graph evolution: densification and shrinking diameters defining and evaluating network communities based on ground-truth the spread of obesity in a large social network over 32 years identifying the influential nodes via eigen-centrality from the differences and similarities of structure tracking influential individuals in dynamic networks evaluating influential nodes in social networks by local centrality with a coefficient a survey on topological properties, network models and analytical measures in detecting influential nodes in online social networks identifying influential spreaders in noisy networks spreading processes in multilayer networks identifying the influential spreaders in multilayer interactions of online social networks identifying influential spreaders in complex multilayer networks: a centrality perspective we would also thank dennis nii ayeh mensah for helping us revising english of this paper. the authors declare no conflict of interest. key: cord-143847-vtwn5mmd authors: ryffel, th'eo; pointcheval, david; bach, francis title: ariann: low-interaction privacy-preserving deep learning via function secret sharing date: 2020-06-08 journal: nan doi: nan sha: doc_id: 143847 cord_uid: vtwn5mmd we propose ariann, a low-interaction framework to perform private training and inference of standard deep neural networks on sensitive data. this framework implements semi-honest 2-party computation and leverages function secret sharing, a recent cryptographic protocol that only uses lightweight primitives to achieve an efficient online phase with a single message of the size of the inputs, for operations like comparison and multiplication which are building blocks of neural networks. built on top of pytorch, it offers a wide range of functions including relu, maxpool and batchnorm, and allows to use models like alexnet or resnet18. we report experimental results for inference and training over distant servers. last, we propose an extension to support n-party private federated learning. the massive improvements of cryptography techniques for secure computation over sensitive data [15, 13, 28] have spurred the development of the field of privacy-preserving machine learning [45, 1] . privacy-preserving techniques have become practical for concrete use cases, thus encouraging public authorities to use them to protect citizens' data, for example in covid-19 apps [27, 17, 38, 39] . however, tools are lacking to provide end-to-end solutions for institutions that have little expertise in cryptography while facing critical data privacy challenges. a striking example is hospitals which handle large amounts of data while having relatively constrained technical teams. secure multiparty computation (smpc) is a promising technique that can efficiently be integrated into machine learning workflows to ensure data and model privacy, while allowing multiple parties or institutions to participate in a joint project. in particular, smpc provides intrinsic shared governance: because data are shared, none of the parties can decide alone to reconstruct it. this is particularly suited for collaborations between institutions willing to share ownership on a trained model. use case. the main use case driving our work is the collaboration between healthcare institutions such as hospitals or clinical research laboratories. such collaboration involves a model owner and possibly several data owners like hospitals. as the model can be a sensitive asset (in terms of intellectual property, strategic asset or regulatory and privacy issues), standard federated learning [29, 7] that does not protect against model theft or model retro-engineering [24, 18] is not suitable. to data centers, but are likely to remain online for long periods of time. last, parties are honestbut-curious, [20, chapter 7.2.2] and care about their reputation. hence, they have little incentive to deviate from the original protocol, but they will use any information available in their own interest. contributions. by leveraging function secret sharing (fss) [9, 10] , we propose the first lowinteraction framework for private deep learning which drastically reduces communication to a single round for basic machine learning operations, and achieves the first private evaluation benchmark on resnet18. • we build on existing work on function secret sharing to design compact and efficient algorithms for comparison and multiplication, which are building blocks of neural networks. they are highly modular and can be assembled to build complex workflows. • we show how these blocks can be used in machine learning to implement operations for secure evaluation and training of arbitrary models on private data, including maxpool and batchnorm. we achieve single round communication for comparison, convolutional or linear layers. • last, we provide an implementation 1 and demonstrate its practicality both in lan (local area network) and wan settings by running secure training and inference on cifar-10 and tiny imagenet with models such as alexnet [31] and resnet18 [22] . related work. related work in privacy-preserving machine learning encompasses smpc and homomorphic encryption (he) techniques. he only needs a single round of interaction but does not support efficiently non-linearities. for example, ngraph-he [5] and its extensions [4] build on the seal library [44] and provide a framework for secure evaluation that greatly improves on the cryptonet seminal work [19] , but it resorts to polynomials (like the square) for activation functions. smpc frameworks usually provide faster implementations using lightweight cryptography. minionn and deepsecure [34, 41] use optimized garbled circuits [50] that allow very few communication rounds, but they do not support training and alter the neural network structure to speed up execution. other frameworks such as sharemind [6] , secureml [36] , securenn [47] or more recently falcon [48] rely on additive secret sharing and allow secure model evaluation and training. they use simpler and more efficient primitives, but require a large number of rounds of communication, such as 11 in [47] or 5 + log 2 (l) in [48] (typically 10 with l = 32) for relu. aby [16] , chameleon [40] and more recently aby 3 [35] mix garbled circuits, additive or binary secret sharing based on what is most efficient for the operations considered. however, conversion between those can be expensive and they do not support training except aby 3 . last, works like gazelle [26] combine he and smpc to make the most of both, but conversion can also be costly. works on trusted execution environment are left out of the scope of this article as they require access to dedicated hardware [25] . data owners which cannot afford these secure enclaves might be reluctant to use a cloud service and to send their data. notations. all values are encoded on n bits and live in z 2 n . note that for a perfect comparison, y + α should not wrap around and become negative. because y is in practice small compared to the n-bit encoding amplitude, the failure rate is less than one comparison in a million, as detailed in appendix c.1. security model. we consider security against honest-but-curious adversaries, i.e., parties following the protocol but trying to infer as much information as possible about others' input or function share. this is a standard security model in many smpc frameworks [6, 3, 40, 47] and is aligned with our main use case: parties that would not follow the protocol would face major backlash for their reputation if they got caught. the security of our protocols relies on indistinguishability of the function shares, which informally means that the shares received by each party are computationally indistinguishable from random strings. a formal definition of the security is given in [10] . about malicious adversaries, i.e., parties who would not follow the protocol, as all the data available are random, they cannot get any information about the inputs of the other parties, including the parameters of the evaluated functions, unless the parties reconstruct some shared values. the later and the fewer values are reconstructed, the better it is. as mentioned by [11] , our protocols could be extended to guarantee security with abort against malicious adversaries using mac authentication [15] , which means that the protocol would abort if parties deviated from it. our algorithms for private equality and comparison are built on top of the work of [10] , so the security assumptions are the same as in this article. however, our protocols achieve higher efficiency by specializing on the operations needed for neural network evaluation or training. we start by describing private equality which is slightly simpler and gives useful hints about how comparison works. the equality test consists in comparing a public input x to a private value α. evaluating the input using the function keys can be viewed as walking a binary tree of depth n, where n is the number of bits of the input (typically 32). among all the possible paths, the path from the root down to α is called the special path. figure 1 illustrates this tree and provides a compact representation which is used by our protocol, where we do not detail branches for which all leaves are 0. evaluation goes as follows: two evaluators are each given a function key which includes a distinct initial random state (s, t) ∈ {0, 1} λ × {0, 1}. each evaluator starts from the root, at each step i goes down one node in the tree and updates his state depending on the bit x[i] using a common correction word cw (i) ∈ {0, 1} 2(λ+1) from the function key. at the end of the computation, each evaluation outputs t. as long as x[i] = α[i], the evaluators stay on the special path and because the input x is public and common, they both follow the same path. if a bit x[i] = α[i] is met, they leave the special path and should output 0 ; else, they stay on it all the way down, which means that x = α and they should output 1. the main idea is that while they are on the special path, evaluators should have states (s 0 , t 0 ) and (s 1 , t 1 ) respectively, such that s 0 and s 1 are i.i.d. and t 0 ⊕ t 1 = 1. when they leave it, the correction word should act to have s 0 = s 1 but still indistinguishable from random and t 0 = t 1 , which ensures t 0 ⊕ t 1 = 0. each evaluator should output its t j and the result will be given by t 0 ⊕ t 1 . the formal description of the protocol is given below and is composed of two parts: first, in algorithm 1, the keygen algorithm consists of a preprocessing step to generate the functions keys, and then, in algorithm 2, eval is run by two evaluators to perform the equality test. it takes as input the private share held by each evaluator and the function key that they have received. they use g : {0, 1} λ → {0, 1} 2(λ+1) , a pseudorandom generator, where the output set is {0, 1} λ+1 ×{0, 1} λ+1 , and operations modulo 2 n implicitly convert back and forth n-bit strings into integers. intuitively, the correction words cw (i) are built from the expected state of each evaluator on the special path, i.e., the state that each should have at each node i if it is on the special path given some initial state. during evaluation, a correction word is applied by an evaluator only when it has t = 1. hence, on the special path, the correction is applied only by one evaluator at each bit. algorithm 1: keygen: key generation for equality to α if at step i, the evaluator stays on the special path, the correction word compensates the current states of both evaluators by xor-ing them with themselves and re-introduces a pseudorandom value s (either s r 0 ⊕ s r 1 or s l 0 ⊕ s l 1 ), which means the xor of their states is now (s, 1) but those states are still indistinguishable from random. on the other hand, if x[i] = α[i], the new state takes the other half of the correction word, so that the xor of the two evaluators states is (0, 0). from there, they have the same states and both have either t = 0 or t = 1. they will continue to apply the same corrections at each step and their states will remain the same with t 0 ⊕ t 1 = 0. a final computation is performed to obtain shared [[t ]] modulo 2 n of the result bit t = t 0 ⊕ t 1 ∈ {0, 1} shared modulo 2. from the privacy point of view, when the seed s is (truly) random, g(s) also looks like a random bit-string (this is a pseudorandom bit-string). each half is used either in the cw or in the next state, but not both. therefore, the correction words cw (i) do not contain information about the expected states and for j = 0, 1, the output k j is independently uniformly distributed with respect to α and 1−j , in a computational way. as a consequence, at the end of the evaluation, for j = 0, 1, t j also follows a distribution independent of α. until the shared values are reconstructed, even a malicious adversary cannot learn anything about α nor the inputs of the other player. function keys should be sent to the evaluators in advance, which requires one extra communication of the size of the keys. we use the trick of [10] to reduce the size of each correction word in the keys, from 2(1 + λ) to (2 + λ) by reusing the pseudo-random λ-bit string dedicated to the state used when leaving the special path for the state used for staying onto it, since for the latter state the only constraint is the pseudo-randomness of the bitstring. our major contribution to the function secret sharing scheme is regarding comparison (which allows to tackle non-polynomial activation functions for neural networks): we build on the idea of the equality test to provide a synthetic and efficient protocol whose structure is very close from the previous one. instead of seeing the special path as a simple path, it can be seen as a frontier for the zone in the tree where x ≤ α. to evaluate x ≤ α, we could evaluate all the paths on the left of the special path and then sum up the results, but this is highly inefficient as it requires exponentially many evaluations. our key idea here is to evaluate all these paths at the same time, noting that each time one leaves the special path, it either falls on the left side (i.e., x < α) or on the right side (i.e., x > α). hence, we only need to add an extra step at each node of the evaluation, where depending on the bit value x[i], we output a leaf label which is 1 only if x[i] < α[i] and all previous bits are identical. only one label between the final label (which corresponds to x = α) and the leaf labels can be equal to one, because only a single path can be taken. therefore, evaluators will return the sum of all the labels to get the final output. the full description of the comparison protocol is detailed in appendix a, together with a detailed explanation of how it works. we now apply these primitives to a private deep learning setup in which a model owner interacts with a data owner. the data and the model parameters are sensitive and are secret shared to be kept private. the shape of the input and the architecture of the model are however public, which is a standard assumption in secure deep learning [34, 36] . all our operations are modular and follow this additive sharing workflow: inputs are provided secret shared and are masked with random values before being revealed. this disclosed value is then consumed with preprocessed function keys to produce a secret shared output. each operation is independent of all surrounding operations, which is known as circuit-independent preprocessing [11] and implies that key generation can be fully outsourced without having to know the model architecture. this results in a fast runtime execution with a very efficient online communication, with a single round of communication and a message size equal to the input size for comparison. preprocessing is performed by a trusted third party to build the function keys. this is a valid assumption in our use case as such third party would typically be an institution concerned about its image, and it is very easy to check that preprocessed material is correct using a cut-and-choose technique [51] . matrix multiplication (matmul). as mentioned by [11] , multiplication fit in this additive sharing workflow. we use beaver triples [2] ]. matrix multiplication is identical but uses matrix beaver triples [14] . relu activation function is supported as a direct application of our comparison protocol, which we combine with a point wise multiplication. convolution can be computed as a single matrix multiplication using an unrolling technique as described in [12] and illustrated in figure 3 in appendix c.2. argmax operator used in classification to determine the predicted label can also be computed in a constant number of rounds using pairwise comparisons as shown by [21] . the main idea here is, given a vector (x 0 , . . . , x m−1 ), to compute the matrix m ∈ r m−1×m where each row m i = (x i+1 mod m , . . . , x i+m+1 mod m ). then, each element of column j is compared to x j , which requires m(m − 1) parallel comparisons. a column j where all elements are lower than x j indicates that j is a valid result for the argmax. maxpool can be implemented by combining these two methods: the matrix is first unrolled like in figure 3 and the maximum of each row in then computed using parallel pairwise comparisons. more details and an optimization when the kernel size k equals 2 are given in appendix c.3. batchnorm is implemented using a approximate division with newton's method as in [48] : given an input x = (x 0 , . . . , x m−1 ) with mean µ and variance σ 2 , we return γ ·θ · ( x − µ) + β. variables γ and β are learnable parameters andθ is the estimate inverse of √ σ 2 + with 1 and is computed iteratively using: θ i+1 = θ i · (3 − (σ 2 + ) · θ 2 i )/2. more details can be found in appendix c.4. more generally, for more complex activation functions such as softmax, we can use polynomial approximations methods, which achieve acceptable accuracy despite involving a higher number of rounds [37, 23, 21] . table 1 summarizes the online communication cost of each operation, and shows that basic operations such as comparison have a very efficient online communication. we also report results from [48] which achieve good experimental performance. these operations are sufficient to evaluate real world models in a fully private way. to also support private training of these models, we need to perform a private backward pass. as we overload operations such as convolutions or activation functions, we cannot use the built-in autograd functionality of pytorch. therefore, we have developed a custom autograd functionality, where we specify how to compute the derivatives of the operations that we have overloaded. backpropagation also uses the same basic blocks than those used in the forward pass. this 2-party protocol between a model owner and a data owner can be extended to an n-party federated learning protocol where several clients contribute their data to a model owned by an orchestrator server. this approach is inspired by secure aggregation [8] but we do not consider here clients being phones which means we are less concerned with parties dropping before the end of the protocol. in addition, we do not reveal the updated model at each aggregation or at any stage, hence providing better privacy than secure aggregation. at the beginning of the interaction, the server and model owner initializes its model and builds n pairs of additive shares of the model parameters. for each pair i, it keeps one of the shares and sends the other one to the corresponding client i. then, the server runs in parallel the training procedure with all the clients until the aggregation phase starts. aggregation for the server shares is straightforward, as the n shares it holds can be simply locally averaged. but the clients have to average their shares together to get a client share of the aggregated model. one possibility is that clients broadcast their shares and compute the average locally. however, to prevent a client colluding with the server from reconstructing the model contributed by a given client, they hide their shares using masking. this can be done using correlated random masks: client i generates a seed, sends it to client i + 1 while receiving one from client i − 1. client i then generates a random mask m i using its seed and another m i−1 using the one of client i − 1 and publishes its share masked with m i − m i−1 . as the masks cancel each other out, the computation will be correct. we follow a setup very close to [48] and assess inference and training performance of several networks on the datasets mnist [33] , cifar-10 [30] , 64×64 tiny imagenet and 224×224 tiny imagenet [49, 42] , presented in appendix d.1. more precisely, we assess 5 networks as in [48] : a fully-connected network (network-1), a small convolutional network with maxpool (network-2), lenet [32] , alexnet [31] and vgg16 [46] . furthermore, we also include resnet18 [22] which to the best of our knowledge has never been studied before in private deep learning. the description of these networks is taken verbatim from [48] and is available in appendix d.2. our implementation is written in python. to use our protocols that only work in finite groups like z 2 32 , we convert our input values and model parameters to fixed precision. to do so, we rely on the pysyft library [43] protocol. however, our inference runtimes reported in table 2 compare favourably with existing work including [34-36, 47, 48] , in the lan setting and particularly in the wan setting thanks to our reduced number of communication rounds. for example, our implementation of network-1 is 2× faster than the best previous result by [35] in the lan setting and 18× faster in the wan setting compared to [48] . for bigger networks such as alexnet on cifar-10, we are still 13× faster in the wan setting than [48] . results are given for a batched evaluation, which allows parallelism and hence faster execution as in [48] . for larger networks, we reduce the batch size to have the preprocessing material (including the function keys) fitting into ram. test accuracy. thanks to the flexibility of our framework, we can train each of these networks in plain text and need only one line of code to turn them into private networks, where all parameters are secret shared. we compare these private networks to their plaintext counterparts and observe that the accuracy is well preserved as shown in table 3 . if we degrade the encoding precision, which by default considers values in z 2 32 , and the fixed precision which is by default of 3 decimals, performance degrades as shown in appendix b. training. we can either train from scratch those networks or fine tune pre-trained models. training is an end-to-end private procedure, which means the loss and the gradients are never accessible in plain text. we use stochastic gradient descent (sgd) which is a simple but popular optimizer, and support both hinge loss and mean square error (mse) loss, as other losses like cross entropy which is used in clear text by [48] cannot be computed over secret shared data without approximations. we report runtime and accuracy obtained by training from scratch the smaller networks in table 4 . note that because of the number of epochs, the optimizer and the loss chosen, accuracy does not match best known results. however, the training procedure is not altered and the trained model will be strictly equivalent to its plaintext counterpart. training cannot complete in reasonable time for larger networks, which are anyway available pre-trained. note that training time includes the time spent building the preprocessing material, as it cannot be fully processed in advance and stored in ram. discussion. for larger networks, we could not use batches of size 128. this is mainly due to the size of the comparison function keys which is currently proportional to the size of the input tensor, with a multiplication factor of nλ where n = 32 and λ = 128. optimizing the function secret sharing protocol to reduce those keys would lead to massive improvements in the protocol's efficiency. our implementation actually has more communication than is theoretically necessary according to table 1 , suggesting that the experimental results could be further improved. as we build on top of pytorch, using machines with gpus could also potentially result in a massive speed-up, as an important fraction of the execution time is dedicated to computation. last, accuracies presented in table 3 and table 4 do not match state-of-the-art performance for the models and datasets considered. this is not due to internal defaults of our protocol but to the simplified training procedure we had to use. supporting losses such as the logistic loss, more complex optimizers like adam and dropout layers would be an interesting follow-up. one can observe the great similarity of structure of the comparison protocol given in algorithm 3 and 4 with the equality protocol from algorithm 1 and 2: the equality test is performed in parallel with an additional information out i at each node, which holds a share of either 0 when the evaluator stays on the special path or if it has already left it at a previous node, or a share of α[i] when it leaves the special path. this means that if α[i] = 1, leaving the special path implies that x[i] = 0 and hence x ≤ α, while if α[i] = 0, leaving implies x[i] = 1 so x > α and the output should be 0. the final share out n+1 corresponds the previous equality test. note that in all these computations modulo 2 n , while the bitstrings s j · cw (i) ) = ((state j,0 , state j,1 ), (state j,0 , state j,1 )) 9 parse s we have studied the impact of lowering the encoding space of the input to our function secret sharing protocol from z 2 32 to z 2 k with k < 32. finding the lowest k guaranteeing good performance is an interesting challenge as the function keys size is directly proportional to it. this has to be done together with reducing fixed precision from 3 decimals down to 1 decimal to ensure private values aren't too big, which would result in higher failure rate in our private comparison protocol. we have reported in table 5 our findings on network-1, which is pre-trained and then evaluated in a private fashion. table 5 : accuracy (in %) of network-1 given different precision and encoding spaces what we observe is that 3 decimals of precision is the most appropriate setting to have an optimal precision while allowing to slightly reduce the encoding space down to z 2 24 or z 2 28 . because this is not a massive gain and in order to keep the failure rate in comparison very low, we have kept z 2 32 for all our experiments. c implementation details our comparison protocol can fail if y + α wraps around and becomes negative. we can't act on α because it must be completely random to act as a perfect mask and to make sure the revealed x = y + α mod 2 n does not leak any information about y, but the smaller y is, the lower the error probability will be. [11] suggests a method which uses 2 invocations of the protocol to guarantee perfect correctness but because it incurs an important runtime overhead, we rather show that the failure rate of our comparison protocol is very small and is reasonable in contexts that tolerate a few mistakes, as in machine learning. more precisely, we quantify it on real world examples, namely on network-2 and on the 64×64 tiny imagenet version of vgg16, with a fixed precision of 3 decimals, and find respective failure rates of 1 in 4 millions comparisons and 1 in 100 millions comparisons. such error rates do not affect the model accuracy, as table 3 shows. figure 4 illustrates how maxpool uses ideas from matrix unrolling and argmax computation. notations present in the figure are consistent with the explanation of argmax using pairwise comparison in section 4.3. the m × m matrix is first unrolled to a m 2 × k 2 matrix. it is then expanded on k 2 layers, each of which each shifted by a step of 1. next, m 2 k 2 (k 2 − 1) pairwise comparisons are then applied simultaneously between the first layer and the other ones, and for each x i we sum the result of its k − 1 comparison and check if it equals k − 1. we multiply this boolean by x i and sum up along a line (like x 1 to x 4 in the figure) . last, we restructure the matrix back to its initial structure. in addition, when the kernel size k is 2, rows are only of length 4 and it can be more efficient to use a binary tree approach instead, i.e. compute the maximum of columns 0 and 1, 2 and 3 and the max of the result: it requires log 2 (k 2 ) = 2 rounds of communication and only approximately (k 2 − 1)(m/s) 2 comparisons, compared to a fixed 3 rounds and approximately k 4 (m/s) 2 . interestingly, average pooling can be computed locally on the shares without interaction because it only includes mean operations, but we didn't replace maxpool operations with average pooling to avoid distorting existing neural networks architecture. the batchnorm layer is the only one in our implementation which is a polynomial approximation. moreover, compared to [48] , the approximation is significantly coarser as we don't make any costly initial approximation and we reduce the number of iterations of the newton method from 4 to only 3. typical relative error can be up to 20% but as the primary purpose of batchnorm is to normalise data, having rough approximations here is not an issue and doesn't affect learning capabilities, as our experiments show. however, it is a limitation for using pre-trained networks: we observed on alexnet adapted to cifar-10 that training the model with a standard batchnorm and evaluating it with our approximation resulted in poor results, so we had to train it with the approximated layer. this section is taken almost verbatim from [48] . we select 4 datasets popularly used for training image classification models: mnist [33] , cifar-10 [30] , 64×64 tiny imagenet and 224×224 tiny imagenet [49] . mnist mnist [33] is a collection of handwritten digits dataset. it consists of 60,000 images in the training set and 10,000 in the test set. each image is a 28×28 pixel image of a handwritten digit along wit a label between 0 and 9. we evaluate network-1, network-2, and the lenet network on this dataset. cifar-10 cifar-10 [30] consists of 50,000 images in the training set and 10,000 in the test set. it is composed of 10 different classes (such as airplanes, dogs, horses etc.) and there are 6,000 images of each class with each image consisting of a colored 32×32 image. we perform private training of alexnet and inference of vgg16 on this dataset. tiny imagenet tiny imagenet [49] consists of two datasets of 100,000 training samples and 10,000 test samples with 200 different classes. the first dataset is composed of colored 64×64 images and we use it with alexnet and vgg16. the second is composed of colored 224×224 images and is used with resnet18. we have selected 6 models for our experimentations. network-1 a 3-layered fully-connected network with relu used in secureml [36] . network-2 a 4-layered network selected in minionn [34] with 2 convolutional and 2 fullyconnected layers, which uses maxpool in addition to relu activation. lenet this network, first proposed by lecun et al. [32] , was used in automated detection of zip codes and digit recognition. the network contains 2 convolutional layers and 2 fully connected layers. alexnet alexnet is the famous winner of the 2012 imagenet ilsvrc-2012 competition [31] . it has 5 convolutional layers and 3 fully connected layers and it can batch normalization layer for stability and efficient training. vgg16 vgg16 is the runner-up of the ilsvrc-2014 competition [46] . vgg16 has 16 layers and has about 138m parameters. resnet18 resnet18 [22] is the runner-up of the ilsvrc-2015 competition. it is a convolutional neural network that is 18 layers deep, and has 11.7m parameters. it uses batch normalisation and we're the first private deep learning framework to evaluate this network. model architectures of network-1 and network-2, together with lenet, and the adaptations for cifar-10 of alexnet and vgg16 are precisely depicted in appendix d of [48] . note that in the cifar-10 version alexnet, authors have used the version with batchnorm layers, and we have kept this choice. for the 64×64 tiny imagenet version of alexnet, we used the standard architecture from pytorch to have a pretrained network. it doesn't have batchnorm layers, and we have adapted the classifier part as illustrated in figure 5 . note also that we permute relu and maxpool where applicable like in [48] , as this is strictly equivalent in terms of output for the network and reduces the number of comparisons. more generally, we don't proceed to any alteration of the network behaviour except with the approximation on batchnorm. this improves usability of our framework as it allows to take a pre-trained neural network from a standard deep learning library like pytorch and to encrypt it generically with a single line of code. privacy-preserving machine learning: threats and solutions efficient multiparty protocols using circuit randomization optimizing semi-honest secure multiparty computation for the internet ngraph-he2: a high-throughput framework for neural network inference on encrypted data ngraph-he: a graph compiler for deep learning on homomorphically encrypted data sharemind: a framework for fast privacypreserving computations towards federated learning at scale: system design practical secure aggregation for privacy-preserving machine learning function secret sharing function secret sharing: improvements and extensions secure computation with preprocessing via function secret sharing high performance convolutional neural networks for document processing faster fully homomorphic encryption: bootstrapping in less than 0.1 seconds private image analysis with mpc. accessed 2019-11-01 multiparty computation from somewhat homomorphic encryption aby-a framework for efficient mixed-protocol secure two-party computation a survey of secure multiparty computation protocols for privacy preserving genetic tests model inversion attacks that exploit confidence information and basic countermeasures cryptonets: applying neural networks to encrypted data with high throughput and accuracy foundations of cryptography deep residual learning for image recognition accuracy and stability of numerical algorithms deep models under the gan: information leakage from collaborative deep learning chiron: privacy-preserving machine learning as a service {gazelle}: a low latency framework for secure neural network inference an efficient multi-party scheme for privacy preserving collaborative filtering for healthcare recommender system overdrive: making spdz great again federated learning: strategies for improving communication efficiency the cifar-10 dataset imagenet classification with deep convolutional neural networks gradient-based learning applied to document recognition mnist handwritten digit database oblivious neural network predictions via minionn transformations aby3: a mixed protocol framework for machine learning secureml: a system for scalable privacy-preserving machine learning an improved newton iteration for the generalized inverse of a matrix, with applications information technology-based tracing strategy in response to covid-19 in south korea-privacy controversies privacy-preserving contact tracing of covid-19 patients chameleon: a hybrid secure computation framework for machine learning applications deepsecure: scalable provably-secure deep learning imagenet large scale visual recognition challenge a generic framework for privacy preserving deep learning privacy-preserving deep learning very deep convolutional networks for large-scale image recognition securenn: efficient and private neural network training falcon: honest-majority maliciously secure framework for private deep learning tiny imagenet challenge how to generate and exchange secrets the cut-and-choose game and its application to cryptographic protocols we would like to thank geoffroy couteau, chloé hébant and loïc estève for helpful discussions throughout this project. we are also grateful for the long-standing support of the openmined community and in particular its dedicated cryptography team, including yugandhar tripathi, s p sharan, george-cristian muraru, muhammed abogazia, alan aboudib, ayoub benaissa, sukhad joshi and many others.this work was supported in part by the european community's seventh framework programme (fp7/2007-2013 grant agreement no. 339563 -cryptocloud) and by the french project fui anblic. the computing power was graciously provided by the french company arkhn. key: cord-312817-gskbu0oh authors: witte, carmel; hungerford, laura l.; rideout, bruce a.; papendick, rebecca; fowler, james h. title: spatiotemporal network structure among “friends of friends” reveals contagious disease process date: 2020-08-06 journal: plos one doi: 10.1371/journal.pone.0237168 sha: doc_id: 312817 cord_uid: gskbu0oh disease transmission can be identified in a social network from the structural patterns of contact. however, it is difficult to separate contagious processes from those driven by homophily, and multiple pathways of transmission or inexact information on the timing of infection can obscure the detection of true transmission events. here, we analyze the dynamic social network of a large, and near-complete population of 16,430 zoo birds tracked daily over 22 years to test a novel “friends-of-friends” strategy for detecting contagion in a social network. the results show that cases of avian mycobacteriosis were significantly clustered among pairs of birds that had been in direct contact. however, since these clusters might result due to correlated traits or a shared environment, we also analyzed pairs of birds that had never been in direct contact but were indirectly connected in the network via other birds. the disease was also significantly clustered among these friends of friends and a reverse-time placebo test shows that homophily could not be causing the clustering. these results provide empirical evidence that at least some avian mycobacteriosis infections are transmitted between birds, and provide new methods for detecting contagious processes in large-scale global network structures with indirect contacts, even when transmission pathways, timing of cases, or etiologic agents are unknown. avian mycobacteriosis is a bacterial disease that has long been considered contagious, passing indirectly between birds through the fecal-oral route [1, 2] . however, recent long-term studies in well-characterized cohorts have found low probabilities of disease acquisition among exposed birds [3, 4] and multiple strains and species of mycobacteria associated with single outbreaks [5] [6] [7] [8] . these findings suggest pre-existing environmental reservoirs of potentially san diego zoo global houses one of the largest, breeding bird populations in the world, historically averaging over 3,000 birds at any given time across two facilities, the san diego zoo and san diego zoo safari park (collectively referred to as san diego zoo global, sdzg). birds are frequently moved among enclosures for breeding, behavior or other management reasons, as well as imported from or exported to other institutions. this creates a dynamic network of contacts over time that varies individual exposure to environments and other birds. the source population included 16,837 birds present at sdzg between 1 january 1992-1 june 2014 that were at least 6 months old and present for at least seven days. all birds in this population were under close keeper observation and veterinary care during the entire study period and received complete post-mortem exams if they died. birds in this population had documented dates of hatch, acquisition, removal, and death. we excluded a small number of birds (n = 437) because they had incomplete information on enclosure histories. the 16,430 remaining birds had near-complete enclosure tracking over time with move-in and move-out dates for each occupied enclosure. all management data were stored in an electronic database. thus, the population represents a group of birds for which 1) a near-complete social network could be assembled from housing records that tracked dynamic movement over time, and 2) avian mycobacteriosis disease status could be determined for any bird that died. all historic data in these retrospective analyses were originally collected for medical activities and animal management purposes unrelated to the present study. for these reasons, the san diego zoo global institutional animal care and use committee exempted our study from review for the ethical use of animals in research. if a bird in the source population died, a board-certified veterinary pathologist conducted a thorough post-mortem exam that included histopathology on complete sets of tissues, unless advanced autolysis precluded evaluation. if lesions suggestive of avian mycobacteriosis were observed, then special stains (ziehl-neelsen or fite-faraco) were used to confirm presence of acid-fast bacilli. occasionally, clinical presentation permitted antemortem diagnosis based on tissue biopsy. for this study, any bird with acid-fast bacilli present in tissues was considered positive for avian mycobacteriosis at the date of diagnosis. birds were classified as 'infected' on their date of diagnosis or 'uninfected' on their date of death if the post-mortem examination showed no evidence of disease. birds were also classified as 'uninfected' on their date of export if they were still apparently healthy. birds that were still alive on the study end date of 6/1/2014, were followed for up to the assumed minimum incubation period (further described below; e.g., six months or through 11/28/2014) to determine final disease status. the network was defined based on the subset of birds that qualified as subjects and their friends (network nodes), and the connections between them (network edges). study subjects included all birds from the source population with complete information on history of exposure to other birds. this included both birds that hatched in the population, as well as birds imported from elsewhere. if a bird was imported, then it must have been present for a duration equal to or greater than the maximum incubation period (further defined below); those that were present for less time were not included as a study subject because they could have been infected prior to importation. any bird that directly shared an enclosure with a subject for at least seven days was considered a "friend". thus, the same bird could serve as both a subject as well as a friend for other birds, as illustrated in fig 1. spatial connections between subjects and friends were determined through cross-referencing enclosure move-in and move-out dates of all birds. contact occurring in a few enclosures, including hospital and quarantine enclosures, could not be determined and was therefore excluded. exposures that could lead to potential transmission of mycobacteriosis would be those which occurred within the incubation period of the subject (fig 1) . however, the distribution of the true incubation period for avian mycobacteriosis is unknown. as a starting point, minimum incubation period, i.e., the minimum time for an exposure to result in detectable disease, was set to six months. this was based on early literature from experimental studies that mimicked natural transmission [27, 28] . this is also consistent with our own data where the earliest case in the population occurred at 182 days of age [3] . the maximum incubation period was set to two years. early studies reported deaths occurring up to 12-14 months after infection [27] [28] [29] ; however, some authors reviewed by feldman [1] considered it possible that the disease progression could take years. for subjects that were classified as non-infected, this same interval (two years to six months prior to death or censoring) was used to identify contact with friends. for example, if a subject died on january 1, 2005, it would be connected to all friends with which it shared an enclosure for at least seven days within the time window of two years until six months prior to the subject's death, or between january 1, 2003 and july 1, 2004. exposures of subjects to friends that could lead to potential disease transmission would also be those which occurred within the friends' infectious periods when the bacteria could spread to other birds (fig 1) . the period of shedding during which a bird is infectious for other birds is unknown and no estimates were available for a naturally occurring disease course. therefore, friends were assumed to be infectious for the maximum incubation time, or two years, as diagram of potential transmission relationships and connectivity of birds in the network. the figure represents three example birds, assessed for the potential for each to have acquired infection from the other. each bird, or "subject", was defined to have an incubation period, initially set to the period occurring six to 24 months before the bird's final date in the study. any other bird that shared an enclosure with the subject during its incubation period was defined as a "friend" if the two birds shared the space during the second bird's infectious period. a friend's infectious period was initially set to the period occurring two years prior to its final date in the study. thus, the figure shows the incubation and infectious periods for each bird in the larger bars while the smaller bars show the overlapping period when the other two birds would be defined as its infectious"friends". the network edges were created from identifying the spatial and temporal overlap of potential incubation and infectious periods of subjects and friends in the study population. https://doi.org/10.1371/journal.pone.0237168.g001 a starting point. exposure of the subject to friends that were not infected was considered for the same two-year period prior to the friend's final date in the study. fig 2a and 2b illustrates network assembly over time for an example subject and its friends. the transmission network was graphed using the kamada-kawai [30] algorithm and all visualizations and analyses were performed using r software, package: igraph [31] . an initial clustering of disease associated with all indirect contacts that could influence the subject's disease status (based on timing of contact). friends-of-friends: same environment. clustering of disease associated with influential indirect contacts that were exposed to the same enclosure/environment. friends-of-friends: contagion. clustering of disease associated with influential indirect contacts that were never exposed to the same environment. this evaluation is key for removing the confounding effects of the environment and testing for a contagious process. friends-of-friends: homophily. clustering of disease associated with friends of friends that were never exposed to the same environment and could not have transmitted disease to the subject based on the timing of the connection. this reverse-time placebo test evaluates our data for the presence of homophily, or whether disease clustering can be explained by similarities among connected birds. https://doi.org/10.1371/journal.pone.0237168.g002 network was structured to include all connections of seven days or more between birds that occurred during their lifetimes. from this, the transmission network used in the analyses was constructed by refining connectivity based on the subjects' incubation periods and friends' infectious periods as described above. network topology was characterized by size (number of nodes and edges), average path length, and transitivity (probability that two connected birds both share a connection with another bird). to evaluate statistically whether or not disease status of a subject is predicted by the disease status of its friend, we calculated the probability of mycobacteriosis in a subject given exposure to an additional infected friend relative to the probability of mycobacteriosis in a subject exposed to an additional non-infected friend, i.e., the relative risk (rr). to determine significance of the rr, the observed rr was compared to the distribution of the same rr calculation on 1000 randomly generated null networks where the network topology and disease prevalence were preserved, but the disease status was randomly shuffled to different nodes [15, 32] . if the observed rr fell outside the range of permuted values between the 2.5 th and 97.5 th percentiles, i.e., the null 95% confidence interval (ci), then we rejected the null hypothesis that the observed relationship was due to chance alone. reported p-values were estimated from the null 95% ci. we evaluated the relative risk of disease transmission through five types of shared relationships between subjects and their friends ( fig 2b) . each evaluation targeted different groups of subject-friend pairs that varied in degrees of separation as well as spatial and temporal characteristics of network edges. risk (also referred to as "clustering") of disease associated with directly connected birds, or "friends". this analysis examined all pairs of birds where the subject was in direct contact with its friend during the subject's defined incubation period and the friend's infectious period (illustrated in fig 1) . the rr estimate includes the combined risk from direct exposure to both other infected birds and a common environmental source. this analysis examined whether associations persisted among the indirectly connected friends, as observed in other contagious processes [15] . to identify these friends of friends, we constructed a matrix of shortest paths between all subject-friend pairs that never directly shared an enclosure but were indirectly connected through an intermediary bird. before estimating the rr and conducting the random permutation tests, the data were limited to each subject's set of "influential" nodes, or the friends of friends connected by pathways that respect time ordering along which disease could propagate [33] . in other words, the friend of friend shared an enclosure with an intermediary bird before the intermediary bird contacted the subject. the estimated rr includes the indirect risk of disease from both contagion and exposure to a common environmental source. risk of disease transmission associated with influential friends of friends sharing an environment with their subject. this analysis examined associations with the subset of all influential friends of friends, where both birds were in the same enclosure but not at the same time. for example, bird a shares an enclosure with c. if a moves out and b subsequently moves in, then b is exposed to a via c. importantly, both a and b also were exposed to the same environment. associations in this group would reflect a combination of risk due to common environmental exposure and contagion. for contagion, we evaluated associations with the influential friends of friends that were never in the same enclosure as their subject. from the earlier example, if bird c is moved from an enclosure with bird a to an enclosure with bird b, then b is exposed to a via c. that is, a can transmit infection to b, even though they never shared an enclosure. case clustering could not be attributed to exposure to the same environment because the subject and its friends of friends were never housed in the same enclosure. this evaluation also ensured correct temporal alignment between exposure to an infectious agent and disease outcome in the subject. this comparison was key for removing confounding effects of environmental exposure and testing for a contagious process. although disease clustering among friends of friends could represent a contagious process, there is a possibility that some of the association could be explained by homophily, i.e., that connected birds could be more alike than the general bird population in terms of species, behavior, susceptibility, enclosure characteristics, etc. [19] . this could make both birds more likely to acquire infection from any source and manifest as clustering on a network at degrees of separation. we tested the network for the presence of homophily using a reverse-time placebo test. for this test, we evaluated disease clustering between a subject and its friends of friends from different enclosures that could not have transmitted infection based on the timing of the contact. for the tests of contagion, we described how b could be exposed to a via c; however, in that same example, the reverse would not be true. b could not transmit infection to a because disease transmission is time-dependent. for our reverse-time placebo test, we evaluated whether the infection status of b predicted the infection status in a. if so, then it would suggest homophily is present and driving disease clustering. sensitivity analyses were performed to compare differences in rr estimates while varying model assumptions. we varied subjects' incubation time (testing a minimum of three months and a maximum of one, three, four and five years) and friends' infectious time (two years, one year, and six months). we also refined network edges to evaluate associations in subsets of data where biases were minimized. this included limiting the friends to those whose exposure to the subject was exclusively outside of the two-year infectious window. it also included refining network edges to contact between subjects and friends that occurred only in small enclosures where enclosure sharing may be a better proxy of true exposure. finally, we limited analyses to subjects and friends that died and received a post-mortem examination. the 16,430 birds in the source population consisted of 950 species and subspecies housed across 848 enclosures. mycobacteriosis was diagnosed in 275 of these birds (1.7%). the subset that qualified as study subjects included 13,409 of the birds, which represented 810 species and subspecies. subjects were housed across 837 different enclosures that varied in size, housing anywhere from one to over 200 birds at any given time. in total, 203 (1.5%) subjects developed mycobacteriosis. subjects were present in the study population for variable amounts of time with the median follow-up being 3.4 years (iqr: 1.4-7 years). on average, subjects moved between enclosures 4.4 times (sd: 4.1; range: 0-71), and were housed in three separate enclosures (sd: 2.5; range 1-26). the average time a subject spent with each friend was about ten months (314 days; sd: 201 days). the full network that included all subject-friend connections contained 2,492,438 edges, but we focused on the transmission network limited to plausible fecal-oral transmission routes based on sharing an enclosure for at least 7 days during the subjects' incubation periods and its friends' infectious periods. this transmission network included all 16,430 birds with 905,499 connections linking their temporal and spatial location. the median number of friends each subject contacted (network degree centrality), was 105 (iqr: 21-303; range: 0-1435). the network exhibited small world properties [34] with short paths (average path length = 3.8) and many cliques where groups of birds were all connected to each other (transitivity = 0.63). a portion of the network diagram that includes subjects infected with avian mycobacteriosis and their directly connected friends is shown (fig 3) . results from all five associations are shown in fig 4 and rr estimates with p-values are reported in table 1 . when we performed our test between the subject and its directly connected friends we found significant clustering of cases based on social network ties; the risk of mycobacteriosis given exposure to an infected friend was 7.0 times greater than the risk of mycobacteriosis given exposure to an uninfected friend (p<0.001). significant associations persisted among the friends of friends. the rr of disease given exposure to any influential, infected friend of friend, compared to exposure to an uninfected friend of friend, was 1.35 (p<0.001). when subset to just the influential friends of friends that shared the same environment, the rr was 1.47 (p = 0.004). importantly, the friends-of-friends contagion model identified a significant 31% increase in risk of infection among subjects that were exposed to an infected friend of friend compared to those exposed to an uninfected friend of friend (rr: 1.31, p = 0.004). we found no evidence of homophily with our reverse-time placebo test; i.e., there was no significant association when the friends of friends were limited to those who may have correlated traits, but could not have influenced the subject's disease status based on location and timing of their indirect connection (rr: 0.95; p = 0.586). results of sensitivity analyses for all five evaluated relationships are shown in table 1 . the sensitivity analyses did not yield drastically different findings than the analyses of the main network and the significance of most associations remained. generally, as the subjects' incubation periods increased, the magnitude of the rrs with the friends and friends of friends decreased. this same pattern was observed when connectivity was limited to that occurring two years prior to the friends' removal dates (i.e., outside of the friends' incubation windows). patterns of significance were mostly unchanged when the network was limited to just animals with post-mortem exams, and just birds housed in small enclosures. importantly, significant disease clustering in the test for contagion persisted in most examined network variations. the exception to this is when the subjects' maximum incubation periods or the friends' infectious periods became more narrowly defined. homophily was detected only when network edges were restricted to exposures outside friends' incubation periods when long time spans were present (rr: 1.10; p = 0.014). our friends-of-friends network analysis suggests that avian mycobacteriosis can spread through bird social networks. although connected birds may acquire infection from exposure to common environmental sources and may share features that make them more likely to acquire disease through the environment, our friends-of-friends method detected statistically significant bird-to-bird transmission. one of the biggest challenges in determining if bird-to-bird contagion is present for infectious agents that are present in the environment, such as mycobacteria, is distinguishing the role of the environment. in one scenario, the environment serves as an intermediate collection place for mycobacteria being passed via (mostly) fecal contamination from an infected bird to one or more other birds, leading to infection spread in chain-or web-like patterns across a network [35] . alternatively, the environment may serve as the natural, independent reservoir of mycobacteria (e.g., biofilms in the water [36] ) giving rise to opportunistic infection among birds that share the location. spatial and temporal disease clustering could represent either or both of these two infection routes. homophily, where connected individuals tend to be more alike in species or habitat needs than the general population, and, therefore, may share the same disease susceptibility, could occur in both scenarios. for the directly connected birds in our study, the significantly elevated rr represented a combination of these three effects. examining the friends of friends rather than directly connected birds provided a means to disassociate exposure to another bird from exposure to that bird's environment. at two degrees of separation, the characteristics of network edges were more distinct, with temporal separation in potential transmission pathways and spatial separation in location. we exploited these pathways in a stepwise approach to calculate the rr of disease given exposure to friends of friends with different types of network ties. the subset of all influential friends of friends were temporally aligned to pass infection, but this group again represented a combined effect of multiple transmission pathways. because there was no evidence of significant homophily (further discussed below), we could use the network structure to test for the presence of contagion. among subjects who were connected to infected friends of friends in a different enclosure, the significant increase in risk for mycobacteriosis represents contagion. while this very specific subset of network edges allowed us to disentangle environmental and contagious transmission, it required two consecutive infections among a chain of related birds. this ignored most subjects and their friends of friends that shared enclosures where both processes were possible and completely confounded. while our extensive, long-term set of connections = 16,430) . the estimated relative risk (rr) for each of five different relationships between subjects and friends that were directly and indirectly connected. evaluated relationships are described in the methods and fig 2b. significance of the estimate was determined by comparing conditional probability of mycobacteriosis in the observed network with 1000 permutations of an identical network (with the topology and incidence of mycobacteriosis preserved) in which the same number of infected birds were randomly distributed. error bars show the null 95% confidence intervals generated from the random permutations. rrs that were outside of the null and significant are indicated with � . https://doi.org/10.1371/journal.pone.0237168.g004 in this network allowed detection of disease transmission using just this subset, the relative risks likely underestimate the true magnitude of bird-to-bird contagion. our data show significant, directional clustering along the pathways on which disease could propagate; however we did not find clustering when we reversed these pathways-where birds were connected, but disease could not be transmitted because passing an infection cannot move backwards through time. we applied our test of directionality, which is similar to those used by others [15] , to evaluate whether homophily could be driving the observed associations. in this bird population, similar species with comparable habitat needs have always tended to be housed together. therefore, we would expect biases due to homophily would exert similar effects along all pathways of connectivity, regardless of time. it is well documented that homophily and contagion are confounded in social networks [37, 38] and we could not specifically adjust the rrs for unobserved homophily; however many of the psychosocial factors that lead to homophily in human networks [19, 38] are not directly applicable to birds. while homophily table 1 , 1992-2014 (n = 16,430) . subjects network edges the five evaluated relationships are described in detail in the methods and fig 2b. the calculated statistic is the probability that a subject has disease, given that its friend has disease, compared to the probability that a subject has disease given that its friend does not (i.e., rr). to determine whether the observed rr falls within the 2.5 th and 97.5 th percentile of the null distribution, the disease status was randomly shuffled in 1000 network permutations where the network structure and prevalence of mycobacteriosis was preserved. significant p values indicate that the observed rr fell outside of the null 95% ci and we reject the null hypothesis that the observed rr is due to chance alone. https://doi.org/10.1371/journal.pone.0237168.t001 might still be present, our data strongly suggest that it is not driving the observed clustering of disease between a subject and its friends of friends. historically, in experimental infection studies, birds have been shown to be susceptible to the infectious bacilli when directly administered, i.e., introduced intravenously, intramuscularly, intraperitoneally, subcutaneously, or orally [39] [40] [41] [42] . yet, the relevance of direct inoculation to natural transmission has always been tenuous. some studies have shown little to no transmission when healthy chickens were placed in contact with either diseased birds or their contaminated environments [43] . therefore, our study provides new evidence, which supports bird-to-bird transmission in natural settings. our results also suggest that avian mycobacteriosis is not highly contagious, which is consistent with early experimental studies that conclude the bacteria must be given repeatedly over long periods of time to ensure infection [1] . the small world network structure that we identified for birds in the study population would predict epidemic-style outbreaks for diseases with facile and rapid transmission [34, 44] ; however, most birds did not acquire infection even when directly linked to other positive birds for long periods. over time, we have not seen epidemics and the incidence of disease in this population is consistently low (1%) [3] . our network approach was elucidating in this particular scenario, enabling us to uncover subtle patterns of a contagious process. environmental mycobacteria are recognized as the cause for ntm infections in humans and other animals [9] [10] [11] . limited genetic and speciation data from managed avian populations have found multiple strains and species of mycobacteria attributed to single outbreaks [6] [7] [8] . in our bird population, several different species and genotypes of mycobacteria have also been identified [5, 45] . consequently, we know that some birds could not have passed the infection to each other. genetic data from mycobacterial isolates would be a more definitive method of identifying the transmission of infection within a shared environment. for the present study, our approach was to isolate and test for contagion when there is missing information on the specific etiologic agents and transmission pathways. additional studies using genetic data could refine relevant transmission pathways or highlight important environmental sources within the network. we took care in assembling our network to ensure that the edge construction between subjects and friends adhered to general recommendations for disease networks [26, 46, 47] . this included incorporating biologically meaningful time-periods relevant to mycobacterial disease ecology and the type of exposure needed for transmission. generally, mycobacteriosis is considered a chronic disease, with an incubation period that can last for months and possibly years [1, 2] . it is also thought that animals can insidiously shed the organisms for long periods of time and those organisms can potential stay viable in the environment for years [48, 49] . we know there is misclassification of exposure in this network, because the true extents of incubation and infectious periods are wide, variable, and unknown. in sensitivity analyses, our rr estimates were generally similar when we varied incubation and infectious periods (table 1) . we did find a significant rr when limiting network edges to those occurring before the friends' 2-year incubation period, which suggests that some contagious processes may occur before the 2-year window. we also found that evidence for contagion was lost when either the subject incubation period or friend infectious period was short (less than six months and less than one year, respectively). it is likely that the shorter incubation times did not allow sufficient overlap of risk periods between subjects and friends. the duration of exposure needed for transmission is also unknown, but birds can be housed together for a year or more and not acquire infection [1, 4] . generally, mathematical models show that increasing the intensity or duration of contact between individuals with an infectious disease increases the probability of a transmission event and this can be reflected in weighted networks [35, 50] . in the present study, we required a minimum of seven days together to establish a network link that could capture relevant, short-duration exposure; however, the majority of birds were together for longer, with the mean contact-days being about 10.5 months (314 days). further exploration of contact heterogeneity on network associations may provide additional insight into clinically relevant exposure, infectious periods, and incubation times. inferring contagion by testing for disease clustering in subsets of the network requires quite complete network ascertainment, very good information on location over time, knowledge of disease outcomes, and a large number of subjects and their connected friends over time. our zoo data were unique in this respect and represent an example of how network substructures can inform global disease processes. many of the issues that cause bias in network measures, such as node censoring [51] , network boundary specification [52] , or unfriending [53] are unlikely to have affected our findings due to the completeness of our data. while such data may currently be rare, large datasets with similar network resolution may become widely available in the future as the world becomes increasingly connected by technology. for example, many new public and private contact-tracing initiatives are taking advantage of mobile phone technology to digitally track covid-19. eventually, these may allow near-complete human disease transmission networks to be assembled. this makes our friends-of-friends social network approach using network substructures a viable option for informing indirect covid-19 transmission pathways and public policy. most epidemiologic studies that use a network approach focus on directly transmitted, infectious diseases [47] . social networks to investigate diseases transmitted through the environment are assembled less often because defining contact in the presence of environmental persistence or other important transmission routes, such as fomites or insects, can be challenging [26] . to our knowledge, this is the first application of a friends-of-friends method to determine whether global patterns of connectivity support a contagious process. similar approaches could be useful to investigate diseases of humans or animals when the network is complete and mobility patterns are known, but the disease etiology or transmission pathways are unknown. avian tuberculosis infections. baltimore: the williams & wilkins company diseases of poultry investigation of characteristics and factors associated with avian mycobacteriosis in zoo birds investigation of factors predicting disease among zoo birds exposed to avian mycobacteriosis molecular epidemiology of mycobacterium avium subsp. avium and mycobacterium intracellulare in captive birds pcr-based typing of mycobacterium avium isolates in an epidemic among farmed lesser white-fronted geese (anser erythropus) mycobacterium avium subsp. avium distribution studied in a naturally infected hen flock and in the environment by culture, serotyping and is901 rflp methods avian tuberculosis in naturally infected captive water birds of the ardeideae and threskiornithidae families studied by serotyping, is901 rflp typing, and virulence for poultry current epidemiologic trends of the nontuberculous mycobacteria (ntm) primary mycobacterium avium complex infections correlate with lowered cellular immune reactivity in matschie's tree kangaroos (dendrolagus matschiei) simian immunodeficiency virus-inoculated macaques acquire mycobacterium avium from potable water during aids the spread of obesity in a large social network over 32 years dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the framingham heart study social network sensors for early detection of contagious outbreaks social contagion theory: examining dynamic social networks and human behavior prolonged outbreak of mycobacterium chimaera infection after open-chest heart surgery geographic prediction of human onset of west nile virus using dead crow clusters: an evaluation of year 2002 data in new york state cancer cluster investigations: review of the past and proposals for the future birds of a feather: homophily in social networks relationships between mycobacterium isolates from patients with pulmonary mycobacterial infection and potting soils mycobacterium chimaera outbreak associated with heater-cooler devices: piecing the puzzle together contact networks in a wild tasmanian devil (sarcophilus harrisii) population: using social network analysis to reveal seasonal variability in social behaviour and its implications for transmission of devil facial tumour disease badger social networks correlate with tuberculosis infection integrating association data and disease dynamics in a social ungulate: bovine tuberculosis in african buffalo in the kruger national park influence of contact heterogeneity on tb reproduction ratio r 0 in a free-living brushtail possum trichosurus vulpecula population infectious disease transmission and contact networks in wildlife and livestock untersuchungen uber die tuberkulinkehllappenprobe beim huhn die empfanglichkeit des huhnes fur tuberkulose unter normalen haltungsbedingungen seasonal distribution as an aid to diagnosis of poultry diseases an algorithm for drawing general undirected graphs the igraph software package for complex network research a guide to null models for animal social network analysis network reachability of real-world contact sequences dynamics, and the small-world phenomenon mathematical models of infectious disease transmission surrounded by mycobacteria: nontuberculous mycobacteria in the human environment homophily and contagion are generically confoudnded in observational social network studies origins of homophily in an evolving social network contribution to the experimental infection of young chickens with mycobacterium avium morphological changes in geese after experimental and natural infection with mycobacterium avium serotype 2 a model of avian mycobacteriosis: clinical and histopathologic findings in japanese quail (coturnix coturnix japonica) intravenously inoculated with mycobacterium avium experimental infection of budgerigars (melopsittacus undulatus) with five mycobacterium species bulletin north dakota agricultural experimental station mathematical studies on human disease dynamics: emerging paradigms and challanges whole-genome analysis of mycobacteria from birds at the san diego zoo network transmission inference : host behavior and parasite life cycle make social networks meaningful in disease ecology networks and the ecology of parasite transmission: a framework for wildlife parasitology avian tuberculoisis: collected studies. technical bulletin of the north dakota agricultural experimental station field manual of wildlife diseases: general field procedures and diseases of birds. usgs-national wildlife health center epidemic processes in complex networks censoring outdegree compromises inferences of social network peer effects and autocorrelation social networks and health: models, methods, and applications the "unfriending" problem: the consequences of homophily in friendship retention for causal estimates of social influence we thank the many people from sdzg that made this work possible, including the disease investigations team and veterinary clinical staff for ongoing disease surveillance, and the animal management staff for tracking housing histories of birds. we thank caroline baratz, dave rimlinger, michael mace, and the animal care staff for assistance with historical enclosure data. we thank richard shaffer, florin vaida, and christina sigurdson for thoughtful comments on manuscript preparation. key: cord-259634-ays40jlz authors: marcelino, jose; kaiser, marcus title: critical paths in a metapopulation model of h1n1: efficiently delaying influenza spreading through flight cancellation date: 2012-05-15 journal: plos curr doi: 10.1371/4f8c9a2e1fca8 sha: doc_id: 259634 cord_uid: ays40jlz disease spreading through human travel networks has been a topic of great interest in recent years, as witnessed during outbreaks of influenza a (h1n1) or sars pandemics. one way to stop spreading over the airline network are travel restrictions for major airports or network hubs based on the total number of passengers of an airport. here, we test alternative strategies using edge removal, cancelling targeted flight connections rather than restricting traffic for network hubs, for controlling spreading over the airline network. we employ a seir metapopulation model that takes into account the population of cities, simulates infection within cities and across the network of the top 500 airports, and tests different flight cancellation methods for limiting the course of infection. the time required to spread an infection globally, as simulated by a stochastic global spreading model was used to rank the candidate control strategies. the model includes both local spreading dynamics at the level of populations and long-range connectivity obtained from real global airline travel data. simulated spreading in this network showed that spreading infected 37% less individuals after cancelling a quarter of flight connections between cities, as selected by betweenness centrality. the alternative strategy of closing down whole airports causing the same number of cancelled connections only reduced infections by 18%. in conclusion, selecting highly ranked single connections between cities for cancellation was more effective, resulting in fewer individuals infected with influenza, compared to shutting down whole airports. it is also a more efficient strategy, affecting fewer passengers while producing the same reduction in infections. the network of connections between the top 500 airports is available under the resources link on our website http://www.biological-networks.org. complex networks are pervasive and underlie almost all aspects of life. they appear at different scales and paradigms, from metabolic networks, the structural correlates of brain function, the threads of our social fabric and to the larger scale making cultures and businesses come together through global travel and communication [1] [2] [3] [4] [5] [6] . recently, these systems have been modelled and studied using network science tools giving us new insight in fields such as sociology, epidemics, systems biology and neuroscience. typically components such as persons, cities, proteins or brain regions are represented as nodes and connections between components as edges [6] [7] . many of these networks can be categorised by their common properties. two properties relevant to spreading phenomena are the modular and scale-free organization of real-world networks. modular network consist of several modules with relatively many connections within modules but few connections between modules. scale-free networks with highly connected nodes (hubs) where the probability of a node having k edges follows a power law k −γ [8] [9] . it is possible for a network to show both scale-free and modular properties, however the two features may also appear independently. the worldwide airline network observed in this study was found to be both scale-free and modular [10] . spreading in networks is a general topic ranging from communication over the internet [11] [12], phenomena in biological networks [13] , or the spreading of diseases within populations [14] . scale-free properties of airline networks are of interest in relation to the error and attack tolerance of these networks [5] [15] . for scale-free networks, the selective removal of hubs produced a much greater impact on structural network integrity, as measured through increases in shortest-path lengths, than simply removing randomly selected nodes [15] . structural network integrity can also be influenced by partially inactivating specific connections (edges) between nodes [16] [17] [18] . dynamical processes such as disease spreading over heterogeneous networks was also shown to be impeded by targeting the hubs [19] [20] , with similar findings for highest traffic airports in the case of sars epidemic spreading [5] . in contrast to predictions of scale-free models, recent studies of the airline network [21] demonstrated that the structural cohesiveness of the airline network did not arise from the high degree nodes, but it was in fact due to the particular community structure which meant some of the lesser connected airports had a more central role (indicated by an higher betweenness centrality, the ratio of all-pairs shortest paths crossing each node). here we expand on this finding further by considering a range of centrality measures for individual connections between cities, show that their targeted removal can improve on existing control strategies [5] for controlling influenza spreading and finally discuss the effect of the community structure on this control. to demonstrate the impact on influenza spreading caused by topological changes to the airline network, we run simulations using a stochastic metapopulation model of influenza [22] [23] where the worldwide network of commercial flights is used as the path for infected individuals traveling between cities (see fig. 1a with mexico city as starting node of an outbreak). for this, we observe individuals within cities that contain one of the 500 most frequently used airports worldwide (based on annual total passenger number). individuals within the model can be susceptible (s), infected (i), or removed (r). the number of infected individuals depends on the population of each city and the volume of traffic over airline connections between cities. note that the time course of disease spreading will also be influence by seasonality [23] ; however, only spreading in one season was tested here. the simulated epidemic starts in 1 july 2007 from a single city, mexico city in our case, and its evolution over the following year is recorded. we then consider the number of days necessary for the epidemic to reach its peak as well as the maximum number of infected individuals ( fig. 2a) . this procedure is then repeated following the removal of a percentage of connections ranked as by a range of distinct measures such as edge betweenness centrality, jaccard coefficient or difference and product of node degrees. finally we also test the effect of shutting down the most highly connected airports (hubs) up to the same level of cancelled connections. comparing single edge removal strategies against the previously proposed shutdown of whole nodes (airports) we find that removing selected edges has a greater impact on the spreading of influenza with a significantly smaller loss of connectivity between cities. for the global airline network only a smaller set of flights routes between cities would need to be stopped instead of cancelling all the flights from a set of airports to get the same reduction in spreading. in addition as demonstrated in [21] for structural cohesiveness and in [24] regarding dynamical epidemic spreading, it is the community structure and not the degree distribution that plays a critical role in facilitating spreading. our method of slowing down spreading by removing critical connections is efficient as it targets links between such communities. concerning the computational complexity, whereas some strategies are computationally costly for large or rapidly evolving networks, several edge removal strategies are as fast as hub removal while still offering much better spreading control. note that whereas we observed similar strategies in an earlier study [25] , the current work includes the following changes: first, simulations run at the level of individuals rather than simulating whether the disease has reached ('infected') airports. second, the spreading between cities, over the airline network, now depends on the number of seats in airline connections between cities. this gives a much more realistic estimate of the actual spreading pattern as not only the existence of a flight connection but the specific number of passengers that flow over that link is taken into account. third, the previous study used an si model that is suitable for early stages of epidemic spreading. however, in this study we use an sir model that allows us to observe the time course of influenza spreading up to one year after the initial outbreak. for the network used in the study, the top 500 cities worldwide with the highest traffic airports became the nodes and an edge connects two of such nodes if there is at least one scheduled passenger flight between them. edges are then weighted by the daily average passenger capacity for that route. spreading in this network can then show how a disease outbreak, e.g. h1n1 or sars influenza, can spread around the world [5] [23] . as in previous studies [5] [26], we have used a similar methodology [22] where one city is the starting point for the epidemic and air travel between such cities offers the only transmission path for an infectious disease to spread between them. due to the relevance of the recent h1n1 (influenza a) epidemic we have used mexico city to be the epidemic starting point of our simulations. spreading simulations starting in mexico city with 100 exposed individuals were summarised by ninfectious the greatest number of infected individuals that were infectious at any time during the epidemic. spreading control strategies were evaluated by removing up to 25% of the flight routes and measuring the resulting decrease in ninfectious (see fig. 2a and methods). measures based on edge betweenness and jaccard coefficient were the two best predictors of critical edges (fig. 1a) . among the top intercontinental connections identified by betweenness centrality are flights from sao paulo (brazil) to beijing (china), sapporo (japan) to new york (usa) and montevideo (uruguay) to paris (france). after removing a quarter of all edges, both strategies showed a decrease in infected population of 37% for edge betweenness centrality and 23% for the jaccard coefficient, compared to only 18% for the hub removal strategy. (a) influenza spreading for mexico city as starting node, measured by the number of infected individuals over time on the intact network (blue) and after removing 25% of edges by hub removal (red) or edge betweenness (green). (b) maximum infected population following sequential edge elimination by betweenness centrality, jaccard coefficient, difference and product of degrees and hub removal (see methods). whereas in [23] a control strategy based on travel restrictions found that travel would need to be cut by 95% to significantly reduce the number of infected population, we observed that by removing connections ranked by edge betweenness this reduction to appeared after 18% of flight routes were cancelled (see fig. 2b ). to understand the underlying mechanism of these results we produced two rewired versions of the original network: one version preserved the degree distribution alone while another preserved both the latter and also the original community structure. applying the same spreading simulations on these rewired versions of the network showed that only on networks that preserved the original's community structure did we observe a significant reduction in infections when removing edges (see fig. 3 ) connecting nodes ranked by jaccard coefficient. for the 25% restriction level considered, betweenness centrality was the best measure even when no communities were present, offering a 41% reduction in infected cases in both types of network. this apparent advantage of betweenness even in networks without communities is due to its use of the capacity of each connection (edge weight), at 25% edge removal it will have removed most major high capacity connections from the network. jaccard is a purely structural measure and without knowledge of capacity. the presence of communities is then critical for its performance. at lower levels of damage we see that jaccard is better than edge betweenness centrality at reducing infected cases in networks with community structure. selecting specific edges for removal efficiently controls spreading in the airline network. although this was not tested directly, cancelling fewer flights might also lead to fewer passengers that are affected by these policies compared to the approach of cancelling mostly flights from highly connected nodes (hubs). with the same number of removed connections, edge removal strategies resulted in both a larger slowdown of spreading and a resulting much smaller number of infected individuals compared to hub removal strategies. edge betweenness was best at predicting critical edges that carried the greater traffic weighted by number of passengers traveling resulting in a large reduction in infectious population; however we also observed that removing edges ranked using the purely structural jaccard coefficient (see fig. 2a ) led to the greatest delay in reaching the peak of the epidemic. among the best predictor edge measures, due to a computational complexity of o(n 2 ), the jaccard coefficient is the fastest measure to calculate, making it particularly suitable for large networks or networks where the topology frequently changes. edge betweenness was the computationally most costly measure with o(n * e), for a network with n nodes and e edges. whereas hub removal was the worst strategy in this study, node centrality might lead to better results. indeed, previous findings [10] show that the most highly connected cities in the airline system do not necessarily have the highest node centrality. however, node centrality would be computationally as costly as edge betweenness. highly ranked connections predicted by edge measures were critical for the transmission of infections or activity and can be targeted individually with fewer disruptions for the overall network. in the transportation network studied, this means higher ranked individual connections could be cancelled instead of isolating whole cities from the rest of the world. results obtained from simulating the same spreading strategy over differently rewired versions of the airline network demonstrated the mechanism behind the performance of the jaccard predictor in slowing down spreading in networks that display a community structure, as is the case for spatially distributed real-world networks [27] [28] [29] . this is a good measure for these types of networks, given its good computational efficiency and the little information it requires to compute the critical links -it needs nothing else than to know the connections between nodes. the current study was testing different strategies and different percentages of removed edges leading to a large number of scenarios that had to be tested. therefore, several simplifications had to be performed whose role could be investigated in future studies. first, only one starting point, mexico city, for epidemics was tested. while this is in line with earlier studies using 1-3 starting points [5] [23] , it would be interesting to test whether there are exceptions to the outcomes presented here. second, spreading was observed only in one season, summer. previous work [23] has pointed out that the actual spreading pattern differs for different seasons. third, only the 500 airports with the largest traffic volume rather than all 3,968 airports were included in the simulation. while this was done in order to be comparable with the earlier study of hufnagel et al. [5] , tests on the larger dataset would be interesting. including airports with lower traffic volumes might preferable include national and regional airports within network modules. this could lead to a faster infection of regions; however, connections between communities would still remain crucial for the global spreading pattern. compared to our earlier study where the spreading of infection between airports rather than individuals was modelled [25] , edge betweenness could reduce the maximally infected population number more than targeting network hubs. the jaccard coefficient that showed very good performance in the earlier study [25] , however, did not perform better than the hub strategy. the difference and product of node degrees were poor strategies for both spreading models. this indicates that metapopulation models can lead to a different evaluation of flight cancellation strategies for slowing down influenza spreading. in conclusion, our results point to edge-based component removal for efficiently slowing spreading in airline and potentially other real-world networks. the network of connections between the top 500 airports is available under the resources link on our website http://www.biological-networks.org. note that distribution of the complete data set, including all airports and traffic volumes, is not allowed due to copyright restrictions. however, the complete dataset can be purchased directly from oag worldwide limited. as in other work [5] [10], we obtained scheduled flight data for one year provided by oag aviation solutions (luton, uk). this listed 1,341,615 records of worldwide flights operating from july 1, 2007 to july 30, 2008, which is estimated by oag to cover 99% of commercial flights. the records include the cities of origin and destination, days of operation, and the type of aircraft in service for that route. airports were uniquely identified by their iata code together with their corresponding cities. these cities became the nodes in the network. short-distance links corresponding to rail, boat, bus or limousine connections were removed from our data set. an edge connecting a pair of cities is present if at least one scheduled flight connected both airports. as in previous studies [5] , we used a sub-graph containing the 500 top airports that was obtained by selecting the airports with greater seat traffic combining incoming and outgoing routes. this subset of airports still represents at least 95% of the global traffic, and as demonstrated in [30] it includes sufficient information to describe the global spread of influenza. we are allowed to make the restricted data set of 500 airports available and you can download it under the resources link at http://www.biological-networks.org/ our analysis is based on the stochastic equation-based (seb) epidemic spreading model as used in [31] , simulating the spreading of influenza both within cities and at a global level through flights connecting the cities' local airports. within cities, a stochastically variable portion of the susceptible population establishes contact with infected individuals. this type of meta-population model accounts for 5 different states of individuals within cities: non-susceptible, susceptible, exposed, infectious, and removed (deceased). as we have not considered vaccination in this model we did not use the non-susceptible class in our study. movement of individuals between cities is determined deterministically from the daily average passenger seats on flights between cities. once infectious, individuals will not travel. we have assumed a moderate level of transmissibility between individuals, where r0 = 1.7, as also used in other influenza studies [31] [32] . note, however, that future epidemics of h5n1 and other viruses might have different in [5] a similar model including stochastic local dynamics was used, however it was focused on a specific outbreak of sars (severe acute respiratory syndrome) and hong kong was considered its starting point. five candidate measures for predicting critical edges in networks were tested. the measures are based on range of different parameters including node similarity, degree and all pairs shortest paths. measures are taken only once from the intact network and are not recomputed after each removal step. edge betweenness centrality [33] [34] represents how many times that particular edge is part of the all-pairs shortest paths in the network. edge betweenness can show the impact of a particular edge on the overall characteristic path length of the network; a high value reveals an edge that will quite likely increase the average number of steps needed for spreading. the jaccard similarity coefficient (or matching index [35] [36] ) shows how similar the neighbourhood connectivity structure of two nodes is, for example two nodes who shared the exact same set of neighbours would have the maximum similarity coefficient of 1. a low coefficient reveals a connection between two different network structures that might represent a "shortcut" between remote regions, making such low jaccard coefficient edges a good target for removal. the absolute difference of degrees for the adjacent nodes is another measure of similarity of two nodes. a large value here indicates a connection between a network hub a more sparsely connected region of the network. the product of the degrees of the nodes connected by the edge is high when both nodes are highly connected (hubs). for testing the absolute difference and product of degrees we also considered the opposite removal strategy (starting with lowest values) but the results showed to be consistently under-performing when compared to all other measures (not shown). finally, highly connected nodes will be detected and the nodes, and therefore all the edges of that node, will be removed from the network. note that this is referred to as 'hub removal strategy' whereas the impact is shown in relation to the number of edges which are removed after each node removal. original simulation code, as used in [23] , was obtained from the midas project (research computing division, rti international). the simulator was developed in java (sun microsystems, usa) programming language using the anylogic tm (version 5.5, xj technologies, usa) simulation framework to implement the dynamical model. network measures were implemented in custom matlab (r2008b, mathworks, inc., natick, usa) code. results were further processed in matlab . simulations were run in parallel on a 16-core hp proliant server, using the sun java 6 virtual machine. edge betweenness centrality was implemented using the algorithm by brandes [34] . links between cities in the network were considered to be directed, the network used included a total of 24,009 edges. mexico city was used as a starting node as observed in in the recent 2009 h1n1 pandemic. the starting date of the epidemic was assumed to be 1 july, and the pandemic evolution is simulated over the following 365 days, covering all the effects of seasonality as seen in both the southern and northern hemispheres. following the removal of each group of edges ranked by each control strategy, the spreading simulations were repeated. to test whether the mechanism of control arose from the particular community structure or degree distribution, we observed two different rewired versions of the original network. in one version only each individual node degree was maintained and the whole network was randomly rewired, destroying the original community structure. for the second, the original community structure was preserved but the sub-network within each community was rewired, so connections within the community were rearranged but the original inter-community links were preserved. both rewiring strategies preserved the original degree structure by the commonly used algorithm [37] in order to maintain the same number of passengers departing from each city and the number of passengers is only shuffled to different destinations. this way both strategies did not change in the number of passengers departing from each city, only the connectivity structure was modified. the original community structure was identified using an heuristic modularity optimization algorithm [38] which identified four distinct clusters. these are predominantly geographic: one for north and central america, including canada and hawaii, another for south america, a third including the greater part of china (except hong kong, macau and beijing) and finally a fourth including all other airports (fig. 1b) . twenty rewired networks were generated for each version of the rewiring algorithm and the daily average evolution of influenza, using the same spreading algorithm as above, was taken across these 20 networks. this was repeated after the removal of each group of edges. therefore each measure on each of the rewired lots combines 182,500 individual results. classes of small-world networks exploring complex networks statistical mechanics of complex networks the structure and function of complex networks forecast and control of epidemics in a globalized world scale-free networks: complex webs in nature and technology graph theory emergence of scaling in random networks villas boas pr. characterization of complex networks: a survey of measurements the worldwide air transportation network: anomalous centrality, community structure, and cities' global roles breakdown of the internet under intentional attack epidemic spreading in scale-free networks error and attack tolerance of complex networks attack vulnerability of complex networks edge vulnerability in neural and metabolic networks multiple weak hits confuse complex systems: a transcriptional regulatory network as an example infection dynamics on scale-free networks superspreading and the effect of individual variation on disease emergence modeling the world-wide airport network a mathematical model for the global spread of influenza controlling pandemic flu: the value of international air travel restrictions. plos one superspreading and the effect of individual variation on disease emergence reducing in fl uenza spreading over the airline network the role of the airline transportation network in the prediction and predictability of global epidemics modeling the internet's large-scale topology nonoptimal component placement, but short processing paths, due to long-distance projections in neural systems community analysis in social networks sampling for global epidemic models and the topology of an international airport network controlling pandemic flu: the value of international air travel restrictions. plos one strategies for mitigating an influenza pandemic a set of measures of centrality based on betweenness a faster algorithm for betweenness centrality computational methods for the analysis of brain connectivity graph theory methods for the analysis of neural connectivity patterns. neuroscience databases. a practical guide specificity and stability in topology of protein networks watts dj, strogatz sh. collective dynamics of 'small-world' networks on the evolution of random graphs we thank http://www.flightstats.com for providing location information for all airports and oag worldwide limited for providing the worldwide flight data for one year. supported by wcu program through the national research foundation of korea funded by the ministry of education, science and technology (r32-10142). marcus kaiser was also supported by the royal society (rg/2006/r2), the carmen e-science project (http://www.carmen.org.uk) funded by epsrc (ep/e002331/1), and (ep/g03950x/1). jose marcelino was supported by epsrc phd studentship (case/cna/06/25) with a contribution from e-therapeutics plc. the authors have declared that no competing interests exist. key: cord-134926-dk28wutc authors: dasgupta, anirban; sengupta, srijan title: scalable estimation of epidemic thresholds via node sampling date: 2020-07-28 journal: nan doi: nan sha: doc_id: 134926 cord_uid: dk28wutc infectious or contagious diseases can be transmitted from one person to another through social contact networks. in today's interconnected global society, such contagion processes can cause global public health hazards, as exemplified by the ongoing covid-19 pandemic. it is therefore of great practical relevance to investigate the network trans-mission of contagious diseases from the perspective of statistical inference. an important and widely studied boundary condition for contagion processes over networks is the so-called epidemic threshold. the epidemic threshold plays a key role in determining whether a pathogen introduced into a social contact network will cause an epidemic or die out. in this paper, we investigate epidemic thresholds from the perspective of statistical network inference. we identify two major challenges that are caused by high computational and sampling complexity of the epidemic threshold. we develop two statistically accurate and computationally efficient approximation techniques to address these issues under the chung-lu modeling framework. the second approximation, which is based on random walk sampling, further enjoys the advantage of requiring data on a vanishingly small fraction of nodes. we establish theoretical guarantees for both methods and demonstrate their empirical superiority. infectious diseases are caused by pathogens, such as bacteria, viruses, fungi, and parasites. many infectious diseases are also contagious, which means the infection can be transmitted from one person to another when there is some interaction (e.g., physical proximity) between them. today, we live in an interconnected world where such contagious diseases could spread through social contact networks to become global public health hazards. a recent example of this phenomenon is the covid-19 outbreak caused by the so-called novel coronavirus (sars-cov-2) that has spread to many countries zhu et al., 2020; wang et al., 2020; sun et al., 2020) . this recent global outbreak has caused serious social and economic repercussions, such as massive restrictions on movement and share market decline (chinazzi et al., 2020) . it is therefore of great practical relevance to investigate the transmission of contagious diseases through social contact networks from the perspective of statistical inference. consider an infection being transmitted through a population of n individuals. according to the susceptible-infected-recovered (sir) model of disease spread, the pathogen can be transmitted from an infected person (i) to a susceptible person (s) with an infection rate given by β, and an infected individual becomes recovered (r) with a recovery rate given by µ. this can be modeled as a markov chain whose state at time t is given by a vector (x t 1 , . . . , x t n ), where x t i denotes the state of the i th individual at time t, i.e., x t i ∈ {s, i, r}. for the population of n individuals, the state space of this markov chain becomes extremely large with 3 n possible configurations, which makes it impractical to study the exact system. this problem was addressed in a series of three seminal papers by kermack and mckendrick (kermack and mckendrick, 1927 , 1932 , 1933 . instead of modeling the disease state of each individual at at a given point of time, they proposed compartmental models, where the goal is to model the number of individuals in a particular disease state (e.g., susceptible, infected, recovered) at a given point of time. since their classical papers, there has been a tremendous amount of work on compartmental modeling of contagious diseases over the last ninety years (hethcote, 2000; van den driessche and watmough, 2002; brauer et al., 2012) . compartmental models make the unrealistic assumption of homogeneity, i.e., each individual is assumed to have the same probability of interacting with any other individual. in reality, individuals interact with each other in a highly heterogeneous manner, depending upon various factors such as age, cultural norms, lifestyle, weather, etc. the contagion process can be significantly impacted by heterogeneity of interactions rocha et al., 2011; galvani and may, 2005; woolhouse et al., 1997) , and therefore compartmental modeling of contagious diseases can lead to substantial errors. in recent years, contact networks have emerged as a preferred alternative to compartmental models (keeling, 2005) . here, a node represents an individual, and an edge between two nodes represent social contact between them. an edge connecting an infected node and a susceptible node represents a potential path for pathogen transmission. this framework can realistically represent the heterogeneous nature of social contacts, and therefore provide much more accurate modeling of the contagion process than compartmental models. notable examples where the use of contact networks have led to improvements in prediction or understanding of infectious diseases include bengtsson et al. (2015) and kramer et al. (2016) . consider the scenario where a pathogen is introduced into a social contact network and it spreads according to an sir model. it is of particular interest to know whether the pathogen will die out or lead to an epidemic. this is dictated by a set of boundary conditions known as the epidemic threshold, which depends on the sir parameters β and µ as well as the network structure itself. above the epidemic threshold, the pathogen invades and infects a finite fraction of the population. below the epidemic threshold, the prevalence (total number of infected individuals) remains infinitesimally small in the limit of large networks (pastor-satorras et al., 2015) . there is growing evidence that such thresholds exist in real-world host-pathogen systems, and intervention strategies are formulated and executed based on estimates of the epidemic threshold. (dallas et al., 2018; shulgin et al., 1998; wallinga et al., 2005; pourbohloul et al., 2005; meyers et al., 2005) . fittingly, the last two decades have seen a significant emphasis on studying epidemic thresholds of contact networks from several disciplines, such as computer science, physics, and epidemiology (newman, 2002; wang et al., 2003; colizza and vespignani, 2007; chakrabarti et al., 2008; gómez et al., 2010; wang et al., 2016 . see leitch et al. (2019) for a complete survey on the topic of epidemic thresholds. concurrently but separately, network data has rapidly emerged as a significant area in statistics. over the last two decades, a substantial amount of methodological advancement has been accomplished in several topics in this area, such as community detection (bickel and chen, 2009; zhao et al., 2012; rohe et al., 2011; sengupta and chen, 2015) , model fitting and model selection (hoff et al., 2002; handcock et al., 2007; krivitsky et al., 2009; wang and bickel, 2017; yan et al., 2014; bickel and sarkar, 2016; sengupta and chen, 2018) , hypothesis testing (ghoshdastidar and von luxburg, 2018; tang et al., 2017a,b; bhadra et al., 2019) , and anomaly detection (zhao et al., 2018; sengupta, 2018; komolafe et al., 2019) , to name a few. the state-of-the-art toolbox of statistical network inference includes a range of random graph models and a suite of estimation and inference techniques. however, there has not been any work at the intersection of these two areas, in the sense that the problem of estimating epidemic thresholds has not been investigated from the perspective of statistical network inference. furthermore, the task of computing the epidemic threshold based on existing results can be computationally infeasible for massive networks. in this paper, we address these gaps by developing a novel sampling-based method to estimate the epidemic threshold under the widely used chung-lu model (aiello et al., 2000) , also known as the configuration model. we prove that our proposed method has theoretical guarantees for both statistical accuracy and computational efficiency. we also provide empirical results demonstrating our method on both synthetic and real-world networks. the rest of the paper is organized as follows. in section 2, we formally set up the prob-lem statement and formulate our proposed methods for approximating the epidemic threshold. in section 3, we desribe the theoretical properties of our estimators. in section 4, we report numerical results from synthetic as well as real-world networks. we conclude the paper with discussion and next steps in section 5. definition and description λ(a) spectral radius of the matrix a d i degree of the node i of the network δ i expected degree of the node i of the network s(t), i(t), r(t) number of susceptible (s), infected (i), and recovered/removed (r) individuals in the population at time t β infection rate: probability of transmission of a pathogen from an infected individual to a susceptible individual per effective contact (e.g. contact per unit time in continuous-time models, or per time step in discrete-time models) µ recovery rate: probability that an infected individual will recover per unit time (in continuous-time models) or per time step (in discrete-time models) consider a set of n individuals labelled as 1, . . . , n, and an undirected network (with no self-loops) representing interactions between them. this can represented by an nby-n symmetric adjacency matrix a, where a(i, j) = 1 if individuals i and j interact and a(i, j) = 0 otherwise. consider a pathogen spreading through this contact network according to an sir model. from existing work (chakrabarti et al., 2008; gómez et al., 2010; prakash et al., 2010; wang et al., 2016 , we know that the boundary condition for the pathogen to become an epidemic is given by where λ(a) is the spectral radius of the adjacency matrix a. the left hand side of equation (1) is the ratio of the infection rate to the recovery rate, which is purely a function of the pathogen and independent of the network. as this ratio grows larger, an epidemic becomes more likely, as new infections outpace recoveries. the right hand side of equation (1) is the spectral radius of the adjacency matrix, which is purely a function of the network and independent of the pathogen. larger the spectral radius, the more connected the network, and therefore an epidemic becomes more likely. thus, the boundary condition in equation (1) connects the two aspects of the contagion process, the pathogen transmissibility which is quantified by β/µ, and the social contact network which is quantified by the spectral radius. if β µ < 1 λ(a) , the pathogen dies out, and if β µ > 1 λ(a) , the pathogen becomes an epidemic. given a social contact network, the inverse of the spectral radius of its adjacency matrix represents the epidemic threshold for the network. any pathogen whose transmissiblity ratio is greater than this threshold is going to cause an epidemic, whereas any pathogen whose transmissiblity ratio is less than this threshold is going to die out. therefore, a key problem in network epidemiology is to compute the spectral radius of the social contact network. realistic urban social networks that are used in modeling contagion processes have millions of nodes (eubank et al., 2004; barrett et al., 2008) . to compute the epidemic threshold of such networks, we need to find the largest (in absolute value) eigenvalue of the adjacency matrix a. this is challenging because of two reasons. 1. first, from a computational perspective, eigenvalue algorithms have computational complexity of ω(n 2 ) or higher. for massive social contact networks with millions of nodes, this can become too burdensome. 2. second, from a statistical perspective, eigenvalue algorithms require the entire adjacency matrix for the full network of n individuals. it can be challenging or expensive to collect interaction data of n individuals of a massive population (e.g., an urban metropolis). furthermore, eigenvalue algorithms typically require the full matrix to be stored in the random-access memory of the computer, which can be infeasible for massive social contact networks which are too large to be stored. the first issue could be resolved if we could compute the epidemic threshold in a computationally efficient manner. the second issue could be resolved if we could compute the epidemic threshold only using data on a small subset of the population. in this paper, we aim to resolve both issues by developing two approximation methods for computing the spectral radius. to address these problems, let us look at the spectral radius, λ(a), from the perspective of random graph models. the statistical model is given by a ∼ p , which is short-hand for a(i, j) ∼ bernoulli(p (i, j)) for 1 ≤ i < j ≤ n. then λ(a) converges to λ(p ) in probability under some mild conditions (chung and radcliffe, 2011; benaych-georges et al., 2019; bordenave et al., 2020) . to make a formal statement regarding this convergence, we reproduce below a slightly paraphrased version (for notational consistency) of an existing result in this context. lemma 1 (theorem 1 of chung and radcliffe (2011)). let be the maximum expected degree, and suppose that for some > 0, ∆ > 4 9 log(2n/ ) for sufficiently large n. then with probability at least 1 − , for sufficiently large n, to make note of a somewhat subtle point: from an inferential perspective it is tempting to view the above result as a consistency result, where λ(p ) is the population quantity or parameter of interest and λ(a) is its estimator. however, in the context of epidemic thresholds, we are interested in the random variable λ(a) itself, as we want to study the contagion spread conditional on a given social contact network. therefore, in the present context, the above result should not be interpreted as a consistency result. rather, we can use the convergence result in a different way. for massive networks, the random variable λ(a), which we wish to compute but find it infeasible to do so, is close to the parameter λ(p ). suppose we can find a random variable t (a) which also converges in probability to λ(p ), and is computationally efficient. since t (a) and λ(a) both converge in probability to λ(p ), we can use t (a) as an accurate proxy for λ(a). this would address the first of the two issues described at the beginning of this subsection. furthermore, if t (a) can be computed from a small subset of the data, that would also solve the second issue. this is our central heuristic, which we are going to formalize next. so far, we have not made any structural assumptions on p , we have simply considered the generic inhomogeneous random graph model. under such a general model, it is very difficult to formulate a statistic t (a) which is cheap to compute and converges to λ(p ). therefore, we now introduce a structural assumption on p , in the form of the well-known chung-lu model that was introduced by aiello et al. (2000) and subsequently studied in many papers (chung and lu, 2002; chung et al., 2003; decreusefond et al., 2012; pinar et al., 2012; zhang et al., 2017) . for a network with n nodes, let δ = (δ 1 , . . . , δ n ) be the vector of expected degrees. then under the chung-lu model, this formulation preserves e[d i ] = δ i , where d i is the degree of the i th node, and is very flexible with respect to degree heterogeneity. under model (2), note that rank(p ) = 1, and we have recall that we are looking for some computationally efficient t (a) which converges in probability to λ(p ). we now know that under the chung-lu model, λ(p ) is equal to the ratio of the second moment to the first moment of the degree distribution. therefore, a simple estimator of λ(p ) is given by the sample analogue of this ratio, i.e., ( we now want to demonstrate that approximating λ(a) by t 1 (a) provides us with very substantial computational savings with little loss of accuracy. the approximation error can be quantified as and our goal is to show that e 1 (a) → 0 in probability, while the computational cost of t 1 (a) is much smaller than that of λ(a). we will show this both from a theoretical perspective and an empirical perspective. we next describe the empirical results from a simulation study, and we postpone the theoretical discussion to section 3 for organizational clarity. we used n = 5000, 10000, and constructed a chung-lu random graph model where p (i, j) = θ i θ j . the model parameters θ 1 , . . . , θ n were uniformly sampled from (0, 0.25). then, we randomly generated 100 networks from the model, and computed λ(a) and t 1 (a). the results are reported in table 2 . average runtime for the moment based estimator, t 1 (a), is only 0.07 seconds for n = 5000 and 0.35 seconds for n = 10000, whereas for the spectral radius, λ(a), it is 78.2 seconds and 606.44 seconds respectively, which makes the latter 1100-1700 times more computationally burdensome. the average error for t 1 (a) is very small, and so is the sd of errors. thus, even for moderately sized networks where n = 5000 or n = 10000, using t 1 (a) as a proxy for λ(a) can reduce the computational cost to a great extent, and the corresponding loss in accuracy is very small. for massive networks where n is in millions, this advantage of t 1 (a) over λ(a) is even greater; however, the computational burden for λ(a) becomes so large that this case is difficult to illustrate using standard computing equipment. thus, t 1 (a) provides us with a computationally efficient and statistically accurate method for finding the epidemic threshold. the first approximation, t 1 (a), provides us with a computationally efficient method for finding the epidemic threshold. this addresses the first issue pointed out at the beginning of section 2.1. however, computing t 1 (a) requires data on the degree of all n nodes of the network. therefore, this does not solve the second issue pointed out at the beginning of section 2.1. we now propose a second alternative, t 2 , to address the second issue. the idea behind this approximation is based on the same heuristic that was laid out in section 2.2. since λ(p ) is a function of degree moments, we can estimate these moments using observed node degrees. in defining t 1 (a), we used observed degrees of all n nodes in the network. however, we can also estimate the degree moments by considering a small sample of nodes, based on random walk sampling. the algorithm for computing t 2 is given in algorithm 1. algorithm 1 randomwalkestimate 1: procedure estimate(g, r, t * ) 2: x ← 1. while t ≤ t * do 4: x ← random neighbor of x, chosen uniformly. v ← 0. while i ≤ r do x ← random neighbor of x, chosen uniformly. 9: return t 2 = v/r. note that we only use (t * + r) randomly sampled nodes for computing t 2 , which implies that we do not need to collect or store data on the n individuals. therefore this method overcomes the second issue pointed out at the beginning of section 2.1. the approximation error arising from this method can be defined as and we want to show that e 2 (a) → 0 in probability, while the data-collection cost of t 2 (a) is much less than that of t 1 (a). in the next section, we are going to formalize this. in this section, we are going to establish that the approximation errors e 1 (a) and e 2 (a), defined in equations (4) and (5), converge to zero in probability. from theorem 2.1 of chung et al. (2003) , we know that when holds, then for any > 0, therefore, under (6), it suffices to show that, for any > 0, we would like to show that, under reasonable conditions, for any > 0, we will show that for any > 0, we first prove that (8) implies (7). equation (8) note that m 2 /m 1 is a strictly increasing function of m 2 and a strictly decreasing function of m 1 . therefore, for outcomes belonging to the above event, note that 1 − 1 − 1 + = 2 1 + < 2 , and 1 + 1 − − 1 = 2 1 − < 4 , given that < 1/2. now, fix > 0 and let = /4. then, thus, proving (8) is sufficient for proving (7). next, we state and prove the theorem which will establish (8). theorem 2. if the average of the expected degrees goes to infinity, i.e., 1 n i δ i → ∞, and the spectral radius dominates log 2 (n), i.e., i δ 2 i i δ i = ω(log 2 n), then for any > 0, proof. we will use hoeffding's inequality (hoeffding, 1994) for the first part, and we begin by stating the inequality for the sum of bernoulli random variables. let b 1 , . . . , b m be m independent (but not necessarily identically distributed) bernoulli random variables, and s m = m i=1 b i . then for any t > 0, in our case, and we know that {a(i, j) : 1 ≤ i < j ≤ n} are independent bernoulli random variables. fix > 0 and note that e[ i λ min (l) of the laplacian of g, it follows above that ε(q) = 1−λ 2 (q) = 1−λ 2 (d −1/2 ad −1/2 ) = λ n−1 (i−d −1/2 ad −1/2 ) = 1 − o(1). putting these together, we get the following corollary on the total number of node queries. corollary 6.1. for a graph generated from the expected degrees model, with probability 1 − 1/n, algorithm 1, needs to query ≤ 6dmax d min , but this is a loose bound, better bounds can be derived for power law degree distributions, for instance. thus, we have proved that the approximation error for t 2 (a) goes to zero in probability. in addition, corollary 6.1 shows that the number of nodes that we need to query in order to have an accurate approximation is much smaller than n. furthermore, computing t 2 only requires node sampling and counting degrees, and therefore the runtime is much smaller than eigenvalue algorithms. therefore, t 2 (a) is a computationally efficient and statistically accurate approximation of the epidemic threshold, while also requiring a much smaller data budget compared to t 1 (a). in this section, we characterize the empirical performance of our sampling algorithm on two synthetic networks, one generated from the chung-lu model and the second generated from the preferential attachment model of . our first dataset is a graph generated from the chung-lu model of expected degrees. we generated a powerlaw sequence (i.e. fraction of nodes with degree d is proportion data nodes edges λ(a) t 1 (a) chung-lu 50k 72k 43.83 48.33 pref-attach 50k 250k 37 32.8 table 3 : statistics of the two synthetic datasets used. to d −β ) with exponent β = 2.5 and then generated a graph with this sequence as the expected degrees. table 3 notes that, as expected, the first eigenvalue the second dataset is generated from the preferential attachment model , where each incoming node adds 5 edges to the existing nodes, the probability of choosing a specific node as neighbor being proportional to the current degree of that node. while the preferential attachment model naturally gives rise to a directed graph, we convert the graph to an undirected one before running our algorithm. it is interesting to note that even in this case the chung-lu model does not hold, our first approximation, t 1 (a), is close to λ(a). in each of the networks, the random walk algorithm presented in algorithm 1 was used for sampling. the random walk was started from an arbitrary node and every 10 th node was sampled (to account for the mixing time) from the walk. these samples were then used to calculate t 2 (a). this experiment was repeated 10 times. these gave estimates t 1 2 , . . . , t 10 2 . we then calculate two relative errors ∀i ∈ {1, 2, . . . , 10}, we plot the averages of { t 1−t 2 i } and { λ−t 2 i } against the actual number of nodes seen by the random walk. note that the x-axis accurately reflect how many times the algorithm actually queried the network, not just the number of samples used. measuring the cost of uniform node sampling in this setting, for instance, would need to keep track of how many nodes are touched by a metropolis-hastings walk that implements the uniform distribution. figure 1 demonstrates the results. for the two synthetic networks, the algorithm is able to get a 10% approximation to the statistic t 1 (a) by exploring at most 10% of the network. with more samples from the random walk, the mean relative errors settle to around 4-5%. however, once we measure the mean relative errors with respect to λ(a), it becomes clearer that the estimator t 2 (a) does better when the graph is closer to the assumed (i.e. chung-lu) model. for the chung-lu graph, the mean error λ−t 2 essentially is very similar to t 1−t 2 , which is to be expected. for the preferential attachment graph too, it is clear that the estimate t 2 is able to achieve a better than 10% relative error approximation of λ(a). note that, if we were instead counting only the nodes whose degrees were actually used for estimation, the fraction of network used would be roughly 1 − 2% in all the cases, the majority of the node cost actually goes in making the random walk mix. in this work, we investigated the problem of computing sir epidemic thresholds of social contact networks from the perspective of statistical inference. we considered the two challenges that arise in this context, due to high computational and data-collection complexity of the spectral radius. for the chung-lu network generative model, the spectral radius can be characterized in terms of the degree moments. we utilized this fact to develop two approximations of the spectral radius. the first approximation is computationally efficient and statistically accurate, but requires data on observed degrees of all nodes. the second approximation retains the computationally efficiency and statistically accuracy of the first approximation, while also reducing the number of queries or the sample size quite substantially. the results seem very promising for networks arising from the chung-lu and preferential attachment generative models. there are several interesting and important future directions. the methods proposed in this paper have provable guarantees only under the chung-lu model, although it works very well under the preferential attachment model. this seems to indicate that the degree based approximation might be applicable to a wider class of models. on the other hand, this leaves open the question of developing a better "model-free" estimator, as well as asking similar questions about other network features. in this work we only considered the problem of accurate approximation of the epidemic threshold. from a statistical as well as a real-world perspective, there are several related inference questions. these include uncertainty quantification, confidence intervals, onesample and two-sample testing, etc. social interaction patterns vary dynamically over time, and such network dynamics can have significant impacts on the contagion process leitch et al. (2019) . in this paper we only considered static social contact networks, and in future we hope to study epidemic thresholds for time-varying or dynamic networks. we do realize that in the face of the current pandemic, while it is important to pursue research relevant to it, it is also important to be responsible in following the proper scientific process. we would like to state that in this work, the question of epidemic threshold estimation has been formalized from a theoretical viewpoint in a much used, but simple, random graph model. we are not yet at a position to give any guarantees about the performance of our estimator in real social networks. we do hope, however, that the techniques developed here can be further refined to work to give reliable estimators in practical settings. a random graph model for massive graphs emergence of scaling in random networks emergence of scaling in random networks episimdemics: an efficient algorithm for simulating the spread of infectious disease over large realistic social networks largest eigenvalues of sparse inhomogeneous erdős-rényi graphs using mobile phone data to predict the spatial spread of cholera a bootstrap-based inference framework for testing similarity of paired networks a nonparametric view of network models and newman-girvan and other modularities hypothesis testing for automated community detection in networks spectral radii of sparse random matrices. annales de l'institut henri poincare (b) probability and statistics mathematical models in population biology and epidemiology epidemic thresholds in real networks the effect of travel restrictions on the spread of the 2019 novel coronavirus the average distances in random graphs with given expected degrees eigenvalues of random power law graphs on the spectra of general random graphs. the electronic journal of combinatorics invasion threshold in heterogeneous metapopulation networks experimental evidence of a pathogen invasion threshold large graph limit for an sir process in random network with heterogeneous connectivity modelling disease outbreaks in realistic urban social networks dimensions of superspreading practical methods for graph two-sample testing discretetime markov chain approach to contact-based disease spreading in complex networks model-based clustering for social networks the mathematics of infectious diseases probability inequalities for sums of bounded random variables latent space approaches to social network analysis clinical features of patients infected with 2019 novel coronavirus in wuhan, china. the lancet the implications of network structure for epidemic dynamics containing papers of a mathematical and physical character contributions to the mathematical theory of epidemics. ii.the problem of endemicity contributions to the mathematical theory of epidemics. iii.further studies of the problem of endemicity statistical evaluation of spectral methods for anomaly detection in static networks spatial spread of the west africa ebola epidemic representing degree distributions, clustering, and homophily in social networks with latent cluster random effects models toward epidemic thresholds on temporal networks: a review and open questions chernoff-type bound for finite markov chains network theory and sars: predicting outbreak diversity spread of epidemic disease on networks epidemic processes in complex networks the similarity between stochastic kronecker and chung-lu graph models modeling control strategies of respiratory pathogens got the flu (or mumps)? check the eigenvalue! simulated epidemics in an empirical spatiotemporal network of 50,185 sexual contacts spectral clustering and the high-dimensional stochastic blockmodel anomaly detection in static networks using egonets spectral clustering in heterogeneous networks. statistica sinica a block model for node popularity in networks with community structure pulse vaccination strategy in the sir epidemic model early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study. the lancet digital health a semiparametric two-sample hypothesis testing problem for random graphs a nonparametric two-sample hypothesis testing problem for random graphs reproduction numbers and subthreshold endemic equilibria for compartmental models of disease transmission a measles epidemic threshold in a highly vaccinated population a novel coronavirus outbreak of global health concern predicting the epidemic threshold of the susceptible-infected-recovered model unification of theoretical approaches for epidemic spreading on complex networks epidemic spreading in real networks: an eigenvalue viewpoint likelihood-based model selection for stochastic block models heterogeneities in the transmission of infectious agents: implications for the design of control programs model selection for degree-corrected block models random graph models for dynamic networks performance evaluation of social network anomaly detection using a moving windowbased scan method consistency of community detection in networks under degree-corrected stochastic block models a novel coronavirus from patients with pneumonia in china key: cord-206872-t6lr3g1m authors: huang, huawei; kong, wei; zhou, sicong; zheng, zibin; guo, song title: a survey of state-of-the-art on blockchains: theories, modelings, and tools date: 2020-07-07 journal: nan doi: nan sha: doc_id: 206872 cord_uid: t6lr3g1m to draw a roadmap of current research activities of the blockchain community, we first conduct a brief overview of state-of-the-art blockchain surveys published in the recent 5 years. we found that those surveys are basically studying the blockchain-based applications, such as blockchain-assisted internet of things (iot), business applications, security-enabled solutions, and many other applications in diverse fields. however, we think that a comprehensive survey towards the essentials of blockchains by exploiting the state-of-the-art theoretical modelings, analytic models, and useful experiment tools is still missing. to fill this gap, we perform a thorough survey by identifying and classifying the most recent high-quality research outputs that are closely related to the theoretical findings and essential mechanisms of blockchain systems and networks. several promising open issues are also summarized finally for future research directions. we wish this survey can serve as a useful guideline for researchers, engineers, and educators about the cutting-edge development of blockchains in the perspectives of theories, modelings, and tools. blockchains have been deeply diving into multiple applications that are closely related to every aspect of our daily life, such as cryptocurrencies, business applications, smart city, internet-of-things (iot) applications, and etc. in the following, before discussing the motivation of this survey, we first conduct a brief exposition of the state-of-the-art blockchain survey articles published in the recent few years. to identify the position of our survey, we first collect 66 state-of-the-art blockchain-related survey articles. the numbers of each category of those surveys are shown in fig. 1 . we see that the top-three popular topics of blockchain-related survey are iot & iiot, consensus protocols, and security & privacy. we also classify those existing surveys and their chronological distribution in fig. 2 , from which we discover that i) the number of surveys published in each year increases dramatically, and ii) the diversity of topics also becomes greater following the chronological order. in detail, we summarize the publication years, topics, and other metadata of these surveys in table 1 and table 2 . basically, those surveys can be classified into the following 7 groups, which are briefly reviewed as follows. 1.1.1 blockchain essentials. the first group is related to the essentials of the blockchain. a large number of consensus protocols, algorithms, and mechanisms have been reviewed and summarized in [1] [2] [3] [4] [5] [6] [7] [8] . for example, motivated by lack of a comprehensive literature review regarding the consensus protocols for blockchain networks, wang et al. [3] emphasized on both the system design and the incentive mechanism behind those distributed blockchain consensus protocols such as byzantine fault tolerant (bft)-based protocols and nakamoto protocols. from a game-theoretic viewpoint, the authors also studied how such consensus protocols affect the consensus participants in blockchain networks. during the surveys of smart contracts [9] [10] [11] , atzei et al. [9] paid their attention to the security vulnerabilities and programming pitfalls that could be incurred in ethereum smart contracts. dwivedi et al. [10] performed a systematic taxonomy on smart-contract languages, while zheng et al. [11] conducted a survey on the challenges, recent technical advances and typical platforms of smart contracts. sharding techniques are viewed as promising solutions to solving the scalability issue and low-performance problems of blockchains. several survey articles [12, 13] provide systematic reviews on sharding-based blockchain techniques. for example, wang et al. [12] focused on the general design flow and critical design challenges of sharding protocols. next, yu et al. [13] mainly discussed the intra-consensus security, atomicity of cross-shard transactions, and other advantages of sharding mechanisms. regarding scalability, chen et al. [14] analyzed the scalability technologies in terms of efficiency-improving and function-extension of blockchains, while zhou et al. [15] compared and classified the existing scalability solutions in manuscript submitted to acm roles for the performance, security, healthy conditions of blockchain systems and blockchain networks. for example, salah et al. [26] studied how blockchain technologies benefit key problems of ai. zheng et al. [27] proposed the concept of blockchain intelligence and pointed out the opportunities that both these two terms can benefit each other. next, chen et al. [28] discussed the privacy-preserving and secure design of machine learning when blockchain techniques are imported. liu et al. [29] identified the overview, opportunities, and applications when integrating blockchains and machine learning technologies in the context of communications and networking. recently, game theoretical solutions [30] have been reviewed when they are applied in blockchain security issues such as malicious attacks and selfish mining, as well as the resource allocation in the management of mining. both the advantages and disadvantages of game theoretical solutions and models were discussed. networking. first, park et al. [31] discussed how to take the advantages of blockchains in could computing with respect to security solutions. xiong et al. [32] then investigated how to facilitate blockchain applications in mobile iot and edge computing environments. yang et al. [33] identified various perspectives including motivations, frameworks, and functionalities when integrating blockchain with edge computing. nguyen et al. [34] presented a comprehensive survey when blockchain meets 5g networks and beyond. the authors focused on the opportunities that blockchain can bring for 5g technologies, which include cloud computing, mobile edge computing, sdn/nfv, network slicing, d2d communications, 5g services, and 5g iot applications. manuscript submitted to acm table 2 . taxonomy of existing blockchain-related surveys (part 2). category ref. year topic iot, iiot christidis [35] 2016 blockchains and smart contracts for iot ali [36] 2018 applications of blockchains in iot fernandez [37] 2018 usage of blockchain for iot kouicem [38] 2018 iot security panarello [39] 2018 integration of blockchain and iot dai [40] 2019 blockchain for iot wang [41] 2019 blockchain for iot nguyen [42] 2019 integration of blockchain and cloud of things restuccia [43] 2019 blockchain technology for iot cao [44] 2019 challenges in distributed consensus of iot park [45] 2020 blockchain technology for green iot lao [46] 2020 iot applications in blockchain systems alladi [47] 2019 blockchain applications in industry 4.0 and iiot zhang [48] 2019 5g beyond for iiot based on edge intelligence and blockchain uav alladi [49] 2020 blockchain-based uav applications group-6: lu [50] 2018 functions, applications and open issues of blockchain casino [51] 2019 current status, classification and open issues of blockchain apps agriculture bermeo [52] 2018 blockchain technology in agriculture ferrag [53] 2020 blockchain solutions to security and privacy for green agriculture sdn alharbi [54] 2020 deployment of blockchains for software defined networks business apps konst. [55] 2018 blockchain-based business applications smart city xie [56] 2019 blockchain technology applied in smart cities smart grids alladi [57] 2019 blockchain in use cases of smart grids aderibole [58] 2020 smart grids based on blockchain technology file systems huang [59] 2020 blockchain-based distributed file systems, ipfs, filecoin, etc. space industry torky [60] 2020 blockchain in space industry covid19 nguyen [61] 2020 combat covid-19 using blockchain and ai-based solutions yuan [62] 2016 the state of the art and future trends of blockchain general & outlook zheng [63] 2017 architecture, consensus, and future trends of blockchains zheng [64] 2018 challenges and opportunities of blockchain yuan [65] 2018 blockchain and cryptocurrencies kolb [66] 2020 core concepts, challenges, and future directions in blockchains 1.1.5 iot & iiot. the blockchain-based applications for internet of things (iot) [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] and industrial internet of things (iiot) [47, 48] have received the largest amount of attention from both academia and industry. for example, as a pioneer work in this category, christidis et al. [35] provided a survey about how blockchains and smart contracts promote the iot applications. later on, nguyen et al. [42] presented an investigation of the integration between blockchain technologies and cloud of things with in-depth discussion on backgrounds, motivations, concepts and architectures. recently, park et al. [45] emphasized on the topic of introducing blockchain technologies to the sustainable ecosystem of green iot. for the iiot, zhang et al. [48] discussed the integration of blockchain and edge intelligence to empower a secure iiot framework in the context of 5g and beyond. in addition, when applying blockchains to the unmanned aerial vehicles (uav), alladi et al. [49] reviewed numerous application scenarios covering both commercial and military domains such as network security, surveillance, etc. the research areas covered by the existing surveys on the blockchain-based applications include general applications [50, 51] , agriculture [52, 53] , software-defined networking (sdn) [54] , business applications [55] , smart city [56] , smart grids [57, 58] , distributed file systems [59] , space industry [60] , and covid-19 [61] . some of those surveys are reviewed as follows. lu et al. [50] performed a literature review on the fundamental features of blockchain-enabled applications. through the review, the authors expect to outlook the development routine of blockchain technologies. then, casino et al. [51] presented a systematic survey of blockchain-enabled applications in the context of multiple sectors and industries. both the current status and the prospective characteristics of blockchain technologies were identified. in more specific summary of survey-article review: through the brief review of the state-of-the-art surveys, we have found that the blockchain technologies have been adaptively integrated into a growing range of application sectors. the blockchain theory and technology will bring substantial innovations, incentives, and a great number of application scenarios in diverse fields. based on the analysis of those survey articles, we believe that there will be more survey articles published in the near future, very likely in the areas of sharding techniques, scalability, interoperability, smart contracts, big data, ai technologies, 5g and beyond, edge computing, cloud computing, and many other fields. via the overview, shown in table 1 , table 2 , fig. 1 and fig. 2 in a summary, by this article, we would like to fill the gap by emphasizing on the cutting-edge theoretical studies, modelings, and useful tools for blockchains. particularly, we try to include the latest high-quality research outputs that have not been included by other existing survey articles. we believe that this survey can shed new light on the further development of blockchains. our survey presented in this article includes the following contributions. • we conduct a brief classification of existing blockchain surveys to highlight the meaning of our literature review shown in this survey. • we then present a comprehensive investigation on the state-of-the-art theoretical modelings, analytics models, performance measurements, and useful experiment tools for blockchains, blockchain networks, and blockchain systems. • several promising directions and open issues for future studies are also envisioned finally. the structure of this survey is shown in fig. 3 and organized as follows. section 2 introduces the preliminaries of blockchains. section 3 summarizes the state-of-the-art theoretical studies that improve the performance of blockchains. in section 4, we then review various modelings and analytic models that help understand blockchains. diverse measurement approaches, datasets, and useful tools for blockchains are overviewed in section 5. we outlook the open issues in section 6. finally, section 7 concludes this article. blockchain is a promising paradigm for content distribution and distributed consensus over p2p networks. in this section, we present the basic concepts, definitions and terminologies of blockchains appeared in this article. manuscript submitted to acm 2.1 prime blockchain platforms 2.1.1 bitcoin. bitcoin is viewed as the blockchain system that executes the first cryptocurrency on this world. it builds upon two major techniques, i.e., nakamoto consensus and utxo model, which are introduced as follows. nakamoto consensus. to achieve an agreement of blocks, bitcoin adopts the nakamoto consensus, in which miners generate new blocks by solving a puzzle. in such a puzzle-solving process, also referred to as mining, miners need to calculate a nonce value that fits the required difficulty level [67] . through changing the difficulty, bitcoin system can maintain a stable rate of block-generation, which is about one block per 10 minutes. when a miner generates a new block, it broadcasts this message to all the other miners in the network. if others receive this new block, they add this block to their local chain. if all of the other miners receive this new block timely, the length of the main chain increases by one. however, because of the network delays, not always all the other miners can receive a new block in time. when a miner generates a block before it receives the previous one, a fork yields. bitcoin addresses this issue by following the rule of longest chain. utxo model. the unspent transaction output (utxo) model is adopted by cryptocurrencies like bitcoin, and other popular blockchain systems [68, 69] . a utxo is a set of digital money, each represents a chain of ownership between the owners and the receivers based on the cryptography technologies. in a blockchain, the overall utxos form a set, in which each element denotes the unspent output of a transaction, and can be used as an input for a future transaction. a client may own multiple utxos, and the total coin of this client is calculated by summing up all associated utxos. using this model, blockchains can prevent the double-spend [70] attacks efficiently. [71] is an open-source blockchain platform enabling the function of smart contract. as the token in ethereum, ether is rewarded to the miners who conducted computation to secure the consensus of the blockchain. ethereum executes on decentralized ethereum virtual machines (evms), in which scripts are running on a network consisting of public ethereum nodes. comparing with bitcoin, the evm's instruction set is believed turing-complete. ethereum also introduces an internal pricing mechanism, called gas. a unit of gas measures the amount of computational effort needed to execute operations in a transaction. thus, gas mechanism is useful to restrain the spam in smart contracts. ethereum 2.0 is an upgraded version based on the original ethereum. the upgrades include a transition from pow to proof-of-stake (pos), and a throughput-improving based on sharding technologies. eosio is another popular blockchain platform released by a company block.one on 2018. different from bitcoin and ethereum, the smart contracts of eosio don't need to pay transaction fees. its throughput is claimed to reach millions of transactions per second. furthermore, eosio also enables low block-confirmatoin latency, low-overhead bft finality, and etc. these excellent features has attracted a large-number of users and developers to quickly and easily deploy decentralized applications in a governed blockchain. for example, in total 89,800,000 eosio blocks have been generated in less than one and a half years since its first launching. the consensus mechanism in blockchains is for fault-tolerant to achieve an agreement on the same state of the blockchain network, such as a single state of all transactions in a cryptocurrency blockchain. popular proof-based consensus protocols include pow and pos. in pow, miners compete with each other to solve a puzzle that is difficult to produce a result but easy to verify the result by others. once a miner yields a required nonce value through a huge number of attempts, it gets paid a certain cryptocurrencies for creating a new block. in contrast, pos doesn't have miners. instead, the new block is forged by validators selected randomly within a committee. the probability to be chosen as a validator is linearly related to the size of its stake. pow and pos are both adopted as consensus protocols for the security of cryptocurrencies. the former is based on the cpu power, and the latter on the coin age. therefore, pos is with lower energy-cost and less likely to be attacked by the 51% attack. blockchain as a distributed and public database of transactions has become a platform for decentralized applications. despite its increasing popularity, blockchain technology faces the scalability problem: throughput does not scale with the increasing network size. thus, scalable blockchain protocols that can solve the scalability issues are still in an urgent need. many different directions, such as off-chain, dag, and sharding techniques, have been exploited to address the scalability of blockchains. here, we present several representative terms related to scalability. mathematically, a dag is a finite directed graph where no directed cycles exist. in the context of blockchain, dag is viewed as a revolutionized technology that can upgrade blockchain to a new generation. this is because dag is blockless, and all transactions link to multiple other transactions following a topological order on a dag network. thus, data can move directly between network participants. this results in a faster, cheaper and more scalable solution for blockchains. in fact, the bottleneck of blockchains mainly relies on the structure of blocks. thus, probably the blockless dag could be a promising solution to improve the scalability of blockchains substantially. technique. the consensus protocol of bitcoin, i.e., nakamoto consensus, has significant drawbacks on the performance of transaction throughput and network scalability. to address these issues, sharding technique is one of the outstanding approaches, which improves the throughput and scalability by partitioning the blockchain network into several small shards such that each can process a bunch of unconfirmed transactions in parallel to generate medium blocks. such medium blocks are then merged together in a final block. basically, sharding technique includes network sharding, transaction sharding and state sharding. one shortcoming of sharding technique is that the malicious network nodes residing in the same shard may collude with each other, resulting in security issues. therefore, the sharding-based protocols exploits reshuffling strategy to address such security threats. however, reshuffling brings the cross-shard data migration. thus, how to efficiently handle the cross-shard transactions becomes an emerging topic in the context of sharding blockchain. manuscript submitted to acm 3.1.1 throughput & latency. aiming to reduce the confirmation latency of transactions to milliseconds, hari et al. [72] proposed a high-throughput, low-latency, deterministic confirmation mechanism called accel for accelerating bitcoin's block confirmation. the key findings of this paper includes how to identify the singular blocks, and how to use singular blocks to reduce the confirmation delay. once the confirmation delay is reduced, the throughput increases accordingly. two obstacles have hindered the scalability of the cryptocurrency systems. the first one is the low throughput, and the other one is the requirement for every node to duplicate the communication, storage, and state representation of the entire blockchain network. wang et al. [73] studied how to solve the above obstacles. without weakening decentralization and security, the proposed monoxide technique offers a linear scale-out ability by partitioning the workload. and they preserved the simplicity of the blockchain system and amplified its capacity. the authors also proposed a novel chu-ko-nu mining mechanism, which ensures the cross-zone atomicity, efficiency and security of the however, the authors also admitted that although the proposed prism has a high throughput, its confirming latency still maintains as large as 10 seconds since there is only a single voter chain in prism. a promising solution is to introduce a large number of such voter chains, each of which is not necessarily secure. even though every voter chain is under attacking with a probability as high as 30%, the successful rate of attacking a half number of all voter chains is still theoretically very low. thus, the authors believed that using multiple voter chains would be a good solution to reducing the confirmation latency while not sacrificing system security. considering that ethereum simply allocates transactions to shards according to their account addresses rather than relying on the workload or the complexity of transactions, the resource consumption of transactions in each shard is unbalanced. in consequence, the network transaction throughput is affected and becomes low. to solve this problem, woo et al. [75] proposed a heuristic algorithm named garet, which is a gas consumption-aware relocation mechanism for improving throughput in sharding-based ethereum environments. in particular, the proposed garet can relocate transaction workloads of each shard according to the gas consumption. the experiment results show that garet achieves a higher transactions throughput and a lower transaction latency compared with existing techniques. the transactions generated at real-time make the size of blockchains keep growing. for example, the storage efficiency of original-version bitcoin has received much criticism since it requires to store the full transaction history in each bitcoin peer. although some revised protocols advocate that only the full-size nodes store the entire copy of whole ledger, the transactions still consume a large storage space in those full-size nodes. to alleviate this problem, several pioneer studies proposed storage-efficient solutions for blockchain networks. for example, by exploiting the erasure code-based approach, perard et al. [76] proposed a low-storage blockchain mechanism, aiming to achieve a low requirement of storage for blockchains. the new low-storage nodes only have to store the linearly encoded fragments of each block. the original blockchain data can be easily recovered by retrieving fragments from other nodes under the erasure-code framework. thus, this type of blockchain nodes allows blockchain clients to reduce the storage table 3 . latest theories of improving the performance of blockchains. throughput [72] reduce confirmation delay authors proposed a high-throughput, low-latency, deterministic confirmation mechanism, aiming to accelerate bitcoin's block confirmation. & latency the proposed monoxide offers a linear scale-out by partitioning workloads. particularly, chu-konu mining mechanism enables the cross-zone atomicity, efficiency and security of the system. [74] prism authors proposed a new blockchain protocol, i.e., prism, aiming to achieve a scalable throughput with a full security of bitcoin. [75] garet authors proposed a gas consumption-aware relocation mechanism for improving throughput in sharding-based ethereum. storage [76] erasure codebased authors proposed a new type of low-storage blockchain nodes using erasure code theory to reduce the storage space of blockchains. efficiency [77] jidar: data-reduction strategy authors proposed a data reduction strategy for bitcoin namely jidar, in which each node only has to store the transactions of interest and the related merkle branches from the complete blocks. [78] segment blockchain authors proposed a data-reduced storage mechanism named segment blockchain such that each node only has to store a segment of the blockchain. reliability [79] availability of blockchains authors studied the availability for blockchain-based systems, where the read and write availabilities are conflict to each other. analysis [80] reliability prediction authors proposed h-brp to predict the reliability of blockchain peers by extracting their reliability parameters. capacity. the authors also tested their system on the low-configuration raspberry pi to show the effectiveness, which demonstrates the possibility towards running blockchains on iot devices. then, dai et al. [77] proposed jidar, which is a data reduction strategy for bitcoin. in jidar, each node only has to store the transactions of interest and the related merkle branches from the complete blocks. all nodes verify transactions collaboratively by a query mechanism. this approach seems very promising to the storage efficiency of bitcoin. however, their experiments show that the proposed jidar can only reduce the storage overhead of each peer by about 1% comparing with the original bitcoin. under the similar idea, xu et al. [78] reduced the storage of blockchains using a segment blockchain mechanism, in which each node only needs to store a piece of blockchain segment. the authors also proved that the proposed mechanism endures a failure probability (ϕ/n) m if an adversary party commits a collusion with less than a number ϕ of nodes and each segment is stored by a number m of nodes. this theoretical result is useful for the storage design of blockchains when developing a particular segment mechanism towards data-heavy distributed applications. in public blockchains, the system clients join the blockchain network basically through a third-party peer. thus, the reliability of the selected blockchain peer is critical to the security of clients in terms of both resource-efficiency and monetary issues. to enable clients evaluate and choose the reliable blockchain peers, zheng et al. [80] proposed a hybrid reliability prediction model for blockchains named h-brp, which is able to predict the reliability of blockchain peers by extracting their reliability parameters. manuscript submitted to acm sharding [81] rapidchain authors proposed a new sharding-based protocol for public blockchains that achieves nonlinearly increase of intra-committee communications with the number of committee memebers. blockchains [82] sharper authors proposed a permissioned blockchain system named sharper, which adopts sharding techniques to improve scalability of cross-shard transactions. [83] d-gas authors proposed a dynamic load balancing mechanism for ethereum shards, i.e., d-gas. it reallocates tx accounts by their gas consumption on each shard. [84] nrss authors proposed a node-rating based new sharding scheme, i.e., nrss, for blockchains, aiming to improve the throughput of committees. [85] optchain authors proposed a new sharding paradigm, called optchain, mainly used for optimizing the placement of transactions into shards. [86] sharding-based scaling system authors proposed an efficient shard-formation protocol that assigns nodes into shards securely, and a distributed transaction protocol that can guard against malicious byzantine fault coodinotors. [87] sschain authors proposed a non-reshuffling structure called sschain, which supports both transaction sharding and state sharding while eliminating huge data-migration across shards. [88] eumonia authors proposed eumonia, which is a permissionless parallel-chain protocol for realizing a global ordering of blocks. [89] vulnerability of sybil attacks authors systematically analyzed the vulnerability of sybil attacks in protocol elastico. [90] n/2 bft sharding approach authors proposed a new blockchain sharding approach that can tolerate up to 1/2 of the byzantine nodes within a shard. [91] cycledger authors proposed a protocol cycledger to pave a way towards scalability, security and incentive for sharding blockchains. interoperability [92] interoperability architecture authors proposed a novel interoperability architecture that supports the cross-chain cooperations among multiple blockchains, and a novel monitor multiplexing reading (mmr) method for the passive cross-chain communications. of multiple-chain [93] hyperservice authors proposed a programming platform that provides interoperability and programmability over multiple heterogeneous blockchains. systems [94] protocol move authors proposed a programming model for smart-contract developers to create dapps that can interoperate and scale in a multiple-chain envrionment. [95] crosscryptocurrency tx protocol authors proposed a decentralized cryptocurrency exchange protocol enabling crosscryptocurrency transactions based on smart contracts deployed on ethereum. [16] cross-chain comm. authors conducted a systematic classification of cross-chain communication protocols. one of the critical bottlenecks of today's blockchain systems is the scalability. for example, the throughput of a blockchain is not scalable when the network size grows. to address this dilemma, a number of scalability approaches have been proposed. in this part, we conduct an overview of the most recent solutions with respect to sharding techniques, interoperability among multiple blockchains, and other solutions. some early-stage sharding blockchain protocols (e.g., elastico) improve the scalability by enforcing multiple groups of committees work in parallel. however, this manner still requires a large amount of communication for verifying every transaction linearly increasing with the number of nodes within a committee. thus, the benefit of sharding policy was not fully employed. as an improved solution, zamani et al. [81] proposed a byzantine-resilient sharding-based protocol, namely rapidchain, for permissionless blockchains. taking the advantage of block pipelining, rapidchain improves the throughput by using a sound intra-committee consensus. the authors also developed an efficient cross-shard verification method to avoid the broadcast messages flooding in the holistic network. to enforce the throughput scaling with the network size, gao et al. [96] proposed a scalable blockchain protocol, which leverages both sharding and proof-of-stake consensus techniques. their experiments were performed in an amazon ec2-based simulation network. although the results showed that the throughput of the proposed protocol increases following the network size, the performance was still not so high, for example, the maximum throughput was 36 transactions per second and the transaction latency was around 27 seconds. aiming to improve the efficiency of cross-shard transactions, amiri et al. [82] proposed a permissioned blockchain system named sharper, which is strive for the scalability of blockchains by dividing and reallocating different data shards to various network clusters. the major contributions of the proposed sharper include the related algorithm and protocol associated to such sharper model. in the author's previous work, they have already proposed a permissioned blockchain, while in this paper the authors extended it by introducing a consensus protocol in the processing of both intra-shard and cross-shard transactions. finally, sharper was devised by adopting sharding techniques. one of the important contributions is that sharper can be used in the networks where there are a high percentage of non-faulty nodes. furthermore, this paper also contributes a flattened consensus protocol w.r.t the order of cross-shard transactions among all involved clusters. considering that the ethereum places each group of transactions on a shard by their account addresses, the workloads and complexity of transactions in shards are apparently unbalanced. this manner further damages the network throughput. to address this uneven problem, kim et al. [83] proposed d-gas, which is a dynamic load balancing mechanism for ethereum shards. using such d-gas, the transaction workloads of accounts on each shard can be reallocated according to their gas consumption. the target is to maximize the throughput of those transactions. the evaluation results showed that the proposed d-gas achieved at most a 12% superiority of transaction throughput and a 74% lower transaction latency comparing with other existing techniques. the random sharding strategy causes imbalanced performance gaps among different committees in a blockchain network. those gaps yield a bottleneck of transaction throughput. thus, wang et al. [84] proposed a new sharding policy for blockchains named nrss, which exploits node rating to assess network nodes according to their performance of transaction verifications. after such evaluation, all network nodes will be reallocated to different committees aiming at filling the previous imbalanced performance gaps. through the experiments conducted on a local blockchain system, the results showed that nrss improves throughput by around 32% under sharding techniques. sharding has been proposed to mainly improve the scalability and the throughput performance of blockchains. a good sharding policy should minimize the cross-shard communications as much as possible. a classic design of sharding is the transactions sharding. however, such transactions sharding exploits the random sharding policy, which leads to a dilemma that most transactions are cross-shard. to this end, nguyen et al. [85] proposed a new sharding paradigm differing from the random sharding, called optchain, which can minimize the number of cross-shard transactions. the authors achieved their goal through the following two aspects. first they designed two metrics, named t2s-score (transaction-to-shard) and l2s-score (latency-to-shard), respectively. t2s-score aims to measure how likely manuscript submitted to acm a transaction should be placed into a shard, while l2s-score is used to measure the confirmation latency when placing a transaction into a shard. next, they utilized a well-known pagerank analysis to calculate t2s-score and proposed a mathematical model to estimate l2s-score. finally, how does the proposed optchain place transactions into shards based on the combination of t2s and l2s scores? in brief, they introduced another metric composed of both t2s and l2s, called temporal fitness score. for a given transaction u and a shard s i , optchain figures the temporal fitness score for the pair ⟨u, s i ⟩. then, optchain just puts transaction u into the shard that is with the highest temporal fitness score. similar to [85] , dang et al. [86] proposed a new shard-formation protocol, in which the nodes of different shards are re-assigned into different committees to reach a certain safety degree. in addition, they also proposed a coordination protocol to handle the cross-shard transactions towards guarding against the byzantine-fault malicious coordinators. the experiment results showed that the throughput achieves a few thousands of tps in both a local cluster with 100 nodes and a large-scale google cloud platform testbed. considering that the reshuffling operations lead to huge data migration in the sharding-based protocols, chen et al. although the existing sharding-based protocols, e.g., elastico, ominiledger and rapaidchain, have gained a lot of attention, they still have some drawbacks. for example, the mutual connections among all honest nodes require a big amount of communication resources. furthermore, there is no an incentive mechanism driven nodes to participate in sharding protocol actively. to solve those problems, zhang et al. [91] proposed cycledger, which is a protocol designed for the sharding-based distributed ledger towards scalability, reliable security, and incentives. such the proposed cycledger is able to select a leader and a subset of nodes for each committee that handle the intra-shard consensus and the synchronization with other committees. a semi-commitment strategy and a recovery processing scheme were also proposed to deal with system crashing. in addition, the authors also proposed a reputation-based incentive policy to encourage nodes behaving honestly. following the widespread adoption of smart contracts, the roles of blockchains have been upgraded from token exchanges into programmable state machines. thus, the blockchain interoperability must evolve accordingly. to help realize such new type of interoperability among multiple heterogeneous blockchains, liu et al. [93] proposed hyperservice, which includes two major components, i.e., a programming framework allowing developers to create crosschain applications; and a universal interoperability protocol towards secure implementation of dapps on blockchains. the authors implemented a 35,000-line prototype to prove the practicality of hyperservice. using the prototype, the end-to-end delays of cross-chain dapps, and the aggregated platform throughput can be measured conveniently. in an ecosystem that consists of multiple blockchains, interoperability among those difference blockchains is an essential issue. to help the smart-contract developers build dapps, fynn et al. [94] proposed a practical move protocol that works for multiple blockchains. the basic idea of such protocol is to support a move operation enabling to move objects and smart contracts from one blockchain to another. recently, to enable cross-cryptocurrency transactions, tian et al. [95] proposed a decentralized cryptocurrency exchange strategy implemented on ethereum through smart contracts. additionally, a great number of studies of cross-chain communications are included in [16] , in which readers can find a systematic classification of cross-chain communication protocols. new protocols [97] ouroboros praos authors proposed a new secure proof-of-stake protocol named ouroboros praos, which is proved secure in the semi-synchronous adversarial setting. [98] tendermint authors proposed a new bft consensus protocol for the wide area network organized by the gossip-based p2p network under adversarial conditions. [73] chu-ko-nu mining authors proposed a novel proof-of-work scheme, named chu-ko-nu mining, which incentivizes miners to create multiple blocks in different zones with only a single pow mining. [99] proof-of-trust (pot) authors proposed a novel proof-of-trust consensus for the online services of crowdsourcing. new [100] streamchain authors proposed to shift the block-based distributed ledgers to a new paradigm of stream transaction processing to achieve a low end-to-end latencies without much affecting throughput. in monoxide proposed by [73] , the authors have devised a novel proof-of-work scheme, named chu-ko-nu mining. this new proof protocol encourages a miner to create multiple blocks in different zones simultaneously with a single pow solving effort. this mechanism makes the effective mining power in each zone is almost equal to the level of the total physical mining power in the entire network. thus, chu-ko-nu mining increases the attack threshold for each zone to 50%. furthermore, chu-ko-nu mining can improve the energy consumption spent on mining new blocks because a lot of more blocks can be produced in each round of normal pow mining. the online services of crowdsourcing face a challenge to find a suitable consensus protocol. by leveraging the advantages of the blockchain such as the traceability of service contracts, zou et al. [99] proposed a new consensus protocol, named proof-of-trust (pot) consensus, for crowdsourcing and the general online service industries. basically, such pot consensus protocol leverages a trust management of all service participants, and it works as a hybrid blockchain architecture in which a consortium blockchain integrates with a public service network. conventionally, block-based data structure is adopted by permissionless blockchain systems as blocks can efficiently amortize the cost of cryptography. however, the benefits of blocks are saturated in today's permissioned blockchains since the block-processing introduces large batching latencies. to the distributed ledgers that are neither geo-distributed nor pow-required, istván et al. [100] proposed to shift the traditional block-based data structure into the paradigm of stream-like transaction processing. the premier advantage of such paradigm shift is to largely shrink the end-to-end latencies for permissioned blockchains. the authors developed a prototype of their concept based on hyperledger fabric. the results showed that the end-to-end latencies achieved sub-10 ms and the throughput was close to 1500 tps. permissioned blockchains have a number of limitations, such as poor performance, privacy leaking, and inefficient cross-application transaction handling mechanism. to address those issues, amiri et al. [101] proposed caper, which a permissioned blockchain that can well deal with the cross-application transactions for distributed applications. in particular, caper constructs its blockchain ledger using dag and handles the cross-application transactions by adopting three specific consensus protocols, i.e., a global consensus using a separate set of orders, a hierarchical consensus protocol, and a one-level consensus protocol. then, chang et al. [102] proposed an edge computing-based blockchain [105] architecture, in which edge-computing providers supply computational resources for blockchain miners. the authors then formulated a two-phase stackelberg game for the proposed architecture, aiming to find the stackelberg equilibrium of the theoretical optimal mining scheme. next, zheng et al. [103] proposed a new infrastructure for practical pow blockchains called axechain, which aims to exploit the precious computing power of miners to solve arbitrary practical problems submitted by system users. the authors also analyzed the trade-off between energy consumption and security guarantees of such axechain. this study opens up a new direction for pursing high energy efficiency of meaningful pow protocols. with the non-linear (e.g., graphical) structure adopted by blockchain networks, researchers are becoming interested in the performance improvement brought by new data structures. to find insights under such non-linear blockchain systems, chen et al. [104] performed a systematic analysis by taking three critical metrics into account, i.e., full verification, scalability, and finality-duration. the authors revealed that it is impossible to achieve a blockchain that enables those three metrics at the same time. any blockchain designers must consider the trade-off among such three properties. the graphs are widely used in blockchain networks. for example, merkel tree has been adopted by bitcoin, and several blockchain protocols, such as ghost [106] , phantom [107] , and conflux [108] , constructed their blocks using the directed acyclic graph (dag) technique. different from those generalized graph structures, we review the most recent studies that exploit the graph theories for better understanding blockchains in this part. since the transactions in blockchains are easily structured into graphs, the graph theories and graph-based data mining techniques are viewed as good tools to discover the interesting findings beyond the graphs of blockchain networks. some representative recent studies are reviewed as follows. leveraging the techniques of graph analysis, chen et al. [109] characterized three major activities on ethereum, i.e., money transfer, the creation of smart contracts, and the invocation of smart contracts. the major contribution of this paper is that it performed the first systematic investigation and proposed new approaches based on cross-graph analysis, which can address two security issues existing in ethereum: attack forensics and anomaly detection. particularly, w.r.t the graph theory, the authors mainly concentrated on the following two aspects: (1) graph construction: they identified four types of transactions that are not related to money transfer, smart contract creation, or smart contract invocation. (2) graph analysis: then, they divided the remaining transactions into three groups according to the activities they triggered, i.e., money flow grahp (mfg), smart contract creation graph (ccg) and contract invocation graph (cig). via this manner, the authors delivered many useful insights of transactions that are helpful to address the security issues of ethereum. similarly, by processing bitcoin transaction history, akcora et al. [110] and dixon et al. [111] modeled the transfer network into an extreme transaction graph. through the analysis of chainlet activities [112] in the constructed graph, they proposed to use garch-based forecasting models to identify the financial risk of bitcoin market for cryptocurrency users. an emerging research direction associated with blockchain-based cryptocurrencies is to understand the network dynamics behind graphs of those blockchains, such as the transaction graph. this is because people are wondering what the connection between the price of a cryptocurrency and the dynamics of the overlying transaction graph is. to answer such a question, abay et al. [113] proposed chainnet, which is a computationally lightweight method to learning the graph features of blockchains. the authors also disclosed several insightful findings. for example, it is the topological feature of transaction graph that impacts the prediction of bitcoin price dynamics, rather than the degree distribution of the transaction graph. furthermore, utilizing the mt. gox transaction history, chen et al. [114] also exploited the graph-based data-mining approach to dig the market manipulation of bitcoin. the authors constructed three graphs, i.e., extreme high graph (ehg), extreme low graph (elg), and normal graph (nmg), based on the initial processing of transaction dataset. then, they discovered many correlations between market manipulation patterns and the price of bitcoin. on the other direction, based on address graphs, victor et al. [115] studied the erc20 token networks through analyzing smart contracts of ethereum blockchain. different from other graph-based approaches, the authors focused on their attention on the address graphs, i.e., token networks. with all network addresses, each token network is viewed as an overlay graph of the entire ethereum network addresses. similar to [109] , the authors presented the relationship between transactions by exploiting graph-based analysis, in which the arrows can denote the invoking functions between transactions and smart contracts, and the token transfers between transactions as well. the findings presented by this study help us have a well understanding of token networks in terms of time-varying characteristics, such as the usage patterns of the blockchain system. an interesting finding is that around 90% of all transfers stem from the top 1000 token contracts. that is to say, only less than 10% of token recipients have transferred their tokens. this finding is contrary to the viewpoint proposed by [116] , where somin et al. showed that the full transfers seem to obey a power-law distribution. however, the study [115] indicated that those transfers in token networks likely do not follow a power law. the authors attributed such the observations to the following three possible reasons: 1) most of the token users don't have incentives to transfer their tokens. instead, they just simply hold tokens; 2) the majority of inactive tokens are treated as something like unwanted spam; 3) a small portion, i.e., approximately 8%, of users intended to sell their tokens to a market exchange. recently, zhao et al. [117] explored the account creation, account vote, money transfer and contract authorization activities of early-stage eosio transactions through graph-based metric analysis. their study revealed abnormal transactions like voting gangs and frauds. the latencies of block transfer and processing are generally existing in blockchain networks since the large number of miner nodes are geographically distributed. such delays increase the probability of forking and the vulnerability to malicious attacks. thus, it is critical to know how would the network dynamics caused by the block propagation latencies and the fluctuation of hashing power of miners impact the blockchain performance such as block generation rate. to find the connection between those factors, papadis et al. [118] developed stochastic models to derive the blockchain evolution in a wide-area network. their results showed us practical insights for the design issues of blockchains, for example, how to change the difficulty of mining in the pow consensus while guaranteeing an expected block generation rate or an immunity level of adversarial attacks. the authors then performed analytical studies and simulations to evaluate the accuracy of their models. this stochastic analysis opens up a door for us to have a deeper understanding of dynamics in a blockchain network. towards the stability and scalability of blockchain systems, gopalan et al. [119] also proposed a stochastic model for a blockchain system. during their modeling, a structural asymptotic property called one-endedness was identified. the authors also proved that a blockchain system is one-ended if it is stochastically stable. the upper and lower bounds of the stability region were also studied. the authors found that the stability bounds are closely related to the conductance of the p2p blockchain network. those findings are very insightful such that researchers can assess the scalability of blockchain systems deployed on large-scale p2p networks. although sharding protocol is viewed as a very promising solution to solving the scalability of blockchains and adopted by multiple well-known blockchains such as rapidchain [81] , omniledger [69] , and monoxide [73] , the failure probability for a committee under sharding protocol is still unknown. to fill this gap, hafid et al. [120] [121] [122] proposed a stochastic model to capture the security analysis under sharding-based blockchains using a probabilistic approach. with the proposed mathematical model, the upper bound of the failure probability was derived for a committee. in particular, three probability inequalities were used in their model, i.e., chebyshev, hoeffding, and chvátal. the authors claim that the proposed stochastic model can be used to analyze the security of any sharding-based protocol. in blockchain networks, several stages of mining processing and the generation of new blocks can be formulated as queueing systems, such as the transaction-arrival queue, the transaction-confirmation queue, and the block-verification queue. thus, a growing number of studies are exploiting the queueing theory to disclose the mining and consensus mechanisms of blockchains. some recent representative works are reviewed as follows. to develop a queueing theory of blockchain systems, li et al. [123, 124] devised a batch-service queueing system to describe the mining and the creating of new blocks in miners' pool. for the blockchain queueing system, the authors exploited the type gi/m/1 continuous-time markov process. then, they derived the stable condition and the stationary probability matrix of the queueing system utilizing the matrix-geometric techniques. then, viewing that the confirmation delay of bitcoin transactions are larger than conventional credit card systems, ricci et al. [125] proposed a theoretical framework integrating the queueing theory and machine learning techniques to have a deep understanding towards the transaction confirmation time. the reason the authors chose the queueing theory for their study is that a queueing model is suitable to see insights into how the different blockchain parameters affect the transaction latencies. their measurement results showed that the bitcoin users experience a delay that is slightly larger than the residual time of a block confirmation. frolkova et al. [126] formulated the synchronization process of bitcoin network as an infinite-server model. the authors derived a closed-form for the model that can be used to capture the queue stationary distribution. furthermore, they also proposed a random-style fluid limit under service latencies. on the other hand, to evaluate and optimize the performance of blockchain-based systems, memon et al. [128] via graph analysis, authors extracted three major activities, i.e., money transfer, smart contracts creation, and smart contracts invocation. based mining [113] features of transaction graphs proposed an extendable and computationally efficient method for graph representation learning on blockchains. theories [114] market manipulation patterns authors exploited the graph-based data-mining approach to reveal the market manipulation evidence of bitcoin. [117] clustering coefficient, assortativity of tx graph authors exploited the graph-based analysis to reveal the abnormal transactions of eosio. token networks [115] token-transfer distributions authors studied the token networks through analyzing smart contracts of ethereum blockchain based on graph analysis. [110, 111] extreme chainlet activity authors proposed graph-based analysis models for assessing the financial investment risk of bitoin. blockchain network analysis [118] block completion rates, and the probability of a successful adversarial attack authors derived stochastic models to capture critical blockchain properties, and to evaluate the impact of blockchain propagation latency on key performance metrics. this study provides us useful insights of design issues of blockchain networks. stability analysis [119] time to consistency, cycle length, consistency fraction, age of information authors proposed a network model which can identify the stochastic stability of blockchain systems. failure probability analysis [120] [121] [122] failure probability of a committee, sums of upper-bounded hypergeometric and binomial distributions for each epoch authors proposed a probabilistic model to derive the security analysis under sharding blockchain protocols. this study can tell how to keep the failure probability smaller than a defined threshold for a specific sharding protocol. mining procedure and blockgeneration [123, 124] the average # of tx in the arrival queue and in a block, and average confirmation time of tx authors developed a makovian batch-service queueing system to express the mining process and the generation of new blocks in miners pool. blockconfirmation time [125] the residual lifetime of a block till the next block is confirmed authors proposed a theoretical framework to deeply understand the transaction confirmation time, by integrating the queueing theory and machine learning techniques. synchronization process of bitcoin network [126] stationary queue-length distribution authors proposed an infinite-server model with random fluid limit for bitcoin network. mining resources allocation [127] mining resource for miners, queueing stability authors proposed a lyapunov optimization-based queueing analytical model to study the allocation of mining resources for the pow-based blockchain networks. blockchain's theoretical working principles [128] # of tx per block, mining interval of each block, memory pool size, waiting time, # of unconfirmed tx authors proposed a queueing theory-based model to have a better understanding the theoretical working principle of blockchain networks. critical statistics metrics of blockchain networks, such as the number of transactions every new block, the mining interval of a block, transactions throughput, and the waiting time in memory pool, etc. next, fang et al. [127] proposed a queueing analytical model to allocate mining resources for the general pow-based blockchain networks. the authors formulated the queueing model using lyapunov optimization techniques. based on such stochastic theory, a dynamic allocation algorithm was designed to find a trade-off between mining energy and queueing delay. different from the aforementioned work [123] [124] [125] , the proposed lyapunov-based algorithm does not need to make any statistical assumptions on the arrivals and services. for the people considering whether a blockchain system is needed for his/her business, a notable fact is that blockchain is not always applicable to all real-life use cases. to help analyze whether blockchain is appropriate to a specific application scenario, wust et al. [129] provided the first structured analytical methodology and applied it to analyzing authors proposed the first structured analytical methodology that can help decide whether a particular application system indeed needs a blockchain, either a permissioned or permissionless, as its technical solution. exploration of [130] temporal information and the multiplicity features of ethereum transactions authors proposed an analytical model based on the multiplex network theory for understanding ethereum transactions. ethereum transactions [131] pending time of ethereum transactions authors conducted a characterization study of the ethereum by focusing on the pending time, and attempted to find the correlation between pending time and fee-related parameters of ethereum. modeling the competition over multiple miners [132] competing mining resources of miners of a cryptocurrency blockchain authors exploited the game theory to find a nash equilibria while peers are competing mining resources. a neat bound of consistency latency [133] consistency of a pow blockchain authors derived a neat bound of mining latencies that helps understand the consistency of nakamoto's blockchain consensus in asynchronous networks. network connectivity [134] consensus security authors proposed an analytical model to evaluate the impact of network connectivity on the consensus security of pow blockchain under different adversary models. how ethereum responds to sharding [135] balance among shards, # of tx that would involve multiple shards, the amount of data relocated across shards authors studied how sharding impact ethereum by firstly modeling ethereum through graph modeling, and then assessing the three metrics mentioned when partitioning the graph. required properties of sharding protocols [136] consistency and scalability authors proposed an analytical model to evaluate whether a protocol for sharded distributed ledgers fulfills necessary properties. vulnerability by forking attacks [137] hashrate power, net cost of an attack authors proposed fine-grained vulnerability analytical model of blockchain networks incurred by intentional forking attacks taking the advantages of large deviation theory. counterattack to double-spend attacks [70] robustness parameter, vulnerability probability authors studied how to defense and even counterattack the double-spend attacks in pow blockchains. limitations of pbftbased blockchains [138] performance of blockchain applications, persistence, possibility of forks authors studied and identified several misalignments between the requirements of permissioned blockchains and the classic bft protocols. three representative scenarios, i.e., supply chain management, interbank payments, and decentralized autonomous organizations. although ethereum has gained much popularity since its debut in 2014, the systematically analysis of ethereum transactions still suffers from insufficient explorations. therefore, lin et al. [130] proposed to model the transactions using the techniques of multiplex network. the authors then devised several random-walk strategies for graph representation of the transactions network. this study could help us better understand the temporal data and the multiplicity features of ethereum transactions. to better understand the network features of an ethereum transaction, sousa et al. [131] focused on the pending time, which is defined as the latency counting from the time a transaction is observed to the time this transaction is packed into the blockchain. the authors tried to find the correlations between such pending time with the fee-related parameters such as gas and gas price. surprisingly, their data-driven empirical analysis results showed that the correlation between those two factors has no clear clue. this finding is counterintuitive. to achieve a consensus about the state of blockchains, miners have to compete with each other by invoking a certain proof mechanism, say pow. such competition among miners is the key module to public blockchains such as bitcoin. to model the competition over multiple miners of a cryptocurrency blockchain, altman et al. [132] exploited the game theory to find a nash equilibria while peers are competing mining resources. the proposed approach help researchers well understand such competition. however, the authors also mentioned that they didn't study the punishment and cooperation between miners over the repeated games. those open topics will be very interesting for future studies. to ensure the consistency of pow blockchain in an asynchronous network, zhao et al. [133] performed an analysis and derived a neat bound around 2µ ln(µ/ν ) , where µ + ν = 1, with µ and ν denoting the fraction of computation power dominated by the honest and adversarial miners, respectively. such a neat bound of mining latencies is helpful to us to well understand the consistency of nakamoto's blockchain consensus in asynchronous networks. bitcoin's consensus security is built upon the assumption of honest-majority. under this assumption, the blockchain system is thought secure only if the majority of miners are honest while voting towards a global consensus. recent researches believe that network connectivity, the forks of a blockchain, and the strategy of mining are major factors that impact the security of consensus in bitcoin blockchain. to provide pioneering concrete modelings and analysis, xiao et al. [134] proposed an analytical model to evaluate the network connectivity on the consensus security of pow blockchains. to validate the effectiveness of the proposed analytical model, the authors applied it to two adversary scenarios, i.e., honest-but-potentially-colluding, and selfish mining models. although sharding is viewed as a prevalent technique for improving the scalability to blockchain systems, several essential questions are: what we can expect from and what price is required to pay for introducing sharding technique to ethereum? to answer those questions, fynn et al. [135] studied how sharding works for ethereum by modeling ethereum into a graph. via partitioning the graph, they evaluated the trade-off between the edge-cut and balance. several practical insights have been disclosed. for example, three major components, e..g, computation, storage and bandwidth, are playing a critical role when partitioning ethereum; a good design of incentives is also necessary for adopting sharding mechanism. as mentioned multiple times, sharding technique is viewed as a promising solution to improving the scalability of blockchains. however, the properties of a sharded blockchain under a fully adaptive adversary are still unknown. to this end, avarikioti et al. [136] defined the consistency and scalability for sharded blockchain protocol. the limitations of security and efficiency of sharding protocols were also derived. then, they analyzed these two properties on the context of multiple popular sharding-based protocols such as omniledger, rapidchain, elastico, and monoxide. several interesting conclusions have been drawn. for example, the authors thought that elastico and momoxide failed to guarantee the balance between consistency and scalability properties, while omniledger and rapidchain fulfill all requirements of a robust sharded blockchain protocol. forking attacks has become the normal threats faced by the blockchain market. the related existing studies mainly focus on the detection of such attacks through transactions. however, this manner cannot prevent the forking attacks from happening. to resist the forking attacks, wang et al. [137] studied the fine-grained vulnerability of blockchain networks caused by intentional forks using the large deviation theory. this study can help set the robustness parameters for a blockchain network since the vulnerability analysis provides the correlation between robust level and the vulnerability probability. in detail, the authors found that it is much more cost-efficient to set the robust level parameters than to spend the computational capability used to lower the attack probability. the existing economic analysis [139] reported that the attacks towards pow mining-based blockchain systems can be cheap under a specific condition when renting sufficient hashrate capability. moroz et al. [70] studied how to defense the double-spend attacks in an interesting reverse direction. the authors found that the counterattack of victims can lead to a classic game-theoretic war of attrition model. this study showed us the double-spend attacks on some pow-based blockchains are actually cheap. however, the defense or even counterattack to such double-spend attacks is possible when victims are owing the same capacity as the attacker. although bft protocols have attracted a lot of attention, there are still a number of fundamental limitations unaddressed while running blockchain applications based on the classical bft protocols. those limitations include one related to low performance issues, and two correlated to the gaps between the state machine replication and blockchain models (i.e., the lack of strong persistence guarantees and the occurrence of forks). to identify those limitations, bessani et al. [138] first studied them using a digital coin blockchain app called smartcoin, and a popular bft replication library called bft-smart, then they discussed how to tackle these limitations in a protocol-agnostic manner. the authors also implemented an experimental platform of permissioned blockchain, namely smartchain. their evaluation results showed that smartchain can address the limitations aforementioned and significantly improve the performance of a blockchain application. ref. cryptojacking [140] hardware performance counters authors proposed a machine learning-based solution to prevent cryptojacking attacks. detection [141] various system resource utilization authors proposed an in-browser cryptojacking detection approach (capjack), based on the latest capsnet. marketmanipulation mining [114] various graph characteristics of transaction graph authors proposed a mining approach using the exchanges collected from the transaction networks. predicting volatility of bitcoin price [111] various graph characteristics of extreme chainlets authors proposed a graph-based analytic model to predict the intraday financial risk of bitcoin market. money-laundering detection [142] various graph characteristics of transaction graph authors exploited machine learning models to detect potential money laundering activities from bitcoin transactions. ponzi-scheme [143] factors that affect scam persistence authors analyzed the demand and supply perspectives of ponzi schemes on bitcoin ecosystem. detection [144, 145] account and code features of smart contracts authors detected ponzi schemes for ethereum based on data mining and machine learning approaches. design problem of cryptoeconomic systems [146] price of xns token, subsidy of app developers authors presented a practical evidence-based example to show how data science and stochastic modeling can be applied to designing cryptoeconomic blockchains. pricing mining hardware [147] miner revenue, asic value authors studied the correlation between the price of mining hardware (asic) and the value volatility of underlying cryptocurrency. in mining. thus, any web users face severe risks from the cryptocurrency-hungry hackers. for example, the cryptojacking attacks [148] have raised growing attention. in such type of attacks, a mining script is embedded secretly by a hacker without notice from the user. when the script is loaded, the mining will begin in the background of the system and a large portion of hardware resources are requisitioned for mining. to tackle the cryptojacking attacks, tahir et al. [140] proposed a machine learning-based solution, which leverages the hardware performance counters as the critical features and can achieve a high accuracy while classifying the parasitic miners. the authors also built their approach into a browser extension towards the widespread real-time protection for web users. similarly, ning et al. [141] proposed capjack, which is an in-browser cryptojacking detector based on deep capsule network (capsnet) [149] technology. as mentioned previously, to detect potential manipulation of bitcoin market, chen et al. [114] proposed a graph-based mining to study the evidence from the transaction network built based on mt. gox transaction history. the findings of this study suggests that the cryptocurrency market requires regulation. to predict drastic price fluctuation of bitcoin, dixon et al. [111] studied the impact of extreme transaction graph (etg) activity on the intraday dynamics of the bitcoin prices. the authors utilized chainlets [112] (sub graphs of transaction graph) for developing their predictive models. manuscript submitted to acm [151] mentioned that money laundering conducted in the underground market can be detected using the bitcoin mixing services. however, they didn't present an essential anti-money laundering strategy in their paper. in contrast, utilizing a transaction dataset collected over three years, hu et al. [142] performed in-depth detection for discovering money laundering activities on bitcoin network. to identify the money laundering transactions from the regular ones, the authors proposed four types of classifiers based on the graph features appeared on the transaction graph, i.e., immediate neighbors, deepwalk embeddings, node2vec embeddings and decision tree-based. it is not common to introduce data science and stochastic simulation modelings into the design problem of cryptoeconomic engineering. laskowski et al. [146] presented a practical evidencebased example to show how this manner can be applied to designing cryptoeconomic blockchains. yaish et al. [147] discussed the relationship between the cryptocurrency mining and the market price of the special hardware (asics) that supports pow consensus. the authors showed that the decreasing volatility of bitcoin's price has a counterintuitive negative impact to the value of mining hardware. this is because miners are not financially incentivized to participate in mining, when bitcoin becomes widely adopted thus making its volatility decrease. this study also revealed that a mining hardware asic could be imitated by bonds and underlying cryptocurrencies such as bitcoins. although diverse blockchains have been proposed in recent years, very few efforts have been devoted to measuring the performance of different blockchain systems. thus, this part reviews the representative studies of performance measurements for blockchains. the measurement metrics include throughput, security, scalability, etc. as a pioneer work in this direction, gervais et al. [152] proposed a quantitative framework, using which they studied the security and performance of several pow blockchains, such as bitcoin, litecoin, dogecoin and ethereum. the authors focused on multiple metrics of security model, e.g., stale block rate, mining power, mining costs, the number of block confirmations, propagation ability, and the impact of eclipse attacks. they also conducted extensive simulations for the four blockchains aforementioned with respect to the impact of block interval, the impact of block size, and throughput. via the evaluation of network parameters about the security of pow blockchains, researchers can compare the security performance objectively, and thus help them appropriately make optimal adversarial strategies and the security provisions of pow blockchains. general mining-based blockchains, e.g., bitcoin and ethereum tps, the overheads of cross-zone transactions, the confirmation latency of transactions, etc. monoxide was implemented utilizing c++. rocksdb was used to store blocks and tx. the real-world testing system was deployed on a distributed configuration consisting of 1200 virtual machines, with each owing 8 cores and 32 gb memory. in total 48,000 blockchain nodes were exploited in the testbed. [74] general blockchains throughput and confirmation latency, scalability under different # of clients, forking rate, and resource utilization (cpu, network bandwidth) prism testbed is deployed on amazon ec2 instances each with 16 cpu cores, 16 gb ram, 400 gb nvme ssd, and a 10 gbps network interface. in total 100 prism client instances are connected into a topology in random 4-regular graph. [75] ethereum nasir et al. [153] conducted performance measurements and discussion of two versions of hyperledger fabric. the authors focused on the metrics including execution time, transaction latency, throughput and the scalability versus the number of nodes in blockchain platforms. several useful insights have been revealed for the two versions of hyperledger fabric. as already mentioned previously in [73] , the authors evaluated their proposed monoxide w.r.t the metrics including the scalability of tps as the number of network zones increase, the overhead of both cross-zone transactions and storage size, the confirmation latency of transactions, and the orphan rate of blocks. in [74] , the authors performed rich measurements for their proposed new blockchain protocol prism under limited network bandwidth and cpu resources. the performance evaluated includes the distribution of block propagation delays, the relationship between block size and mining rate, block size versus assembly time, the expected time to reach consensus on block hash, the expected time to reach consensus on blocks, etc. later, zheng et al. [154] proposed a scalable framework for monitoring the real-time performance blockchain systems. this work has evaluated four popular blockchain systems, i.e., ethereum, parity [158] , cryptape inter-enterprise trust automation (cita) [159] and hyperledger fabric [160] i) data analysis based on off-chain data to provide off-chain user behavior for blockchain developers, ii) exploring new features of eosio data that are different from those of ethereum, and iii) conducting a joint analysis of eosio with other blockchains. kalodner et al. [164] proposed blocksci, which is designed as an open-source software platform for blockchain analysis. under the architecture of blocksci, the raw blockchain data is parsed to produce the core blockchain data including transaction graph, indexes and scripts, which are then provided to the analysis library. together with the auxiliary data including p2p data, price data and user tags, a client can either directly query or read through a jupyter notebook interface. to evaluate the performance of private blockchains, dinh et al. [155] proposed a benchmarking framework, named blockbench, which can measure the data processing capability and the performance of various layers of a blockchain system. using such blockbench, the authors then performed detailed measurements and analysis of three blockchains, i.e., ethereum, parity and hyperledger. the results disclosed some useful experiences of those three blockchain systems. for example, today's blockchains are not scalable w.r.t data processing workloads, and several bottlenecks should be considered while designing different layers of blockchain in the software engineering perspective. ethereum has received enormous attention on the mining challenges, the analytics of smart contracts, and the management of block mining. however, not so many efforts have been spent on the information dissemination in authors also made this simulator open-source on github. in this section, we envision the open issues and promising directions for future studies. 6.1.3 cross-shard performance . although a number of committee-based sharding protocols [69, 73, 81, 165] have been proposed, those protocols can only endure at most 1/3 adversaries. thus, more robust byzantine agreement protocols need to be devised. furthermore, all the sharding-based protocols incur additional cross-shard traffics and latencies because of the cross-shard transactions. therefore, the cross-shard performance in terms of throughput, latency and other metrics, has to be well guaranteed in future studies. on the other hand, the cross-shard transactions are inherent for the cross-shard protocols. thus, the pros and cons of such the correlation between different shards are worthy investigating using certain modelings and theories such as graph-based analysis. 6.1.4 cross-chain transaction accelerating mechanisms . on cross-chain operations, [92] is essentially a pioneer step towards practical blockchain-based ecosystems. following this roadmap paved by [92] , we are exciting to anticipate the subsequent related investigations will appear soon in the near future. for example, although the inter-chain transaction experiments achieve an initial success, we believe that the secure cross-chain transaction accelerating mechanisms are still on the way. in addition, further improvements are still required for the interoperability among multiple blockchains, such as decentralized load balancing smart contracts for sharded blockchains. manuscript submitted to acm 6.1.5 ordering blocks for multiple-chain protocols . although multiple-chain techniques can improve the throughput by exploiting the parallel mining of multiple chain instances, how to construct and manage the blocks in all chains in a globally consistent order is still a challenge to the multiple-chain based scalability protocols and solutions. 6.1.6 hardware-assisted accelerating solutions for blockchain networks. to improve the performance of blockchains, for example, to reduce the latency of transaction confirmation, some advanced network technologies, such as rdma (remote direct memory access) and high-speed network cards, can be exploited in accelerating the data-access among miners in blockchain networks. 6.1.7 performance optimization in different blockchain network layers . the blockchain network is built over the p2p networks, which include several typical layers, such as mac layer, routing layer, network layer, and application layer. the bft-based protocols are essentially working for the network layer. in fact, performance improvements can be achieved by proposing various protocols, algorithms, and theoretical models for other layers of the blockchain network. 6.1.8 blockchain-assisted bigdata networks. although big data and blockchain have several performance metrics that are contrary to each other. for example, big data is a centralized management technology with an emphasize on the privacy-preserving oriented to diverse computing environments. the data processed by big data technology should ensure nonredundancy and unstructured architecture in a large-scale computing network. in contrast, blockchain technology builds on a decentralized, transparent and immutable architecture, in which data type is simple, data is structured and highly redundant. furthermore, the performance of blockchains require scalability and the off-chain computing paradigm. thus, how to integrate those two technologies together and pursue the mutual benefit for each other is an open issue that is worthy in-depth studies. for example, the potential research topics include how to design a suitable new blockchain architecture for big data technologies, and how to break the isolated data islands using blockchains while guaranteeing the privacy issues of big data. • exploiting more general queueing theories to capture the real-world arrival process of transactions, mining new blocks, and other queueing-related blockchain phases. • performing priority-based service policies while dealing with transactions and new blocks, to meet a predefined security or regulation level. • developing more general probabilistic models to characterize the correlations among the multiple performance parameters of blockchain systems. 6.2.2 privacy-preserving for blockchains. from the previous overview, we observe that most of the existing works under this category are discussing the blockchain-based security and privacy-preserving applications. the fact is that the security and privacy are also the critical issues of the blockchain itself. for example, the privacy of transactions could be hacked by attackers. however, dedicated studies focusing on those issues are still insufficient. mechanisms for malicious miners. the cryptojacking miners are reportedly existing in web browsers according to [140] . this type of malicious codes is commandeering the hardware resources such as computational capability and memory of web users. thus, the anti-cryptojacking mechanisms and strategies are necessary to develop for protecting normal browser users. 6.2.4 security issues of cryptocurrency blockchains. the security issues of cryptocurrency blockchains, such as double-spend attacks, frauds in smart contracts, have arisen growing attention from both industrial and academic fields. however, little efforts have been committed to the theoretical investigations towards the security issues of cryptocurrency blockchains. for example, the exploration of punishment and cooperation between miners over multiple chains is an interesting topic for cryptocurrency blockchains. thus, we expect to see broader perspectives of modeling the behaviors of both attackers and counterattackers in the context of monetary blockchain attacks. to most of the beginners in the field of the blockchain, they have a dilemma about lack of powerful simulation/emulation tools for verifying their new ideas or protocols. therefore, the powerful simulation/emulation platforms that are easy to deploy scalable testbeds for the experiments would be very helpful to the research community. through a brief review of state-of-the-art blockchain surveys at first, we found that a dedicated survey focusing on the theoretical modelings, analytical models and useful experiment tools for blockchains is still missing. to fill this gap, we then conducted a comprehensive survey of the state-of-the-art on blockchains, particularly in the perspectives of theories, modelings, and measurement/evaluation tools. the taxonomy of each topic presented in this survey tried to convey the new protocols, ideas, and solutions that can improve the performance of blockchains, and help people better understand the blockchains in a further level. we believe our survey provides a timely guidance on the theoretical insights of blockchains for researchers, engineers, educators, and generalized readers. survey of consensus protocols on blockchain applications blockchain consensus algorithms: the state of the art and future trends a survey on consensus mechanisms and mining management in blockchain networks sok: a consensus taxonomy in the blockchain era a survey about consensus algorithms used in blockchain a survey on consensus mechanisms and mining strategy management in blockchain networks sok: consensus in the age of blockchains a survey of distributed consensus protocols for blockchain networks a survey of attacks on ethereum smart contracts blockchain-based smart-contract languages: a systematic literature review an overview on smart contracts: challenges, advances and platforms sok: sharding on blockchain survey: sharding in blockchains research on scalability of blockchain technology: problems and methods solutions to scalability of blockchain: a survey sok: communication across distributed ledgers a systematic literature review of blockchain cyber security a survey of blockchain from security perspective a survey of blockchain technology on security, privacy, and trust in crowdsourcing services the security of big data in fog-enabled iot applications including blockchain: a survey a survey on privacy protection in blockchain system a comprehensive survey on blockchain: working, security analysis, privacy threats and potential applications blockchain data analysis: a review of status, trends and challenges dissecting ponzi schemes on ethereum: identification, analysis, and impact blockchain for cloud exchange: a survey blockchain for ai: review and open research challenges blockchain intelligence: when blockchain meets artificial intelligence when machine learning meets blockchain: a decentralized, privacy-preserving and secure design blockchain and machine learning for communications and networking systems a survey on blockchain: a game theoretical perspective blockchain security in cloud computing: use cases, challenges, and solutions when mobile blockchain meets edge computing integrated blockchain and edge computing systems: a survey, some research issues and challenges blockchain for 5g and beyond networks: a state of the art survey blockchains and smart contracts for the internet of things applications of blockchains in the internet of things: a comprehensive survey a review on the use of blockchain for the internet of things internet of things security: a top-down survey blockchain and iot integration: a systematic survey blockchain for internet of things: a survey survey on blockchain for internet of things integration of blockchain and cloud of things: architecture, applications and challenges blockchain for the internet of things: present and future when internet of things meets blockchain: challenges in distributed consensus blockchain technology toward green iot: opportunities and challenges a survey of iot applications in blockchain systems: architecture, consensus, and traffic modeling blockchain applications for industry 4.0 and industrial iot: a review edge intelligence and blockchain empowered 5g beyond for the industrial internet of things applications of blockchain in unmanned aerial vehicles: a review blockchain: a survey on functions, applications and open issues a systematic literature review of blockchain-based applications: current status, classification and open issues blockchain in agriculture: a systematic literature review security and privacy for green iot based agriculture review blockchain solutions and challenges deployment of blockchain technology in software defined networks: a survey blockchain for business applications: a systematic literature review a survey of blockchain technology applied to smart cities: research issues and challenges blockchain in smart grids: a review on different use cases blockchain technology for smart grids: decentralized nist conceptual model when blockchain meets distributed file systems: an overview, challenges, and open issues blockchain in space industry: challenges and solutions blockchain and ai-based solutions to combat coronavirus (covid-19)-like epidemics: a survey blockchain: the state of the art and future trends an overview of blockchain technology: architecture, consensus, and future trends blockchain challenges and opportunities: a survey blockchain and cryptocurrencies: model, techniques, and applications core concepts, challenges, and future directions in blockchain: a centralized tutorial bitcoin: a peer-to-peer electronic cash system a secure sharding protocol for open blockchains omniledger: a secure, scale-out, decentralized ledger via sharding double-spend counterattacks: threat of retaliation in proof-of-work systems ethereum: a secure decentralised generalised transaction ledger accel: accelerating the bitcoin blockchain for high-throughput, low-latency applications monoxide: scale out blockchains with asynchronous consensus zones prism: scaling bitcoin by 10,000 x garet: improving throughput using gas consumption-aware relocation in ethereum sharding environments erasure code-based low storage blockchain node jidar: a jigsaw-like data reduction approach without trust assumptions for bitcoin system segment blockchain: a size reduced storage mechanism for blockchain on availability for blockchain-based systems selecting reliable blockchain peers via hybrid blockchain reliability prediction rapidchain: scaling blockchain via full sharding sharper: sharding permissioned blockchains over network clusters gas consumption-aware dynamic load balancing in ethereum sharding environments a node rating based sharding scheme for blockchain optchain: optimal transactions placement for scalable blockchain sharding towards scaling blockchain systems via sharding sschain: a full sharding protocol for public blockchain without data migration overhead eunomia: a permissionless parallel chain protocol based on logical clock on the feasibility of sybil attacks in shard-based permissionless blockchains an n/2 byzantine node tolerate blockchain sharding approach cycledger: a scalable and secure parallel protocol for distributed ledger via sharding towards a novel architecture for enabling interoperability amongst multiple blockchains hyperservice: interoperability and programmability across heterogeneous blockchains smart contracts on the move enabling cross-chain transactions: a decentralized cryptocurrency exchange protocol scalable blockchain protocol based on proof of stake and sharding ouroboros praos: an adaptively-secure, semi-synchronous proof-of-stake blockchain the latest gossip on bft consensus a proof-of-trust consensus protocol for enhancing accountability in crowdsourcing services streamchain: do blockchains need blocks caper: a cross-application permissioned blockchain incentive mechanism for edge computing-based blockchain axechain: a secure and decentralized blockchain for solving easily-verifiable problems nonlinear blockchain scalability: a game-theoretic perspective credit-based payments for fast computing resource trading in edge-assisted internet of things secure high-rate transaction processing in bitcoin phantom: a scalable blockdag protocol scaling nakamoto consensus to thousands of transactions per second understanding ethereum via graph analysis bitcoin risk modeling with blockchain graphs blockchain analytics for intraday financial risk modeling forecasting bitcoin price with graph chainlets chainnet: learning on blockchain graphs with topological features market manipulation of bitcoin: evidence from mining the mt. gox transaction network measuring ethereum-based erc20 token networks network analysis of erc20 tokens trading on ethereum blockchain exploring eosio via graph characterization stochastic models and wide-area network measurements for blockchain design and analysis stability and scalability of blockchain systems a probabilistic security analysis of sharding-based blockchain protocols a methodology for a probabilistic security analysis of sharding-based blockchain protocols new mathematical model to analyze security of sharding-based blockchain protocols blockchain queue theory markov processes in blockchain systems learning blockchain delays: a queueing theory approach a bitcoin-inspired infinite-server model with a random fluid limit toward low-cost and stable blockchain networks simulation model for blockchain systems using queuing theory do you need a blockchain? modeling and understanding ethereum transaction records via a complex network approach an analysis of the fees and pending time correlation in ethereum blockchain competition between miners: a game theoretic perspective an analysis of blockchain consistency in asynchronous networks: deriving a neat bound modeling the impact of network connectivity on consensus security of proof-of-work blockchain challenges and pitfalls of partitioning blockchains divide and scale: formalization of distributed ledger sharding protocols corking by forking: vulnerability analysis of blockchain from byzantine replication to blockchain: consensus is only the beginning the economic limits of bitcoin and the blockchain the browsers strike back: countering cryptojacking and parasitic miners on the web capjack: capture in-browser crypto-jacking by deep capsule network through behavioral analysis characterizing and detecting money laundering activities on the bitcoin network analyzing the bitcoin ponzi scheme ecosystem detecting ponzi schemes on ethereum: towards healthier blockchain technology exploiting blockchain data to detect smart ponzi schemes on ethereum evidence based decision making in blockchain economic systems: from theory to practice pricing asics for cryptocurrency mining a first look at browser-based cryptojacking dynamic routing between capsules data mining for detecting bitcoin ponzi schemes money laundering in the bitcoin network: perspective of mixing services on the security and performance of proof of work blockchains performance analysis of hyperledger fabric platforms a detailed and real-time performance monitoring framework for blockchain systems blockbench: a framework for analyzing private blockchains measuring ethereum network peers local bitcoin network simulator for performance evaluation using lightweight virtualization parity documentation cita technical whitepaper hyperledger fabric: a distributed operating system for permissioned blockchains performance monitoring xblock-eth: extracting and exploring blockchain data from etherem xblock-eos: extracting and exploring blockchain data from eosio blocksci: design and applications of a blockchain analysis platform the honey badger of bft protocols key: cord-015967-kqfyasmu authors: tagore, somnath title: epidemic models: their spread, analysis and invasions in scale-free networks date: 2015-03-20 journal: propagation phenomena in real world networks doi: 10.1007/978-3-319-15916-4_1 sha: doc_id: 15967 cord_uid: kqfyasmu the mission of this chapter is to introduce the concept of epidemic outbursts in network structures, especially in case of scale-free networks. the invasion phenomena of epidemics have been of tremendous interest among the scientific community over many years, due to its large scale implementation in real world networks. this chapter seeks to make readers understand the critical issues involved in epidemics such as propagation, spread and their combat which can be further used to design synthetic and robust network architectures. the primary concern in this chapter focuses on the concept of susceptible-infectious-recovered (sir) and susceptible-infectious-susceptible (sis) models with their implementation in scale-free networks, followed by developing strategies for identifying the damage caused in the network. the relevance of this chapter can be understood when methods discussed in this chapter could be related to contemporary networks for improving their performance in terms of robustness. the patterns by which epidemics spread through groups are determined by the properties of the pathogen carrying it, length of its infectious period, its severity as well as by network structures within the population. thus, accurately modeling the underlying network is crucial to understand the spread as well as prevention of an epidemic. moreover, implementing immunization strategies helps control and terminate theses epidemics. for instance, random networks, small worlds display lesser variation in terms of neighbourhood sizes, whereas spatial networks have poisson-like degree distributions. moreover, as highly connected individuals are of more importance considering disease transmission, incorporating them into the current network is of outmost importance [4] . this is essential in case of capturing the complexities of disease spread. architecturally, scale-free networks are heterogenous in nature and can be dynamically constructed by adding new individuals to the current network structure one at a time. this strategy is similar to naturally forming links, especially in case of social networks. moreover, the newly connected nodes or individuals link to the already existent ones (with larger connections) in a manner that is preferential in nature. this connectivity can be understood by a power-law plot with the number of contacts per individual, a property which is regularly observed in case of several other networks like that of power grids, world-wide-web, to name a few [14] . epidemiologists have worked hard on understanding the heterogeneity of scalefree networks for populations for a long time. highly connected individuals as well as hub participants have played essential roles in the spread and maintenance of infections and diseases. figure 1 .1 illustrates the architecture of a system consisting of a population of individuals. it has several essential components, namely, nodes, links, newly connected nodes, hubs and sub-groups respectively. here, nodes correspond to individuals and their relations are shown as links. similarly, newly connected nodes correspond to those which are recently added to the network, such as initiation of new relations between already existing and unknown individuals [24] . hubs are fig. 1.1 a synthetic scale-free network and its characteristics those nodes which are highly connected, such as individuals who are very popular among others and have many relations and/or friends. lastly, sub-groups correspond to certain sections of the population which have individuals with closely associated relationships, such as group of nodes which are highly dense in nature, or having high clustering coefficient. furthermore, it is important in having large number of contacts as the individuals are at greater risk of infection and, once infected, can transmit it to others. for instance, hub individuals of such high-risk individuals help in maintaining sexually transmitted diseases (stds) in different populations where majority belong to long-term monogamous relationships, whereas in case of sars epidemic, a significant proportion of all infections are due to high risk connected individuals. furthermore, the preferential attachment model proposed by barabási and albert [4] defined the existence of individuals of having large connectivity does not require random vaccination for preventing epidemics. moreover, if there is an upper limit on the connectivity of individuals, random immunization can be performed to control infection. likewise, the dynamics of infectious diseases has been extensively studied in case of scale-free as well as small-world and random networks. in small-world networks, most of the nodes may not be direct neighbors, but can be reached from all other nodes via less number of hops, that are number of nodes between start and terminating nodes. also, in these networks distance, dist, between two random nodes increases proportionally to the logarithm of the number of nodes, tot, in the network [15] , i.e., dist ∝ log tot (1.1) watts and strogatz [24] identified a class of small-world networks and categorized them as random graphs. these were classified on the basis of two independent features, namely, average shortest path length and clustering coefficient. as per erdős-rényi model, random graphs have a smaller average shortest path length and small clustering coefficient. watts and strogatz on the other hand demonstrated that various real-world networks have a smaller average shortest path length along with high clustering coefficient greater than expected randomly. it has been observed that it is difficult to block and/or terminate an epidemic in scale-free networks with slow tails. it has especially been seen in case the network correlations among infections and individuals are absent. another reason for this effect is the presence of hubs, where infections could be sustained and reduced by target-specific selections [17] . it has been well known that real-world networks ranging from social to computers are scale-free in nature, whose degree distribution follows an asymptotic power-law. these are characterized by degree distribution following a power law, for the number of connections, conn for individuals and η is an exponent. barabási and albert [4] analyzed the topology of a portion of the world-wide-web and identified 'hubs'. the terminals had larger number of connections than others and the whole network followed a power-law distribution. they also found that these networks have heavy-tailed degree distributions and thus termed them as 'scale-free'. likewise, models for epidemic spread in static heavy-tailed networks have illustrated that with a degree distribution having moments resulted in lesser prevalence and/or termination for smaller rates of infection [14] . moreover, beyond a particular threshold, this prevalence turns to non-zero. similarly, it has been seen that for networks following power-law, does not exist and the prevalence is non-zero for any infection rates. due to this reason, epidemics are difficult to handle and terminate in static networks having powerlaw degree distributions. likewise, in various instances, networks are not static but dynamic (i.e., they evolve in time) via some rewiring processes, in which edges are detached and reattached according to some dynamic rule. steady states of rewiring networks have been studied in the past. more often, it has been observed that depending on the average connectivity and rewiring rates, networks reach a scale-free steady state, with an exponent, η , represented using dynamical rates [17] . the study of epidemics has always been of interest in areas where biological applications coincide with social issues. for instance, epidemics like influenza, measles, and stds, can pass through large group of individuals, populations, and/or persist over longer timescales at low levels. these might even experience sudden changes of increasing and decreasing prevalence. furthermore, in some cases, single infection outbreaks may have significant effects on a complete population group [1] . epidemic spreading can also occur on complex networks with vertices representing individuals and the links representing interactions among individuals. thus, spreading of diseases can occur over the network of individuals as spreading of computer viruses occur over the world-wide-web. the underlying network in epidemic models is considered to be static while the individual states vary from infected to non-infected individuals according to certain probabilistic rules. furthermore, the evolution of an infected group of individuals in time can be studied by focusing on the average density of infected individuals in steady state. lastly, the spread as well as growth of epidemics can also be monitored by studying the architecture of the network of individuals as well as its statistical properties [2] . one of the essential properties of epidemic spread is its branching pattern, thereby infecting healthy individuals over a time period. this branching pattern of epidemic progression can be classified on the basis of their infection initiation, spread and further spread ( fig. 1.3) [5]. 1. infection initiation: if an infected individual comes in contact with a group of individuals, the infection is transmitted to each with a probability p, independent of one another. furthermore, if the same individual meets k others while being infected, these k individuals form the infected set. due to this random disease transmission from the initially infected individual, those directly connected to it get infected. if infection in a branching process reaches an individual set and fails to infect healthy individuals, then termination of the infection occurs, which leads to no further progression and infection of other healthy individuals. thus, there may be two possibilities for an infection in a branching process model. either it reaches a site infecting no further and terminating out, or it continues to infect healthy individuals through contact processes. the quantity which can be used to identify whether an infection persist or fades out is defined as basic reproductive number [6] . this basic reproductive number, τ, is the expected number of newly infected individuals caused by a single already infected individual. in case where every individual meets k new people and infects each with probability p, the basic reproductive number is represented as it is quite essential as it helps in identifying whether or not an infection can spread through a population of healthy individuals. the concept of τ was first proposed by alfred lotka, and applied in the area of epidemiology by macdonald [13] . for non-complex population models, τ can be identified if information for 'death rate' is present. thus, considering death rate, d, and birth rate, b, at the same time, moreover, τ can also be used to determine whether an infection will terminate, i.e., τ < 1 or it becomes an epidemic, i.e., τ > 1. but, it cannot be used for comparing different infections at the same time on the basis of multiple parameters. several methods, such as identifying eigenvalues, jacobian matrix, birth rate, equilibrium states, population statistics can well be used to analyze and handle τ [18] . there are some standard branching models that are existent for analyzing the progress of infection in a healthy population or network. the first one, reed-frost model, considers a homogeneous close set consisting of total number of individuals, tot. let num designate the number of individuals susceptible to infection at time t = 0 and m num the number of individuals infected by the infection at any time t [19] . here, here, eq. 1.7 is in case of a smaller population. it is assumed that an individual x is infected at time t, whereas any individual y comes in contact with x with a probability a num , where a > 0. likewise, if y is susceptible to infection then it becomes infected at time t + 1 and x is removed from the population ( fig. 1.4a ). in this figure, x or v 1 ( * ) represents the infection start site, y(v 3 ), v 2 are individuals that are susceptible to infection, num = 0, tot = 11, and m num = 1. the second one, 3-clique model constructs a 3-clique sub-network randomly by assigning a set of tot individuals. here, for individual/vertex pair (v i , v j ) with probability p 1 , the pair is included along with vertices triples here, g 1 , g 2 are two independent graphs, where g 1 is a bernoulli graph with edge probability p 1 and g 2 with all possible triangles existing independently with a probability p 2 ( fig. 1.4b ). in this figure, 9 ) are the three 3-clique sub-networks with tot = 9, and g = g 1 g 2 g 3 respectively [21] . the third one, household model assumes that for a given a set of tot individuals or vertices, g 1 is a bernoulli graph consisting of tot b disjoint b−cliques, where b tot with edge probability p 2 . thus, the network g is formed as the superposition of the graphs g 1 and g 2 , i.e., g = g 1 g 2 . moreover, g 1 fragments the population into mutually exclusive groups whereas g 2 describes the relations among individuals in the population. thus, g 1 does not allow any infection spread, as there are no connections between the groups. but, when the relationship structure g 2 is added, the groups are linked together and the infection can now spread using relationship connections ( fig. 1.4c ). in this figure, tot = 10 where the individuals (v 1 to v 10 ) are linked on the basis of randomly assigned p 2 and b = 4 tot = 10. the fig. 1 .5b-d respectively [23] . thus, it is essential to identify the conditions which results in an epidemic spread in one network, with the presence of minimal isolated infections on other network components. moreover, depending on the parameters of individual sub-networks and their internal connectivities, connecting them to one another creates marginal effect on the spread of epidemic. thus, identifying these conditions resulting in analyzing spread of epidemic process is very essential. in this case, two different interconnected network modules can be determined, namely, strongly and weakly coupled. in the strongly coupled one, all modules are simultaneously either infection free or part of an epidemic, whereas in the weakly coupled one a new mixed phase exists, where the infection is epidemic on only one module, and not in others [25] . generally, epidemic models consider contact networks to be static in nature, where all links are existent throughout the infection course. moreover, a property of infection is that these are contagious and spread at a rate faster than the initially infected contact. but, in cases like hiv, which spreads through a population over longer time scales, the course of infection spread is heavily dependent on the properties of the contact individuals. the reason for this being, certain individuals may have lesser contacts at any single point in time and their identities can shift significantly with the infection progress [25] . thus, for modeling the contact network in such infections, transient contacts are considered which may not last through the whole epidemic course, but only for particular amount of time. in such cases, it is assumed that the contact links are undirected. furthermore, different individual timings do not affect those having potential to spread an infection but the timing pattern also influences the severity of the overall epidemic spread. similarly, individuals may also be involved in concurrent partnerships having two or more actively involved ones that overlap in time. thus, the concurrent pattern causes the infection to circulate vigorously through the network [22] . in the last decade, considerable amount of work has been done in characterizing as well as analyzing and understanding the topological properties of networks. it has been established that scale-free behavior is one of the most fundamental concepts for understanding the organization various real-world networks. this scale-free property has a resounding effect on all aspect of dynamic processes in the network, which includes percolation. likewise, for a wide range of scale-free networks, epidemic threshold is not existent, and infections with low spreading rate prevail over the entire population [10] . furthermore, properties of networks such as topological fractality etc. correlate to many aspects of the network structure and function. also, some of the recent developments have shown that the correlation between degree and betweenness centrality of individuals is extremely weak in fractal network models in comparison with non-fractal models [20] . likewise, it is seen that fractal scale-free networks are dis-assortative, making such scale-free networks more robust against targeted perturbations on hubs nodes. moreover, one can also relate fractality to infection dynamics in case of specifically designed deterministic networks. deterministic networks allow computing functional, structural as well as topological properties. similarly, in case of complex networks, determination of topological characteristics has shown that these are scale-free as well as highly clustered, but do not display small-world features. also, by mapping a standard susceptible, infected, recovered (sir) model to a percolation problem, one can also find that there exists certain finite epidemic threshold. in certain cases, the transmission rate needs to exceed a critical value for the infection to spread and prevail. this also specifies that the fractal networks are robust to infections [11] . meanwhile, scale-free networks exhibit various essential characteristics such as power-law degree distribution, large clustering coefficient, large-world phenomenon, to name a few [16] . network analysis can be used to describe the evolution and spread of information in the populations along with understanding their internal dynamics and architecture. specifically, importance should be given to the nature of connections, and whether a relationship between x and y individuals provide a relationship between y and x as well. likewise, this information could be further utilized for identifying transitivitybased measures of cohesion ( fig. 1.6 ). meanwhile, research in networks also provide some quantitative tools for describing and characterizing networks. degree of a vertex is the number of connectivities for each vertex in the form of links. for instance, degree(v 4 ) = 3, degree(v 2 ) = 4 (for undirected graph (fig. 1.6a) ). similarly for fig. 1 likewise, shortest path is the minimum number of links that needs to be parsed for traveling between two vertices. for instance, in fig. 1 diameter of network is the maximum distance between any two vertices or the longest of the shortest walks. thus, in fig. 1 [15] . radius of a network is the minimum eccentricity (eccentricity of a vertex v i is the greatest geodesic distance), i.e., distance between two vertices in a network is the number of edges in a shortest path connecting them between v i and any other vertex of any vertex. for instance, in fig. 1 .6b, radius of network = 2. betweenness centrality (g(v i )) is equal to the number of shortest paths from all vertices to all others that pass through vertex v i , i.e., is the number of those paths that pass through v i . thus, in fig. 1 similarly, closeness centrality (c(v i )) of a vertex v i describes the total distance of v i to all other vertices in the network, i.e., sum the shortest paths of v i to all other vertices in the network. for instance, in fig. 1.6b, c( lastly, stress centrality (s(v i )) is the simple accumulation of the number of shortest paths between all vertex pairs, sometimes interchangeable with betweenness centrality [14] . use of 'adjacency matrix', a v i v j , describing the connections within a population is also persistent. likewise, various network quantities can be ascertained from the adjacency matrix. for instance, size of a population is defined as the average number of contacts per individual, i.e., the powers of adjacency matrix can be used to calculate measures of transitivity [14] . one of the key pre-requisites of network analysis is initial data collection. for performing a complete mixing network analysis for individuals residing in a population, every relationship information is essential. this data provides great difficulty in handling the entire population, as well as handling complicated network evaluation issues. the reason being, individuals have contacts, and recall problems are quite probable. moreover, evaluation of contacts requires certain information which may not always be readily present. likewise, in case of epidemiological networks, connections are included if they explain relationships capable of permitting the transfer of infection. but, in most of the cases, clarity of defining such relations is absent. thus, various types of relationships bestow risks and judgments that needs to be sorted for understanding likely transmission routes. one can also consider weighted networks in which links are not merely present or absent but are given scores or weights according to their strength [9] . furthermore, different infections are passed by different routes, and a mixing network is infection specific. for instance, a network used in hiv transmission is different from the one used to examine influenza. similarly, in case of airborne infections like influenza and measles, various networks need to be considered because differing levels of interaction are required to constitute a contact. the problems with network definition and measurement imply that any mixing networks that are obtained will depend on the assumptions and protocols of the data collection process. three main standard techniques can be employed to gather such information, namely, infection searching, complete contact searching and diary-based studies [9] . after an epidemic spread, major emphasis is laid on determining the source and spread of infection. thus, each infected individual is linked to one other from whom infection is spread as well as from whom the infection is transmitted. as all connections represent actual transmission events, infection searching methods do not suffer from problems with the link definition, but interactions not responsible for this infection transmission are removed. thus, the networks observed are of closed architecture, without any loops, walks, cliques and complete sub-graphs [15] . infection searching is a preliminary method for infectious diseases with low prevalence. these can also be simulated using several mathematical techniques based on differential equations, control theories etc., assuming a homogeneous mixing of population. it can also be simulated in a manner so that infected individuals are identified and cured at a rate proportional to the number of neighbors it has, analogous to the infection process. but, it does not allow to compare various infection searching budgets and thus a discrete-event simulation need to be undertaken. moreover, a number of studies have shown that analyses based on realistic models of disease transmission in healthy networks yields significant projections of infection spread than projections created using compartmental models [8] . furthermore, depending on the number of contacts for any infected individuals, their susceptible neighbors are traced and removed. this is followed by identifying infection searching techniques that yields different numbers of newly infected individuals on the spread of the disease. contact searching identifies potential transmission contacts from an initially infected individual by revealing some new individual set who are prone to infection and can be subject of further searching effort. nevertheless, it suffers from network definition issues; is time consuming and depends on complete information about individuals and their relationships. it has been used as a control strategy, in case of stds. its main objective of contact searching is identifying asymptomatically infected individuals who are either treated or quarantined. complete contact searching deals with identifying the susceptible and/or infected individuals of already infected ones and conducting simulations and/or testing them for degree of infection spread, treating them as well as searching their neighbors for immunization. for instance, stds have been found to be difficult for immunization. the reason being, these have specifically long asymptomatic periods, during which the virus can replicate and the infection is transmitted to healthy, closely related neighbors. this is rapidly followed by severe effects, ultimately leading to the termination of the affected individual. likewise, recognizing these infections as global epidemic has led to the development of treatments that allow them to be managed by suppressing the replication of the infection for as long as possible. thus, complete contact searching act as an essential strategy even in case when the infection seems incurable [7] . diary-based studies consider individuals recording contacts as they occur and allow a larger number of individuals to be sampled in detail. thus, this variation from the population approach of other tracing methods to the individual-level scale is possible. but, this approach suffers from several disadvantages. for instance, the data collection is at the discretion of the subjects and is difficult for researchers to link this information into a comprehensive network, as the individual identifies contacts that are not uniquely recorded [3] . diary-based studies require the individuals to be part of some coherent group, residing in small communities. also, it is quite probable that this kind of a study may result in a large number of disconnected sub-groups, with each of them representing some locally connected set of individuals. diary-based studies can be beneficial in case of identifying infected and susceptible individuals as well as the degree of infectivity. these also provide a comprehensive network for diseases that spread by point-to-point contact and can be used to investigate the patterns infection spread. robustness is an essential connectivity property of power-law graph. it defines that power-law graphs are robust under random attack, but vulnerable under targeted attack. recent studies have shown that the robustness of power-law graph under random and targeted attacks are simulated displaying that power-law graphs are very robust under random errors but vulnerable when a small fraction of high degree vertices or links are removed. furthermore, some studies have also shown that if vertices are deleted at random, then as long as any positive proportion remains, the graph induced on the remaining vertices has a component of order of the total number of vertices [15] . many a times it can be observed that a network of individuals may be subject to sudden change in the internal and/or external environment, due to some perturbation events. for this reason, a balance needs to be maintained against perturbations while being adaptable in the presence of changes, a property known as robustness. studies on the topological and functional properties of such networks have achieved some progress, but still have limited understanding of their robustness. furthermore, more important a path is, higher is the chance to have a backup path. thus, removing a link or an individual from any sub-network may also lead to blocking the information flow within that sub-network. the robustness of a model can also be assessed by means of altering the various parameters and components associated with forming a particular link. robustness of a network can also be studied with respect to 'resilience', a method of analyzing the sensitivities of internal constituents under external perturbation, that may be random or targeted in nature [18] . basic disease models discuss the number of individuals in a population that are susceptible, infected and/or recovered from a particular infection. for this purpose, various differential equation based models have been used to simulate the events of action during the infection spread. in this scenario, various details of the infection progression are neglected, along with the difference in response between individuals. models of infections can be categorized as sir and susceptible, infected, susceptible (sis) [9] . the sir model considers individuals to have long-lasting immunity, and divides the population into those susceptible to the disease (s), infected (i) and recovered (r). thus, the total number of individuals (t ) considered in the population is the transition rate from s to i is κ and the recovery rate from i to r is ρ . thus, the sir model can be represented as likewise, the reproductivity (θ) of an infection can be identified as the average number of secondary instances a typical single infected instance will cause in a population with no immunity. it determines whether infections spreads through a population; if θ < 1, the infection terminates in the long run; θ > 1, the infection spreads in a population. larger the value of θ, more difficult is to control the epidemic [12] . furthermore, the proportion of the population that needs to be immunized can be calculated by known as endemic stability can be identified. depending upon these instances, immunization strategies can be initiated [6] . although the contact network in a general sir model can be arbitrarily complex, the infection dynamics can still being studied as well as modeled in a simple fashion. contagion probabilities are set to a uniform value, i.e., p, and contagiousness has a kind of 'on-off' property, i.e., an individual is equally contagious for each of the t i steps while it has the infection, where 1 is present state of the system. one can extend the idea that contagion is more likely between certain pairs of individuals or vertices by assigning a separate probability p v i ,v j to each pair of individuals or vertices v i and v j , for which v i is linked to v j in a directed contact network. likewise, other extensions of the contact model involves separating the i state into a sequence of early, middle, and late periods of the infection. for instance, it could be used to model an infection with a high contagious incubation period, followed by a less contagious period while symptoms are being expressed [16] . in most of the cases, sir epidemics are thought of dynamic processes, in which the network state evolves step-by-step over time. it captures the temporal dynamics of the infection as it spreads through a population. the sir model has been found to be suitable for infections, which provides lifelong immunity, like measles. in this case, a property termed as the force of infection is existent, a function of the number of infectious individuals is. it also contains information about the interactions between individuals that lead to the transmission of infection. one can also have a static view of the epidemics where sir model for t i = 1. this means that considering a point in an sir epidemic when a vertex v i has just become infectious, has one chance to infect v j (since t i = 1), with probability p. one can visualize the outcome of this probabilistic process and also assume that for each edge in the contact network, a probability signifying the relationship is identified. the sis model can be represented as removed state is absent in this case. moreover, after a vertex is over with the infectious state, it reverts back to the susceptible state and is ready to initiate the infection again. due to this alternation between the s and i states, the model is referred to as sis model. the mechanics of sis model can be discussed as follows [2] . 1. at the initial stage, some vertices remain in i state and all others are in s state. 2. each vertex v i that enters the i state and remains infected for a certain number of steps t i . 3. during each of these t i steps, v i has a probability p of passing the infection to each of its susceptible directly linked neighbors. 4. after t i steps, v i no longer remains infected, and returns back to the s state. the sis model is predominantly used for simulating and understanding the progress of stds, where repeat infections are existent, like gonorrhoea. moreover, certain assumptions with regard to random mixing between individuals within each pair of sub-networks are present. in this scenario, the number of neighbors for each individual is considerably smaller than the total population size. such models generally avoid random-mixing assumptions thereby assigning each individual to a specific set of contacts that they can infect. an sis epidemic, can run for long time duration as it can cycle through the vertices multiple number of times. if at any time during the sis epidemic all vertices are simultaneously free of the infection, then the epidemic terminates forever. the reason being, no infected individuals exist that can pass the infection to others. in case if the network is finite in nature, a stage would arise when all attempts for further infection of healthy individuals would simultaneously fail for t i steps in a row. likewise, for contact networks where the structure is mathematically tractable, a particular critical value of the contagion probability p is existent, an sis epidemic undergoes a rapid shift from one that terminates out quickly to one that persists for a long time. in this case, the critical value of the contagion probability depends on the structure of the problem set [1] . the patterns by which epidemics spread through vertex groups is determined by the properties of the pathogen, length of its infectious period, severity and the network structures. the path for an infection spread are given by a population state, with existence of direct contacts between the individuals or vertices. the functioning of network system depends on the nature of interaction between their individuals. this is essentially because of the effect of infection-causing individuals and topology of networks. to analyze the complexity of epidemics, it is important to understand the underlying principles of its distribution in the history of its existence. in recent years it has been seen that the study of disease dynamics in social networks is relevant with the spread of viruses and the nature of diseases [9] . moreover, the pathogen and the network are closely intertwined with even within the same group of individuals, the contact networks for two different infections are different structures. this depends on respective modes of transmission of infections. for instance, a highly contagious infection, involving airborne transmission, the contact network includes a huge number of links, including any pair of individuals that are in contact with one another. likewise, for an infection requiring close contact, the contact network is much sparser, with fewer pairs of individuals connected by links [7] . immunization is a site percolation problem where each immunized individual is considered to be a site which is removed from the infected network. its aim is to transfer the percolation threshold that leads to minimization of the number of infected individuals. the model of sir and immunization is regarded as a site-bond percolation model, and immunization is considered successful if the infected a network is below a predefined percolation threshold. furthermore, immunizing randomly selected individuals requires targeting a large fraction, frac, of the entire population. for instance, some infections require 80-100 % immunization. meanwhile, targetbased immunization of the hubs requires global information about the network in question, rendering it impractical in many cases, which is very difficult in certain cases [6] . likewise, social networks possess a broad distribution of the number of links, conn, connecting individuals and analyzing them illustrate that that a large fraction, frac, of the individuals need to be immunized before the integrity of the infected network is compromised. this is essentially true for scale-free networks, where p(conn) ≈ conn − η , 2 < η < 3, where the network remains connected even after removal of most of its individuals or vertices. in this scenario, a random immunization strategy requires that most of the individuals need to be immunized before an epidemic is terminated [8] . for various infections, it may be difficult to reach a critical level of immunization for terminating the infection. in this case, each individual that is immunized is given immunity against the infection, but also provides protection to other healthy individuals within the population. based on the sir model, one can only achieve half of the critical immunization level which reduces the level of infection in the population by half. a crucial property of immunization is that these strategies are not perfect and being immunized does not always confer immunity. in this case, the critical threshold applies to a portion of the total population that needs to be immunized. for instance, if the immunization fails to generate immunity in a portion, por, of those immunized, then to achieve immunity one needs to immunize a portion here, im denotes immunity strength. thus, in case if por is huge it is difficult to remove infection using this strategy or provides partial immunity. it may also invoke in various manners: the immunization reduces the susceptibility of an individual to a particular infection, may reduce subsequent transmission if the individual becomes infected, or it may increase recovery. such immunization strategies require the immunized individuals to become infected and shift into a separate infected group, after which the critical immunization threshold (s i ) needs to be established. thus, if cil is the number of secondary infected individuals affected by an initial infectious individual, then thus, s i needs to be less than one, else it is not possible to remove the infection. but, one also needs to note that an immunization works equally efficiently if it reduces the transmission or susceptibility and increases the recovery rate. moreover, when the immunization strategy fails to generate any protection in a proportion por of those immunized, the rest 1−por are fully protected. in this scenario, it can be not possible to remove the infection using random immunization. thus, targeted immunization provides better protection than random-based [13] . in case of homogenous networks, the average degree, conn, fluctuates less and can assume conn conn, i.e., the number of links are approximately equal to average degree. however, networks can also be heterogeneous. likewise, in a homogeneous network such as a random graph, p(conn) decays faster exponentially whereas for heterogenous networks it decays as a power law for large conn. the effect of heterogeneity on epidemic behavior studied in details for many years for scale-free networks. these studies are mainly concerned with the stationary limit and existence of an endemic phase. an essential result of this analysis is the expression of basic reproductive number which in this case is τ ∞ conn 2 conn . here, τ is proportional to the second moment of degree, which finally diverges for increasing network sizes [15] . it has been noticed that the degree of interconnection in between individuals for all form of networks is quite unprecedented. whereas, interconnection increases the spread of information in social networks, another exhaustively studied area contributes to the spread of infection throughout the healthy network. this rapid spreading is done due to less stringency of its passage through the network. moreover, initial sickness nature and time of infection are unavailable most of the time, and the only available information is related to the evolution of the sick-reporting process. thus, given complete knowledge of the network topology, the objective is to determine if the infection is an epidemic, or if individuals have become infected via an independent infection mechanism that is external to the network, and not propagated through the connected links. if one considers a computer network undergoing cascading failures due to worm propagation whereas random failures due to misconfiguration independent of infected nodes, there are two possible causes of the sickness, namely, random and infectious spread. in case of random sickness, infection spreads randomly and uniformly over the network where the network plays no role in spreading the infection; and infectious spread, where the infection is caused through a contagion that spreads through the network, with individual nodes being infected by direct neighbors with a certain probability [6] . in random damage, each individual becomes infected with an independent probability ψ 1 . at time t, each infected individual reports damage with an independent probability ψ 2 . thus, on an average, a fraction ψ of the network reports being infected, where it is already known that social networks possess a broad distribution of the number of links, k, originating from an individual. computer networks, both physical and logical are also known to possess wide, scale-free, distributions. studies of percolation on broad-scale networks display that a large fraction, fc, of the individuals need to be immunized before the integrity of the network is compromised. this is particularly true for scale-free networks, where the percolation threshold tends to 1, and the network remains contagious even after removal of most of its infected individuals [9] . when the hub individuals are targeted first, removal of just a fraction of these results in the breakdown of the network. this has led to the suggestion of targeted immunization of hubs. to implement this approach, the number for connections of each individual needs to be known. during infection spread, at time 0, a randomly selected individual in the network becomes infected. when a healthy individual becomes infected, a time is set for each outgoing link to an adjacent individual that is not infected, with expiration time exponentially distributed with unit average. upon expiration of a link's time, the corresponding individual becomes infected, and in-turn begins infecting its neighbors [7] . in general, for an epidemic to occur in a susceptible population the basic reproductive rate must be greater than 1. in many circumstances not all contacts will be susceptible to infection. in this case, some contacts remain immune, due to prior infection which may have conferred life-long immunity, or due to some previous immunization. therefore, not all individuals are infected and the average number of secondary infections decrease. similarly, the epidemic threshold in this case is the number of susceptible individuals within a population that is required for an epidemic to occur. similarly, the herd immunity is the proportion of population immune to a particular infection. if this is achieved due to immunization, then each case leads to a new case and the infection becomes more stable within the population [6] . one of the simplest immunization procedure consists of random introduction of immune individuals in the population for achieving uniform immunization density. in this case, for a fixed spreading rate, ξ , the relevant control parameter in the density of immune individuals present in the network, the immunity, imm. at the meanfield level, the presence of a uniform immunity reduces ξ by a factor 1 − imm, i.e., the probability of identifying and infecting a susceptible and non-immune individual becomes ξ(1−imm). for homogeneous networks, one observes that, for aconstant ξ , the stationary prevalence is given by for imm > imm c and for imm ≤ imm c here imm c is the critical immunization value above which the density of infected individuals in the stationary state is null and depends on ξ as thus, for a uniform immunization level larger than imm c , the network is completely protected and no large epidemic outbreaks are possible. on the contrary, uniform immunization strategies on scale-free heterogenous networks are totally ineffective. the presence of uniform immunization elocally depresses the infections prevalence for any value of ξ , and it is difficult to identify any critical fraction of immunized individuals that ensures the eradication of infection [2] . cascading, or epidemic processes are those where the actions, infections or failure of certain individuals increase the susceptibility of others. this results in the successive spread of infections from a small set of initially infected individuals to a larger set. initially developed as a way to study human disease propagation, cascades ares useful models in a wide range of application. the vast majority of work on cascading processes focused on understanding how the graph structure of the network affects the spread of cascades. one can also focus on several critical issues for understanding the cascading features in network for which studying the architecture of the network is crucial [5] . the standard independent cascade epidemic model assumes that the network is directed graph g = (v, e), for every directed edge between v i , v j , we say v i is a parent and v j is a child of the corresponding other vertex. parent may infect child along an edge, but the reverse cannot happen. let v denote the set of parents of each vertex v i , and for convenience v i ∈ v is included. epidemics proceed in discrete time where all vertices are initially in the susceptible state. at time 0, each vertex independently becomes active, with probability p init . this set of initially active vertices are called 'seeds'. in each time step, the active vertices probabilistically infects its susceptible children; if vertex v i is active at time t, it infects each susceptible child v j with probability p v i vj , independently. correspondingly, a vertex v j susceptible at time t becomes active in the next time step, i.e., at time t + 1, if any one of its parents infects it. finally, a vertex remains active for only one time slot, after which it becomes inactive and does not spread the infection further as well as cannot be infected again either [5] . thus, in this kind of an sir epidemic, where some vertices remain forever susceptible because the epidemic never reaches them, while others transition, susceptible → active for one time step → inactive. in this chapter, we discussed some critical issues regarding epidemics and their outbursts in static as well as dynamic network structures. we mainly focused on sir and sis models as well as identifying key strategies for identifying the damage caused in networks. we also discussed the various modeling techniques for studying cascading failures. epidemics pass through populations and persists over long time periods. thus, efficient modeling of the underlying network plays a crucial role in understanding the spread and prevention of an epidemic. social, biological, and communication systems can be explained as complex networks with their degree distribution follows a power law, p(conn) ≈ conn − η , for the number of connections, conn for individuals, representing scale-free (sf) networks. we also discussed certain issues on epidemic spreading in sf networks characterized by complex topologies with basic epidemic models describing the proportion of individuals susceptible, infected and recovered from a particular disease. likewise, we also explained the significance of the basic reproduction rate of an infection, that can be identified as the average number of secondary instances a typical single infected instance will cause in a population with no immunity. also, we explained how determining the complete nature of a network required knowledge of every individual in a population and their relationships as, the problems with network definition and measurement depend on the assumptions of data collection processes. nevertheless, we also illustrated the importance of invasion resistance methods, with temporary immunity generating oscillations in localized parts of the network, with certain patches following large numbers of infections in concentrated areas. similarly, we also explained the significance of damages, namely, random, where the damage spreads randomly and uniformly over the network and in particular the network plays no role in spreading the damage; and infectious spread, where the damage spreads through the network, with one node infecting others with some probability. infectious diseases of humans: dynamics and control the mathematical theory of infectious diseases and its applications a forest-fire model and some thoughts on turbulence emergence of scaling in random networks mathematical models used in the study of infectious diseases spread of epidemic disease on networks networks and epidemic models network-based analysis of stochastic sir epidemic models with random and proportionate mixing elements of mathematical ecology intelligent information and database systems propagation phenomenon in complex networks: theory and practice relation between birth rates and death rates the analysis of malaria epidemics graph theory and networks in biology mathematical biology spread of epidemic disease on networks the use of mathematical models in the epidemiology study of infectious diseases and in the design of mass vaccination programmes forest-fire as a model for the dynamics of disease epidemics on the critical behaviour of simple epidemics sensitivity estimates for nonliner mathematical models ensemble modeling of metabolic networks on analytical approaches to epidemics on networks computational modeling in systems biology collective dynamics of 'small-world' networks unifying wildfire models from ecology and statistical physics key: cord-314498-zwq67aph authors: van heck, eric; vervest, peter title: smart business networks: concepts and empirical evidence date: 2009-05-15 journal: decis support syst doi: 10.1016/j.dss.2009.05.002 sha: doc_id: 314498 cord_uid: zwq67aph nan organizations are moving, or must move, from today's relatively stable and slow-moving business networks to an open digital platform where business is conducted across a rapidly-formed network with anyone, anywhere, anytime despite different business processes and computer systems. table 1 provides an overview of the characteristics of the traditional and new business network approaches [2] . the disadvantages and associated costs of the more traditional approaches are caused by the inability to provide relative complex, bundled, and fast delivered products and services. the potential of the new business network approach is to create these types of products and services with the help of combining business network insights with telecommunication capabilities. the "business" is no longer a self-contained organization working together with closely coupled partners. it is a participant in a number of networks where it may lead or act together with others. the "network" takes additional layers of meaningfrom the ict infrastructures to the interactions between businesses and individuals. rather than viewing the business as a sequential chain of events (a value chain), actors in a smart business network seek linkages that are novel and different creating remarkable, "better than usual" results. "smart" has a connotation with fashionable and distinguished and also with short-lived: what is smart today will be considered common tomorrow. "smart" is therefore a relative rather than an absolute term. smartness means that the network of co-operating businesses can create "better" results than other, less smart, business networks or other forms of business arrangement. to be "smart in business" is to be smarter than the competitors just as an athlete who is considered fast means is faster than the others. the pivotal question of smart business networks concerns the relationship between the strategy and structure of the business network on one hand and the underlying infrastructure on the other. as new technologies, such as rfid, allow networks of organizations almost complete insight into where its people, materials, suppliers and customers are at any point in time, it is able to organize differently. but if all other players in the network space have that same insight, the result of the interactions may not be competitive. therefore it is necessary to develop a profound understanding about the functioning of these types of business networks and its impact on networked decision making and decision support systems. the key characteristics of a smart business network are that it has the ability to "rapidly pick, plug, and play" to configure rapidly to meet a specific objective, for example, to react to a customer order or an unexpected situation (for example dealing with emergencies) [4] . one might regard a smart business network as an expectant web of participants ready to jump into action (pick) and combine rapidly (plug) to meet the requirements of a specific situation (play). on completion they are dispersed to "rest" while, perhaps, being active in other business networks or more traditional supply chains. this combination of "pick, plug, play and disperse" means that the fundamental organizing capabilities for a smart business network are: (1) the ability for quick connect and disconnect with an actor; (2) the selection and execution of business processes across the network; and (3) establishing the decision rules and the embedded logic within the business network. we have organized in june 2006 the second sbni discovery session that attracted both academics and executives to analyze and discover the smartness of business networks [1] . we received 32 submissions and four papers were chosen as the best papers that are suitable for this special issue. the four papers put forward new insights about the concept of smart business networks and also provide empirical evidence about the functioning and outcome of these business networks and its potential impact on networked decision making and decision support systems. the first paper deals with the fundamental organizing ability to "rapidly pick, plug, and play" to configure rapidly to meet a specific objective, in this case to find a solution to stop the outbreak of the severe acute respiratory syndrome (sars) virus. peter van baalen and paul van fenema show how the instantiation of a global crisis network of laboratories around the world cooperated and competed to find out how this deadly virus is working. the second paper deals with the business network as orchestrated by the spanish grupo multiasistencia. javier busquets, juan rodón, and jonathan wareham show how the smart business network approach with embedded business processes lead to substantial business advantages. the paper also shows the importance of information sharing in the business network and the design and set up of the decision support and infrastructure. the third paper focus on how buyer-seller relationships in online markets develop over time e.g. how even in market relationships buyers and sellers connect (to form a contract and legal relationship) and disconnect (by finishing the transaction) and later come back to each other (and form a relationship again). ulad radkevitch, eric van heck, and otto koppius identify four types of clusters in an online market of it services. empirical evidence reveals that these four portfolio clusters rely on either arms-length relationships supported by reverse auctions, or recurrent buying with negotiations or a mixed mode, using both exchange mechanisms almost equally (two clusters). the fourth paper puts forward the role and impact of intelligent agents and machine learning in networks and markets. the capability of agents to quickly execute tasks with other agents and systems will be a potential, sustainable and profitable strategy to act faster and better for business networks. wolf ketter, john collins, maria gini, alok gupta, and paul schrater identify how agents are able to learn from historical data and can detect different economic regimes, such as under-supply and over-supply in markets. therefore, agents are able to characterize the economic regimes of markets and forecast the next, future regime in the market to facilitate tactical and strategic decision making. they provide empirical evidence from the analysis of the trading agent competition for supply chain management (tac scm). we identify three important potential directions for future research. the first research stream deals with advanced network orchestration with distributed control and decision making. the first two papers indicate that network orchestration is a key critical component of successful business networks. research of intelligent agents is showing that distributed and decentralized decision making might provide smart solutions because it combines local knowledge of actors and agents in the network with coordination and control of the network as a whole. agents can help to reveal business rules in business networks, or gather pro-actively new knowledge about the business network and will empower the next generation of decision support systems. the second research stream deals with information sharing over and with network partners. for example, diederik van liere explores in his phd dissertation the concept of the "network horizon": the number of nodes that an actor can "see" from a specific position in the network [3] . most companies have a network horizon of "1". they know and exchange information with their suppliers and customers. however, what about the supplier of the suppliers, or the customer of their customers? one develops then a network horizon of "2". diederik van liere provides empirical evidence that with a larger network horizon a company can take a more advantageous network position depending on the distribution of the network horizons across all actors and up to a certain saturation point. the results indicate that the expansion of the network horizon will be in the near future a crucial success factor for companies. future research will shed more light on this type of network analysis and its impact on network performance. the third research stream will focus on the network platform with a networked business operating system (bos). most of the network scientists analyze the structure and dynamics of the business networks independent of the technologies that enable it to perform. it concentrates on what makes the network effective, the linked relationships between the actors, and how their intelligence is combined to reach the network's goals. digital technologies play a fundamental role in today's networks. they have facilitated improvements and fundamental changes in the ways in which organizations and individuals interact and combine as well as revealing unexpected capabilities that create new markets and opportunities. the introduction of new networked business operating systems will be feasible and these operating systems will go beyond the networked linking of traditional enterprise resource planning (erp) systems with customer relationship management (crm) software packages. implementation of a bos enables the portability of business processes and facilitates the end-to-end management of processes running across many different organizations in many different forms. it coordinates the processes among the networked businesses and its logic is embedded in the systems used by these businesses. smart business network initiative smart business networks: how the network wins network horizon and dynamics of network positions eric van heck holds the chair of information management and markets at rotterdam school of management, erasmus university, where he is conducting research and is teaching on the strategic and operational use of information technologies for companies and markets vervest is professor of business networks at the rotterdam school of management, erasmus university, and partner of d-age, corporate counsellors and investment managers for digital age companies firstly, we would like to thank the participants of the 2006 sbni discovery session that was held at the vanenburg castle in putten, the netherlands. inspiring sessions among academics and executives shed light on the characteristics and the functioning of smart business networks.secondly, we thank the reviewers of the papers for all their excellent reviews. we had an intensive review process and would like to thank the authors for their perseverance and hard work to create an excellent contribution to this special issue. we thank kevin desouza, max egenhofer, ali farhoomand, erwin fielt, shirley gregor, lorike hagdorn, chris holland, benn konsynski, kenny preiss, amrit tiwana, jacques trienekens, and dj wu for their excellent help in reviewing the papers.thirdly, we thank andy whinston for creating the opportunity to prepare this special issue of decision support systems on smart business networks. key: cord-007708-hr4smx24 authors: van kampen, antoine h. c.; moerland, perry d. title: taking bioinformatics to systems medicine date: 2015-08-13 journal: systems medicine doi: 10.1007/978-1-4939-3283-2_2 sha: doc_id: 7708 cord_uid: hr4smx24 systems medicine promotes a range of approaches and strategies to study human health and disease at a systems level with the aim of improving the overall well-being of (healthy) individuals, and preventing, diagnosing, or curing disease. in this chapter we discuss how bioinformatics critically contributes to systems medicine. first, we explain the role of bioinformatics in the management and analysis of data. in particular we show the importance of publicly available biological and clinical repositories to support systems medicine studies. second, we discuss how the integration and analysis of multiple types of omics data through integrative bioinformatics may facilitate the determination of more predictive and robust disease signatures, lead to a better understanding of (patho)physiological molecular mechanisms, and facilitate personalized medicine. third, we focus on network analysis and discuss how gene networks can be constructed from omics data and how these networks can be decomposed into smaller modules. we discuss how the resulting modules can be used to generate experimentally testable hypotheses, provide insight into disease mechanisms, and lead to predictive models. throughout, we provide several examples demonstrating how bioinformatics contributes to systems medicine and discuss future challenges in bioinformatics that need to be addressed to enable the advancement of systems medicine. systems medicine fi nds its roots in systems biology, the scientifi c discipline that aims at a systems-level understanding of, for example, biological networks, cells, organs, organisms, and populations. it generally involves a combination of wet-lab experiments and computational (bioinformatics) approaches. systems medicine extends systems biology by focusing on the application of systems-based approaches to clinically relevant applications in order to improve patient health or the overall well-being of (healthy) individuals [ 1 ] . systems medicine is expected to change health care practice in the coming years. it will contribute to new therapeutics through the identifi cation of novel disease genes that provide drug candidates less likely to fail in clinical studies [ 2 , 3 ] . it is also expected to contribute to fundamental insights into networks perturbed by disease, improved prediction of disease progression, stratifi cation of disease subtypes, personalized treatment selection, and prevention of disease. to enable systems medicine it is necessary to characterize the patient at various levels and, consequently, to collect, integrate, and analyze various types of data including not only clinical (phenotype) and molecular data, but also information about cells (e.g., disease-related alterations in organelle morphology), organs (e.g., lung impedance when studying respiratory disorders such as asthma or chronic obstructive pulmonary disease), and even social networks. the full realization of systems medicine therefore requires the integration and analysis of environmental, genetic, physiological, and molecular factors at different temporal and spatial scales, which currently is very challenging. it will require large efforts from various research communities to overcome current experimental, computational, and information management related barriers. in this chapter we show how bioinformatics is an essential part of systems medicine and discuss some of the future challenges that need to be solved. to understand the contribution of bioinformatics to systems medicine, it is helpful to consider the traditional role of bioinformatics in biomedical research, which involves basic and applied (translational) research to augment our understanding of (molecular) processes in health and disease. the term "bioinformatics" was fi rst coined by the dutch theoretical biologist paulien hogeweg in 1970 to refer to the study of information processes in biotic systems [ 4 ] . soon, the fi eld of bioinformatics expanded and bioinformatics efforts accelerated and matured as the fi rst (whole) genome and protein sequences became available. the signifi cance of bioinformatics further increased with the development of highthroughput experimental technologies that allowed wet-lab researchers to perform large-scale measurements. these include determining whole-genome sequences (and gene variants) and genome-wide gene expression with next-generation sequencing technologies (ngs; see table 1 for abbreviations and web links) [ 5 ] , measuring gene expression with dna microarrays [ 6 ] , identifying and quantifying proteins and metabolites with nmr or (lc/ gc-) ms [ 7 ] , measuring epigenetic changes such as methylation and histone modifi cations [ 8 ] , and so on. these, "omics" technologies, are capable of measuring the many molecular building blocks that determine our (patho)physiology. genome-wide measurements have not only signifi cantly advanced our fundamental understanding of the molecular biology of health and disease but table 1 abbreviations and websites have also contributed to new (commercial) diagnostic and prognostic tests [ 9 , 10 ] and the selection and development of (personalized) treatment [ 11 ] . nowadays, bioinformatics is therefore defi ned as "advancing the scientifi c understanding of living systems through computation" (iscb), or more inclusively as "conceptualizing biology in terms of molecules and applying 'informatics techniques' (derived from disciplines such as applied mathematics, computer science and statistics) to understand and organize the information associated with these molecules, on a large scale" [ 12 ] . it is worth noting that solely measuring many molecular components of a biological system does not necessarily result in a deeper understanding of such a system. understanding biological function does indeed require detailed insight into the precise function of these components but, more importantly, it requires a thorough understanding of their static, temporal, and spatial interactions. these interaction networks underlie all (patho)physiological processes, and elucidation of these networks is a major task for bioinformatics and systems medicine . the developments in experimental technologies have led to challenges that require additional expertise and new skills for biomedical researchers: • information management. modern biomedical research projects typically produce large and complex omics data sets , sometimes in the order of hundreds of gigabytes to terabytes of which a large part has become available through public databases [ 13 , 14 ] sometimes even prior to publication (e.g., gtex, icgc, tcga). this not only contributes to knowledge dissemination but also facilitates reanalysis and metaanalysis of data, evaluation of hypotheses that were not considered by the original research group, and development and evaluation of new bioinformatics methods. the use of existing data can in some cases even make new (expensive) experiments superfl uous. alternatively, one can integrate publicly available data with data generated in-house for more comprehensive analyses, or to validate results [ 15 ] . in addition, the obligation of making raw data available may prevent fraud and selective reporting. the management (transfer, storage, annotation, and integration) of data and associated meta-data is one of the main and increasing challenges in bioinformatics that needs attention to safeguard the progression of systems medicine. • data analysis and interpretation . bioinformatics data analysis and interpretation of omics data have become increasingly complex, not only due to the vast volumes and complexity of the data but also as a result of more challenging research questions. bioinformatics covers many types of analyses including nucleotide and protein sequence analysis, elucidation of tertiary protein structures, quality control, pre-processing and statistical analysis of omics data, determination of genotypephenotype relationships, biomarker identifi cation, evolutionary analysis, analysis of gene regulation, reconstruction of biological networks, text mining of literature and electronic patient records, and analysis of imaging data. in addition, bioinformatics has developed approaches to improve experimental design of omics experiments to ensure that the maximum amount of information can be extracted from the data. many of the methods developed in these areas are of direct relevance for systems medicine as exemplifi ed in this chapter. clearly, new experimental technologies have to a large extent turned biomedical research in a data-and compute-intensive endeavor. it has been argued that production of omics data has nowadays become the "easy" part of biomedical research, whereas the real challenges currently comprise information management and bioinformatics analysis. consequently, next to the wet-lab, the computer has become one of the main tools of the biomedical researcher . bioinformatics enables and advances the management and analysis of large omics-based datasets, thereby directly and indirectly contributing to systems medicine in several ways ( fig. 1 3. quality control and pre-processing of omics data. preprocessing typically involves data cleaning (e.g., removal of failed assays) and other steps to obtain quantitative measurements that can be used in downstream data analysis. 4. (statistical) data analysis methods of large and complex omicsbased datasets. this includes methods for the integrative analysis of multiple omics data types (subheading 5 ), and for the elucidation and analysis of biological networks (top-down systems medicine; subheading 6 ). systems medicine comprises top-down and bottom-up approaches. the former represents a specifi c branch of bioinformatics, which distinguishes itself from bottom-up approaches in several ways [ 3 , 19 , 20 ] . top-down approaches use omics data to obtain a holistic view of the components of a biological system and, in general, aim to construct system-wide static functional or physical interaction networks such as gene co-expression networks and protein-protein interaction networks. in contrast, bottom-up approaches aim to develop detailed mechanistic and quantitative mathematical models for sub-systems. these models describe the dynamic and nonlinear behavior of interactions between known components to understand and predict their behavior upon perturbation. however, in contrast to omics-based top-down approaches, these mechanistic models require information about chemical/physical parameters and reaction stoichiometry, which may not be available and require further (experimental) efforts. both the top-down and bottom-up approaches result in testable hypotheses and new wet-lab or in silico experiments that may lead to clinically relevant fi ndings. biomedical research and, consequently, systems medicine are increasingly confronted with the management of continuously growing volumes of molecular and clinical data, results of data analyses and in silico experiments, and mathematical models. due fig. 1 the contribution of bioinformatics ( dark grey boxes ) to systems medicine ( black box ). (omics) experiments, patients, and public repositories provide a wide range of data that is used in bioinformatics and systems medicine studies to policies of scientifi c journals and funding agencies, omics data is often made available to the research community via public databases. in addition, a wide range of databases have been developed, of which more than 1550 are currently listed in the molecular biology database collection [ 14 ] providing a rich source of biomedical information. biological repositories do not merely archive data and models but also serve a range of purposes in systems medicine as illustrated below from a few selected examples. the main repositories are hosted and maintained by the major bioinformatics institutes including ebi, ncbi, and sib that make a major part of the raw experimental omics data available through a number of primary databases including genbank [ 21 ] , geo [ 22 ] , pride [ 23 ] , and metabolights [ 24 ] for sequence, gene expression, ms-based proteomics, and ms-based metabolomics data, respectively. in addition, many secondary databases provide information derived from the processing of primary data, for example pathway databases (e.g., reactome [ 25 ] , kegg [ 26 ] ), protein sequence databases (e.g., uniprotkb [ 27 ] ), and many others. pathway databases provide an important resource to construct mathematical models used to study and further refi ne biological systems [ 28 , 29 ] . other efforts focus on establishing repositories integrating information from multiple public databases. the integration of pathway databases [ 30 -32 ] , and genome browsers that integrate genetic, omics, and other data with whole-genome sequences [ 33 , 34 ] are two examples of this. joint initiatives of the bioinformatics and systems biology communities resulted in repositories such as biomodels, which contains mathematical models of biochemical and cellular systems [ 35 ] , recon 2 that provides a communitydriven, consensus " metabolic reconstruction " of human metabolism suitable for computational modelling [ 36 ] , and seek, which provides a platform designed for the management and exchange of systems biology data and models [ 37 ] . another example of a database that may prove to be of value for systems medicine studies is malacards , an integrated and annotated compendium of about 17,000 human diseases [ 38 ] . malacards integrates 44 disease sources into disease cards and establishes gene-disease associations through integration with the well-known genecards databases [ 39 , 40 ] . integration with genecards and cross-references within malacards enables the construction of networks of related diseases revealing previously unknown interconnections among diseases, which may be used to identify drugs for off-label use. another class of repositories are (expert-curated) knowledge bases containing domain knowledge and data, which aim to provide a single point of entry for a specifi c domain. contents of these knowledge bases are often based on information extracted (either manually or by text mining) from literature or provided by domain experts [ 41 -43 ] . finally, databases are used routinely in the analysis, interpretation, and validation of experimental data. for example, the gene ontology (go) provides a controlled vocabulary of terms for describing gene products, and is often used in gene set analysis to evaluate expression patterns of groups of genes instead of those of individual genes [ 44 ] and has, for example, been applied to investigate hiv-related cognitive disorders [ 45 ] and polycystic kidney disease [ 46 ] . several repositories such as mir2disease [ 47 ] , peroxisomedb [ 41 ] , and mouse genome informatics (mgi) [ 43 ] include associations between genes and disorders, but only provide very limited phenotypic information. phenotype databases are of particular interest to systems medicine. one well-known phenotype repository is the omim database, which primarily describes single-gene (mendelian) disorders [ 48 ] . clinvar is another example and provides an archive of reports and evidence of the relationships among medically important human variations found in patient samples and phenotypes [ 49 ] . clinvar complements dbsnp (for singlenucleotide polymorphisms) [ 50 ] and dbvar (for structural variations) [ 51 ] , which both provide only minimal phenotypic information. the integration of these phenotype repositories with genetic and other molecular information will be a major aim for bioinformatics in the coming decade enabling, for example, the identifi cation of comorbidities, determination of associations between gene (mutations) and disease, and improvement of disease classifi cations [ 52 ] . it will also advance the defi nition of the "human phenome," i.e., the set of phenotypes resulting from genetic variation in the human genome. to increase the quality and (clinical) utility of the phenotype and variant databases as an essential step towards reducing the burden of human genetic disease, the human variome project coordinates efforts in standardization, system development, and (training) infrastructure for the worldwide collection and sharing of genetic variations that affect human health [ 53 , 54 ] . to implement and advance systems medicine to the benefi t of patients' health, it is crucial to integrate and analyze molecular data together with de-identifi ed individual-level clinical data complementing general phenotype descriptions. patient clinical data refers to a wide variety of data including basic patient information (e.g., age, sex, ethnicity), outcomes of physical examinations, patient history, medical diagnoses, treatments, laboratory tests, pathology reports, medical images, and other clinical outcomes. inclusion of clinical data allows the stratifi cation of patient groups into more homogeneous clinical subgroups. availability of clinical data will increase the power of downstream data analysis and modeling to elucidate molecular mechanisms, and to identify molecular biomarkers that predict disease onset or progression, or which guide treatment selection. in biomedical studies clinical information is generally used as part of patient and sample selection, but some omics studies also use clinical data as part of the bioinformatics analysis (e.g., [ 9 , 55 ] ). however, in general, clinical data is unavailable from public resources or only provided on an aggregated level. although good reasons exist for making clinical data available (subheading 2.2 ), ethical and legal issues comprising patient and commercial confi dentiality, and technical issues are the most immediate challenges [ 56 , 57 ] . this potentially hampers the development of systems medicine approaches in a clinical setting since sharing and integration of clinical and nonclinical data is considered a basic requirement [ 1 ] . biobanks [ 58 ] such as bbmri [ 59 ] provide a potential source of biological material and associated (clinical) data but these are, generally, not publicly accessible, although permission to access data may be requested from the biobank provider. clinical trials provide another source of clinical data for systems medicine studies, but these are generally owned by a research group or sponsor and not freely available [ 60 ] although ongoing discussions may change this in the future ( [ 61 ] and references therein). although clinical data is not yet available on a large scale, the bioinformatics and medical informatics communities have been very active in establishing repositories that provide clinical data. one example is the database of genotypes and phenotypes (dbgap) [ 62 ] developed by the ncbi. study metadata, summarylevel (phenotype) data, and documents related to studies are publicly available. access to de-identifi ed individual-level (clinical) data is only granted after approval by an nih data access committee. another example is the cancer genome atlas (tcga) , which also provides individual-level molecular and clinical data through its own portal and the cancer genomics hub (cghub). clinical data from tcga is available without any restrictions but part of the lower level sequencing and microarray data can only be obtained through a formal request managed by dbgap. medical patient records provide an even richer source of phenotypic information , and has already been used to stratify patient groups, discover disease relations and comorbidity, and integrate these records with molecular data to obtain a systems-level view of phenotypes (for a review see [ 63 ] ). on the one hand, this integration facilitates refi nement and analysis of the human phenome to, for example, identify diseases that are clinically uniform but have different underlying molecular mechanisms, or which share a pathogenetic mechanism but with different genetic cause [ 64 ] . on the other hand, using the same data, a phenome-wide association study ( phewas ) [ 65 ] would allow the identifi cation of unrelated phenotypes associated with specifi c shared genetic variant(s), an effect referred to as pleiotropy. moreover, it makes use of information from medical records generated in routine clinical practice and, consequently, has the potential to strengthen the link between biomedical research and clinical practice [ 66 ] . the power of phenome analysis was demonstrated in a study involving 1.5 million patient records, not including genotype information, comprising 161 disorders. in this study it was shown that disease phenotypes form a highly connected network suggesting a shared genetic basis [ 67 ] . indeed, later studies that incorporated genetic data resulted in similar fi ndings and confi rmed a shared genetic basis for a number of different phenotypes. for example, a recent study identifi ed 63 potentially pleiotropic associations through the analysis of 3144 snps that had previously been implicated by genome-wide association studies ( gwas) as mediators of human traits, and 1358 phenotypes derived from patient records of 13,835 individuals [ 68 ] . this demonstrates that phenotypic information extracted manually or through text mining from patient records can help to more precisely defi ne (relations between) diseases. another example comprises the text mining of psychiatric patient records to discover disease correlations [ 52 ] . here, mapping of disease genes from the omim database to information from medical records resulted in protein networks suspected to be involved in psychiatric diseases. integrative bioinformatics comprises the integrative (statistical) analysis of multiple omics data types. many studies demonstrated that using a single omics technology to measure a specifi c molecular level (e.g., dna variation, expression of genes and proteins, metabolite concentrations, epigenetic modifi cations) already provides a wealth of information that can be used for unraveling molecular mechanisms underlying disease. moreover, single-omics disease signatures which combine multiple (e.g., gene expression) markers have been constructed to differentiate between disease subtypes to support diagnosis and prognosis. however, no single technology can reveal the full complexity and details of molecular networks observed in health and disease due to the many interactions across these levels. a systems medicine strategy should ideally aim to understand the functioning of the different levels as a whole by integrating different types of omics data. this is expected to lead to biomarkers with higher predictive value, and novel disease insights that may help to prevent disease and to develop new therapeutic approaches. integrative bioinformatics can also facilitate the prioritization and characterization of genetic variants associated with complex human diseases and traits identifi ed by gwas in which hundreds of thousands to over a million snps are assayed in a large number of individuals. although such studies lack the statistical power to identify all disease-associated loci [ 69 ] , they have been instrumental in identifying loci for many common diseases. however, it remains diffi cult to prioritize the identifi ed variants and to elucidate their effect on downstream pathways ultimately leading to disease [ 70 ] . consequently, methods have been developed to prioritize candidate snps based on integration with other (omics) data such as gene expression, dnase hypersensitive sites, histone modifi cations, and transcription factor-binding sites [ 71 ] . the integration of multiple omics data types is far from trivial and various approaches have been proposed [ 72 -74 ] . one approach is to link different types of omics measurements through common database identifi ers. although this may seem straightforward, in practice this is complicated as a result of technical and standardization issues as well as a lack of biological consensus [ 32 , 75 -77 ] . moreover, the integration of data at the level of the central dogma of molecular biology and, for example, metabolite data is even more challenging due to the indirect relationships between genes, transcripts, and proteins on the one hand and metabolites on the other hand, precluding direct links between the database identifi ers of these molecules. statistical data integration [ 72 ] is a second commonly applied strategy, and various approaches have been applied for the joint analysis of multiple data types (e.g., [ 78 , 79 ] ). one example of statistical data integration is provided by a tcga study that measured various types of omics data to characterize breast cancer [ 80 ] . in this study 466 breast cancer samples were subjected to whole-genome and -exome sequencing, and snp arrays to obtain information about somatic mutations, copy number variations, and chromosomal rearrangements. microarrays and rna-seq were used to determine mrna and microrna expression levels, respectively. reverse-phase protein arrays (rppa) and dna methylation arrays were used to obtain data on protein expression levels and dna methylation, respectively. simultaneous statistical analysis of different data types via a "cluster-of-clusters" approach using consensus clustering on a multi-omics data matrix revealed that four major breast cancer subtypes could be identifi ed. this showed that the intrinsic subtypes (basal, luminal a and b, her2) that had previously been determined using gene expression data only could be largely confi rmed in an integrated analysis of a large number of breast tumors. single-level omics data has extensively been used to identify disease-associated biomarkers such as genes, proteins, and metabolites. in fact, these studies led to more than 150,000 papers documenting thousands of claimed biomarkers, however, it is estimated that fewer than 100 of these are currently used for routine clinical practice [ 81 ] . integration of multiple omics data types is expected to result in more robust and predictive disease profi les since these better refl ect disease biology [ 82 ] . further improvement of these profi les may be obtained through the explicit incorporation of interrelationships between various types of measurements such as microrna-mrna target, or gene methylation-microrna (based on a common target gene). this was demonstrated for the prediction of short-term and long-term survival from serous cystadenocarcinoma tcga data [ 83 ] . according to the recent casym roadmap : "human disease can be perceived as perturbations of complex, integrated genetic, molecular and cellular networks and such complexity necessitates a new approach." [ 84 ] . in this section we discuss how (approximations) to these networks can be constructed from omics data and how these networks can be decomposed in smaller modules. then we discuss how the resulting modules can be used to generate experimentally testable hypotheses, provide insight into disease mechanisms, lead to predictive diagnostic and prognostic models, and help to further subclassify diseases [ 55 , 85 ] (fig. 2 ) network-based approaches will provide medical doctors with molecular level support to make personalized treatment decisions. in a top-down approach the aim of network reconstruction is to infer the connections between the molecules that constitute a biological network. network models can be created using a variety of mathematical and statistical techniques and data types. early approaches for network inference (also called reverse engineering ) used only gene expression data to reconstruct gene networks. here, we discern three types of gene network inference algorithms using methods based on (1) correlation-based approaches, (2) information-theoretic approaches, and (3) bayesian networks [ 86 ] . co-expression networks are an extension of commonly used clustering techniques , in which genes are connected by edges in a network if the amount of correlation of their gene expression profi les exceeds a certain value. co-expression networks have been shown to connect functionally related genes [ 87 ] . note that connections in a co-expression network correspond to either direct (e.g., transcription factor-gene and protein-protein) or indirect (e.g., proteins participating in the same pathway) interactions. in one of the earliest examples of this approach, pair-wise correlations were calculated between gene expression profi les and the level of growth inhibition caused by thousands of tested anticancer agents, for 60 cancer cell lines [ 88 ] . removal of associations weaker than a certain threshold value resulted in networks consisting of highly correlated genes and agents, called relevance networks, which led to targeted hypotheses for potential single-gene determinants of chemotherapeutic susceptibility. information-theoretic approaches have been proposed in order to capture nonlinear dependencies assumed to be present in most biological systems and that cannot be captured by correlation-based distance measures . these approaches often use the concept of mutual information, a generalization of the correlation coeffi cient which quantifi es the degree of statistical (in)dependence. an example of a network inference method that is based on mutual information is aracne, which has been used to reconstruct the human b-cell gene network from a large compendium of human b-cell gene expression profi les [ 89 ] . in order to discover regulatory interactions, aracne removes the majority of putative indirect interactions from the initial mutual information-based gene network using a theorem from information theory, the data processing inequality. this led to the identifi cation of myc as a major hub in the b-cell gene network and a number of novel myc target genes, which were experimentally validated. whether informationtheoretic approaches are more powerful in general than correlationbased approaches is still subject of debate [ 90 ] . bayesian networks allow the description of statistical dependencies between variables in a generic way [ 91 , 92 ] . bayesian networks are directed acyclic networks in which the edges of the network represent conditional dependencies; that is, nodes that are not connected represent variables that are conditionally independent of each other. a major bottleneck in the reconstruction of bayesian networks is their computational complexity. moreover, bayesian networks are acyclic and cannot capture feedback loops that characterize many biological networks. when time-series rather than steady-state data is available, dynamic bayesian networks provide a richer framework in which cyclic networks can be reconstructed [ 93 ] . gene (co-)expression data only offers a partial view on the full complexity of cellular networks. consequently, networks have also been constructed from other types of high-throughput data. for example, physical protein-protein interactions have been measured on a large scale in different organisms including human, using affi nity capture-mass spectrometry or yeast two-hybrid screens, and have been made available in public databases such as biogrid [ 94 ] . regulatory interactions have been probed using chromatin immunoprecipitation sequencing (chip-seq) experiments, for example by the encode consortium [ 95 ] . using probabilistic techniques , heterogeneous types of experimental evidence and prior knowledge have been integrated to construct functional association networks for human [ 96 ] , mouse [ 97 ] , and, most comprehensively, more than 1100 organisms in the string database [ 98 ] . functional association networks can help predict novel pathway components, generate hypotheses for biological functions for a protein of interest, or identify disease-related genes [ 97 ] . prior knowledge required for these approaches is, for example, available in curated biological pathway databases, and via protein associations predicted using text mining based on their cooccurrence in abstracts or even full-text articles. many more integrative network inference methods have been proposed; for a review see [ 99 ] . the integration of gene expression data with chip data [ 100 ] or transcription factor-binding motif data [ 101 ] has shown to be particularly fruitful for inferring transcriptional regulatory networks. recently, li et al. [ 102 ] described the results from a regression-based model that predicts gene expression using encode (chip-seq) and tcga data (mrna expression data complemented with copy number variation, dna methylation, and microrna expression data). this model infers the regulatory activities of expression regulators and their target genes in acute myeloid leukemia samples. eighteen key regulators were identifi ed, whose activities clustered consistently with cytogenetic risk groups. bayesian networks have also been used to integrate multiomics data. the combination of genotypic and gene expression data is particularly powerful, since dna variations represent naturally occurring perturbations that affect gene expression detected as expression quantitative trait loci ( eqtl ). cis -acting eqtls can then be used as constraints in the construction of directed bayesian networks to infer causal relationships between nodes in the network [ 103 ] . large multi-omics datasets consisting of hundreds or sometimes even thousands of samples are available for many commonly occurring human diseases, such as most tumor types (tcga), alzheimer's disease [ 104 ] , and obesity [ 105 ] . however, a major bottleneck for the construction of accurate gene networks is that the number of gene networks that are compatible with the experimental data is several orders of magnitude larger still. in other words, top-down network inference is an underdetermined problem with many possible solutions that explain the data equally well and individual gene-gene interactions are characterized by a high false-positive rate [ 99 ] . most network inference methods therefore try to constrain the number of possible solutions by making certain assumptions about the structure of the network. perhaps the most commonly used strategy to harness the complexity of the gene network inference problem is to analyze experimental data in terms of biological modules, that is, sets of genes that have strong interactions and a common function [ 106 ] . there is considerable evidence that many biological networks are modular [ 107 ] . module-based approaches effectively constrain the number of parameters to estimate and are in general also more robust to the noise that characterizes high-throughput omics measurements. a detailed review of module-based techniques is outside the scope of this chapter (see, for example [ 108 ] ), but we would like to mention a few examples of successful and commonly used modular approaches. weighted gene co-expression network analysis ( wgcna) decomposes a co-expression network into modules using clustering techniques [ 109 ] . modules can be summarized by their module eigengene, a weighted average expression profi le of all gene member of a given module. eigengenes can then be correlated with external sample traits to identify modules that are related with these traits. parikshak et al. [ 110 ] used wgcna to extract modules from a co-expression network constructed using fetal and early postnatal brain development expression data. next, they established that several of these modules were enriched for genes and rare de novo variants implicated in autism spectrum disorder (asd). moreover, the asd-associated modules are also linked at the transcriptional level and 17 transcription factors were found acting as putative co-regulators of asd-associated gene modules during neocortical development. wgcna can also be used when multiple omics data types are available. one example of such an approach involved the integration of transcriptomic and proteomic data from a study investigating the response to sars-cov infection in mice [ 111 ] . in this study wgcna-based gene and protein co-expression modules were constructed and integrated to obtain module-based disease signatures. interestingly, the authors found several cases of identifi er-matched transcripts and proteins that correlated well with the phenotype, but which showed poor or anticorrelation across these two data types. moreover, the highest correlating transcripts and peptides were not the most central ones in the co-expression modules. vice versa , the transcripts and proteins that defi ned the modules were not those with the highest correlation to the phenotype. at the very least this shows that integration of omics data affects the nature of the disease signatures. identifi cation of active modules is another important integrative modular technique . here, experimental data in the form of molecular profi les is projected onto a biological network, for example a protein-protein interaction network. active modules are those subnetworks that show the largest change in expression for a subset of conditions and are likely to contain key drivers or regulators of those processes perturbed in the experiment. active modules have, for example, been used to fi nd a subnetwork that is overexpressed in a particularly aggressive lymphoma subtype [ 112 ] and to detect signifi cantly mutated pathways [ 113 ] . some active module approaches integrate various types of omics data. one example of such an approach is paradigm [ 114 ] , which translates pathways into factor graphs, a class of models that belongs to the same family of models as bayesian networks, and determines sample-specifi c pathway activity from multiple functional genomic datasets. paradigm has been used in several tcga projects, for example, in the integrated analysis of 131 urothelial bladder carcinomas [ 55 ] . paradigm-based analysis of copy number variations and rna-seq gene expression in combination with a propagation-based network analysis algorithm revealed novel associations between mutations and gene expression levels, which subsequently resulted in the identifi cation of pathways altered in bladder cancer. the identifi cation of activating or inhibiting gene mutations in these pathways suggested new targets for treatment. moreover, this effort clearly showed the benefi ts of screening patients for the presence of specifi c mutations to enable personalized treatment strategies. often, published disease signatures cannot be replicated [ 81 ] or provide hardly additional biological insight. also here (modular) network-based approaches have been proposed to alleviate these problems. a common characteristic of most methods is that the molecular activity of a set of genes is summarized on a per sample basis. summarized gene set scores are then used as features in prognostic and predictive models. relevant gene sets can be based on prior knowledge and correspond to canonical pathways, gene ontology categories, or sets of genes sharing common motifs in their promoter regions [ 115 ] . gene set scores can also be determined by projecting molecular data onto a biological network and summarizing scores at the level of subnetworks for each individual sample [ 116 ] . while promising in principle, it is still subject of debate whether gene set-based models outperform gene-based one s [ 117 ] . the comparative analysis of networks across different species is another commonly used approach to constrain the solution space. patterns conserved across species have been shown to be more likely to be true functional interactions [ 107 ] and to harbor useful candidates for human disease genes [ 118 ] . many network alignment methods have been developed in the past decade to identify commonalities between networks. these methods in general combine sequence-based and topological constraints to determine the optimal alignment of two (or more) biological networks. network alignment has, for example, been applied to detect conserved patterns of protein interaction in multiple species [ 107 , 119 ] and to analyze the evolution of co-expression networks between humans and mice [ 120 , 121 ] . network alignment can also be applied to detect diverged patterns [ 120 ] and may thus lead to a better understanding of similarities and differences between animal models and human in health and disease. information from model organisms has also been fruitfully used to identify more robust disease signatures [ 122 -125 ] . sweet-cordero and co-workers [ 122 ] used a gene signature identifi ed in a mouse model of lung adenocarcinoma to uncover an orthologous signature in human lung adenocarcinoma that was not otherwise apparent. bild et al. [ 123 ] defi ned gene expression signatures characterizing several oncogenic pathways of human mammary epithelial cells. they showed that these signatures predicted pathway activity in mouse and human tumors. predictions of pathway activity correlated well with the sensitivity to drugs targeting those pathways and could thus serve as a guide to targeted therapies. a generic approach, pathprint, for the integration of gene expression data across different platforms and species at the level of pathways, networks, and transcriptionally regulated targets was recently described [ 126 ] . the authors used their method to identify four stem cell-related pathways conserved between human and mouse in acute myeloid leukemia, with good prognostic value in four independent clinical studies. we reviewed a wide array of different approaches showing how networks can be used to elucidate integrated genetic, molecular, and cellular networks. however, in general no single approach will be suffi cient and combining different approaches in more complex analysis pipelines will be required. this is fi ttingly illustrated by the diggit (driver-gene inference by genetical-genomics and information theory) algorithm [ 127 ] . in brief, diggit identities candidate master regulators from an aracne gene co-expression network integrated with copy number variations that affect gene expression. this method combines several previously developed computational approaches and was used to identify causal genetic drivers of human disease in general and glioblastoma, breast cancer, and alzheimer's disease in particular. this enabled identifi cation of klhl9 deletions as upstream activators of two previously established master regulators in a specifi c subtype of glioblastoma. systems medicine is one of the steps necessary to make improvements in the prevention and treatment of disease through systems approaches that will (a) elucidate (patho)physiologic mechanisms in much greater detail than currently possible, (b) produce more robust and predictive disease signatures, and (c) enable personalized treatment. in this context, we have shown that bioinformatics has a major role to play. bioinformatics will continue its role in the development, curation, integration, and maintenance of (public) biological and clinical databases to support biomedical research and systems medicine. the bioinformatics community will strengthen its activities in various standardization and curation efforts that already resulted in minimum reporting guidelines [ 128 ] , data capture approaches [ 75 ] , data exchange formats [ 129 ] , and terminology standards for annotation [ 130 ] . one challenge for the future is to remove errors and inconsistencies in data and annotation from databases and prevent new ones from being introduced [ 32 , 76 , 131 -135 ]. an equally important challenge is to establish, improve, and integrate resources containing phenotype and clinical information. to achieve this objective it seems reasonable that bioinformatics and health informatics professionals team up [ 136 -138 ] . traditionally health informatics professionals have focused on hospital information systems (e.g., patient records, pathology reports, medical images) and data exchange standards (e.g., hl7), medical terminology standards (e.g., international classifi cation of disease (icd), snomed), medical image analysis, analysis of clinical data, clinical decision support systems, and so on. while, on the other hand, bioinformatics mainly focused on molecular data, it shares many approaches and methods with health informatics. integration of these disciplines is therefore expected to benefi t systems medicine in various ways [ 139 ] . integrative bioinformatics approaches clearly have added value for systems medicine as they provide a better understanding of biological systems, result in more robust disease markers, and prevent (biological) bias that would possibly occur from using single-omics measurements. however, such studies, and the scientifi c community in general, would benefi t from improved strategies to disseminate and share data which typically will be produced at multiple research centers (e.g., https://www.synapse.org ; [ 140 ] ). integrative studies are expected to increasingly facilitate personalized medicine approaches such as demonstrated by chen and coworkers [ 141 ] . in their study they presented a 14-month "integrative personal omics profi le" (ipop) for a single individual comprising genomic, transcriptomic, proteomic, metabolomic, and autoantibody data. from the whole-genome sequence data an elevated risk for type 2 diabetes (t2d) was detected, and subsequent monitoring of hba1c and glucose levels revealed the onset of t2d, despite the fact that the individual lacked many of the known non-genetic risk factors. subsequent treatment resulted in a gradual return to the normal phenotype. this shows that the genome sequence can be used to determine disease risk in a healthy individual and allows selecting and monitoring specifi c markers that provide information about the actual disease status. network-based approaches will increasingly be used to determine the genetic causes of human diseases. since the effect of a genetic variation is often tissue or cell-type specifi c, a large effort is needed in constructing cell-type-specifi c networks both in health and disease. this can be done using data already available, an approach taken by guan et al. [ 142 ] . the authors proposed 107 tissue-specifi c networks in mouse via their generic approach for constructing functional association networks using lowthroughput, highly reliable tissue-specifi c gene expression information as a constraint. one could also generate new datasets to facilitate the construction of tissue-specifi c networks. examples of such approaches are tcga and the genotype-tissue expression (gtex) project. the aim of gtex is to create a data resource for the systematic study of genetic variation and its effect on gene expression in more than 40 human tissues [ 143 ] . regardless of the way how networks are constructed, it will become more and more important to offer a centralized repository where networks from different cell types and diseases can be stored and accessed. nowadays, these networks are diffi cult to retrieve and are scattered in supplementary fi les with the original papers, links to accompanying web pages, or even not available at all. a resource similar to what the systems biology community has created with the biomodels database would be a great leap forward. there have been some initial attempts in building databases of network models, for example the cellcircuits database [ 123 ] ( http://www.cellcircuits.org ) and the causal biological networks (cbn) database of networks related to lung disease [ 144 ] ( http://causalbionet.com ). however, these are only small-scale initiatives and a much larger and coordinated effort is required. another main bottleneck for the successful application of network inference methods is their validation. most network inference methods to date have been applied to one or a few isolated datasets and were validated using some limited follow-up experiments, for example via gene knockdowns, using prior knowledge from databases and literature as a gold standard, or by generating simulated data from a mathematical model of the underlying network [ 145 , 146 ] . however, strengths and weaknesses of network inference methods across cell types, diseases, and species have hardly been assessed. notable exceptions are collaborative competitions such as the dialogue on reverse engineering assessment and methods (dream) [ 147 ] and industrial methodology for process verifi cation (improver) [ 146 ] . these centralized initiatives propose challenges in which individual research groups can participate and to which they can submit their predictions, which can then be independently validated by the challenge organizers. several dream challenges in the area of network inference have been organized, leading to a better insight into the strengths and weaknesses of individual methods [ 148 ] . another important contribution of dream is that a crowd-based approach integrating predictions from multiple network inference methods was shown to give good and robust performance across diverse data sets [ 149 ] . also in the area of systems medicine challenge-based competitions may offer a framework for independent verifi cation of model predictions. systems medicine promises a more personalized medicine that effectively exploits the growing amount of molecular and clinical data available for individual patients. solid bioinformatics approaches are of crucial importance for the success of systems medicine. however, really delivering the promises of systems medicine will require an overall change of research approach that transcends the current reductionist approach and results in a tighter integration of clinical, wet-lab laboratory, and computational groups adopting a systems-based approach. past, current, and future success of systems medicine will accelerate this change. the road from systems biology to systems medicine participatory medicine: a driving force for revolutionizing healthcare understanding drugs and diseases by systems biology the roots of bioinformatics in theoretical biology sequencing technologies -the next generation exploring the new world of the genome with dna microarrays spectroscopic and statistical techniques for information recovery in metabonomics and metabolomics next-generation technologies and data analytical approaches for epigenomics gene expression profi ling predicts clinical outcome of breast cancer diagnostic tests based on gene expression profi le in breast cancer: from background to clinical use a multigene assay to predict recurrence of tamoxifentreated, node-negative breast cancer what is bioinformatics? a proposed defi nition and overview of the fi eld the importance of biological databases in biological discovery the 2014 nucleic acids research database issue and an updated nar online molecular biology database collection reuse of public genome-wide gene expression data experimental design for gene expression microarrays learning from our gwas mistakes: from experimental design to scientifi c method effi cient experimental design and analysis strategies for the detection of differential expression using rna-sequencing impact of yeast systems biology on industrial biotechnology the nature of systems biology gene expression omnibus: microarray data storage, submission, retrieval, and analysis the proteomics identifi cations (pride) database and associated tools: status in 2013 metabolights--an open-access generalpurpose repository for metabolomics studies and associated meta-data the reactome pathway knowledgebase data, information, knowledge and principle: back to metabolism in kegg activities at the universal protein resource (uniprot) path2models: large-scale generation of computational models from biochemical pathway maps precise generation of systems biology models from kegg pathways pathguide: a pathway resource list pathway commons, a web resource for biological pathway data consensus and confl ict cards for metabolic pathway databases the ucsc genome browser database: 2014 update biomodels database: a repository of mathematical models of biological processes a community-driven global reconstruction of human metabolism the seek: a platform for sharing data and models in systems biology malacards: an integrated compendium for diseases and their annotation genecards version 3: the human gene integrator in-silico human genomics with genecards peroxisomedb 2.0: an integrative view of the global peroxisomal metabolome the mouse age phenome knowledgebase and disease-specifi c inter-species age mapping searching the mouse genome informatics (mgi) resources for information on mouse biology from genotype to phenotype gene-set approach for expression pattern analysis systems analysis of human brain gene expression: mechanisms for hiv-associated neurocognitive impairment and common pathways with alzheimer's disease systems biology approach to identify transcriptome reprogramming and candidate microrna targets during the progression of polycystic kidney disease mir2disease: a manually curated database for microrna deregulation in human disease a new face and new challenges for online mendelian inheritance in man (omim(r)) clinvar: public archive of relationships among sequence variation and human phenotype searching ncbi's dbsnp database dbvar and dgva: public archives for genomic structural variation using electronic patient records to discover disease correlations and stratify patient cohorts on not reinventing the wheel beyond the genomics blueprint: the 4th human variome project meeting comprehensive molecular characterization of urothelial bladder carcinoma open clinical trial data for all? a view from regulators clinical trial data as a public good biobanking for europe whose data set is it anyway? sharing raw data from randomized trials sharing individual participant data from clinical trials: an opinion survey regarding the establishment of a central repository ncbi's database of genotypes and phenotypes: dbgap mining electronic health records: towards better research applications and clinical care phenome connections phewas: demonstrating the feasibility of a phenome-wide scan to discover genedisease associations mining the ultimate phenome repository probing genetic overlap among complex human phenotypes systematic comparison of phenomewide association study of electronic medical record data and genome-wide association study data finding the missing heritability of complex diseases systems genetics: from gwas to disease pathways a review of post-gwas prioritization approaches when one and one gives more than two: challenges and opportunities of integrative omics the model organism as a system: integrating 'omics' data sets principles and methods of integrative genomic analyses in cancer toward interoperable bioscience data critical assessment of human metabolic pathway databases: a stepping stone for future integration the bridgedb framework: standardized access to gene, protein and metabolite identifi er mapping services integration of transcriptomics and metabonomics: improving diagnostics, biomarker identifi cation and phenotyping in ulcerative colitis a multivariate approach to the integration of multi-omics datasets comprehensive molecular portraits of human breast tumours bring on the biomarkers assessing the clinical utility of cancer genomic and proteomic data across tumor types incorporating inter-relationships between different levels of genomic data into cancer clinical outcome prediction the casym roadmap: implementation of systems medicine across europe molecular classifi cation of cancer: class discovery and class prediction by gene expression monitoring how to infer gene networks from expression profi les coexpression analysis of human genes across many microarray data sets discovering functional relationships between rna expression and chemotherapeutic susceptibility using relevance networks reverse engineering of regulatory networks in human b cells comparison of co-expression measures: mutual information, correlation, and model based indices using bayesian networks to analyze expression data probabilistic graphical models: principles and techniques. adaptive computation and machine learning inferring gene networks from time series microarray data using dynamic bayesian networks the biogrid interaction database: 2013 update architecture of the human regulatory network derived from encode data reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes a genomewide functional network for the laboratory mouse string v9.1: protein-protein interaction networks, with increased coverage and integration advantages and limitations of current network inference methods computational discovery of gene modules and regulatory networks a semisupervised method for predicting transcription factor-gene interactions in escherichia coli regression analysis of combined gene expression regulation in acute myeloid leukemia integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks integrated systems approach identifi es genetic nodes and networks in late-onset alzheimer's disease a survey of the genetics of stomach, liver, and adipose gene expression from a morbidly obese cohort an introduction to systems biology: design principles of biological circuits from signatures to models: understanding cancer using microarrays integrative approaches for fi nding modular structure in biological networks weighted gene coexpression network analysis: state of the art integrative functional genomic analyses implicate specifi c molecular pathways and circuits in autism multi-omic network signatures of disease identifying functional modules in protein-protein interaction networks: an integrated exact approach algorithms for detecting signifi cantly mutated pathways in cancer inference of patient-specifi c pathway activities from multi-dimensional cancer genomics data using paradigm pathway-based personalized analysis of cancer network-based classifi cation of breast cancer metastasis current composite-feature classifi cation methods do not outperform simple singlegenes classifi ers in breast cancer prognosis prediction of human disease genes by humanmouse conserved coexpression analysis a comparison of algorithms for the pairwise alignment of biological networks cross-species analysis of biological networks by bayesian alignment graphalignment: bayesian pairwise alignment of biological networks an oncogenic kras2 expression signature identifi ed by cross-species gene-expression analysis oncogenic pathway signatures in human cancers as a guide to targeted therapies interspecies translation of disease networks increases robustness and predictive accuracy integrated cross-species transcriptional network analysis of metastatic susceptibility pathprinting: an integrative approach to understand the functional basis of disease identifi cation of causal genetic drivers of human disease through systems-level analysis of regulatory networks promoting coherent minimum reporting guidelines for biological and biomedical investigations: the mibbi project data standards for omics data: the basis of data sharing and reuse biomedical ontologies: a functional perspective pdb improvement starts with data deposition what we do not know about sequence analysis and sequence databases annotation error in public databases: misannotation of molecular function in enzyme superfamilies improving the description of metabolic networks: the tca cycle as example more than 1,001 problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology biomedical and health informatics in translational medicine amia board white paper: defi nition of biomedical informatics and specifi cation of core competencies for graduate education in the discipline synergy between medical informatics and bioinformatics: facilitating genomic medicine for future health care elixir: a distributed infrastructure for european biological data enabling transparent and collaborative computational analysis of 12 tumor types within the cancer genome atlas personal omics profi ling reveals dynamic molecular and medical phenotypes tissue-specifi c functional networks for prioritizing phenotype and disease genes the genotype-tissue expression (gtex) project on crowd-verifi cation of biological networks inference and validation of predictive gene networks from biomedical literature and gene expression data verifi cation of systems biology research in the age of collaborative competition dialogue on reverse-engineering assessment and methods: the dream of highthroughput pathway inference revealing strengths and weaknesses of methods for gene network inference wisdom of crowds for robust gene network inference we would like to thank dr. aldo jongejan for his comments that improved the text. key: cord-186031-b1f9wtfn authors: caldarelli, guido; nicola, rocco de; petrocchi, marinella; pratelli, manuel; saracco, fabio title: analysis of online misinformation during the peak of the covid-19 pandemics in italy date: 2020-10-05 journal: nan doi: nan sha: doc_id: 186031 cord_uid: b1f9wtfn during the covid-19 pandemics, we also experience another dangerous pandemics based on misinformation. narratives disconnected from fact-checking on the origin and cure of the disease intertwined with pre-existing political fights. we collect a database on twitter posts and analyse the topology of the networks of retweeters (users broadcasting again the same elementary piece of information, or tweet) and validate its structure with methods of statistical physics of networks. furthermore, by using commonly available fact checking software, we assess the reputation of the pieces of news exchanged. by using a combination of theoretical and practical weapons, we are able to track down the flow of misinformation in a snapshot of the twitter ecosystem. thanks to the presence of verified users, we can also assign a polarization to the network nodes (users) and see the impact of low-quality information producers and spreaders in the twitter ecosystem. propaganda and disinformation have a history as long as mankind, and the phenomenon becomes particularly strong in difficult times, such as wars and natural disasters. the advent of the internet and social media has amplified and made faster the spread of biased and false news, and made targeting specific segments of the population possible [7] . for this reason the vice-president of the european commission with responsibility for policies on values and transparency, vȇra yourová, announced, beginning of june 2020, a european democracy action plan, expected by the end of 2020, in which web platforms admins will be called for greater accountability and transparency, since 'everything cannot be allowed online' [16] . manufacturers and spreaders of online disinformation have been particularly active also during the covid-19 pandemic period (e.g., writing about bill gates role in the pandemics or about masks killing children [2, 3] ). this, alongside the real pandemics [17] , has led to the emergence of a new virtual disease: covid-19 infodemics. in this paper, we shall consider the situation in italy, one of the most affected countries in europe, where the virus struck in a devastating way between the end of february and the end of april [1] . in such a sad and uncertain time, propaganda [1] in italy, since the beginning of the pandemics and at time of writing, almost 310k persons have contracted the covid-19 virus: of these, more than 35k have died. source: http://www.protezionecivile.gov.it/. accessed september 28, 2020. has worked hard: one of the most followed fake news was published by sputnik italia receiving 112,800 likes, shares and comments on the most popular social media. 'the article falsely claimed that poland had not allowed a russian plane with humanitarian aid and a team of doctors headed to italy to fly over its airspace', the ec vice-president yourová said. actually, the studies regarding dis/mis/information diffusion on social media seldom analyse its effective impact. in the exchange of messages on online platforms, a great amount of interactions do not carry any relevant information for the understanding of the phenomenon: as an example, randomly retweeting viral posts does not contribute to insights on the sharing activity of the account. for determining dis/misinformation propagation two main weapons can be used, the analysis of the content (semantic approach) and the analysis of the communities sharing the same piece of information (topological approach). while the content of a message can be analysed on its own, the presence of some troublesome structure in the pattern of news producer and spreaders (i.e., in the topology of contacts) can be detected only trough dedicated instruments. indeed, for real in-depth analyses, the properties of the real system should be compared with a proper null model. recently, entropy-based null models have been successfully employed to filter out random noise from complex networks and focus the attention on non trivial contributions [10, 26] . essentially, the method consists in defining a 'network benchmark' that has some of the (topological) properties of the real system, but is completely random for all the rest. then, every observation that does not agree with the model, i.e., cannot be explained by the topological properties of the benchmark, carries non trivial information. notably, being based on the shannon entropy, the benchmark is unbiased by definition. in the present paper, using entropy-based null-models, we analyse a tweet corpus related to the italian debate on covid-19 during the two months of maximum crisis in italy. after cleaning the system from the random noise, by using the entropy-based null-model as a filter, we have been able to highlight different communities. interestingly enough, these groups, beside including several official accounts of ministries, health institutions, and -online and offline -newspapers and newscasts, encompass four main political groups. while at first sight this may sound surprising -the pandemic debate was more on a scientific than on a political ground, at least in the very first phase of its abrupt diffusion -, it might be due to pre-existing echo chambers [18] . the four political groups are found to perform completely different activities on the platform, to interact differently from each other, and to post and share reputable and non reputable sources of information with great differences in the number of their occurrences. in particular, the accounts from the right wing community interact, mainly in terms of retweets, with the same accounts who interact with the mainstream media. this is probably due to the strong visibility given by the mainstream media to the leaders of that community. moreover, the right wing community is more numerous and more active, even relatively to the number of accounts involved, than the other communities. interestingly enough, newly formed political parties, as the one of the former italian prime minister matteo renzi, quickly imposed their presence on twitter and on the online political debate, with a strong activity. furthermore, the different political parties use different sources for getting information on the spreading on the pandemics. to detect the impact of dis/misinformation in the debate, we consider the news sources shared among the accounts of the various groups. with a hybrid annotation approach, based on independent fact checking organisations and human annotation, we categorised such sources as reputable and non reputable (in terms of credibility of the published news and the transparency of the sources). notably, we experienced that a group of accounts spread information from non reputable sources with a frequency almost 10 times higher than that of the other political groups. and we are afraid that, due to the extent of the online activity of the members of this community, the spreading of such a volume of non reputable news could deceit public opinion. we collected circa 4.5m tweets in italian language, from february 21 st to april 20 th 2020 [2] . details about the political situation in italy during the period of data collection can be found in the supplementary material, section 1.1: 'evolution of the covid-19 pandemics in italy'. the data collection was keyword-based, with keywords related the covid-19 pandemics. twitter's streaming api returns any tweet containing the keyword(s) in the text of the tweet, as well as in its metadata. it is worth noting that it is not always necessary to have each permutation of a specific keyword in the tracking list. for example, the keyword 'covid' will return tweets that contain both 'covid19' and 'covid-19'. table 1 lists a subset of the considered keywords and hashtags. there are some hashtags that overlap due to the fact that an included keyword is a sub-string of another one, but we included both for completeness. the left panel of fig. 1 shows the network obtained by following the projection procedure described in section 5.1. the network resulting from the projection procedure will be called, in the rest of the paper, validated network. the term validated should not be confused with the term verified, which instead denotes a twitter user who has passed the formal authentication procedure by the social platform. in order to get the community of verified twitter users, we applied the louvain algorithm [5] to the data in the validated network. such an algorithm, despite being one of the most popular, is also known to be order dependent [19] . to get rid of this bias, we apply it iteratively n times (n being the number of the nodes) after reshuffling the order of the nodes. finally, we select the partition with the highest modularity. the network presents a strong community structure, composed by four main subgraphs. when analysing the emerging 4 communities, we find that they correspond to 1 right wing parties and media (in steel blue) 2 center left wing (dark red) 3 5 stars movement (m5s ), in dark orange 4 institutional accounts (in sky blue) details about the political situation in italy during the period of data collection can be found in the supplementary material, section 1.2: 'italian political situation during the covid-19 pandemics'. this partition in four subgroups, once examined in more details, presents a richer substructure, described in the right panel of fig. 1 . starting from the center-left wing, we can find a darker red community, including various ngos and various left oriented journalists, vips and pundits. a slightly lighter red sub-community turns out to be composed by the main politicians of the italian democratic party (pd), as well as by representatives from the european parliament (italian and others) and some eu commissioners. the violet red group is mostly composed by the representatives of italia viva, a new party founded by the former italian prime minister matteo renzi (december 2014 -february 2016). in golden red we can find the subcommunity of catholic and vatican groups. finally the dark violet red and light tomato subcommunities consist mainly of journalists. in turn, also the orange (m5s) community shows a clear partition in substructures. in particular, the dark orange subcommunity contains the accounts of politicians, parliament representatives and ministers of the m5s and journalists. in aquamarine, we can find the official accounts of some private and public, national and international, health institutes. finally, in the light slate blue subcommunity we can find various italian ministers as well as the italian police and army forces. similar considerations apply to the steel blue community. in steel blue, the subcommunity of center right and right wing parties (as forza italia, lega and fratelli d'italia). in the following, this subcommunity is going to be called as fi-l-fdi, recalling the initials of the political parties contributing to this group. the sky blue subcommunity includes the national federations of various sports, the official accounts of athletes and sport players (mostly soccer) and their teams. the teal subcommunity contains the main italian news agencies. in this subcommunity there are also the accounts of many universities. the firebrick subcommunity contains accounts related to the as roma football club; analogously in dark red official accounts of ac milan and its players. the slate blue subcommunity is mainly composed by the official accounts of radio and tv programs of mediaset, the main private italian broadcasting company. finally, the sky blue community is mainly composed by italian embassies around the world. for the sake of completeness, a more detailed description of the composition of the subcommunities in the right panel of figure 1 is reported in the supplementary material, section 1.3: 'composition of the subcommunities in the validated network of verified twitter users'. here, we report a series of analyses related to the domain names, hereafter simply called domains, that mostly appear in all the tweets of the validated network of verified users. the domains have been tagged according to their degree of credibility and transparency, as indicated by the independent software toolkit newsguard https://www.newsguardtech.com/. the details of this procedure are reported below. as a first step, we considered the network of verified accounts, whose communities and sub-communities are shown in fig. 1 . on this topology, we labelled all domains that had been shared at least 20 times (between tweets and retweets). table 2 shows the tags associated to the domains. in the rest of the paper, we shall be interested in quantifying reliability of news sources publishing during the period of interest. thus, for our analysis, we will not consider those sources corresponding to social networks, marketplaces, search engines, institutional sites, etc. tags r, ∼ r and nr in table 2 are used only for news sites, be them newspapers, magazines, tv or radio social channels, and they stand for reputable, quasi reputable, not reputable, respectively. label unc is assigned to those domains with less than 20 occurrences in ours tweets and rewteets dataset. in fact, the labeling procedure is a hybrid one. as mentioned above, we relied on newsguard, a plugin resulting from the joint effort of journalists and software table 2 tags used for labeling the domains developers aiming at evaluating news sites according to nine criteria concerning credibility and transparency. for evaluating the credibility level, the metrics consider whether the news source regularly publishes false news, does not distinguish between facts and opinions, does not correct a wrongly reported news. for transparency, instead, the tool takes into account whether owners, founders or authors of the news source are publicly known; and whether advertisements are easily recognizable [3] . after combining the individual scores obtained out of the nine criteria, the plugin associates to a news source a score from 1 to 100, where 60 is the minimum score for the source to be considered reliable. when reporting the results, the plugin provides details about the criteria which passed the test and those that did not. in order to have a sort of no-man's land and not to be too abrupt in the transition between reputability and non-reputability, when the score was between 55 and 65, we considered the source to be quasi reputable, ∼r. it is worth noting that not all the domains in the dataset under investigation were evaluated by newsguard at the time of our analysis. for those not evaluated automatically, the annotation was made by three tech-savvy researchers, who assessed the domains by using the same criteria as newsguard. table 3 gives statistics about number and kind of tweets (tw = pure tweet; rt = retweet), the number of url and distinct url (dist url), the number of domains and users in the validated network of verified users. we clarify what we mean by these terms with an example: a domain for us corresponds to the so-called 'second-level domain' name [4] , i.e., the name directly to the left of .com, .net, and any other top-level domains. for instance, repubblica.it, corriere.it, nytimes.com are considered domains by us. instead, the url maintains here its standard definition [5] and an example is http://www.example.com/index.html. table 4 shows the outcome of the domains annotation, according to the scores of newsguard or to those assigned by the three annotators, when scores were no available from newsguard. at a first glance, the majority of the news domains belong to the reputable category. the second highest percentage is the one of the untagged domains -unc. in fact, in our dataset there are many domains that occur only few times once. for example, there are 300 domains that appear in the datasets only once. fig. 2 shows the trend of the number of tweets and retweets, containing urls, posted by the verified users of the validated projection during the period of data [3] newsguard rating process: https://www.newsguardtech.com/ratings/rating-process-criteria/ [4] https://en.wikipedia.org/wiki/domain_name [5] table 4 annotation results over all the domains in the whole dataset -validated network of verified users. in [9] . going on with the analysis, table 5 shows the percentage of the different types of domains for the 4 communities identified in the left plot of fig. 1 . it is worth observing that the steel blue community (both politicians and media) is the most active one, even if it is not the most represented: the number of users is lower than the one of the center left community (the biggest one, in terms of numbers), but the number of their posts containing a valid url is almost the double of that of the second more active community. interestingly, the activity of the verified users of the steel blue community is more focused on content production of (see the only tweets sub-table) than in sharing (see the only retweets sub-table). in fact, retweets represent almost 14.6% of all posts from the media and the right wing community, while in the case of the center-left community it is 34.5%. this effect is observable even in the average only tweets post per verified user: a right-wing user and a media user have an average of 88.75 original posts, against 34.27 for center-left-wing users. these numbers are probably due to the presence in the former community of the italian most accessed media. they tend to spread their (original) pieces of news on the twitter platform. interestingly, the presence of urls from a non reputable source in the steel blue community is more than 10 times higher than the second score in the same field in the case of original tweets (only tweets). it is worth noting that, for the case of the dark orange and sky blue communities, which are smaller both in terms of users and number of posts, the presence of non classified sources is quite strong (it represents nearly 46% of retweeted posts for both the communities), as it is the frequency of posts linking to social network contents. interestingly enough, the verified users of both groups seem to focus slightly more on the same domains: there are, on average, 1.59 and 1.80 posts for each url domain respectively for the dark orange and sky blue communities, and, on average, 1.26 and 1.34 posts for the steel blue and the dark red communities. the right plot in fig. 1 report a fine grained division of communities: the four largest communities have been further divided into sub-communities, as mentioned in subsection 3.1. here, we focus on the urls shared in the purely political sub-communities in table 7 . broadly speaking, we examine the contribution of the different political parties, as represented on twitter, to the spread of mis/disinformation and propaganda. table 7 clearly shows how the vast majority of the news coming from sources considered scarce or non reputable are tweeted and retweeted by the steel blue political sub-community (fi-l-fdi). notably, the percentage of non reputable sources shared by the fi-l-fdi accounts is more than 4 times the percentage of their community (the steel blue one) and it is more than 30 times the second community in the nr ratio ranking. for all the political sub-communities the incidence of social network links is much higher than in their original communities. looking at table 8 , even if the number of users in each political sub-community is much smaller, some peculiar behaviours can be still be observed. again, the center-right and right wing parties, while representing the least represented ones in terms of users, are much more active than the other groups: each (verified) user is responsible, on average of almost 81.14 messages, while the average is 23.96, 22.12 and 15.29 for m5s, iv and pd, respectively. it is worth noticing that italia viva, while being a recently founded party, is very active; moreover, for them the frequency of quasi reputable sources is quite high, especially in the case of only tweets posts. the impact of uncategorized sources is almost constant for all communities in the retweeting activity, while it is particularly strong for the m5s. finally, the posts by the center left communities (i.e., italia viva and the democratic party) tend to have more than one url. specifically, every post containing at least a url, has, on average, 2.05 and 2.73 urls respectively, against the 1.31 of movimento 5 stelle and 1.20 for the center-right and right wing parties. to conclude the analysis on the validated network of verified users, we report statistics about the most diffused hashtags in the 4 political sub-communities. fig. 3 focuses on wordclouds, while fig. 4 reports the data under an histograms form. actually, from the various hashtags we can derive important information regarding the communications of the various political discursive communities and their position towards the management of the pandemics. first, it has to be noticed that the m5s is the greatest user of hashtags: their two most used hashtags have been used almost twice the most used hashtags used by the pd, for instance. this heavy usage is probably due to the presence in this community of journalists and of the official account of il fatto quotidiano, a newspaper explicitly supporting the m5s: indeed, the first two hashtags are "#ilfattoquotidiano" and "#edicola" (kiosk, in italian). it is interesting to see the relative importance of hashtags intended to encourage the population during the lockdown: it is the case of "#celafaremo" (we will make it), "#iorestoacasa" (i am staying home), "#fermiamoloinsieme" (let's stop it together ): "#iorestoacasa" is present in every community, but it ranks 13th in the m5s verified user community, 29th in the fi-l-fdi community, 2nd in the italia viva community and 10th in the pd one. remarkably, "#celafaremo" is present only in the m5s group, as "#fermiamoloinsieme" can be found in the top 30 hashtags only in the center-right and right wing cluster. the pd, being present in various european institutions, mentions more european related hashtags ("#europeicontrocovid19", europeans against covid-19 ), in order to ask for a common reaction of the eu. the center-right and right wing community has other hashtags as "#forzalombardia" (go, lombardy! ), ranking the 2nd, and "#fermiamoloinsieme", ranking 10th. what is, nevertheless, astonishing, is the presence among the most used hashtags of all communities of the name of politicians from the same group ('interestingly '#salvini" is the first used hashtag in the center right and right wing community, even if he did not perform any duty in the government), tv programs ("#mattino5", "#lavitaindiretta", "#ctcf", "#dimartedì"), as if the main usage of hashtags is to promote the appearance of politicians in tv programs. finally, the hashtags used by fi-l-fdi are mainly used to criticise the actions of the government, e.g., "#contedimettiti" (conte, resign! ). fig. 5 shows the structure of the directed validated projection of the retweet activity network, as outcome of the procedure recalled in section 3 of the supplementary material. as mentioned in section 4 of the supplementary material, the affiliation of unverified users has been determined using the tags obtained by the validated projected network of the verified users, as immutable label for the label propagation of [23] . after label propagation, the representation of the political communities in the validated retweet network changes dramatically with respect to the case of the network of verified users: the center-right and right wing community is the most represented community in the whole network, with 11063 users (representing 21.1% of all the users in the validated network), followed by italia viva users with 8035 accounts (15.4% of all the accounts in the validated network). the impact of m5s and pd is much more limited, with, respectively, 3286 and 564 accounts. it is worth noting that this result is unexpected, due to the recent formation of italia viva. as in our previous study targeting the online propaganda [8] , we observe that the most effective users in term of hub score [21] are almost exclusively from the center-right and right wing party: considering the first 100 hubs, only 4 are not from this group. interestingly, 3 out of these 4 are verified users: roberto burioni, one of the most famous italian virologists, ranking 32nd, agenzia ansa, a popular italian news agency, ranking 61st, and tgcom24, the popular newscast of a private tv channel, ranking 73rd. the fourth account is an online news website, ranking 88th: this is a not verified account which belongs to a not political community. remarkably, in the top 5 hubs we find 3 of the top 5 hubs already found when considered the online debate on migrations from northern africa to italy [8] : in particular, a journalist of a neo-fascist online newspaper (non verified user), an extreme right activist (non verified user) and the leader of fratelli d'italia giorgia meloni (verified user), who ranks 3rd in the hub score. matteo salvini (verified user), who was the first hub in [8] , ranks 9th, surpassed by his party partner claudio borghi, ranking 6th. the first hub in the present network is an extreme right activist, posting videos against african migrants to italy and accusing them to be responsible of the contagion and of violating lockdown measures. table 9 shows the annotation results of all the domains tweeted and retweeted by users in the directed validated network. the numbers are much higher than those shown in table 2 , but the trend confirms the previous results. the majority of urls traceable to news sources are considered reputable. the number of unclassified domains is higher too. in fact, in this case, the annotation was made considering the domains occurring at least 100 times. table 9 annotation results over all the domains -directed validated network table 10 reports statistics about posts, urls, distinct urls, users and verified users in the directed validated network. noticeably, by comparing these numbers with those of table 3 , reporting statistics about the validated network of verified users, we can see that here the number of retweets is much more higher, and the trend is the opposite: verified users tend to tweet more than retweet (46277 vs 17190), while users in the directed validated network, which comprehends also non verified users, have a number of retweets 3.5 times higher than the number of their tweets. fig. 6 shows the trend of the number of tweets containing urls over the period of data collection. since we are analysing a bigger network than the one considered in section 3.2, we have numbers that are one order of magnitude greater than those shown in fig. 2 ; the highest peak, after the discovery of the first cases in lombardy, corresponds to more than 68,000 posts containing urls, whereas the analogous peak in fig. 2 corresponds to 2,500 posts. apart from the order of magnitudes, the two plots feature similar trends: higher traffic before the beginning of the italian lockdown, and a settling down as the quarantine went on [6] . table 11 shows the core of our analysis, that is, the distribution of reputable and non reputable news sources in the direct validated network, consisting of both verified and non-verified users. again, we focus directly on the 4 political sub-communities identified in the previous subsection. two of the sub-communities are part of the center-left wing community, one is associated to the 5 stars movement, the remaining one represents center-right and right wing communities. in line with previous results on the validated network of verified users, the table clearly shows how the vast majority of the news coming from sources considered scarce or non reputable are tweeted and retweeted by the center-right and right wing communities; 98% of the domains tagged as nr are shared by them. as shown in table 12 , the activity of fi-l-fdi users is again extremely high: on average there are 89.3 retweets per account in this community, against the 66.4 of m5s, the 48.4 of iv and the 21.8 of pd. the right wing contribution to the debate is extremely high, even in absolute numbers, due to the the large number of users in this community. it is worth mentioning that the frequency of non reputable sources in this community is really high (at about 30% of the urls in the only tweets) and comparable with that of the reputable ones (see table 11 , only [6] the low peaks for february 27 and march 10 are due to an interruption in the data collection, caused by a connection breakdown. table 11 domains annotation per political sub-communities -directed validated network tweets). in the other sub-communities, pd users are more focused on un-categorised sources, while users from both italia viva and movimento 5 stelle are mostly tweeting and retweeting reputable news sources. and users, but also in absolute numbers: out of the over 1m tweets, more than 320k tweets refer to a nr url. actually, the political competition still shines through the hashtag usage even for the other communities: it is the case, for instance, of italia viva. in the top 30 hashtags we can find '#salvini', '#lega', but also '#papeete' [7] , '#salvinisciacallo' (salvini jackal ) and '#salvinimmmerda' (salvini asshole). on the other hand, in italia viva hashtags supporting the population during the lockdown are used: '#iorestoacasa', '#restoacasa' (i am staying home), '#restiamoacasa' (let's stay home). criticisms towards the management of lombardy health system during the pandemics can be deduced from the hashtag '#commissariamtelalombardia' (put lombardy under receivership) and '#fontana' (the lega administrator of the lombardy region). movimento 5 stelle has the name of the main leader of the opposition '#salvini', as first hashtag and supports criticisms to the lombardy administration with the hashtags '#fontanadimettiti' (fontana, resign! ) and '#gallera', the health and welfare minister of the lombardy region, considered the main responsible for the bad management of the pandemics. nevertheless, it is possible to highlight even some hashtags encouraging the population during the lock down, as the above mentioned '#iorestoacasa', '#restoacasa' and '#restiamoacasa'. it is worth mentioning that the government measures, and the corresponding m5s campaigns, are accompanied specific hashtags: '#curaitalia' is the name of one of the decree of the prime minister to inject liquidity in the italian economy, '#acquistaitaliano' (buy italian products! ), instead, advertise italian products to support the national economy. as a final task, over the whole set of tweets produced or shared by the users in the directed validated network, we counted the number of times a message containing a url was shared by users belonging to different political communities, although without considering the semantics of the tweets. namely, we ignored whether the urls were shared to support or to oppose the presented arguments. table 14 shows the most tweeted (and retweeted) nr domains shared by the political communities presented in table 7 , the number of occurrences is reported next to each domain. the first nr domains for fi-l-fdi in table 14 are related to the right, extreme right and neo-fascist propaganda, as it is the case of imolaoggi.it, ilprimatonazionale.it and voxnews.info, recognised as disinformation websites by newsguard and by the two main italian debunker websites, bufale.net and butac.it. as shown in the table, some domains, although in different number of occurrences, are present under more than one column, thus shared by users close to different political communities. this could mean, for some subgroups of the community, a retweet with the aim of supporting the opinions expressed in the original tweets. however, since the semantics of the posts in which these domains are present were not investigated, the retweets of the links by more than one political community could be due to contrast, and not to support, the opinions present in the original posts. despite the fact that the results were achieved for a specific country, we believe that the applied methodology is of general interest, being able to show trends and peculiarities whenever information is exchanged on social networks. in particular, when analysing the outcome of our investigation, some features attracted our attention: 1 persistence of clusters wrt different discussion topics: in caldarelli et al. [8] , we focused on tweets concerned with immigration, an issue that has been central in the italian political debate for years. here, we discovered that the clusters and the echo chambers that have been detected when analysing tweets about immigration are almost the same as those singled out when considering discussions concerned with covid-19. this may seem surprising, because a discussion about covid-19 may not be exclusively political, but also medical, social, economic, etc.. from this we can argue that the clusters are political in nature and, even when the topic of discussion changes, users remain in their cluster on twitter. (indeed, journalists and politicians use twitter for information and political propaganda, respectively). the reasons political polarisation and political vision of the world affect so strongly also the analysis of what should be an objective phenomenon is still an intriguing question. 2 persistence of online behavioral characteristics of clusters: we found that the most active, lively and penetrating online communities in the online debate on covid-19 are the same found in [8] , formed in a almost purely political debate such as the one represented by the right of migrants to land on the italian territory. 3 (dis)similarities amongst offline and online behaviours of members and voters of parties: maybe less surprisingly, the political habits is also reflected in the degree of participation to the online discussions. in particular, among the parties in the centre-left-wing side, a small party (italia viva) shows a much more effective social presence than the larger party of the italian centre-left-wing (partito democratico), which has many more active members and more parliamentary representation. more generally, there is a significant difference in social presence among the different political parties, and the amount of activity is not at all proportional to the size of the parties in terms of members and voters. 4 spread of non reputable news sources: in the online debate about covid-19, many links to non reputable (defined such by newsguard, a toolkit ranking news website based on criteria of transparency and credibility, led by veteran journalists and news entrepreneurs) news sources are posted and shared. kind and occurrences of the urls vary with respect to the corresponding political community. furthermore, some of the communities are characterised by a small number of verified users that corresponds to a very large number of acolytes which are (on their turn) very active, three times as much as the acolytes of the opposite communities in the partition. in particular, when considering the amount of retweets from poorly reputable news sites, one of the communities is by far (one order of magnitude) much more active than the others. as noted already in our previous publication [8] , this extra activity could be explained by a more skilled use of the systems of propaganda -in that case a massive use of bot accounts and a targeted activity against migrants (as resulted from the analysis of the hub list). our work could help in steering the online political discussion around covid-19 towards an investigation on reputable information, while providing a clear indication of the political inclination of those participating in the debates. more generally, we hope that our work will contribute to finding appropriate strategies to fight online misinformation. while not completely unexpected, it is striking to see how political polarisation affects also the covid-19 debate, giving rise to on-line communities of users that, for number and structure, almost closely correspond to their political affiliations. this section recaps the methodology through which we have obtained the communities of verified users (see section 3.1). this methodology has been designed in saracco et al. [25] and applied in the field of social networks for the first time in [4, 8] . for the sake of completeness, the supplementary material, section 3, recaps the methodology through which we have obtained the validated retweet activity network shown in section 3.3. in section 4 of the supplementary material, the detection of the affiliation of unverified users is described. in the supplementary material, the interested reader will also find additional details about 1) the definition of the null models (section 5); 2) a comparison among various label propagation for the political affiliation of unverified users (section 6); and 3) a brief state of the art on fact checking organizations and literature on false news detection (section 7). many results in the analysis of online social networks (osn) shows that users are highly clustered in group of opinions [1, 11-15, 22, 28, 29] ; indeed those groups have some peculiar behaviours, as the echo chamber effects [14, 15] . following the example of references [4, 8] , we are making use of this users' clustering in order to detect discursive community, i.e. groups of users interacting among themselves by retweeting on the same (covid-related) subjects. remarkably, our procedure does not follow the analysis of the text shared by the various users, but is simply related on the retweeting activity among users. in the present subsection we will examine how the discursive community of verified twitter users can be extracted. on twitter there are two distinct categories of accounts: verified and unverified users. verified users have a thick close to the screen name: the platform itself, upon request from the user, has a procedure to check the authenticity of the account. verified accounts are owned by politicians, journalists or vips in general, as well as the official accounts of ministers, newspapers, newscasts, companies and so on; for those kind of users, the verification procedure guarantees the identity of their account and reduce the risk of malicious accounts tweeting in their name. non verified accounts are for standard users: in this second case, we cannot trust any information provided by the users. the information carried by verified users has been studied extensively in order to have a sort of anchor for the related discussion [4, 6, 8, 20, 27] to detect the political orientation we consider the bipartite network represented by verified (on one layer) and unverified (on the other layer) accounts: a link is connecting the verified user v with the unverified one u if at least one time v was retweeted by u, or viceversa. to extract the similarity of users, we compare the commonalities with a bipartite entropy-based null-model, the bipartite configuration model (bicm [24] ). the rationale is that two verified users that share many links to same unverified accounts probably have similar visions, as perceived by the audience of unverified accounts. we then apply the method of [25] , graphically depicted in fig. 8 , in order to get a statistically validated projection of the bipartite network of verified and unverified users. in a nutshell, the idea is to compare the amount of common linkage measured on the real network with the expectations of an entropy-based null model fixing (on average) the degree sequence: if the associated p-value is so low that the overlaps cannot be explained by the model, i.e. such that it is not compatible with the degree sequence expectations, they carry non trivial information and we project the related information on the (monopartite) projection of verified users. the interested reader can find the technical details about this validated projection in [25] and in the supplementary information. the data that support the findings of this study are available from twitter, but restrictions apply to the availability of these data, which were used under license 1 italian socio-political situation during the period of data collection in the present subsection we present some crucial facts for the understanding of the social context in which our analysis is set. this subsection is divided into two parts: the contagion evolution and the political situation. these two aspects are closely related. a first covid-19 outbreak was detected in codogno, lodi, lombardy region, on february, 19th [1] . in the very next day, two cases were detected in vò, padua, veneto region. on february, 22th, in order to contain the contagions, the national government decided to put in quarantine 11 municipalities, 10 in the area around lodi and vò, near padua [2] . nevertheless, the number of contagions raised to 79, hitting 5 different regions; one of the infected person in vò died, representing the first registered italian covid-19 victim [3] . on february, 23th there were already 229 confirmed cases in italy. the first lockdown should have lasted until the 6th of march, but due to the still increasing number of contagions in northern italy, the italian prime minister giuseppe conte intended to extend the quarantine zone to almost all the northern italy on sunday, march 8th [4] : travel to and from the quarantine zone were limited to case of extreme urgency. a draft of the decree announcing the expansion of the quarantine area appeared on the website of the italian newspaper corriere della sera on the late evening of saturday, 7th, causing some panic in the interested areas [5] : around 1000 people, living in milan, but coming from southern regions, took trains and planes to reach their place of [1] prima lodi, ""paziente 1", il merito della diagnosi va diviso... per due", 8th june 2020 [2] italian gazzetta ufficiale, "decreto-legge 23 febbraio 2020, n. 6". the date is intended to be the very first day of validity of the decree. [3] il fatto quotidiano, "coronavirus,è morto il 78enne ricoverato nel padovano. 15 contagiati in lombardia, un altro in veneto", 22nd february 2020. [4] bbc news, "coronavirus: northern italy quarantines 16 million people", 8th march 2020" [5] the guardian, "leaked coronavirus plan to quarantine 16m sparks chaos in italy", 8th march 2020 origins [6] [7] . in any case, the new quarantine zone covered the entire lombardy and partially other 4 regions. remarkably, close to bergamo, lombardy region, a new outbreak was discovered and the possibility of defining a new quarantine area on march 3th was considered: this opportunity was later abandoned, due to the new northern italy quarantine zone of the following days. this delay seems to have caused a strong increase in the number of contagions, making the bergamo area the most affected one, in percentage, of the entire country [8] ; at time of writing, there are investigations regarding the responsibility of this choice. on march, 9th, the lockdown was extended to the whole country, resulting in the first country in the world to decide for national quarantine [9] . travels were restricted to emergency reason or to work; all business activities that were not considered as essentials, as pharmacies and supermarkets, had to be closed. until the 21st of march lockdown measures became progressively stricter all over the country. starting from the 14th of april, some retails activities as children clothing shops, reopened. a first fall in the number of deaths was observed on the 20th of april [10] . a limited reopening started with the so-called "fase 2" (phase 2 ) on the 4th of may [11] . from the very first days of march, the limited capacity of the intensive care departments to take care of covid-infected patients, took to the necessity of a re-organization of italian hospitals, leading, e.g., to the opening of new intensive care departments [12] . moreover, new communication forms with the relatives of the patients were proposed, new criteria for the intubating patients were developed, and, in the extreme crisis, in the most infected cases, the emergency management took to give priority to the hospitalisation to patients with a higher probability to recover [13] . outbreaks were mainly present in hospitals [19] . unfortunately, healthcare workers were contaminated by the covid [14] . this contagion resulted in a relative high number of fatalities: by the 22nd of april, 145 covid deaths were registered among doctors. due to the pressure on the intensive care capacity, even the healthcare personnel was subject to extreme stress, especially in the most affected zones [15] . on august 8th, 2019, the leader of lega, the main italian right wing party, announced to negate the support to the government of giuseppe conte, which was formed after a post-election coalition between the renzi formed a new center-left party, italia viva (italy alive, iv), due to some discord with pd; despite the scission, italia viva continued to support the actual government, having some of its representatives among the ministers and undersecretaries, but often marking its distance respect to both pd and m5s. due to the great impact that matteo salvini and giorgia meloni -leader of fratelli d'italia, a right wing party-have on social media, they started a massive campaign against the government the day after its inauguration. the regions of lombardy, veneto, piedmont and emilia-romagna experienced the highest number of contagions during the pandemics; among those, the former 3 are administrated by the right and center-right wing parties, the fourth one by the pd. the disagreement in the management of the pandemics between regions and the central government was the occasion to exacerbate the political debate (in italy, regions have a quite wide autonomy for healthcare). the regions administrated by the right wing parties criticised the centrality of the decisions regarding the lock down, while the national government criticises the health management (in lombardy the healthcare system has a peculiar organisation, in which the private sector is supported by public funding) and its non effective measure to reduce the number of contagions. the debate was ridden even at a national level: the opposition criticized the financial origin of the support to the various economic sectors. moreover, the role of the european union in providing funding to recover italian economics after the pandemics was debated. here, we detail the composition of the communities shown in figure 1 of the main text. we remind the reader that, after applying the leuven algorithm to the validated network of verified twitter users, we could observe 4 main communities, that correspond to 1 right wing parties and media (in steel blue) 2 center left wing (dark red) 3 5 stars movement (m5s ), in dark orange 4 institutional accounts (in sky blue) starting from the center-left wing, we can find a darker red community, including various ngos (the italian chapters of unicef, medecins sans frontieres, action aid, emergency, save the children, etc.), various left oriented journalists, vips and pundits [16] . finally, we can find in this group political movements ('6000sardine') and politicians on the left of pd (as beppe civati, pietro grasso, ignazio marino) or on the left current of the pd (laura boldrini, michele emiliano, stefano bonaccini). a slightly lighter red sub-community turns out to be composed by the main politicians of the italian democratic party (pd), as well as by representatives from the european parliament (italian and others) and some eu commissioners. the violet red group is mostly composed by the representatives of the newly founded italia viva, by the former italian prime minister matteo renzi (december 2014 -february 2016) and former secretary of pd. in golden red we can find the subcommunity of catholic and vatican groups. finally the dark violet red and light tomato subcommunities are composed mainly by journalists. interestingly enough, the dark violet red contains also accounts related to the city of milan (the major, the municipality, the public services account) and to the spoke person of the chinese minister of foreign affair. in turn, also the orange (m5s) community shows a clear partition in substructures. in particular, the dark orange subcommunity contains the accounts of politicians, parliament representatives and ministers of the m5s and journalists and the official account of il fatto quotidiano, a newspaper supporting the movement 5 stars. interestingly, since one of the main leaders of the movement, luigi di maio, is also the italian minister of foreign affairs, we can find in this subcommunity also the accounts of several italian embassies around the world, as well as the account of the italian representatives at nato, ocse and oas. in aquamarine, we can find the official accounts of some private and public, national and international, health institutes (as the italian istituto superiore di sanità, literally the italian national institute of health, the world health organization, the fondazione veronesi) the minister of health roberto speranza, and some foreign embassies in italy. finally, in the light slate blue subcommunity we can find various italian ministers as well as the italian police and army forces. similar considerations apply to the steel blue community. in steel blue, the subcommunity of center right and right wing parties (as forza italia, lega and fratelli d'italia). the presidents of the regions of lombardy, veneto and liguria, administrated by center right and right wing parties, can be found here. (in the following this subcommunity is going to be called as fi-l-fdi, recalling the initials of the political parties contributing to this group.) the sky blue subcommunity includes the national federations of various sports, the official accounts of athletes and sport players (mostly soccer) and their teams, as well as sport journals, newscasts and journalists. the teal subcommunity contains the main italian news agencies, some of the main national and local newspapers, [16] as the cartoonists makkox and vauro, the singers marracash, frankiehinrg, ligabue and emphil volo vocal band, and journalists from repubblica (ezio mauro, carlo verdelli, massimo giannini), from la7 tv channel (ricardo formigli, diego bianchi). newscasts and their journalists. in this subcommunity there are also the accounts of many universities; interestingly enough, it includes also the all the local public service local newscasts. the firebrick subcommunity contains accounts related to the as roma football club; analogously in dark red official accounts of ac milan and its players. the slate blue subcommunity is mainly composed by the official accounts of radio and tv programs of mediaset, the main private italian broadcasting company, together with singers and musicians. other smaller subcommunities includes other sport federations, and sports pundits. finally, the sky blue community is mainly composed by italian embassies around the world. the navy subpartition contains also the official accounts of the president of the republic, the italian minister of defense and the one of the commissioner for economy at eu and former prime minister, paolo gentiloni. in the study of every phenomenon, it is of utmost importance to distinguish the relevant information from the noise. here, we remind a framework to obtain a validated monopartite retweet network of users: the validation accounts the information carried by not only the activity of the users, but also by the virality of their messages. we represented pictorially the method in fig. 1 . we define a directed bipartite network in which one layer is composed by accounts and the other one by the tweets. an arrow connecting a user u to a tweet t represents the u writing the message t. the arrow in the opposite direction means that the user u is retweeting the message t. to filter out the random noise from this network, we make use of the directed version of the bicm, i.e. the bipartite directed configuration model (bidcm [15] ). the projection procedure is then, analogous to the one presented in the previous subsection: it is pictorially displayed in the fig. 1 . briefly, consider the couple of users u 0 and u 1 and consider the number of message written by u 0 and shared u 1 . then, calculate which is the distribution of the same measure according with the bidcm: if the related p-value is statistically significant, i.e. if the number of u 0 's tweets shared by u 1 is much more than expected by the bidcm, we project a (directed) link from u 0 to u 1 . summarising, the comparison of the observation on the real network with the bidcm permits to uncover all contributions that cannot originate from the constraints of the null-model. using the technique described in subsection 5.1 of the main text, we are able to assign to almost all verified users a community, based on the perception of the unverified users. due to the fact that the identity of verified users are checked by twitter, we have the possibility of controlling our groups. indeed, as we will show in the following, the network obtained via the bipartite projection provides a reliable description regarding the closeness of opinions and role in the social debate. how can we use this information in order to infer the orientation of non verified users? in the reference [6] we used the tags obtained for both verified and unverified users in the bipartite network described in subsection 5.1 of the main real network c) e) figure 1 schematic representation of the projection procedure for bipartite directed network. a) an example of a real directed bipartite network. for the actual application, the two layers represent twitter accounts (turquoise) and posts (gray). a link from a turquoise node to a gray one represents that the post has been written by the user; a link in the opposite direction represents a retweet by the considered account. b) the bipartite directed configuration model (bidcm) ensemble is defined. the ensemble includes all the link realisations, once the number of nodes per layer has been fixed. c) we focus our attention on nodes i and j and count the number of directed common neighbours (in magenta both the nodes and the links to their common neighbours), i.e., the number of posts written by i and retweeted by j. subsequently, d) we compare this measure on the real network with the one on the ensemble: if this overlap is statistically significant with respect to the bidcm, e) we have a link from i to j in the projected network. text and propagated those labels accross the network. in a recent analysis, we observed that other approaches are more stable [16] : in the present manuscript we make use of the most stable algorithm. we use the label propagation as proposed in [22] on the directed validated network. indeed, the validated directed network in the present appendix we remind the main steps for the definition of an entropy based null model; the interested reader can refer to the review [8] . we start by revising the bipartite configuration model [23] , that has been used for detecting the network of similarities of verified users. we are then going to examine the extension of this model to bipartite directed networks [15] . finally, we present the general methodology to project the information contained in a -directed or undirected-bipartite network, as developed in [24] . let us consider a bipartite network g * bi , in which the two layers are l and γ. define g bi the ensemble of all possible graphs with the same number of nodes per layer as in g * bi . it is possible to define the entropy related to the ensemble as [20] : where p (g bi ) is the probability associated to the instance g bi . now we want to obtain the maximum entropy configuration, constraining some relevant topological information regarding the system. for the bipartite representation of verified and unverified user, a crucial ingredient is the degree sequence, since it is a proxy of the number of interactions (i.e. tweets and retweets) with the other class of accounts. thus in the present manuscript we focus on the degree sequence. let us then maximise the entropy (1), constraining the average over the ensemble of the degree sequence. it can be shown, [24] , that the probability distribution over the ensemble is where m iα represent the entries of the biadjacency matrix describing the bipartite network under consideration and p iα is the probability of observing a link between the nodes i ∈ l and α ∈ γ. the probability p iα can be expressed in terms of the lagrangian multipliers x and y for nodes on l and γ layers, respectively, as in order to obtain the values of x and y that maximize the likelihood to observe the real network, we need to impose the following conditions [13, 26]        where the * indicates quantities measured on the real network. actually, the real network is sparse: the bipartite network of verified and unverified users has a connectance ρ 3.58 × 10 −3 . in this case the formula (3) can be safely approximated with the chung-lu configuration model, i.e. where m is the total number of links in the bipartite network. in the present subsection we will consider the case of the extension of the bicm to direct bipartite networks and highlight the peculiarities of the network under analysis in this representation. the adjancency matrix describing a direct bipartite network of layers l and γ has a peculiar block structure, once nodes are order by layer membership (here the nodes on l layer first): where the o blocks represent null matrices (indeed they describe links connecting nodes inside the same layer: by construction they are exactly zero) and m and n are non zero blocks, describing links connecting nodes on layer l with those on layer γ and viceversa. in general m = n, otherwise the network is not distinguishable from an undirected one. we can perform the same machinery of the section above, but for the extension of the degree sequence to a directed degree sequence, i.e. considering the in-and out-degrees for nodes on the layer l, (here m iα and n iα represent respectively the entry of matrices m and n) and for nodes on the layer γ, the definition of the bipartite directed configuration model (bidcm, [15] ), i.e. the extension of the bicm above, follows closely the same steps described in the previous subsection. interestingly enough, the probabilities relative to the presence of links from l to γ are independent on the probabilities relative to the presence of links from γ to l. if q iα is the probability of observing a link from node i to node α and q iα the probability of observing a link in the opposite direction, we have where x out i and x in i are the lagrangian multipliers relative to the node i ∈ l, respectively for the out-and the in-degrees, and y out α and y in α are the analogous for α ∈ γ. in the present application we have some simplifications: the bipartite directed network representation describes users (on one layer) writing and retweeting posts (on the other layer). if users are on the layer l and posts on the opposite layer and m iα represents the user i writing the post α, then k in α = 1 ∀α ∈ γ, since each message cannot have more than an author. notice that, since our constraints are conserved on average, we are considering, in the ensemble of all possible realisations even instances in which k in α > 1 or k in α = 0, or, otherwise stated, non physical; nevertheless the average is constrained to the right value, i.e. 1. the fact that k in α is the same for every α allows for a great simplification of the probability per link on m: where n γ is the total number of nodes on the γ layer. the simplification in (9) is extremely helpful in the projected validation of the bipartite directed network [2] . the information contained in a bipartite -directed or undirected-network, can be projected onto one of the two layers. the rationale is to obtain a monopartite network encoding the non trivial interactions among the two layers of the original bipartite network. the method is pretty general, once we have a null model in which probabilities per link are independent, as it is the case of both bicm and bidcm [24] . the first step is represented by the definition of a bipartite motif that may capture the non trivial similarity (in the case of an undirected bipartite network) or flux of information (in the case of a directed bipartite network). this quantity can be captured by the number of v −motifs between users i and j [11, 23] , or by its direct extension (note that v ij = v ji ). we compare the abundance of these motifs with the null models defined above: all motifs that cannot be explained by the null model, i.e. whose p-value are statistically significance, are validated into the projection on one of the layers [24] . in order to assess the statistically significance of the observed motifs, we calculate the distribution associated to the various motifs. for instance, the expected value for the number of v-motifs connecting i and j in an undirected bipartite network is where p iα s are the probability of the bicm. analogously, where in the last step we use the simplification of (9) [2] . in both the direct and the undirect case, the distribution of the v-motifs or of the directed extensions is poisson binomial one, i.e. a binomial distribution in which each event shows a different probability. in the present case, due to the sparsity of the analysed networks, we can safely approximate the poisson-binomial distribution with a poisson one [14] . in order to state the statistical significance of the observed value, we calculate the related p-values according to the relative null-models. once we have a p-value for every detected v-motif, the related statistical significance can be established through the false discovery rate (fdr) procedure [3] . respect to other multiple test hypothesis, fdr controls the number of false positives. in our case, all rejected hypotheses identify the amount of v-motifs that cannot be explained only by the ingredients of the null model and thus carry non trivial information regarding the systems. in this sense, the validated projected network includes a link for every rejected hypothesis, connecting the nodes involved in the related motifs. in the main text, we solved the problem of assigning the orientation to all relevant users in the validated retweet network via a label propagation. the approach is similar, but different to the one proposed in [6] , the differences being in the starting labels, in the label propagation algorithm and in the network used. in this section we will revise the method employed in the present article, as compared it to the one in [6] and evaluate the deviations from other approaches. first step of our methodology is to extract the polarisation of verified users from the bipartite network, as described in section 5.1 of the main text, in order to use it as seed labels in the label propagation. in reference [6] , a measure of the "adherence" of the unverified users towards the various communities of verified users was used in order to infer their orientation, following the approach in [2] , in turn based on the polarisation index defined in [4] . this approach was extremely performing when practically all unverified users interact at least once with verified one, as in [2] . while still having good performances in a different dataset as the one studied in [6] , we observed isolated deviations: it was the case of users with frequent interactions with other unverified accounts of the same (political) orientation, randomly retweeting a different discursive community verified user. in this case, focusing just on the interaction with verified accounts, those nodes were assigned a wrong orientation. the labels for the polarisation of the unverified users defined [6] were subsequently used as seed labels in the label propagation. due to the possibility described above of assigning wrongly labels to unverified accounts, in the present paper, we consider only the tags of verified users, since they pass a strict validation procedure and are more stable. in order to compare the results obtained with the various approaches, we calculated the variation of information (vi, [17] ). v i considers exactly the different in information contents captured by two different partition, as consider by the shannon entropy. results are reported in the matrix in figure 2 for the 23th of february (results are similar for other days). even when using the weighted retweet network as "exact" result, the partition found by the label propagation of our approach has a little loss of information, comparable with the one of using an unweighted approach. indeed, the results found by the various community detection algorithms show little agreement with the label propagation ones. nevertheless, we still prefer the label propagation procedure, since the validated projection on the layer of verified users is theoretically sound and has a non trivial interpretation. the main result of this work quantifies the level of diffusion on twitter of news published by sources considered scarcely reputable. academy, governments, and news agencies are working hard to classify information sources according to criteria of credibility and transparency of published news. this is the case, for example, of newsguard, which we used for the tagging of the most frequent domains in the direct validated network obtained according to the methodology presented in the previous sections. as introduced in subsection 3.2 of the main text, the newsguard browser extension and mobile app [19] offers a reliability result for the most popular newspapers in the world, summarizing with a numerical score the level of credibility and journalistic transparency of the newspaper. with the same philosophy, but oriented towards us politics, the fact-checking site politifact.com reports with a 'truth meter' the degree of truthfulness of original claims made by politicians, candidates, their staffs, and, more, in general, protagonists of us politics. one of the eldest fact-checking websites dates back to 1994: snopes.com, in addition to political figures, is a fact-checker for hoaxes and urban legends. generally speaking, a fact-checking site has behind it a multitude of editors and journalists who, with a great deal of energy, manually check the reliability of a news, or of the publisher of that news, by evaluating criteria such as, e.g., the tendency to correct errors, the nature of the newspaper's finances, and if there is a clear differentiation between opinions and facts. thus, it is worth noting that recent attempts tried to automatically find articles worthy of being fact-checked. for example, work in [1] uses a supervised classifier, based on an ensemble of neural networks and support vector machines, to figure out which politicians' claims need to be debunked, and which have already been debunked. despite the tremendous effort of stakeholders to keep the fact-checking sites up to date and functioning, disinformation resists debunking due to a combination of factors. there are psychological aspects, like the quest for belonging to a community and getting reassuring answers, the adherence to one's viewpoint, a native reluctance to change opinion [28, 29] , the formation of echo chambers [10] , where people polarize their opinions as they are insulated from contrary perspectives: these are key factors for people to contribute to the success of disinformation spreading [7, 9] . moreover, researchers demonstrate how the spreading of false news is strategically supported by the massive and organized use of trolls and bots [25] . despite the need to educate the user to a conscious fruition of online information through means also different from those represented by technological solutions, there are a series of promising works that exploit classifiers based on machine learning or on deep learning to tag a news as credible or not. one interesting approach is based on the analysis of spreading patterns on social platforms. monti et al. recently provide a deep learning framework for detection of fake news cascades [18] . a ground truth is acquired by following the example by vosoughi et al. [27] collecting twitter cascades of verified false and true rumors. employing a novel deep learning paradigm for graph-based structures, cascades [19] https://www.newsguardtech.com/ are classified based on user profile, user activity, network and spreading, and content. the main result of the work is that 'a few hours of propagation are sufficient to distinguish false news from true news with high accuracy'. this result has been confirmed by other studies too. work in [30] , by zhao et al. examine diffusion cascades on weibo and twitter: focusing on topological properties, such as the number of hops from the source and the heterogeneity of the network, the authors demonstrate that networks in which fake news are diffused feature characteristics really different from those diffusing genuine information. diffusion networks investigation appear to be a definitive path to follow for fake news detection. this is also confirmed by pierri et al. [21] : also here, the goal is to classifying news articles pertaining to bad and genuine information' by solely inspecting their diffusion mechanisms on twitter'. even in this case, results are impressive: a simple logistic regression model is able to correctly classify news articles with a high accuracy (auroc up to 94%). the political blogosphere and the 2004 u.s. election: divided they blog 2020) coronavirus: 'deadly masks' claims debunked coronavirus: bill gates 'microchip' conspiracy theory and other vaccine claims fact-checked extracting significant signal of news consumption from social networks: the case of twitter in italian political elections fast unfolding of communities in large networks influence of fake news in twitter during the 2016 us presidential election how does junk news spread so quickly across social media? algorithms, advertising and exposure in public life the role of bot squads in the political propaganda on twitter tracking social media discourse about the covid-19 pandemic: development of a public coronavirus twitter data set the statistical physics of real-world networks political polarization on twitter predicting the political alignment of twitter users partisan asymmetries in online political activity echo chambers: emotional contagion and group polarization on facebook mapping social dynamics on facebook: the brexit debate 2020) tackling covid-19 disinformation -getting the facts right ) speech of vice president věra jourová on countering disinformation amid covid-19 -from pandemic to infodemic filter bubbles, echo chambers, and online news consumption community detection in graphs finding users we trust: scaling up verified twitter users using their communication patterns opinion dynamics on interacting networks: media competition and social influence near linear time algorithm to detect community structures in large-scale networks randomizing bipartite networks: the case of the world trade web inferring monopartite projections of bipartite networks: an entropy-based approach maximum-entropy networks. pattern detection, network reconstruction and graph combinatorics journalists on twitter: self-branding, audiences, and involvement of bots emotional dynamics in the age of misinformation debunking in a world of tribes coronavirus, a milano la fuga dalla "zona rossa": folla alla stazione di porta garibaldi coronavirus, l'illusione della grande fuga da milano. ecco i veri numeri degli spostamenti verso sud coronavirus: italian army called in as crematorium struggles to cope with deaths coronavirus: italy extends emergency measures nationwide italy sees first fall of active coronavirus cases: live updates coronavirus in italia, verso primo ok spostamenti dal 4/5, non tra regioni italy's health care system groans under coronavirus -a warning to the world negli ospedali siamo come in guerra. a tutti dico: state a casa coronavirus: ordini degli infermieri, 4 mila i contagiati automatic fact-checking using context and discourse information extracting significant signal of news consumption from social networks: the case of twitter in italian political elections controlling the false discovery rate: a practical and powerful approach to multiple testing users polarization on facebook and youtube fast unfolding of communities in large networks the role of bot squads in the political propaganda on twitter the psychology behind fake news the statistical physics of real-world networks fake news: incorrect, but hard to correct. the role of cognitive ability on the impact of false information on social impressions echo chambers: emotional contagion and group polarization on facebook graph theory (graduate texts in mathematics) resolution limit in community detection maximum likelihood: extracting unbiased information from complex networks. phys rev e -stat nonlinear on computing the distribution function for the poisson binomial distribution reconstructing mesoscale network structures the contagion of ideas: inferring the political orientations of twitter accounts from their connections comparing clusterings by the variation of information fake news detection on social media using geometric deep learning at the epicenter of the covid-19 pandemic and humanitarian crises in italy: changing perspectives on preparation and mitigation. catal non-issue content 20 near linear time algorithm to detect community structures in large-scale networks randomizing bipartite networks: the case of the world trade web inferring monopartite projections of bipartite networks: an entropy-based approach the spread of low-credibility content by social bots analytical maximum-likelihood method to detect patterns in real networks a question of belonging: race, social fit, and achievement cognitive and social consequences of the need for cognitive closure fake news propagate differently from real news even at early stages of spreading analysis of online misinformation during the peak of the covid-19 pandemics in italy supplementary material guido caldarelli 1,2,3* † , rocco de nicola 3 † , marinella petrocchi 4 † , manuel pratelli 3 † and fabio saracco 3 † there is another difference in the label propagation used here against the one in [6] : in the present paper we used the label propagation of [22] , while the one in [6] was quite home-made. as in reference [22] , the seed labels of [6] are fixed, i.e. are not allowed to change [17] . the main difference is that, in case of a draw, among the labels of the first neighbours, in [22] a tie is removed randomly, while in the algorithm of [6] the label is not assigned and goes into a new run, with the newly assigned labels. moreover, the updated of labels in [22] is asynchronous, while it is synchronous in [6] . we opted for the one in [22] for being actually a standard in the label propagation algorithms, being stable, more studied, and faster [18] . finally, differently from the procedure in [6] , we applied the label propagation not to the entire (undirected version of the) retweet network, but on the (undirected version of the) validated one. (the intent of choosing the undirected version is that in both case in which a generic account is significantly retweeting or being retweeted by another one, they do probably share some vision of the phenomena under analysis, thus we are not interested in the direction of the links, in this situation.) the rationale in using the validated network is to reduce the calculation time (due to the dimensions of the dataset), while obtaining an accurate result. while the previous differences from the procedure of [6] are dictated by conservativeness (the choice of the seed labels) or by the adherence to a standard (the choice of [22] ), this last one may be debatable: why choosing the validated network should return "better" results than the ones calculated on the entire retweet network? we consider the case of a single day (in order to reduce the calculation time) and studied 6 different approaches:1 a louvain community detection [5] on the undirected version of the validated network of retweets; 2 a louvain community detection on the undirected version of the unweighted retweet network; 3 a louvain community detection on the undirected version of the weighted retweet network, in which the weights are the number of retweets from user to user; 4 a label propagation a la raghavan et al. [22] on the directed validated network of retweets; 5 a label propagation a la raghavan et al. on the (unweighted) retweet network; 6 a label propagation a la raghavan et al. on the weighted retweet network, the weights being the number of retweets from user to user. actually, due to the order dependence of louvain [12] , we run several times the louvain algorithm after reshuffling the order of the nodes, taking the partition in communities that maximise the modularity. similarly, the label propagation of [22] has a certain level of randomness: we run it several times and choose the most frequent label assignment for every node. key: cord-027304-a0vva8kb authors: achermann, guillem; de luca, gabriele; simoni, michele title: an information-theoretic and dissipative systems approach to the study of knowledge diffusion and emerging complexity in innovation systems date: 2020-05-23 journal: computational science iccs 2020 doi: 10.1007/978-3-030-50423-6_19 sha: doc_id: 27304 cord_uid: a0vva8kb the paper applies information theory and the theory of dissipative systems to discuss the emergence of complexity in an innovation system, as a result of its adaptation to an uneven distribution of the cognitive distance between its members. by modelling, on one hand, cognitive distance as noise, and, on the other hand, the inefficiencies linked to a bad flow of information as costs, we propose a model of the dynamics by which a horizontal network evolves into a hierarchical network, with some members emerging as intermediaries in the transfer of knowledge between seekers and problem-solvers. our theoretical model contributes to the understanding of the evolution of an innovation system by explaining how the increased complexity of the system can be thermodynamically justified by purely internal factors. complementing previous studies, we demonstrate mathematically that the complexity of an innovation system can increase not only to address the complexity of the problems that the system has to solve, but also to improve the performance of the system in transferring the knowledge needed to find a solution. formalization and organization of a network becomes strategic to accelerate the flow of information and knowledge and the emergence of innovation [2] . several forms of reticular organization (hierarchical, heterarchical, according to the centrality of elements, according to the transitivity of element, etc.) can be conceptualized within that context. evolutionary economics and technology studies highlight (neo-schumpeterian) models to understand the plurality of evolution cases, depending on the initial forms of organization, but also on the ability of a system to adapt to systemic crises. in this work we study, from an information-theoretical perspective, the relationship between the structure of an innovation network, the noise in its communication channels and the energy costs associated with the network's maintenance. an innovation network is here considered to encompass a variety of organisations who, through their interactions and the resulting relationships, build a system conducive to the emergence of innovation. this system is identified by the literature [3] with different terms, such as innovation ecosystem, [4] problem-solving network, [5] or innovation environment [6] . in this system, the information channels transfer a multitude of information and knowledge which, depending on the structural holes, [7, 8] but also on the absence of predetermined receivers, are subject to information "noise" [9] . the more the information is distorted in the network, the more energy is needed to transfer accurate information, in order to keep performance of the innovation network high. the idea we propose is that the structure of an innovation system evolves to address the heterogeneity in the quality of communication that takes place between its members. in particular, we argue that the noise in a network increases the complexity of the network structure required for the accurate transfer of information and knowledge, and thus leads to the emergence of hierarchical structures. these structures, thanks to their fractal configuration, make it possible to combine high levels of efficiency in the transmission of information, with low network maintenance costs. this idea complements previous studies that have analysed the relationship between the structure of an innovation network, on one hand, and the complexity of the problem to be solved and the resulting innovation process, on the other, [10] by focusing on communication noise and cost of network structure maintenance. to the existing understanding of this phenomenon we contribute by identifying a thermodynamically efficient process which the network follows as it decreases in entropy while simultaneously cutting down its costs. this model is based on the analysis of a network composed of two classes or categories of organisations, which operate within the same innovation system [11] . these classes are represented by a central organisation called seeker, which poses a research question to a group of other organisations, called problem-solvers, and from which in turn receives a solution. it has been suggested [12] that one of the problems that the innovation system has to solve, and for which it self-organises, is the problem of effective diffusion of knowledge between problem-solvers and solution-seekers, as this can be considered as a problem sui generis. the theory on the diffusion of knowledge in an innovation system suggests that this problem is solved through the evolution of modular structures in the innovation network, which implies the emergence of organisations that act as intermediary conduits of knowledge between hyperspecialised organisations in the same innovation environment [13] . a modular structure is, in network theory, connected to the idea of a hierarchical or fractal structure of the network, [14] and is also characterised by scale-invariance; [15] the latter is a particularly important property, because if innovation systems have it as an emergent property of their behaviour, this allows them to be considered as complex adaptive systems [16] . it has been suggested that scale-invariance property of an innovation system might emerge as the result of horizontal cooperation between its elements, [17] which try to reach the level of complexity required to solve a complex problem; but it is not yet clear how does a complex structure emerge when the complexity of the problem does not vary, which is a phenomenon observed empirically [18, 19] . in this paper we show how complexity can also vary as a result of a non-uniform distribution of the cognitive distance between organisations of the network, and of the adaptation required to solve the problem of knowledge diffusion among them. our contribution to the theoretical understanding on the self-organising properties of innovation systems is that, by framing the problem of heterogeneous cognitive distance between organisations under the theory of dissipative systems, we can explain in thermodynamically efficient terms the reduction in entropy of an innovation system, as an emergent adaptation aimed at reducing costs of maintenance of the system's structure. the theoretical framework which we use for this paper is comprised by four parts. first, we will frame the innovation system as a thermodynamically-open system, which is a property that derives from the fact that social systems also are [20] . second, we will see under what conditions a system can undertake self-organisation and evolution. this will allow us to consider an innovation system as a complex adaptive system, should it be found that there are emergent properties of its behaviour which lead to an increase in complexity. third, we will frame the innovation system as a dissipative system, which is a property also shared by social systems [21] . dissipative systems are characterised by the fact that a variation in the level of their entropy tends to happen as a consequence of their changed ability to process inputs, and we will see how this applies for innovation systems. lastly, we will study cognitive distance as it applies to a network of innovation, in order to show how a spontaneous reduction in it leads to an increase in complexity of the network. systems. an open thermodynamic system is defined as a system which exchanges matter and energy with its surrounding environment, [22] and among them are found all social systems, which are open systems due to their exchanging of energy with the surrounding environment [23] . social systems are also dynamical systems, because their structure changes over time through a process of dynamical evolution [24] . innovation systems are some special classes of social systems, [25] which can thus also be considered as open systems [26] . in addition to this, like all social systems, innovation systems are also capable of selforganisation, [27] which is a property that they inherit from social systems [28] . there is however a property which distinguishes innovation systems from the generic social system: that is, the capacity of the former to act as problem-solving environments [11] . an innovation system possesses the peculiar function of developing knowledge, [29] which is not necessarily possessed by the general social system [30] . it has been theorised that developing and distributing knowledge [31] is the method by which the innovation system implements the function of solving problems, [32, 33] and we will be working within this theoretical assumption. the innovation system, for this paper, is therefore framed as a thermodynamically-open social system which solves problems through the development and diffusion of knowledge. evolution and self-organisation. like all other social systems, [34] an innovation system undertakes evolution [35] and changes in complexity over time [36] . the change in complexity of a system, in absence of any central planning or authority, is called in the literature self-organisation [37] . self-organisation in a system implies that the system's components do not have access to global information, but only to information which is available in their immediate neighbourhood, and that upon that information they then act [28] . innovation systems evolve, with a process that may concern either their members, [38] their relationships and interactions, [39] the technological channels of communication, [40] the policies pursued in them, [41] or all of these factors simultaneously [42] . for the purpose of this work we will limit ourselves to consider as evolution of an innovation system the modification of the existing relationships between its members, and the functions which they perform in their system. this process of evolution of the innovation system is characterised by self-organisation, [43] and it occurs along the lines of both information [44] and knowledge flows within the system [45] . the selforganisation of an innovation system is also the result of evolutionary pressures, [46] and we will here argue that one form of such pressures is cognitive distance between organisations within a network of innovation, whose attempt at reduction may lead to modifications in the relationships within the system and to the emergence of complex structures. while it has also been suggested that variations in the complexity of an innovation system might be the consequence of intrinsic complexity of the problems to be solved, [47] it has also been suggested that problems related to the transfer of knowledge within the elements of the system can, by themselves, generate the emergence of complex network structures, through a process which is thermodynamically advantageous. dissipative innovation systems. as the innovation system acquires a more complex structure, its entropy decreases. if one assumes that the decrease in entropy follows the expenditure of some kind of energy by the system, without which its evolution towards a lower-entropy state is not possible, then it follows that the innovation system can be framed as a dissipative system. this is a consequence of the theory which, in more general terms, suggests that all social systems can be considered as dissipative systems; [48] and, among them, innovation systems can thus also be considered as dissipative systems [49] . the application of the theory of dissipative structures [50] to the study of social systems has already been done in the past, [51, 52] and it has also been applied to the study of innovation systems specifically, [53, 54] to understand the process by which new structures evolve in old organisational networks [55] . by framing the problem in this manner the emergence of a hierarchical structure in a dissipative innovation system can be considered as a process through which the innovation system reaches a different level of entropy in its structure, [56] by means of a series of steps which imply sequential minimal variations in the level of entropy of the system, and lead to the emergence of complexity [57] . cognitive distance as noise. the process of transferring knowledge between organisations presumes the existence of knowledge assets that are transferred [58] . companies embedded in an innovation system are therefore characterised by an intellectual or knowledge capital, [59] which is the sum of the knowledge assets which they possess, [60] and which in turn are the result of the individual organisation's path of development, [61] and of the knowledge possessed by the human and technological components of the organisation [62] . any two organisations do not generally share the same intellectual capital, and therefore there are differences in the knowledge assets which they possess, and in the understanding and representation which they create about the world. this difference is called "cognitive distance" in the literature on knowledge management, and it refers to the difficulty in transferring knowledge between any two organisations [63] . the theory suggests that an innovation network has to perform a trade-off between increasing cognitive distance between organisations, which means higher novelty value, and increasing mutual understanding between them, which gives higher transfer of knowledge at the expenses of novelty [64] . it has been argued that if alliances (that is, network connections) are successfully formed between organisations with high cognitive distance between their members, this leads to a higher production of innovation by that alliance, [65] as a consequence of the relationship between cognitive distance and novelty, as described above. it has also been argued that the measure of centrality of a organisation in an innovation network is a consequence of the organisation's impact on the whole knowledge governance process, with organisations contributing more to it located more centrally in the network [66] . we propose that this known mechanism might play a role in the dynamic evolution of an innovation system, in a manner analogous to that of noise in an information system. the idea is that an organisation generally possessing a lower cognitive distance between multiple components of a network might spontaneously become a preferential intermediary for the transfer of knowledge within the innovation system, and as a consequence of this a hierarchical network structure emerges out of a lower-ordered structure. 2 the structure of the network and its evolution the modeling of the process of evolution of a network of innovation is conducted as follows. first, we imagine that there are two different structures of the ego-network of an innovation seeker company that are the subject of our analysis. the first is a horizontal network, in which a seeker organisation is positioned in a network of solvers, which are all directly connected with the seeker organisation in question. the second is a hierarchical or fractal network, in which a structure exists that represents the presence of intermediaries in the transfer of knowledge between the seeker organisation and the solving organisations in the same network. all nodes besides the seeker organisation being studied in the first scenario, and all nodes at the periphery of the hierarchical structure of the second scenario, are from here on called solvers. there are n nodes in the ego-network of an innovation seeker company. the n nodes in the horizontal network are all solver nodes, while the n nodes in the hierarchical network are divided into two classes of nodes: the intermediaries comprised of m nodes, and the solvers, comprised of m 2 nodes (fig. 1) . in order to make the two network structures comparable we impose the additional condition that the total number of nodes in the two networks is the same, which is satisfied for n ¼ m 2 þ m. we also impose the additional condition that each of the n solver nodes in the periphery of the horizontal network has at least m link neighbours belonging to n, as this allows us to describe a dynamical process which leads from the horizontal network to the hierarchical network without the creation of new links. the hierarchical network always possesses a lower entropy than the horizontal network comprised of the same number of nodes. this can be demonstrated by using as a measure of entropy shannon's definition, [67] which calculates it as the amount of information required to describe the current status of a system, accordingly to the formula below: this measure of entropy can be applied to a social network by assigning the random variable x to the flattened adjacency matrix of the edges of the network, as done by others [68] . the adjacency matrices of the two classes of networks in relation to the size n + 1 of the same network are indicated in the tables below, for the specific values m ¼ 2 ! n ¼ 6 ( table 1) . in general, for any value m ! 2, if a horizontal network has n solver nodes, one seeker node is connected to all other n nodes, and all solver nodes are additionally connected to m solver nodes each, where m þ m 2 ¼ n. in a hierarchical network with m intermediary nodes and m 2 solver nodes, the seeker node is connected to m intermediary nodes, and each of the intermediary nodes is connected to m solver nodes. the general formulation of the adjacency matrix is indicated below, in relation to the value of m ( table 2 ). the adjacency matrices can be flattened by either chaining all rows or all columns together, in order to obtain a vector x which univocally corresponds to a given matrix. this vector has a dimensionality of n þ 1 ð þ 2 , having been derived from an n + 1 by n + 1 matrix. the vector x which derives from flattening can then be treated as the probability distribution over a random binary variable, and shannon's measure of entropy can be computed on it. for the horizontal network, the vector x horizontal has value 1 two times for each of the peripheral nodes because of their connection to the centre, and then again twice for each of the peripheral nodes. this means that the vector x horizontal corresponds to the probability distribution (2). for the hierarchical network, the vector x hierarchical has value 1 two times for each of the m intermediary nodes, and then 2 times for each of the m 2 nodes. the probability distribution associated with the vector x hierarchical is therefore (3) the hierarchical network systematically possesses a lower level of entropy than a horizontal network with the same number of nodes, as shown in the graph below (fig. 2) . since we consider the network as a dissipative system, the lower level of entropy implies an expected higher energetic cost of maintenance for the lower-entropy structure. it follows from this theoretical premise that the hierarchical network should either allow the system to receive a higher input, or emit a lower output, or both simultaneously, lest its structure would decay to a higher entropy form, the horizontal one. an innovation system which starts evolving from a horizontal structure would tend to develop a hierarchical structure as a solution to the problem of transfer of knowledge in a network where cognitive distance is not uniformly distributed, as we will see in this paragraph. this can be shown by considering the hierarchical network as an attractor for the dynamical evolution of a horizontal network, under condition that the cognitive distance between pairs of nodes is distributed non-uniformly. stationary states. for the context of this paper, as we model a finite-state network which operates on discrete time, which models the dynamics of a dissipation systems which evolves over time [69] . these functions have the form depicted below, with x k ð þ being the state of the system at time k, u k ð þ being the input to the system at k, and y k ð þ being the output of the system. if the system does not undertake change in its internal structure, having already reached a stationary state, then x k þ 1 ð þ¼x k ð þ. as we want to study whether the system spontaneously evolves from a horizontal to a hierarchical structure, we can assume that x k þ 1 ð þ¼f hiearchical x k ð þ; u k ð þ ð þ¼x k ð þ which can only be true if either the input u k ð þ is 0, which is not the case if the system is active, or if u k þ 1 ð þ¼u k ð þ. for the innovation system this condition is valid if minor variations in the structure of the network associated with it do not lead to a significant variation of the input to the system, which means that no advantages in the receipt by the seeker of solutions found by the solver should be found. if this is true, and if the hierarchical structure is an attractor for the corresponding horizontal network, then we expect the input of the horizontal network to increase as it acquires a modular structure and develops into a hierarchical network. input of the system. the input function of the system depends on the receipt by the seeker organisation of a solution to a problem found by one of the peripheral solver organisations, as described above. let us imagine that at each timestep the solver organisations do indeed find a solution, and that thus the input u k ð þ depends on the number of solver nodes, and for each of them on the probability of correct transmission of knowledge from them to the seeker organisation, which increases as the cognitive distance between two communicating nodes decreases. if this is true, then the input to the horizontal network is a function of the form u horizontal n k ; p k ð þ, where n is the number of solver nodes, and p is the cognitive distance in the knowledge transmission channel. similarly, the input to the hierarchical network is a function of the form u hierarchical m 2 k ; q k à á which depends on the m 2 solver nodes in the hierarchical network, and on the parameter q which describes the cognitive distance. n and m are such that as they increase so do, respectively, u horizontal and u hierarchical ; while p and q are such that, as they decrease, so do respectively u horizontal and u hierarchical increase. it can be also noted that it can then be argued that if p q then u horizontal [ u hiearchical , which means that the system would not evolve into a hierarchical network. it can also be noted that, if n and m 2 are sufficiently large, then lim n;m! þ 1 n m 2 à á ¼ 1 and therefore any difference between the number of solvers in the two network structures would not play a role in the input to the innovation system. from this follows that u hiearchical [ u horizontal ! q \ p; that is, that the input to the innovation system with a hierarchical structure is higher than the input to the innovation system with a horizontal structure, if the cognitive distance between the members of the former is lower than the cognitive distance between the members of the latter. output of the system. as per the output of the system, we can imagine that there is a cost to be paid for the maintenance of the communication channels from which the seeker receives solutions from the solvers. if the system is in a stationary state, the condition y k þ 1 ð þ¼y k ð þ must be valid, as it follows from the considerations that u k þ 1 ð þ¼u k ð þ. if the system is not in a stationary state, as the input to the system increases, so should the output, under the hypothesis of dissipative system described above. a graphical representation of the evolution of the system from higher to lower entropy state is thus presented below (fig. 3) . the seeker organisation would at each step receive a solution transferred by one of its link neighbours, with the indication of the full path through which the communication has reached it. the seeker would then pay a certain cost, an output with the terminology of dissipative systems, for the maintenance of the channel through which the solution has been transferred to it successfully. such channels increase in intensity or weight, and are more likely to be used in subsequent iterations. on the contrary, channels through which a solution has not been received in a given iteration are decreased in intensity or weight, and are less likely to be used in the future. a process such as the one described would eventually, if enough iterations are performed, lead to the withering links between nodes with higher cognitive distance, and to the preservation of links between nodes with a lower cognitive distance. new connections are not formed, because cognitive distance is considered to be an exogenous parameter in this model, which does not vary once the innovation system starts evolving. real-world phenomena are not characterised by this restriction, which should be considered when analysing real-world systems under this model. the originality of this paper consists in the framing of an innovation system under different theoretical approaches, such as that of thermodynamically-open systems, selforganisation and evolution, dissipative systems, and cognitive distance, which, when combined, highlight another way of understanding the overall operation and the evolution of innovation systems. from this perspective, the process which we here describe evolu on hierarchical network fig. 3 . evolution of a branch of the innovation network from a higher to a lower entropy structure, from left to right. the letters p and q define respectively a high and a low cognitive distance between peers. accounts for an emergent complexity of the innovation system, which can occur without central planning and on the basis of information locally available by its members. this seems to confirm the theory according to which innovation systems can self-organise to solve, among others, the problem of transfer of knowledge among their members. this seems also to suggest that, if the only form of proximity which matters is cognitive, and not geographical, organisational, or other, it might be possible to infer cognitive distance between the members of an innovation system on the basis of the way in which their relationships change over time. the theoretical prediction which this model allows to make is that, should a connection between members of an innovation system be preserved while others are dropped, this means that the cognitive distance between pairs of nodes with surviving connections is lower than that of other nodes in their ego-networks. the modelling of the evolution of an innovation system that we propose also shows that, if an innovation system starts its evolution with a centrally, highly-connected organisation in a largely horizontal network of solver, where the cognitive distance between each pair of nodes is not uniformly distributed, then the system would evolve towards a lower-entropy hierarchical structure, in order to solve the problem of transfer of knowledge from the organisations at the periphery of the innovation system to the central organisation. our finding is consistent with the theory on modularity as an emergent property of complex adaptive innovation systems. subsequent research might apply the mathematical model described in this paper to a longitudinal study of the evolution of real-world innovation networks, in order to test whether the theory related to the spontaneous emergence of a hierarchical structure of innovation networks can be empirically supported. on the theoretical plane, further research could expand the understanding of the evolution of an innovation network by adding considerations related to the role which geographical and organisational proximity have in the development of the network, and add these factors to the model proposed. issues related to perturbation of the network, limit cycle of its evolution, and self-organised criticality in connection to our model may also be explored in subsequent works. sociologie et épistémologie evolution and structure of technological systems -an innovation output network leveraging complexity for ecosystemic innovation what is an innovation ecosystem networks for innovation and problem solving and their use for improving education: a comparative overview value creation from the innovation environment: partnership strategies in university spin-outs structural holes: the social structure of competition knowledge management, intellectual capital, structural holes, economic complexity and national prosperity innovation as a nonlinear process, the scientometric perspective, and the specification of an 'innovation opportunities explorer the role of the organization structure in the diffusion of innovations innovation contests, open innovation, and multiagent problem solving network structure and the diffusion of knowledge the limits to specialization: problem solving and coordination in "modular networks hierarchical organization in complex networks scale-free and hierarchical structures in complex networks what is a complex innovation system? cooperation, scale-invariance and complex innovation systems: a generalization growing silicon valley on a landscape: an agent-based approach to high-tech industrial clusters developing the art and science of innovation systems enquiry: alternative tools and methods, and applications to sub-saharan african agriculture prigogine's model for self-organization in nonequilibrium systems the evolution of dissipative social systems the meaning of open systems social systems knowledge, complexity and innovation systems institutional complementarity and diversity of social systems of innovation and production thermodynamic properties in the evolution of firms and innovation systems networks, national innovation systems and self-organisation self-organisation and evolution of biological and social systems functions of innovation systems: a new approach for analysing technological change on the sociology of intellectual stagnation: the late twentieth century in perspective disciplinary knowledge production and diffusion in science a knowledge-based theory of the organisation-the problemsolving perspective thinking: a guide to systems engineering problem-solving social complexity: patterns, processes, and evolution the evolution of economic and innovation systems a complexity-theoretic perspective on innovation policy emergence versus self-organisation: different concepts but promising when combined the evolution of innovation systems understanding evolving universityindustry relationships innovation as co-evolution of scientific and technological networks: exploring tissue engineering from technopoles to regional innovation systems: the evolution of localised technology development policy perspectives on cluster evolution: critical review and future research issues innovation, diversity and diffusion: a selforganisation model social information and self-organisation self-organization, knowledge and responsibility the dynamics of innovation: from national systems and "mode 2" to a triple helix of university-industry-government relations the value and costs of modularity: a problem-solving perspective a dissipative network model with neighboring activation the analysis of dissipative structure in the technological innovation system of enterprises modern thermodynamics: from heat engines to dissipative structures lessons from the nonlinear paradigm: applications of the theory of dissipative structures in the social sciences self-organization and dissipative structures: applications in the physical and social sciences understanding organizational transformation using a dissipative structure model technological paradigms, innovative behavior and the formation of dissipative enterprises a dissipative structure model of organization transformation entropy model of dissipative structure on corporate social responsibility revisiting complexity theory to achieve strategic intelligence defining knowledge management: toward an applied compendium enterprise knowledge capital intellectual capital-defining key performance indicators for organizational knowledge assets the dynamics of knowledge assets and their link with firm performance management mechanisms, technological knowledge assets and firm market performance problems and solutions in knowledge transfer empirical tests of optimal cognitive distance industry cognitive distance in alliances and firm innovation performance the impact of focal firm's centrality and knowledge governance on innovation performance the mathematical theory of communication the physics of spreading processes in multilayer networks dissipative control for linear discrete-time systems key: cord-280648-1dpsggwx authors: gillen, david; morrison, william g. title: regulation, competition and network evolution in aviation date: 2005-05-31 journal: journal of air transport management doi: 10.1016/j.jairtraman.2005.03.002 sha: doc_id: 280648 cord_uid: 1dpsggwx abstract our focus is the evolution of business strategies and network structure decisions in the commercial passenger aviation industry. the paper reviews the growth of hub-and-spoke networks as the dominant business model following deregulation in the latter part of the 20th century, followed by the emergence of value-based airlines as a global phenomenon at the end of the century. the paper highlights the link between airline business strategies and network structures, and examines the resulting competition between divergent network structure business models. in this context we discuss issues of market structure stability and the role played by competition policy. taking a snapshot of the north american commercial passenger aviation industry in the spring of 2003, the signals on firm survivability and industry equilibrium are mixed; some firms are under severe stress while others are succeeding in spite of the current environment. 1 in the us, we find united airlines in chapter 11 and us airways emerging from chapter 11 bankruptcy protection. we find american airlines having just reported the largest financial loss in us airline history, while delta and northwest airlines along with smaller carriers like alaska, america west and several regional carriers are restructuring and employing cost reduction strategies. we also find continental airlines surviving after having been in and out of chapter 11 in recent years, while southwest airlines continues to be profitable. in canada, we find air canada in companies creditors arrangement act (cca) bankruptcy protection (the canadian version of chapter 11), after reporting losses of over $500 million for the year 2002 and in march 2003. meanwhile westjet, like southwest continues to show profitability, while two new carriers, jetsgo and canjet (reborn), have entered the market. looking at europe, the picture is much the same, with large full-service airlines (fsas hereafter) such as british airways and lufthansa sustaining losses and suffering financial difficulties, while value-based airlines (vbas) like ryanair and easyjet continue to grow and prosper. until recently, asian air travel markets were performing somewhat better than in north america, however the severe acute respiratory syndrome (sars) epidemic had a severe negative effect on many asian airlines. 2 clearly, the current environment is linked to several independent negative demand shocks that have hit the industry hard. 3 slowdown was already underway in 2001, prior to the 9-11 tragedy, which gave rise to the 'war on terrorism' followed by the recent military action in iraq. finally, the sars virus has not only severely diminished the demand for travel to areas where sars has broken out and led to fatalities, but it has also helped to create yet another reason for travellers to avoid visiting airports or travelling on aircraft, based on a perceived risk of infection. all of these factors have created an environment where limited demand and price competition has favoured the survival of airlines with a low-cost, lowprice focus. in this paper we examine the evolution of air transport networks after economic deregulation, and the connection between networks and business strategies, in an environment where regulatory changes continue to change the rules of the game. the deregulation of the us domestic airline industry in 1978 was the precursor of similar moves by most other developed economies in europe (beginning 1992-1997) , canada (beginning in 1984) , australia (1990) and new zealand (1986) . 4 the argument was that the industry was mature and capable of surviving under open market conditions subject to the forces of competition rather than under economic regulation. 5 prior to deregulation in the us, some airlines had already organized themselves into hub-and-spoke net-works. delta airlines, for example, had organized its network into a hub at atlanta with multiple spokes. other carriers had evolved more linear networks with generally full connectivity and were reluctant to shift to hub-and-spoke for two reasons. first, regulations required permission to exit markets and such exit requests would likely lead to another carrier entering to serve 'public need'. secondly, under regulation it was not easy to achieve the demand side benefits associated with networks because of regulatory barriers to entry. in the era of economic regulation the choice of frequency and ancillary service competition were a direct result of being constrained in fare and market entry competition. with deregulation, airlines gained the freedom to adapt their strategies to meet market demand and to reorganize themselves spatially. consequently, huband-spoke became the dominant choice of network structure. the hub-and-spoke network structure was perceived to add value on both the demand and cost side. on the demand side, passengers gained access to broad geographic and service coverage, with the potential for frequent flights to a large number of destinations. 6 large carriers provided lower search and transactions costs for passengers and reduced through lower time costs of connections. they also created travel products with high convenience and service levels-reduced likelihood of lost luggage, in-flight meals and bar service for example. the fsa business model thus favoured high service levels which helped to build the market the market at a time when air travel was an unusual or infrequent activity for many individuals. building the market not only meant encouraging more air travel but also expanding the size of the network which increased connectivity and improved aircraft utilization. on the cost side the industry was shown to have few if any economies of scale, but there were significant economies of density. feeding spokes from smaller centres into a hub airport enabled full service carriers to operate large aircraft between major centres with passenger volumes that lowered costs per available seat. an early exception to the hub-and-spoke network model was southwest airlines. in the us, southwest airlines was the original 'vba' representing a strategy designed to build the market for consumers whose main loyalty is to low-price travel. this proved to be a sustainable business model and southwest's success was to create a blueprint for the creation of other vbas around the world. the evolution has also been assisted by the disappearance of charter airlines with deregulation as fsas served a larger scope of the demand function through their yield management system. (footnote continued) of economies from manufacturing to service economies and service industries are more aviation intensive than manufacturing. developed economies as in europe and north america as well as australia and new zealand, have an increasing proportion of gdp provided by service industries particularly tourism. one sector that is highly aviation intensive is the high technology sector. it is footloose and therefore can locate just about anywhere; the primary input is human capital. it can locate assembly in low-cost countries and this was enhanced under new trade liberalization with the wto. 4 canada's deregulation was not formalised under the national transportation act until 1987. australia and new zealand signed an open skies agreement in 2000, which created a single australia-new zealand air market, including the right of cabotage. canada and the us signed an open skies agreement well in 1996 but not nearly so liberal as the australian-new zealand one. 5 in contrast to deregulation within domestic borders, international aviation has been slower to introduce unilateral liberalization. consequently the degree of regulation varies across routes, fares, capacity, entry points (airports) and other aspects of airline operations depending upon the countries involved. the us-uk, german, netherlands and korea bilaterals are quite liberal, for example. in some cases, however, most notably in australasia and europe, there have been regional air trade pacts, which have deregulated markets between and within countries. the open skies agreement between canada and the us is similar to these regional agreements. meanwhile, benefits of operating a large hub-andspoke network in a growing market led to merger waves in the us (mid-1980s) and in canada (late-1980s) and consolidation in other countries of the world. large firms had advantages from the demand side, since they were favoured by many passengers and most importantly by high yield business passengers. they also had advantages from the supply side due to economies of density and economies of stage length. 7 in most countries other than the us there tended to be high industry concentration with one or at most two major carriers. it was also true that in most every country except the us there was a national (or most favoured) carrier that was privatized at the time of deregulation or soon thereafter. in canada in 1995 the open skies agreement with the us was brought in. 8 around this time we a new generation of vbas emerged. in europe, ryanair and easyjet experienced rapid and dramatic growth following deregulation within the eu. some fsas responded by creating their own vbas: british airways created go, klm created buzz and british midland created bmibaby for example. westjet airlines started service in western canada in 1996 serving three destinations and has grown continuously since that time. canadian airlines, faced with increased competition in the west from westjet as well as aggressive competition from air canada on longer haul routes, was in a severe financial by the late 1990s. a bidding war for a merged air canada and canadian was initiated and in 2000, air canada emerged the winner with a 'winners curse', having assumed substantial debt and constraining service and labour agreements. canada now had one fsa and three or four smaller airlines, two of which were vbas. in the new millennium, some consolidation has begun to occur amongst vbas in europe with the merger of, easyjet and go in 2002, and the acquisition of buzz by ryanair in 2003. more importantly perhaps, the vba model has emerged as a global phenomenon with vba carriers such as virgin blue in australia, gol in brazil, germania and hapag-lloyd in germany and air asia in malaysia. looking at aviation markets since the turn of the century, casual observation would suggest that a combination of market circumstances created an opportunity for the propagation of the vba business model-with a proven blueprint provided by southwest airlines. however a question remains as to whether something else more fundamental has been going on in the industry to cause the large airlines and potentially larger alliances to falter and fade. if the causal impetus of the current crisis was limited to cyclical macro factors combined with independent demand shocks, then one would expect the institutions that were previously dominant to re-emerge once demand rebounds. if this seems unlikely it is because the underlying market environment has evolved into a new market structure, one in which old business models and practices are no longer viable or desirable. the evolution of business strategies and markets, like biological evolution is subject to the forces of selection. airlines who cannot or do not adapt their business model to long-lasting changes in the environment will disappear, to be replaced by those companies whose strategies better fit the evolved market structure. but to understand the emerging strategic interactions and outcomes of airlines one must appreciate that in this industry, business strategies are necessarily tied to network choices. the organization of production spatially in air transportation networks confers both demand and supply side network economies and the choice of network structure by a carrier necessarily reflects aspects of its business model and will exhibit different revenue and cost drivers. in this section we outline important characteristics of the business strategy and network structures of two competing business models: the full service strategy (utilizing a hub-and-spoke network) and the low cost strategy model which operates under a partial point-to-point network structure. the full service business model is predicated on broad service in product and in geography bringing customers to an array of destinations with flexibility and available capacity to accommodate different routings, no-shows and flight changes. the broad array of destinations and multiple spokes requires a variety of aircraft with differing capacities and performance characteristics. the variety increases capital, labour and operating costs. this business model labours under cost penalties and lower productivity of hub-and-spoke operations including long aircraft turns, connection slack, congestion, and personnel and baggage online connections. these features take time, resources and labour, all of which are expensive and are not easily avoided. the hub-and-spoke system is also conditional on airport and airway infrastructure, information provision through computer reservation and highly sophisticated yield management systems. the network effects that favoured hub and spoke over linear connected networks lie in the compatibility of article in press 7 unit costs decrease as stage length increases but at a diminishing rate. 8 there was a phase in period for select airports in canada as well as different initial rules for us and canadian carriers. flights and the internalization of pricing externalities between links in the network. a carrier offering flights from city a to city b through city h (a hub) is able to collect traffic from many origins and place them on a large aircraft flying from h to b, thereby achieving density economies. in contrast a carrier flying directly from a to b can achieve some direct density economies but more importantly gains aircraft utilization economies. in the period following deregulation, density economies were larger than aircraft utilization economies on many routes, owing to the limited size of many origin and destination markets. on the demand side, fsas could maximize the revenue of the entire network by internalizing the externalities created by complementarities between links in the network. in our simple example, of a flight from a to c via hub h the carrier has to consider how pricing of the ah link might affect the demand for service on the hb link. if the service were offered by separate companies, the company serving ah will take no consideration of how the fare it charged would influence the demand on the hb link since it has no right to the revenue on that link. the fsa business model thus creates complexity as the network grows, making the system work effectively requires additional features most notably, yield management and product distribution. in the period following deregulation, technological progress provided the means to manage this complexity, with large information systems and in particular computer reservation systems. computer reservation systems make possible sophisticated flight revenue management, the development of loyalty programs, effective product distribution, revenue accounting and load dispatch. they also drive aircraft capacity, frequency and scheduling decisions. as a consequence, the fsa business model places relative importance on managing complex schedules and pricing systems with a focus on profitability of the network as a whole rather than individual links. the fsa business model favours a high level of service and the creation of a large service bundle (inflight entertainment, meals, drinks, large numbers of ticketing counters at the hub, etc.) which serves to maximize the revenue yields from business and longhaul travel. an important part of the business service bundle is the convenience that is created through fully flexible tickets and high flight frequencies. high frequencies can be developed on spoke routes using smaller feed aircraft, and the use of a hub with feed traffic from spokes allows more flights for a given traffic density and cost level. more flights reduce total trip time, with increased flexibility. thus, the hub-and-spoke system leads to the development of feed arrangements along spokes. indeed these domestic feeds contributed to the development of international alliances in which one airline would feed another utilizing the capacity of both to increase service and pricing. like the fsa model, the vba business plan creates a network structure that can promote connectivity but in contrast trades off lower levels of service, measured both in capacity and frequency, against lower fares. in all cases the structure of the network is a key factor in the success of vbas even in the current economic and demand downturn. vbas tend to exhibit common product and process design characteristics that enable them to operate at a much lower cost per unit of output. 9 on the demand side, vbas have created a unique value proposition through product and process design that enables them to eliminate, or ''unbundle'' certain service features in exchange for a lower fare. these service feature trade-offs are typically: less frequency, no meals, no free, or any, alcoholic beverages, more passengers per flight attendant, no lounge, no interlining or code-sharing, electronic tickets, no pre-assigned seating, and less leg room. most importantly the vba does not attempt to connect its network although their may be connecting nodes. it also has people use their own time to access or feed the airport. 10 there are several key areas in process design (the way in which the product is delivered to the consumer) for a vba that result in significant savings over a full service carrier. one of the primary forms of process design savings is in the planning of point-to-point city pair flights, focusing on the local origin and destination market rather than developing hub systems. in practice, this means that flights are scheduled without connections and stops in other cities. this could also be considered product design, as the passenger notices the benefit of travelling directly to their desired destination rather than through a hub. rather than having a bank of flights arrive at airports at the same time, low-cost carriers spread out the staffing, ground handling, maintenance, food services, bridge and gate requirements at each airport to achieve savings. another less obvious, but important cost saving can be found in the organization design and culture of the company. it is worth noting at this point that the innovator of product, process, and organizational redesign is generally accepted to be southwest airlines. many low-cost start-ups have attempted to replicate that model as closely as possible; however, the hardest area to replicate has proved to be the organization design and culture. 11 extending the ''look and feel'' to the aircraft, there is a noticeable strategy for low-cost airlines. successful vbas focus on a homogeneous fleet type (mostly the boeing 737 but this is changing; e.g. jet blue with a320 fleet). the advantages of a 'common fleet' are numerous. purchasing power is one-with the obvious exception of the aircraft itself, heavy maintenance, parts, supplies; even safety cards are purchased in one model for the entire fleet. training costs are reduced-with only one type of fleet, not only do employees focus on one aircraft and become specialists, but economies of density can be achieved in training. the choice of airports is typically another source of savings. low-cost carriers tend to focus on secondary airports that have excess capacity and are willing to forego some airside revenues in exchange for non-airside revenues that are developed as a result of the traffic stimulated from low-cost airlines. in simpler terms, secondary airports charge less for landing and terminal fees and make up the difference with commercial activity created by the additional passengers. further, secondary airports are less congested, allowing for faster turn times and more efficient use of staff and the aircraft. the average taxi times shown in table 1 (below) are evidence of this with respect to southwest in the us and one only has to consider the significant taxi times at pearson airport in toronto to see why hamilton is such an advantage for westjet. essentially, vbas have attempted to reduce the complexity and resulting cost of the product by unbundling those services that are not absolutely necessary. this unbundling extends to airport facilities as well, as vbas struggle to avoid the costs of expensive primary airport facilities that were designed with full service carriers in mind. while the savings in product design are the most obvious to the passenger, it is the process changes that have produced greater savings for the airline. the design of low-cost carriers facilitates some revenue advantages in addition to the many cost advantages, but it is the cost advantages that far outweigh any revenue benefits achieved. these revenue advantages included simplified fare structures with 3-4 fare levels, a simple 'yield' management system, and the ability to have one-way tickets. the simple fare structure also facilitates internet booking. however, what is clearly evident is the choice of network is not independent of the firm strategy. the linear point-topoint network of vbas allows it to achieve both cost and revenue advantages. table 1 below, compares key elements of operations for us airlines 737 fleets. one can readily see a dramatic cost advantage for southwest airlines compared to fsas. in particular, southwest is a market leader in aircraft utilization and average taxi times. if one looks at the differences in the us between vbas like southwest and fsas, there is a 2:1 cost difference. this difference is similar to what is found in canada between westjet and air canada as well as in europe. these carriers buy the fuel and capital in the same market, and although there may be some difference between carriers due to hedging for example, these are not structural or permanent changes. the vast majority of the cost difference relates to product and process complexity. this complexity is directly tied to the design of their network structure. table 2 compares cost drivers for fsas and vbas in europe. the table shows the key underlying cost drivers and where a vba like ryanair has an advantage over fsas in crew and cabin personnel costs, airport charges and distribution costs. the first two are directly linked to network design. a hub-and-spoke network is service it should also be noted that the vba model is not generic. different low cost carriers do different things and like all businesses we see continual redefinition of the model. intensive and high cost. even distribution cost-savings are related indirectly to network design because vbas have simple products and use passengers' time as an input to reduce airline connect costs. in europe, ryanair has been a leader in the use of the internet for direct sales and 'e-tickets'. in the us southwest airlines was an innovator in ''e-ticketing'', and was also one of the first to initiate bookings on the internet. vbas avoid travel agency commissions and ticket production costs: in canada, westjet has stated that internet booking account for approximately 40% of their sales, while in europe, ryanair claimed an internet sales percentage of 91% in march 2002. 12 while most vbas have adopted direct selling via the internet, the strategy has been hard for fsas to respond to with any speed given their complex pricing systems. recent moves by full service carriers in the us and canada to eliminate base commissions should prove to be interesting developments in the distribution chains of all airlines. to some degree, vbas have positioned themselves as market builders by creating point-to-point service in markets where it could not be warranted previously due to lower traffic volumes at higher fsa fares. vbas not only stimulate traffic in the direct market of an airport, but studies have shown that vbas have a much larger potential passenger catchment area than fsas. the catchment area is defined as the geographic region surrounding an airport from which passengers are derived. while an fsa relies on a hub-and-spoke network to create catchment, low-cost carriers create the incentive for each customer to create their own spoke to the point of departure. table 3 provides a summary of the alternative airline strategies pursued in canada, and elsewhere in the world. the trend worldwide thus far indicates two quite divergent business strategies. the entrenched fsa carriers' focuses on developing hub and spoke networks while new entrants seem intent on creating low-cost, point-to-point structures. the hub and spoke system places a very high value on the feed traffic brought to the hub by the spokes, especially the business traffic therein, thereby creating a complex, marketing intense business where revenue is the key and where production costs are high. inventory (of seats) is also kept high in order to meet the service demands of business travellers. the fsa strategy is a high cost strategy because the hub-and-spoke network structure means both reduced productivity for capital (aircraft) and labour (pilots, cabin crew, airport personnel) and increased costs due to self-induced congestion from closely spaced banks of aircraft. 13 the fsa business strategy is sustainable as long as no subgroup of passengers can defect from the coalition of all passenger groups, and recognizing this, competition between fsas included loyalty programs designed to protect each airline's coalition of passenger groupsfrequent travellers in particular. the resulting market structure of competition between fsas was thus a cozy oligopoly in which airlines competed on prices for some economy fares, but practiced complex price discrimination that allowed high yields on business travel. however, the vulnerability of the fsa business model was eventually revealed through the vba strategy which (a) picked and chose only those origin-destination links that were profitable and (b) targeted price sensitive consumers. 14 the potential therefore was not for business travellers to defect from fsas (loyalty programs helped to maintain this segment of demand) but for leisure travellers and other infrequent flyers to be lured away by lower fares (fig. 1) . figs. 2 and 3 present a schemata that help to summarize the contributory factors that propagated the fsa hub-and-spoke system and made it dominant, followed by the growth of the vba strategy along with the events and factors that now threaten the fsa model. in this section we set out a simple framework to explain the evolution of network equilibrium and show westjet estimated that a typical ticket booked through their call centre costs roughly $12, while the same booking through the internet costs around 50 cents. 13 airlines were able to reduce their costs to some degree by purchasing ground services from third parties. unfortunately they could not do this with other processes of the business. 14 vbas will also not hesitate to exit a market if it is not profitable (e.g. westjet's recent decision to leave sault st. marie and sudbury) while fsas are reluctant to exit for fear of missing feed traffic and beyond revenue. how it is tied to the business model. the linkage will depend on how the business models differ with respect to the integration of demand conditions, fixed and variable cost and network organization. let three nodes {y 1 ; y 2 ,y 3 ; (0,0), (0,1), (1,0)}, form the corner coordinates of an isosceles right triangle. the nodes and the sides of the triangle may thus represent a simple linear travel network that defines fully connected network hub-and-spoke network partial point-to-point network congestion or other factors affecting passenger throughput at airports. this simple network structure allows us to compare three possible structures for the supply of travel services: a complete (fully connected) point-to-point network (all travel constitutes a direct link between two nodes); a hub-and-spoke network (travel between y 1 and y 2 requires a connection through y 2 ) and limited (or partial) point-to-point network (selective direct links between nodes). these are illustrated in fig. 3 below. in the network structures featuring point-to-point travel, the utility of consumers who travel depends only on a single measure of the time duration of travel and a single measure of convenience. however in the hub-andspoke network, travel between y 1 and y 3 requires a connection at y 2 ; consequently the time duration of travel depends upon the summed distance d 1c3 â¼ d 12 ã¾ d 23 â¼ 1 ã¾ ffiffi ffi 2 p : furthermore, in a hub-and-spoke network, there is interdependence between the levels of convenience experienced by travellers. if there are frequent flights between y 1 and y 2 but infrequent flights between y 2 and y 3 ; then travellers will experience delays at y 2 : there has been an evolving literature on the economics of networks or more properly the economics of network configuration. hendricks et al. (1995) show that economies of density can explain the hub-andspoke system as the optimal system in the airline networks. the key to the explanation lies in the level of density economies. however, when comparing a point-to-point network they find the hub-and-spoke network is preferred when marginal costs are high and demand is low but given some fixed costs and intermediate values of variable costs a point-to-point network may be preferred. shy (2001) shows that profit levels on a fully connected (fc) network are higher than on a hub-and-spoke network when variable flight costs are relatively low and passenger disutility with connections at hubs is high. what had not been explained well, until pels et al. (2000) is the relative value of market size to achieve lower costs per available seat mile (asm) versus economies of density. pels et al. (2000) explore the optimality of airline networks using linear marginal cost functions and linear, symmetric demand functions; mc â¼ 1 ã� bq and p â¼ a ã� q=2 where b is a returns to density parameter and a is a measure of market size. the pels model demonstrates the importance of fixed costs in determining the dominance of one network structure over another in terms of optimal profitability. in particular, the robustness of the hub-and-spoke network configuration claimed by earlier authors (hendricks et al., 1995) comes into question. in our three-node network, the pels model generates two direct markets and one transfer market in the huband-spoke network, compared with three direct markets in the fully connected network. defining aggregate demand as q â¼ q d ã¾ q t ; the profits from a hub-andspoke network, are: y while the profits of a fc network are: y more generally, for a network of size n, hub-and-spoke optimal profits are: y and fc profits are: under what conditions would an airline be indifferent between network structure? the market size at which profit maximizing prices and quantities equate the profits in each network structure is where, the two possible values of a ã� implied by (5) represent upper and lower boundaries on the market size for which the hub-and-spoke network and the fully connected network generate the same level of optimal profits. these boundary values are of course conditional on given values of the density economies parameter (y) fixed costs (f), and the size of the network (n). these parameters can provide a partial explanation for the transition from fc to hub-and-spoke network structures after deregulation. with relatively low returns to density, and low fixed costs per link, even in a growing market, the hub-andspoke structure generates inferior profits compared with the fc network, except when the market size (a) is extremely high. however with high fixed costs per network link, the hub-and-spoke structure begins to dominate at a relatively small market size and this advantage is amplified as the size of the network grows. importantly in this model, dominance does not mean that the inferior network structure is unprofitable. in (a; b) space, the feasible area (defining profitability) of the fc structure encompasses that of the hub-and-spoke structure. this accommodates the observation that not all airlines adopted the hub-and-spoke network model following deregulation. where the model runs into difficulties is in explaining the emergence of limited point-to-point networks and the vba model. it is the symmetric structure of the model that renders it unable to capture some important elements of the environment in which vbas have been able to thrive. in particular, three elements of asymmetry are missing. first, the model does not allow for asymmetric demand growth between nodes in the network. with market growth, returns to density can increase on a subset of links that would have been feeder spokes in the hub-and-spoke system when the market was less developed. these links may still be infeasible for fsas but become feasible and profitable as independent point-to-point operations, providing an airline has low enough costs. second, the model does not distinguish between market demand segments and therefore cannot capture the gradual commoditization of air travel, as more consumers become frequent flyers. to many consumers today, air travel is no longer an exotic product with an air of mystery and an association with wealth and luxury. there has been an evolution of preferences that reflects the perception that air travel is just another means of getting from a to b. as the perceived nature of the product becomes more commodity-like, consumers become more price sensitive and are willing to trade off elements of service for lower prices. 15 vbas use their low fares to grow the market by competing with other activities. their low cost structure permits such a strategy. fsas cannot do this to any degree because of their choice of bundled product and higher costs. third, the model does not capture important asymmetries in the costs of fsas and vbas, such that vbas have significantly lower marginal and fixed costs. notice that the dominance of the hub-and-spoke structure over the fc network relies in part on the cost disadvantage of a fixed cost per link, which becomes prohibitive in the fc network as the number of nodes (n) gets large. vbas do not suffer from this disadvantage because they can pick and choose only those nodes that are profitable. furthermore, fsas variable costs are higher because of the higher fixed costs associated with their choice of hub-and-spoke network. it would seem that with each new economic cycle, the evolution of the airline industry brings about an industry reconfiguration. several researchers have suggested that this is consistent with an industry structure with an 'empty core', meaning non-existence of a natural market equilibrium. button (2003) makes the argument as follows. we know that a structural shift in the composition (i.e., more low-cost airlines) of the industry is occurring and travel substitutes are pushing down fares and traffic. we also observe that heightened security has increased the time and transacting costs of trips and these are driving away business, particularly short haul business trips. as legacy airlines shrink and die away, new airlines emerge and take up the employment and market slack. the notion of the 'empty core' problem in economics is essentially a characterization of markets where too few competitors generate supra-normal profits for incumbents, which then attracts entry. however entry creates frenzied competition in a war-of-attrition game environment: the additional competition induced by entry results in market and revenue shares that produce losses for all the market participants. consequently entry and competition leads to exit and a solidification of market shares by the remaining competitors who then earn supra-normal profits that once again will attract entry. while there is some intuitive appeal to explaining the dynamic nature of the industry resulting from an innate absence of stability in the market structure, there are theoretical problems with this perspective. 16 the fundamental problem with the empty core concept is that its roots lie in models of exogenous market structure that impose (via assumptions) the conditions of the empty core rather than deriving it as the result of decisions made by potential or incumbent market participants. in particular, for the empty core to perpetuate itself, entrants must be either ill advised or have some unspecified reason for optimism. in contrast, modern to model a such a demand system we need a consumer utility function of the form, u â¼ uã°y ; t; v ã� â¼ gv ã°y 2pã�; where y represents dollar income per period and t 2 â½0; 1 represents travel trips per period. v is an index of travel convenience, related to flight frequency and p is the delivered price of travel. this reduces each consumer's choice problem to consumption of a composite commodity priced at $1, and the possibility of taking at most one trip per period. utility is increasing in v and decreasing in p, thus travellers are willing to tradeoff convenience for a lower delivered price. diversity in the willingness to trade off convenience for would be represented by distribution for y, g; and v over some range of parameter values. thus the growth of value-based demand for air travel would be represented by an increase in the density of consumers with relatively low value of these parameters. 16 the empty core theory is often applied to industries that exhibit significant economies of scale, airlines are thought generally to have limited if any scale economies but they do exhibit significant density economies. these density economies are viewed as providing conditions for an empty core. the proponents however only argue on the basis of fsas business model. industrial organization theory in economics is concerned with understanding endogenously determined market structures. in such models, the number of firms and their market conduct emerge as the result of a decisions to enter or exit the market and decisions concerning capacity, quantity and price. part of the general problem of modeling an evolving market structure is to understand that incumbents and potential entrants to the market construct expectations with respect to their respective market shares in any post-entry market. a potential entrant might be attracted by the known or perceived level of profits being earned by the incumbents, but must consider how many new consumers they can attract to their product in addition to the market share that can appropriated from the incumbent firms. this will depend in part upon natural (technological) and strategic barriers to entry, and on the response that can be expected if entry occurs. thus entry only occurs if the expected profits exceed the sunk costs of entry. while natural variation in demand conditions may induce firms to make errors in their predictions, resulting in entry and exit decisions, this is not the same thing as an 'empty core '. 17 in the air travel industry, incumbent firms (especially fsas) spend considerable resources to protect their market shares from internal and external competition. the use of frequent flier points along with marketing and branding serve this purpose. these actions raise the barriers to entry for airlines operating similar business models. what about the threat of entry or the expansion of operations by vbas? could this lead to exit by fsas? there may be legitimate concern from fsas concerning the sustainability of the full-service business model when faced with low-cost competition. in particular, the use of frequency as an attribute of service quality by fsas generates revenues from high-value business travellers, but these revenues only translate into profits when there are enough economy travellers to satisfy load factors. so, to the extent that vbas steal away market share from fsas they put pressure on the viability of this aspect of the fsa business model. the greatest threat to the fsa from a vba is that a lower the fare structure offered to a subset of passengers may induce the fsa to expand the proportion of seats offered to lower fares within the yield management system. this will occur with those vbas like southwest, virgin blue in australia and easyjet that do attempt to attract the business traveller from small and medium size firms. however, carriers like ryanair and westjet have a lower impact on overall fare structure since their frequencies are lower and the fsa can target the vbas flights. 18 while fsas may find themselves engaged in price and/or quality competition, the economics of price competition with differentiated products suggests that such markets can sustain oligopoly structures in which firms earn positive profits. this occurs because the prices of competing firms become strategic complements. that is, when one firm increases its price, the profit maximizing response of competitors is to raise price also and there are many dimensions on which airlines can product differentiate within the fsa business model. 19 there is no question fsas have higher seat mile costs than vbas. the problem comes about when fsas view their costs as being predominately fixed and hence marginal costs as being very low. this 'myopic' view ignores the need to cover the long run cost of capital. this in conjunction with the argument that network revenue contribution justifies most all routes, leads to excessive network size and severe price discounting. 20 however, when economies are buoyant, high yield traffic provides sufficient revenues to cover costs and provide substantial profit. in their assessment of the us airline industry, morrison and winston (1995) argue that the vast majority of losses incurred by fsas up to that point were due to their own fare, and fare war, strategies. it must be remembered that fsas co-exist with southwest in large numbers of markets in the us. what response would we expect from an fsa to limited competition from a vba on selected links of its hub-and-spoke network? given the fsa focus on maximization of aggregate network revenues and a cognisance that successful vba entry could steal away their base of economy fare consumers (used to generate the frequencies that provide high yield revenues), one might expect aggressive price competition to either prevent entry or to hasten the exit of a vba rival. this creates a problem for competition bureaus around the world as vbas file an increasing number of predatory pricing charges against fsas. similarly, the ability of this has led some to lobby for renewed government intervention in markets or anti-trust immunity for small numbers of firms. however, if natural variability is a key factor in explaining industry dynamics, there is nothing to suggest that governments have superior information or ability to manipulate the market structure to the public benefit. 18 there are some routes in which westjet does have high frequencies and has significantly impacted mainline carriers. (e.g. calgary-abbotsford) 19 a standard result in the industrial organization literature is that competing firms engaged in price competition will earn positive economic profits when their products are differentiated. 20 the beyond or network revenue argument is used by many fsas to justify not abandoning markets or charging very low prices on some routes. the argument is that if we did not have all the service from a to b we would never receive the revenue from passengers who are travelling from b to c. in reality this is rarely true. when fsas add up the value of each route including its beyond revenue the aggregate far exceeds the total revenue of the company. the result is a failure to abandon uneconomic routes. the three current most profitable airlines among the fsas, qantas, lufthansa and ba, do not use beyond revenue in assessing route profitability. fsas to compete as hub-and-spoke carriers against a competitive threat from vbas is constrained by the rules of the game as defined by competition policy. in canada, air canada faces a charge of predatory pricing for its competition against canjet and westjet in eastern canada. in the us, american airlines won its case in a predatory pricing charge brought by three vbas: vanguard airlines, sun jet and western pacific airlines. in germany, both lufthansa and deutsche ba have been charged with predatory pricing. in australia, qantas also faces predatory pricing charges. gillen and morrison (2003) points out three important dimensions of predatory pricing in air travel markets. first, demand complementarities in hub-andspoke networks lead fsas to focus on 'beyond revenues'-the revenue generated by a series of flights in an itinerary rather than the revenues generated by any one leg of the trip. fsas therefore justify aggressive price competition with a vba as a means of using the fare on that link (from an origin node to the hub node for example) as a way of maximizing the beyond revenues created when passengers purchase travel on additional links (from the hub to other nodes in the network). the problem with this argument is that promotional pricing is implicitly a bundling argument, where the airline bundles links in the network to maximize revenue. however when fsas compete fiercely on price against vbas, the price on that link is not limited to those customers who demand beyond travel. therefore, whether or not there is an intent to engage in predatory pricing, the effect is predatory as it deprives the vba of customers who do not demand beyond travel. a second dimension of predatory pricing is vertical product differentiation. fsas competition authorities to support the view that they the right to match prices of a rival vba. however, the bundle of services offered by fsas constitutes a more valuable package. in particular, the provision of frequent flyer programs creates a situation where matching the price of a vba is 'de facto' price undercutting, adjusting for product differentiation. a recent case between the vba germania and lufthansa resulted in the bundeskartellamt (the german competition authority) imposing a price premium restriction on lufthansa that prevented the fsa from matching the vbas prices. a third important dimension of predatory pricing in air travel markets is the ability which fsas have to shift capacity around a hub-and-spoke network, which necessarily requires a mixed fleet with variable seating capacities. in standard limit output models of entry deterrence, an investment in capacity is not a credible threat to of price competition if the entrant conjectures that the incumbent will not use that capacity once entry occurs. such models utilize the notion that a capacity investment is an irreversible commitment and that valuable reputation effects cannot be generated by the incumbent engaging in 'irrational' price competition. however in a hub-and-spoke network, an fsa can make a credible threat to transfer capacity to a particular link in the network in support of aggressive price competition, with the knowledge that the capacity can be redeployed elsewhere in the network when the competitive threat is over. this creates a positive barrier to entry with reputation effects occurring in those instances where entry occurs. such was the case when canjet and westjet met with aggressive price competition from air canada on flights from monkton nb to toronto (air canada and canjet) and hamilton (westjet). the fsa defense against such charges is that aircraft do not constitute an avoidable cost and should not be included in any price-cost test of predation. yet while aircraft are not avoidable with respect to the network, they are avoidable to the extent they can be redeployed around the network. if aircraft costs become included in measures of predation under competition laws, this will limit the success of price competition as a competitive response by an fsas responding to vba entry. in the current environment, competition policy rules are not well specified and the uncertainty does nothing to protect competition or to enhance the viability of air travel markets. however there has been increased academic interest in the issue and it seems likely that given the number of cases, some policy changes will be made (e.g., ross and stanbury, 2001) . once again, the way in which fsas have responded to competition from vbas reflects their network model, and competition policy decisions that prevent capacity shifting, price matching and inclusion of 'beyond revenues' will severely constrain the set of strategies an fsa can employ without causing some fundamental changes in the business model and corresponding network structure. 6. so where are we headed? in evolution, the notion of selection dynamics lead us to expect that unsuccessful strategies will be abandoned and successful strategies will be copied or imitated. we have already observed fsas attempts to replicate the vba business model through the creation of fighting brands. air canada created tango, zip, jazz, and jetz. few other carriers worldwide have followed such an extensive re-branding. in europe, british airways created go and klm created buzz, both of which have since been sold and swallowed up by other vbas. qantas has created a low cost long haul carrier-australian airlines. meanwhile, air new zealand, lufthansa, delta and united are moving in the direction of a low-price-low-cost brand. we are also seeing attempts by fsas to simplify their fare structures and exploit the cost savings from direct sales over the internet. thus there do seem to be evolutionary forces that are moving airlines away from the hub-and-spoke network in the direction of providing connections as distinct from true hubbing. american airlines is using a 'rolling hub' concept, which does exactly as its name implies. the purpose is to reduce costs through both fewer factors such as aircraft and labour and to increase productivity. the first step is to 'de-peak' the hub, which means not having banks as tightly integrated. this reduces the amount of own congestion created at hubs by the hubbing carrier and reduces aircraft needed. it also reduces service quality but it has become clear that the traditionally high yield business passenger who valued such time-savings is no longer willing to pay the very high costs that are incurred in producing them. however, as an example, american airlines has reduced daily flights at chicago so with the new schedules it has increased the total elapsed time of flights by an average of 10 min. elapsed time is a competitive issue for airlines as they vie for high-yield passengers who, as a group, have abandoned the airlines and caused revenues to slump. but that 10-min average lengthening of elapsed time appears to be a negative american is willing to accept in exchange for the benefits. at chicago, where the new spread-out schedule was introduced in april, american has been able to operate 330 daily flights with five fewer aircraft and four fewer gates and a manpower reduction of 4-5%. 21 the change has cleared the way for a smoother flow of aircraft departures and has saved taxi time. 22 it is likely that american will try to keep to the schedule and be disinclined to hold aircraft to accommodate late arriving connection passengers. while this may appear to be a service reduction it in fact may not, since on-time performance has improved. 23 the evolution of networks in today's environment will be based on the choice of business model that airlines make. this is tied to evolving demand conditions, the developing technologies of aircraft and infrastructure and the strategic choices of airlines. as we have seen, the hub-and-spoke system is an endogenous choice for fsa while the linear fc network provides the same scope for vbas. the threat to the hub-and-spoke network is the threat to bundled product of fsas. the hub-and-spoke network will only disappear if the fsa cannot implement a lower cost structure business model and at the same time provide the service and coverage that higher yield passengers demand. the higher yield passengers have not disappeared the market has only become somewhat smaller and certainly more fare sensitive, on average. fsas have responded to vbas by trying to copy elements of their business strategy including reduced inflight service, low cost [fighting] brands, and more pointto-point service. however, the ability of fsa to co-exist with vba and hence hub-and-spoke networks with linear networks is to redesign their products and provide incentives for passengers to allow a reduction in product, process and organizational complexity. this is a difficult challenge since they face complex demands, resulting in the design of a complex product and delivered in a complex network, which is a characteristic of the product. for example, no-shows are a large cost for fsa and they have to design their systems in such a way as to accommodate the no-shows. this includes over-booking and the introduction of demand variability. this uncertain demand arises because airlines have induced it with service to their high-yield passengers. putting in place a set of incentives to reduce noshows would lower costs because the complexity would be reduced or eliminated. one should have complexity only when it adds value. another costly feature of serving business travel is to maintain sufficient inventory of seats in markets to meet the time sensitive demands of business travellers. the hub-and-spoke structure is complex, the business processes are complex and these create costs. a huband-spoke network lowers productivity and increases variable and fixed costs, but these are not characteristics inherent in the hub-and-spoke design. they are inherent in the way fsa use the hub-and-spoke network to deliver and add value to their product. this is because the processes are complex even though the complexity is needed for a smaller, more demanding, higher yield set of customers. the redesigning of business processes moves the fsa between cost functions and not simply down their existing cost function but they will not duplicate the cost advantage of vbas. the network structure drives pricing, fleet and service strategies and the network structure is ultimately conditional on the size and preferences in the market. what of the future and what factors will affect the evolution of network design and scope? airline markets american has also reduced its turn around at spoke cities from 2.5 h previously to approximately 42 min. 22 as a result of smoother traffic flows, american has been operating at dallas/fort worth international airport with nine fewer mainline aircraft and two fewer regional aircraft. at chicago, the improved efficiency has allowed american to take five aircraft off the schedule, three large jets and two american eagle aircraft. american estimates savings of $100 million a year from reduced costs for fuel, facilities and personnel, part of the $2 billion in permanent costs it has trimmed from its expense sheet. the new flight schedule has brought unexpected cost relief at the hubs but also at the many ''spoke'' cities served from these major airports. aviation week and space technology, september 2, 2002 and february 18, 2003. 23 interestingly, from an airport perspective the passenger may not spend more total elapsed time but simply more time in the terminal and less time in the airplane. this may provide opportunities for nonaviation revenue strategies. with their networks are continuously evolving. what took place in the us 10 years ago is now occurring in europe. a 'modern' feature of networks is the strategic alliance. alliances between airlines allow them to extend their network, improve their product and service choice but at a cost. alliances are a feature associated with fsas not vbas. it may be that as fsas reposition themselves they will make greater use of alliances. vbas on the other hand will rely more on interlining to extend their market reach. interlining is made more cost effective with modern technologies but also with airports having an incentive to offer such services rather than have the airlines provide them. airports as modern businesses will have a more active role in shaping airline networks in the future. empty cores in airline markets bundling, integration and the delivered price of air travel: are low-cost carriers fullservice competitors the economics of hubs: the case of monopoly the evolution of the airline industry a note on the optimality of airline networks dealing with predatory conduct in the canadian airline industry: a proposal the economics of network industries the authors gratefully acknowledge financial support for travel to this conference, provided by funds from wilfrid laurier university and the sshrc institutional grant awarded to the university. key: cord-000196-lkoyrv3s authors: salathé, marcel; jones, james h. title: dynamics and control of diseases in networks with community structure date: 2010-04-08 journal: plos comput biol doi: 10.1371/journal.pcbi.1000736 sha: doc_id: 196 cord_uid: lkoyrv3s the dynamics of infectious diseases spread via direct person-to-person transmission (such as influenza, smallpox, hiv/aids, etc.) depends on the underlying host contact network. human contact networks exhibit strong community structure. understanding how such community structure affects epidemics may provide insights for preventing the spread of disease between communities by changing the structure of the contact network through pharmaceutical or non-pharmaceutical interventions. we use empirical and simulated networks to investigate the spread of disease in networks with community structure. we find that community structure has a major impact on disease dynamics, and we show that in networks with strong community structure, immunization interventions targeted at individuals bridging communities are more effective than those simply targeting highly connected individuals. because the structure of relevant contact networks is generally not known, and vaccine supply is often limited, there is great need for efficient vaccination algorithms that do not require full knowledge of the network. we developed an algorithm that acts only on locally available network information and is able to quickly identify targets for successful immunization intervention. the algorithm generally outperforms existing algorithms when vaccine supply is limited, particularly in networks with strong community structure. understanding the spread of infectious diseases and designing optimal control strategies is a major goal of public health. social networks show marked patterns of community structure, and our results, based on empirical and simulated data, demonstrate that community structure strongly affects disease dynamics. these results have implications for the design of control strategies. mitigating or preventing the spread of infectious diseases is the ultimate goal of infectious disease epidemiology, and understanding the dynamics of epidemics is an important tool to achieve this goal. a rich body of research [1, 2, 3] has provided major insights into the processes that drive epidemics, and has been instrumental in developing strategies for control and eradication. the structure of contact networks is crucial in explaining epidemiological patterns seen in the spread of directly transmissible diseases such as hiv/aids [1, 4, 5] , sars [6, 7] , influenza [8, 9, 10, 11] etc. for example, the basic reproductive number r 0 , a quantity central to developing intervention measures or immunization programs, depends crucially on the variance of the distribution of contacts [1, 12, 13] , known as the network degree distribution. contact networks with fat-tailed degree distributions, for example, where a few individuals have an extraordinarily large number of contacts, result in a higher r 0 than one would expect from contact networks with a uniform degree distribution, and the existence of highly connected individuals makes them an ideal target for control measures [7, 14] . while degree distributions have been studied extensively to understand their effect on epidemic dynamics, the community structure of networks has generally been ignored. despite the demonstration that social networks show significant community structure [15, 16, 17, 18] , and that social processes such as homophily and transitivity result in highly clustered and modular networks [19] , the effect of such microstructures on epidemic dynamics has only recently started to be investigated. most initial work has focused on the effect of small cycles, predominantly in the context of clustering coefficients (i.e. the fraction of closed triplets in a contact network) [20, 21, 22, 23, 24] . in this article, we aim to understand how community structure affects epidemic dynamics and control of infectious disease. community structure exists when connections between members of a group of nodes are more dense than connections between members of different groups of nodes [15] . the terminology is relatively new in network analysis and recent algorithm development has greatly expanded our ability to detect sub-structuring in networks. while there has been a recent explosion in interest and methodological development, the concept is an old one in the study of social networks where it is typically referred to as a ''cohesive subgroups,'' groups of vertices in a graph that share connections with each other at a higher rate than with vertices outside the group [18] . empirical data on social structure suggests that community structuring is extensive in epidemiological contacts [25, 26, 27] relevant for infectious diseases transmitted by the respiratory or close-contact route (e.g. influenza-like illnesses), and in social groups more generally [16, 17, 28, 29, 30] . similarly, the results of epidemic models of directly transmitted infections such as influenza are most consistent with the existence of such structure [8, 9, 11, 31, 32, 33] . using both simulated and empirical social networks, we show how community structure affects the spread of diseases in networks, and specifically that these effects cannot be accounted for by the degree distribution alone. the main goal of this study is to demonstrate how community structure affects epidemic dynamics, and what strategies are best applied to control epidemics in networks with community structure. we generate networks computationally with community structure by creating small subnetworks of locally dense communities, which are then randomly connected to one another. a particular feature of such networks is that the variance of their degree distribution is relatively low, and thus the spread of a disease is only marginally affected by it [34] . running standard susceptible-infected-resistant (sir) epidemic simulations (see methods) on these networks, we find that the average epidemic size, epidemic duration and the peak prevalence of the epidemic are strongly affected by a change in community structure connectivity that is independent of the overall degree distribution of the full network ( figure 1 ). note that the value range of q shown in figure 1 is in agreement with the value range of q found in the empirical networks used further below, and that lower values of q do not affect the results qualitatively (see suppl. mat. figure s1 ). epidemics in populations with community structure show a distinct dynamical pattern depending on the extent of community structure. in networks with strong community structure, an infected individual is more likely to infect members of the same community than members outside of the community. thus, in a network with strong community structure, local outbreaks may die out before spreading to other communities, or they may spread through various communities in an almost serial fashion, and large epidemics in populations with strong community structure may therefore last for a long time. correspondingly, the incidence rate can be very low, and the number of generations of infection transmission can be very high, compared to the explosive epidemics in populations with less community structure (figures 2a and 2b ). on average, epidemics in networks with strong community structure exhibit greater variance in final size (figures 2c and 2d) , a greater number of small, local outbreaks that do not develop into a full epidemic, and a higher variance in the duration of an epidemic. in order to halt or mitigate an epidemic, targeted immunization interventions or social distancing interventions aim to change the structure of the network of susceptible individuals in such a way as to make it harder for a pathogen to spread [35] . in practice, the number of people to be removed from the susceptible class is often constrained for a number of reasons (e.g., due to limited vaccine supply or ethical concerns of social distancing measures). from a network perspective, targeted immunization methods translate into indentifying which nodes should be removed from a network, a problem that has caught considerable attention (see for example [36] and references therein). targeting highly connected individuals for immunization has been shown to be an effective strategy for epidemic control [7, 14] . however, in networks with strong community structure, this strategy may not be the most effective: some individuals connect to multiple communities (so-called community bridges [37] ) and may thus be more important in spreading the disease than individuals with fewer inter-community connections, but this importance is not necessarily reflected in the degree. identification of community bridges can be achieved using understanding the spread of infectious diseases in populations is key to controlling them. computational simulations of epidemics provide a valuable tool for the study of the dynamics of epidemics. in such simulations, populations are represented by networks, where hosts and their interactions among each other are represented by nodes and edges. in the past few years, it has become clear that many human social networks have a very remarkable property: they all exhibit strong community structure. a network with strong community structure consists of smaller sub-networks (the communities) that have many connections within them, but only few between them. here we use both data from social networking websites and computer generated networks to study the effect of community structure on epidemic spread. we find that community structure not only affects the dynamics of epidemics in networks, but that it also has implications for how networks can be protected from large-scale epidemics. the betweenness centrality measure [38] , defined as the fraction of shortest paths a node falls on. while degree and betweenness centrality are often strongly positively correlated, the correlation between degree and betweenness centrality becomes weaker as community structure becomes stronger ( figure 3 ). thus, in networks with community structure, focusing on the degree alone carries the risk of missing some of the community bridges that are not highly connected. indeed, at a low vaccination coverage, an immunization strategy based on betweenness centrality results in fewer infected cases than an immunization strategy based on degree as the magnitude of community structure increases ( figure 4a ). this observation is critical because the potential vaccination coverage for an emerging infection will typically be very low. a third measure, random walk centrality, identifies target nodes by a random walk, counting how often a node is traversed by a random walk between two other nodes [39] . the random walk centrality measure considers not only the shortest paths between pairs of nodes, but all paths between pairs of nodes, while still giving shorter paths more weight. while infections are most likely to spread along the shortest paths between any two nodes, the cumulative contribution of other paths can still be important [40] : immunization strategies based on random walk centrality result in the lowest number of infected cases at low vaccination coverage (figure 4b and 4c ). to test the efficiency of targeted immunization strategies on real networks, we used interaction data of individuals at five different universities in the us taken from a social network website [41] , and obtained the contact network relevant for directly transmissible diseases (see methods). we find again that the overall most successful targeted immunization strategy is the one that identifies the targets based on random walk centrality. limited immunization based on random walk centrality significantly outperforms immunization based on degree especially when vaccination coverage is low (figure 5a ). in practice, identifying immunization targets may be impossible using such algorithms, because the structure of the contact network relevant for the spread of a directly transmissible disease is generally not known. thus, algorithms that are agnostic about the full network structure are necessary to identify target individuals. the only algorithm we are aware of that is completely agnostic about the network structure network structure identifies target nodes by picking a random contact of a randomly chosen individual [42] . once such an acquaintance has been picked n times, it is immunized. the acquaintance method has been shown to be able to identify some of the highly connected individuals, and thus approximates an immunization strategy that targets highly connected individuals. we propose an alternative algorithm (the so-called community bridge finder (cbf) algorithm, described in detail in the methods) that aims to identify community bridges connecting two groups of clustered nodes. briefly, starting from a random node, the algorithm follows a random path on the contact network, until it arrives at a node that does not connect back to more than one of the previously visited nodes on the random walk. the basic goal of the cbf algorithm is to find nodes that connect to multiple communities -it does so based on the notion that the first node that does not connect back to previously visited nodes of the current random walk is likely to be part of a different community. on all empirical and computationally generated networks tested, this algorithm performed mostly better, often equally well, and rarely worse than the alternative algorithm. it is important to note a crucial difference between algorithms such as cbf (henceforth called stochastic algorithms) and algorithms such as those that calculate, for example, the betweenness centrality of nodes (henceforth called deterministic algorithms). a deterministic algorithm always needs the complete information about each node (i.e. either the number or the identity of all connected nodes for each node in the network). a comparison between algorithms is therefore of limited use if they are not of the same type as they have to work with different inputs. clearly, a deterministic algorithm with information on the full network structure as input should outperform a stochastic algorithm that is agnostic about the full network structure. thus, we will restrict our comparison of cbf to the acquaintance method since this is the only stochastic algorithm we are aware of the takes as input the same limited amount of local information. in the computationally generated networks, cbf outperformed the acquaintance method in large areas of the parameter space ( figure 4d ). it may seem unintuitive at first that the acquaintance method outperforms cbf at very high values of modularity, but one should keep in mind that epidemic sizes are very small in those extremely modular networks (see figure 1a ) because local outbreaks only rarely jump the community borders. if outbreaks are mostly restricted to single communities, then cbf is not the optimal strategy because immunizing community bridges is useless; the acquaintance method may at least find some well connected nodes in each community and will thus perform slightly better in this extreme parameter space. in empirical networks, cbf did particularly well on the network with the strongest community structure (oklahoma), especially in comparison to the similarly effective acquaintance method with n = 2. (figure 5c ). as immunization strategies should be deployed as fast as possible, the speed at which a certain fraction of the . assessing the efficacy of targeted immunization strategies based on deterministic and stochastic algorithms in the computationally generated networks. color code denotes the difference in the average final size s m of disease outbreaks in networks that were immunized before the outbreak using method m. the top panel (a) shows the difference between the degree method and the betweenness centrality method, i.e. s degree 2 s betweenness . a positive difference (colored red to light gray) indicates that the betweenness centrality method resulted in smaller final sizes than the degree method. a negative difference (colored blue to black) indicates that the betweenness centrality method resulted in bigger final sizes than the degree method. if the difference is not bigger than 0.1% of the total population size, then no color is shown (white). panel (a) shows that the betweenness centrality method is more effective than the degree based method in networks with strong community structure (q is high). (b) and (c): like (a), but showing s degree 2 s randomwalk (in (b)) and s betweenness 2 s randomwalk (in (c)). panels (b) and (c) show that the random walk method is the most effective method overall. panel (d) shows that the community bridge finder (cbf) method generally outperforms the acquaintance method (with n = 1) except when community structure is very strong (see main text). final epidemic sizes were obtained by running 2000 sir simulations per network, vaccination coverage and immunization method. doi:10.1371/journal.pcbi.1000736.g004 network can be immunized is an additional important aspect. we measured the speed of the algorithm as the number of nodes that the algorithm had to visit in order to achieve a certain vaccination coverage, and find that the cbf algorithm is faster than the similarly effective acquaintance method with n = 2 at vaccination coverages ,30% (see figure 6 ). a great number of infectious diseases of humans spread directly from one person to another person, and early work on the spread of such diseases has been based on the assumption that every infected individual is equally likely to transmit the disease to any susceptible individual in a population. one of the most important consequences of incorporating network structure into epidemic models was the demonstration that heterogeneity in the number of contacts (degree) can strongly affect how r 0 is calculated [12, 13, 34] . thus, the same disease can exhibit markedly different epidemic patterns simply due to differences in the degree distribution. our results extend this finding and show that even in networks with the same degree distribution, fundamentally different epidemic dynamics are expected to be observed due to different levels of community structure. this finding is important for various reasons: first, community structure has been shown to be a crucial feature of social networks [15, 16, 17, 19] , and its effect on disease spread is thus relevant to infectious disease dynamics. furthermore, it corroborates earlier suggestions that community structure affects the spread of disease, and is the first to clearly isolate this effect from effects due to variance in the degree distribution [43] . second, and consequently, data on the degree distribution of contact networks will not be sufficient to predict epidemic dynamics. third, the design of control strategies benefits from taking community structure into account. an important caveat to mention is that community structure in the sense used throughout this paper (i.e. measured as modularity q ) does not take into account explicitly the extent to which communities overlap. such overlap is likely to play an important role in infectious disease dynamics, because people are members of multiple, potentially overlapping communities (households, schools, workplaces etc.). a strong overlap would likely be reflected in lower overall values for q; however, the exact effect of community overlap on infectious disease dynamics remains to be investigated. identifying important nodes to affect diffusion on networks is a key question in network theory that pertains to a wide range of fields and is not limited to infectious disease dynamics only. there are however two major issues associated with this problem: (i) the structure of networks is often not known, and (ii) many networks are too large to compute, for example, centrality measures efficiently. stochastic algorithms like the proposed cbf algorithm or the acquaintance method address both problems at once. to what extent targeted immunization strategies can be implemented in a infectious diseases/public health setting based on practical and ethical considerations remains an open question. this is true not only for the strategy based on the cbf algorithm, but for most strategies that are based on network properties. as mentioned above, the contact networks relevant for the spread of infectious diseases are generally not known. stochastic algorithms such as the cbf or the acquaintance method are at least in principle applicable when data on network structure is lacking. community structure in host networks is not limited to human networks: animal populations are often divided into subpopulations, connected by limited migration only [44, 45] . targeted immunization of individuals connecting subpopulations has been shown to be an effective low-coverage immunization strategy for the conservation of endangered species [46] . under the assumption of homogenous mixing, the elimination of a disease requires an immunization coverage of at least 1-1/r 0 [1] but such coverage is often difficult or even impossible to achieve due to limited vaccine supply, logistical challenges or ethical concerns. in the case of wildlife animals, high vaccination coverage is also problematic as vaccination interventions can be associated with substantial risks. little is known about the contact network structure in humans, let alone in wildlife, and progress should therefore be made on the development of immunization strategies that can deal with the absence of such data. stochastic algorithms such as the acquaintance method and the cbf method are first important steps in addressing the problem, but the large difference in efficacy between stochastic and deterministic algorithms demonstrates that there is still a long way to go. to investigate the spread of an infectious disease on a contact network, we use the following methodology: individuals in a population are represented as nodes in a network, and the edges between the nodes represent the contacts along which an infection can spread. contact networks are abstracted by undirected, unweighted graphs (i.e. all contacts are reciprocal, and all contacts transmit an infection with the same probability). edges always link between two distinct nodes (i.e. no self loops), and there must be maximally one edge between any single pair of nodes (i.e no parallel edges). each node can be in one of three possible states: (s)usceptible, (i)nfected, or (r)esistant/immune (as in standard sir models). initially, all nodes are susceptible. simulations with immunization strategies implement those strategies before the first infection occurs. targeted nodes are chosen according to a given immunization algorithm (see below) until a desired immunization coverage of the population is achieved, and then their state is set to resistant. after this initial set-up, a random susceptible node is chosen as patient zero, and its state is set to infected. then, during a number of time steps, the initial infection can spread through the network, and the simulation is halted once there are no further infected nodes. at each time step (the unit of time we use is one day, i.e. a figure 5 . assessing the efficacy of targeted immunization strategies in empirical networks based on deterministic and stochastic algorithms. the bars show the difference in the average final size s m of disease outbreaks (n cases) in networks that were immunized before the outbreak using method m. the left panels show the difference between the degree method and the random walk centrality method, i.e. s degree 2 s randomwalk . if the difference is positive (red bars), then the random walk centrality method resulted in smaller final sizes than the degree method. a negative value (black bars) means that the opposite is true. shaded bars show non-significant differences (assessed at the 5% level using the mann-whitney test). the middle and right panels are generated using the same methodology, but measuring the difference between the acquaintance method (with n = 1 in the middle column and n = 2 in the right column, see methods) and the community bridge finder (cbf) method, i.e. s acquaintance1 2 s cbf and s acquaintance2 2 s cbf . again, positive red bars mean that the cbf method results in smaller final sizes, i.e. prevents more cases, than the acquaintance methods, whereas negative black bars mean the opposite. final epidemic sizes were obtained by running 2000 sir simulations per network, vaccination coverage and immunization method. doi:10.1371/journal.pcbi.1000736.g005 time step is one day), an infected node can get infected with probability 12exp(2bi), where b is the transmission rate from an infected to a susceptible node, and i is the number of infected neighboring nodes. at each time step, infected nodes recover at rate c, i.e. the probability of recovery of an infected node per time step is c (unless noted otherwise, we use c = 0.2). if recovery occurs, the state of the recovered node is toggled from infected to resistant. unless mentioned otherwise, the transmission rate b is chosen such that r 0 ,(b/c) * d<3 where d is the mean network degree, i.e the average number of contacts per node. for the networks used here, this approximation is in line with the result from static network theory [47] , r 0 ,t(,k 2 ./,k.21), where ,k. and ,k 2 . are the mean degree and mean square degree, respectively, and where t is the average probability of disease transmission from a node to a neighboring node, i.e. t