AP05_6.vp 1 Introduction Failure recovery is a fundamental task of the dependable systems needed to achieve fault-tolerant communications, smooth operation of system components and a comfortable user interface. Increasingly popular distributed applications providing data sharing, content distribution or stream data delivery services include many different computers, often at distant geographical locations. To communicate between their nodes, these applications build tree-topology overlay structures to connect the nodes and distribute information. The failure recovery schemes for overlay trees use the underlying network to build a completely new tree or to re- store the tree keeping its original structure. While delay-prone creation of a new tree from scratch is usually possible using the same technique as for creation of the original tree, local tree restoration keeping the rest of the tree intact is a rela- tively unexplored area of research. Moreover, the large-scale and dynamic nature of tree-based structures in the rapidly evolving area of overlay communications requires the recov- ery mechanisms to exhibit several key properties such as scalability, independence of location and number of message sources, optional level of fault-tolerance and support for application-specific tree optimization requirements. It is be- coming increasingly apparent that a generic failure recovery platform providing a fragment-location and reconnection framework with these properties would be beneficial for many emerging applications. The problem of the failure recovery of overlay trees con- siders graph S � (�, �) of arbitrary topology, where � is a fi- nite set of vertices representing nodes and � is a finite set of edges representing links between the nodes. Graph S acts as an underlying network for a tree-topology overlay network modeled as graph T � (��, ��), where �� �� represents tree nodes, �� is a finite set of core tree edges representing overlay communication links connecting individual nodes ��. The goal of failure recovery is to protect a given tree net- work T against failure of the faulty cluster �� ��� of one or more adjacent nodes in the tree (see Fig. 1). Its task is to locate the tree fragments caused by the failure to restore the distrib- uted knowledge of the topology and reconnect the tree, omit- ting the failed nodes, to enable communication in the tree to continue. © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 27 Czech Technical University in Prague Acta Polytechnica Vol. 45 No. 6/2005 Generic Platform for Failure Recovery in Survivable Trees V. Dynda Failure recovery is a fundamental task of the dependable systems needed to achieve fault-tolerant communications, smooth operation of sys- tem components and a comfortable user interface. Tree topologies are fragile, yet they are quite popular structures in computer systems. The term survivable tree denotes the capability of the tree network to deliver messages even in the presence of failures. In this paper, we analyze the characteristics of large-scale overlay survivable trees and identify the requirements for general-purpose failure recovery mechanisms in such an environment. We outline a generic failure recovery platform for preplanned tree restoration which meets those requirements, and we focus primarily on its completeness and correctness properties. The platform is based on bypass rings and it uses a bypass routing algorithm to ensure completeness, and specialized leader election to guarantee correctness. The platform supports multiple, on-line and on-the-fly recov- ery, provides an optional level of fault-tolerance, protection selectivity and optimization capability. It is independent of the the protected tree type (regarding traffic direction, number of sources, etc.) and forms a basis for application-specific fragment reconnection. Keywords: fault tolerance, failure recovery, tree restoration, distributed algorithms. �� ORIGINAL TREE T FAULTY CLUSTER ��= {n 8 , n 9 , n 10 } TREE FRAGMENTS T i T 1 T 2 T 3 T 4 T 5 T 6 T 7 Tree node Tree edge Faulty cluster �� Failed tree node Tree fragment T i T Lost connection n 8 n9 n 10 n 1 n 2 n 3 n 4 n 5 n 6 n 7 n 8 n 9 n 10 n 1 n 2 n 3 n 4 n 5 n 6 n 7 n 1 n 2 n 3 n 4 n 5 n 6 n 7 Fig. 1: Failure in the tree network and its partitioning into fragments In this paper, we analyze the environment of general over- lay tree networks and identify the requirements for failure recovery in survivable trees ([1]). We outline a failure recovery platform for preplanned tree restoration based on bypass rings ([2], [3]) – cyclical redundant structures to be used in the event of failure to locate and reconnect the tree fragments. The rest of this paper is structured as follows. In the next section, the related work is summarized. Sections 3 and 4 summarize the requirements for generic failure recovery and relate the qualities that the corresponding recovery scheme is expected to possess. Section 5 briefly describes the proposed platform, and sections 6 and 7 deal with two main issues of recovery – completeness and correctness. Section 8 outlines the elementary possibilities of application-specific fragment reconnection based on the platform. Section 9 discusses the achieved results, and section 10 concludes and sets some future directions. 2 Related work Reconstruction of tree-topology graphs without starting from scratch is a relatively new area of research in the field of failure recovery. So far, on-demand restoration schemes building at least the affected subtree anew have been used in a number of applications. Several preplanned special-purpose protocols based on pre-computed backup paths have also been proposed, aiming at some of the mentioned properties while neglecting others. Although there are many possible applications, nearly all of the previous solutions are designed for specific network-layer or overlay multicast schemes. The important property of local recovery is mostly achieved only in single-source multicast trees, and the scalability of many solutions is limited by dependence of the control or memory overhead on the group size kN. There are several basic straightforward preplanned meth- ods for multicast tree recovery (based on the group leave operation in [4]). In the Grandfather method, each node main- tains a backup link to its grandparent in the rooted tree. When a node (except the root) fails, its child nodes contact the grandparent, which either accepts the connection or redirects them down the tree. Subtrees of the affected nodes remain unchanged. In the Root method, the children of the failed node try to recover the tree by connecting directly to the root node, which uses the same strategy as in the Grandfather approach above. In the Grandfather-All and Root-All methods, all descendants of the failed node try to recover by contacting the grandparent or root, respectively, in order to build the whole affected subtree of the failed node anew using the requested optimizations. All the nodes maintain respective knowledge of their ascendants in the multicast tree – the grandparent node in Grandfather, the root node in Root and Root-All, and all ascendants from grandparent of n to root node in the Grand- father-All method. Except for the Grandfather, the methods do not perform local failure recovery, as the affected nodes contact ascending nodes far up the tree. The scalability of these methods is limited either – in the Grandfather-All method, each node maintains a link to the number of ascen- dants proportional to the size of the group. The Root and Root-All methods also load the root node with extensive com- munication proportional to the group size. These two meth- ods also rely on a single root node whose failure breaks down the whole scheme. The Grandfather method is scalable and keeps locality, but it does not cope with multiple adjacent fail- ures in the tree. Moreover, all these methods are designed for single rooted directed multicast trees only. However, these simple methods represent four classes of a number of other approaches based on the same principles. For example, Proactive Reconstruction [5] belongs to the Grandfather-All category, EFTMRP [6] for recovery in net- work-layer CBT multicast uses a principle similar to the Grandfather method, LFR core recovery [7] resembles the Root method. A different approach is chosen in specialized multicast protocols that use administrative control topologies for group management in addition to the data delivery tree. Narada [8] is a protocol designed for small multicast groups, where each node keeps a periodically refreshed state about all other group members and uses this information to locate and re- connect the fragments. Due to the state exchange inducing the control overhead ( )kN 2 , the Narada protocol is effective only when the group is small. However, this is an explicit design choice where the high overhead is traded off for greater robustness in recovery from node failures. Yoid [9] and HMTP [10] are examples of protocols using a dedicated node called the rendezvous point to arbitrate the failures and locate the fragments. In addition, cached links to several periodi- cally discovered member nodes are used. HMTP nodes also maintain an ancestor list similarly to the Grandfather-All method. These methods are capable of recovery from large failed clusters in their trees; however, the rendezvous point may become a bottleneck and a single point of failure. Another solution is used in overlay protocols for media streaming Nemo [11], Nice [12], FatNemo [13] and ZigZag [14]. They all first construct an administrative highly connected hierarchy among the nodes, and the data delivery tree is then built using this structure. The hierarchy is organized into lay- ers divided into clusters, where each cluster has a leader node which then also belongs to a cluster in a higher level. When a node fails, the leader of its highest layer or the leader of the cluster of its children (depending on the protocol) is re- sponsible for finding another node to take over the traffic and reconnect the disconnected subtrees. After the recovery, the administrative hierarchy is reorganized to adapt to the topology changes. The failure recovery of these protocols is efficient in heavy traffic multicasting, where the costs of the highly connected hierarchy are amortized by the huge amount of data. On the other hand, for less loaded trees, the memory and control traffic overhead may be significant. The data delivery trees are source-specific, the control overhead of a node is (log )k Nk , where k is a constant proportional to the size of the administrative clusters. A Dual-tree network-layer protection scheme [15] con- structs a node-disjoint secondary tree connecting tree leaves in addition to the primary delivery tree. After failure, the scheme identifies disconnected subtrees and reconnects them to the rest of the tree using the secondary tree. The construc- tion of redundant trees increasing network-layer multicast reliability is also studied in [16], and it is employed in overlay multicast as well in CoopNet [17]. Link-protection [18] and path-protection [19] for network-layer multicasting in ATM 28 © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ Acta Polytechnica Vol. 45 No. 6/2005 Czech Technical University in Prague networks propose an individual backup path for each link in the tree and for each source-destination pair, respectively. 3 Main issues of overlay failure recovery From the point of view of the design of failure recovery mechanisms for survivable networks, the characteristics of the environment as well as the properties of the protected tree networks and the group model are essential. Recent trends in a networked computing environment point towards large- -scale unbounded network infrastructures with theoretically unlimited number and size of groups interconnected with overlay trees. To design a practical and efficient failure recovery scheme for these systems, the following characteristics of the comput- ing environment must be taken into account: � Characteristics of a distributed system – asynchronous commu- nication, no global clock and autonomous behavior of the nodes. � No central authority controlling proper functionality of the system; all nodes are peers. � No global knowledge of the number of nodes, their identities and the network topology. � Unrestricted failure pattern; nodes fail arbitrarily at any time. � Unlimited size of the system and the underlying network S. The most significant attributes of the overlay tree struc- tures involve: � Traffic direction. In single-source trees, the message traffic flows in a single direction from the root (source) node to other member nodes. However, in many emergent ap- plications, the message source may change in the runtime or even several nodes become a message source simulta- neously disseminating the traffic to the tree. � Tree adaptation. As individual nodes may arbitrarily join or leave the group, the respective tree is either expanded or shrunk. Moreover, the tree may adapt its topology in order to satisfy potential extern optimization requirements. � No global knowledge and unlimited size of the tree. The size of the group is not limited, and the number and identity of the tree member nodes is not fully known. Each tree mem- ber keeps only the information about its neighbors in the tree for routing and possibly for tree adaptation purposes. � Real-time operation. The tree operates in real time, such that the traffic in the tree cannot be suddenly turned off or switched to an off-line or stalled mode. � Self-containment. Due to the scale of the system, the possible number of groups and the overlay nature of the trees, the information pertaining to a particular tree is required to be kept solely at the tree member nodes, and no other node is capable to hold even auxiliary information concerning this tree. It shapes up that a generic platform for failure recovery available for different applications in this environment would be profitable perhaps as a part of the middleware architec- ture to increase reliability and cut down the costs. The listed attributes represent the restrictive characteristics of the envi- ronment and the protected tree. Of course, not all applica- tions employing tree communication structures employ this kind of trees, and the characteristics are somehow relaxed. However, in many other applications, particularly large-scale data sharing or data storing peer-to-peer systems, the tree networks exhibit all these attributes, which then must be reflected by the respective properties of the platform. 4 Survivable trees A survivable tree is a general tree-topology communica- tion network capable to deliver information to all its correct member nodes even in the presence of failures. Consider T � (��, ��) to be an overlay tree network, �� ��� and �� T to be a connected vertex-induced subgraph of T. Failure of faulty cluster �� causes T to be partitioned into fragments Ti, i � 1, 2, …, N � card ( AT(��)), where card ( AT(��)) is cardinality of a neighbor set of �� in T. If card (��) � 1 then �� is called a single failure; if card (��) >1 then �� is a multiple failure of adjacent nodes in T. A survivable tree T is required to deliver messages even in the event of several single or multiple failures. A failure re- covery is a process of reconnecting fragments Ti in a single restored network T’ � (��\��, ��’), allowing the traffic to continue. We focus on two principal properties that each re- covery mechanism employed in a survivable tree must have – correctness and completeness. Correctness is based on the essen- tial requirement to keep the tree topology of the network even after failure, since the correctness of most applications depends precisely on the acyclic property of the graph. Com- pleteness requires all the fragments of the original tree to be connected in a single restored tree, allowing all correct nodes to participate in T’. The following extra qualities of the recovery scheme are needed to address the characteristics of the large-scale un- bounded environments of the targeted applications and properties of the possible protected trees, and should form the design subject of failure recovery in survivable trees. � Scalability with the size of the protected tree and the under- lying network. � Multiple failure recovery. Capability to recover T from multi- ple failures. � Locality. The recovery affects only the tree nodes in the clos- est neighborhood of the failed nodes, keeping the rest of the tree in its original structure. � On-line recovery. Ability to recovery several simultaneous failures in a single tree. � Computational symmetry. There is no arbiter node, all �� nodes are peers. � On-the-fly recovery. The failure recovery is performed while the traffic in the tree goes on, even through the nodes performing the recovery. � Optional level of fault-tolerance provided by the scheme for the survivable tree, allowing an optimal trade-off between survivability and costs to be found. � Protection selectivity allows the fault-tolerance level to be cho- sen individually for each node. The survivable tree may then provide stronger protection against failure of less reli- able or functionally more important nodes in the tree. © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 29 Czech Technical University in Prague Acta Polytechnica Vol. 45 No. 6/2005 � Traffic direction independence. The recovery success depends neither on the traffic direction in the tree nor on the link orientation in multi-source multi-rooted trees. � Optimization capability. The scheme takes into account the application-specific requirements regarding the restored tree. There are only three methods capable of failure recovery in a single multi-source tree – Yoid [9], HMTP [10] and network-layer link-protection [18]. However, link-protection recovers the network from link failures only, and Yoid and HMTP are not fully distributed, as they need the RP node for their operation. Other schemes are designed either for sin- gle-source trees or for single-rooted shared trees, and they usually do not provide optional failure recovery and protec- tion selectivity. Moreover, the overhead of several recovery methods is proportional to the group size, which degrades their scalability. Local recovery is performed only in the Grandfather type of restoration and in implicit multicast schemes (e.g., Nemo [11], Nice [12], ZigZag [14]). 5 Bypass ring platform When designing a failure recovery scheme for a survivable tree, we face two main challenges to be solved while keeping in mind the requirements for failure recovery in survivable trees: � How to locate all the fragments and route messages among them � How to avoid creating cycles during fragment reconnection Our solution is based on bypass rings – virtual cyclic struc- tures appended to the tree and providing alternative paths to eliminate the failed nodes, locate the fragments and reroute the traffic in the tree ([3]). Each bypass ring is identified by its center node and diameter; its edges connect individual tree branches of the center node in a distance proportional to the diameter. Several concentric bypass rings of increasing diameter form a bypass framework. It is the responsibility of the bypass routing algorithm to lo- cate all N � AT(��) nodes and route among them cyclically in a uniform direction and order, regardless of the source and destination of the messages, using edges of bypass frame- works. This cyclical path bypassing cluster �� is referred to as bypass cycle BC(��). The bypass cycle connects all N fragments of the tree so that they can communicate and join together to restore the tree. However, it is not possible to sequentially join all the fragments on BC(��) since a cycle would occur in T’. Instead, a single bypass edge on BC(��) that does not partici- pate in the reconnection is to be identified in a distributed way. This is the task of the leader link election (LLE) process, which is based on comparing the hierarchical identifiers of the fragments ([2]). A hierarchical identifier is a unique com- pound value inferred from the structure of the failed cluster found out by the routing algorithm. The overall operation of the scheme involves several fun- damental steps – scheme initialization, failure detection, designated nodes discovery, leader link election, tree fragment reconnection and bypass ring reconfiguration. In the initialization phase, the by- pass frameworks are set up centered at selected tree nodes against whose failure the tree is to be protected, with the diameter depending on the desired protection level of the particular node. In the event of failure, the failure detecting nodes initiate the recovery immediately and use the bypass routing to discover designated nodes, DN AT( ) ( )�� ��� – the bypass cycle nodes with distinct properties related to the type of protected tree allowing them to coordinate the rest of the recovery process. Bypass cycle edges incident to the desig- nated nodes become candidates for the LLE process. As the leader election proceeds along the bypass cycle, the pairs of fragments are successively joined together, forming greater connected components until all the fragments are recon- nected into the restored tree T’. Various fragment reconnection methods respecting the results of LLE can be designed to consider application-spe- cific constraints and requirements regarding the tree proper- ties (e.g., degree constraints, weight functions, latency or bandwidth limitations, etc.). 6 Bypass routing Bypass routing is one of the key components of the pro- posed scheme, as it ensures its completeness. Clearly, the success of the routing depends on the availability of the respective bypass rings in the event of failure. To achieve uniform direction and fragment order of the routing, the by- pass rings are to be set up systematically in the tree. For this purpose, we introduce the partial order of the tree, which unambiguously specifies the sequence of neighbors SeqT(n) of each tree member node n (e.g., according to their identifiers). 30 © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ Acta Polytechnica Vol. 45 No. 6/2005 Czech Technical University in Prague BR T (n, 3) (a) R � (n, n k ) R�(n, n k �1) Seq T (n) n n k n k �1 n 0 n deg(n)�1 ... ... .. . .. . (b) BF T (n, 4) Seq T (n) ... ... .. . .. . n n k n k �1 BR T (n, 2) BR T (n, 3) BR T (n, 4) Fig. 2: Bypass ring BRT(n, 3) (a) and framework BFT(n, 4) (b) The partial order defines the arrangement of each bypass ring in the tree. Each bypass ring is referred to as BRT(n, d), where n is its center node and d is its diameter. BRT(n, d) consists of degT(n) bypass edges connecting each pair of tree branches BT(n, nk) and BT(n, nk�1) neighboring in SeqT(n). We define the positive and negative ordered rays, RT �(n, nk) and RT �(n, nk) of each branch BT(n, nk) according to the partial order as its leftmost and rightmost path, provided that T is drawn as a planar graph where SeqT(n) of each node follows a clockwise direc- tion. Each bypass edge of BRT(n, d) connecting BT(n, nk) and BT(n, nk�1) is then initiated on RT �(n, nk) at distance d 2 � �� � � from n and terminated on RT �(n, nk � 1) at distance d 2 � �� � � , as shown in Fig. 2a. The bypass framework is defined as the union of con- centric bypass rings (see Fig. 2b): BF n d BR n dT T d d ( , ) ( , )max max � �2 � . With this arrangement, all the bypass rings keep the same direction, allowing the bypass routing algorithm to route between branches of a given center node using its rings, and to employ rings of lower diameters centered in particular branches to route through those branches, while preserving the direction. Supposing that frameworks BF n dT ( , )max are set up around each node n �� in T, there are dmax bypass edges initiated at each node and terminating at nodes at increas- ing distances (up to dmax) on RT �(n, nk) of each of the node’s branches B n nT k( , ), n A nk T ( ). The routing itself is then based on the fact that each faulty cluster is an intersection of the respective tree branches of the nodes neighboring with the cluster (as illustrated in Fig. 3): �� �� T T i j n A B n n i T � ( , ) ( ) � , where n A nj T i �( ) �� At each node ni AT (��), the routing algorithm system- atically browses B n nT i j( , ) using the bypass edges initiated at ni to find another node neighboring with ��, node ni�1, which is the next on BC(��). The branch lookup is performed se- quentially by checking the nodes on RT �(ni, nj) at an increasing distance until the first non-faulty node, ni�1, is found. The sequence of nodes on RT �(ni, nj) is kept for further use by LLE. This process is shown in Fig. 4a. Fig. 4b demonstrates a practi- cal example of routing in the network from Fig. 1. It can be shown [2] that bypass routing is feasible provided that the length of the positive ordered ray between every two bypass cycle neighbors is less or equal to dmax. The lower bound of the maximum recoverable failure is thus dmax 2 � �� � � nodes in arbitrary clusters and dmax�1 nodes in internal clus- ters of the tree (i.e., the clusters not containing leaves of the protected tree). The memory needed to keep routing infor- mation at a node is equal to (deg ( ) )maxT n d� . © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 31 Czech Technical University in Prague Acta Polytechnica Vol. 45 No. 6/2005 �� B T (n i ,n j ) n i n j .. . n i�1 BC (��) B T (n i�1 ,n j�k ) B T (ni,nj) �� BC (� � ) n i n j n j+k n i�1 n i�2 (a) (b) R � (n i, n j) Fig. 3: Relation between nodes n AT ( )�� and faulty cluster �� �� n j n j �1 n j�k n i�1 n j �2 n i R � (ni, nj) BC(��) BR T (n j , 2) BR T (n j , 3) BR T (n, k�1) (a) (b) ��n8 n 9 n 10 n 1 n 2 n 3 n 4 n 5 n 6 n 7 (n 8 ,2) (n 8 ,3) (n 9 ,4 ) (n 10 ,2) (n 1 0 ,2 ) (n 1 0 ,3 ) (n 9 ,2 ) (n 9 ,2 ) (n 9 ,3) (n 8 ,2 ) (n 8 ,2) Fig. 4: Routing from n Ai T ( )�� to ni�1 (a) and routing around ��� � n n n8 9 10, , (b) 7 Leader link election The task of the leader link election algorithm is to identify a leader – a single edge on the bypass cycle that will not partic- ipate in fragment reconnection in order to ensure correctness of the restored tree while keeping it connected. Moreover, to support on-line and on-the-fly recovery, we look for a solution guaranteeing that once a link loses the election, it remains lost and that there is not a state of the algorithm in which there is no leader (i.e. all the fragments are connected). The completeness property is provided by the bypass routing algorithm; the bypass cycle forms a ring-topology communi- cation structure for the election. The election is based on identifiers of the bypass cycle nodes; the basic idea is similar to the Chang-Roberts leader election [20], where the maximum known ID is sent around the cycle by means of ELECTION messages and compared with the ID of each intermediate node. The LLE algorithm exploits the favorable properties of Chang-Roberts on or- dered cycles, where it needs only N � card(BC(��)) messages and only a single ELECTION message to decide whether a given node (the bypass cycle edge incident to it) loses the elec- tion or not (see [3] for further details). Except for single failures �� � { }n f where BC( )�� � � A BR nT T f( ( , )�� )= 2 , the bypass cycle nodes are not or- dered. For this reason, the hierarchical identifiers that uniquely identify each node relative to another node in the tree are used. The leaves of an arbitrary partially ordered rooted tree are ordered in parts according to the hierarchical identifiers based in their closest common parent node. Ap- plying the principle of Chang-Roberts for an ordered ring, more than one leader can be elected using N ELECTION mes- sages. These leaders (except one) are thereafter eliminated by a recursive sweep process considering hierarchical identifiers related to the common root node. Fig. 5 illustrates the princi- ple of eliminating multiple leaders (only the leaf nodes are members of BC( )�� ). The common root node for bypass cycle nodes is a faulty node nr �� with the minimum identifier determined to- gether with the respective hierarchical identifiers step by step by the routing algorithm as it browses the relevant ordered rays in the bypass cycle lookup. In this way, the algorithm utilizes a byproduct of the bypass routing – knowledge of the cluster structure – to achieve ( log )N Nb average message complexity of the election, where b is the average branching factor in ��. Moreover, the algorithm needs only N messages to elect a leader link in an arbitrary bypass cycle in hierarchi- cally ordered trees (e.g., binary search trees) and also in the bypass cycles of single failures in general partially ordered trees. 8 Fragment reconnection The platform constituted by bypass routing and leader election forms a basis for various fragment reconnection methods responsible for joining the fragments into a single connected tree T’ according to application-specific require- ments. Fragment reconnection can be performed together with the LLE process – as the LLE messages travel around the bypass cycle and determine individual nodes not to be lead- ers, the respective fragments can be joined to the rest of the tree so that the data traffic can be transmitted immediately. The most straightforward reconnection approach comes directly from the LLE process. The fragments on BC(��) are sequentially joined, except for the terminal fragments of the leader link. The drawback is the fact that the diameter of the failure recovery area AT T( )�� � is always equal to N � 1, which might be a limitation for some applications because without extern tree balancing, the tree would depreciate to a linear graph after a certain number of recoveries. This is called LR reconnection method. Two different reconnection methods, called TRM and HRM, were proposed in [3]. In the TRM method, all the frag- ments are joined directly to one of the designated nodes, while the HRM method allows the fragments to be joined in longer consecutive sequences. As a generalization of these two approaches, we propose the parameterized HR-x re- connection method, where value x affects the length of the successively joined fragments along BC( )�� and thus it can influence the degree of the nodes and the diameter of AT T( )�� �. The maximum number of new core edges inci- dent to the affected nodes is min , N x N �� �� � �� � � � � � � � � 1 2 1 . The diameter is proportional to x as well. HR-1 thus repre- sents the TRM, and HR-N is equivalent to the LR method. The possibility to influence the properties of the restored tree may help to balance or optimize it. The particular value of x can even be chosen autonomously by each bypass cycle node, so the local requirements may also be supported. We 32 © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ Acta Polytechnica Vol. 45 No. 6/2005 Czech Technical University in Prague n s ...... SWEEP ON n s SWEPT LEADER ELECTION (a) (b) LEADER n t Seq T (n t ) Seq T (n s ) PARTIAL ORDER Fig. 5: Leader election and the recursive sweep process note that the HR-x method is only an illustration of the reconnection process. More sophisticated systematical recon- nection methods can be applied, provided that they retain correctness and completeness of the recovery. 9 Discussion The characteristics of the computing environment in large-scale distributed systems as well as the general proper- ties of overlay trees in this environment place quite specific requirements on tree restoration. The proposed platform for generic failure recovery is designed to meet the mentioned requirements of survivable trees and to form a basis for appli- cation-specific fragment reconnection. The node memory complexity of the bypass ring recovery scheme is (deg ( ) )maxT n d� and the average message com- plexity of the recovery is ( log )N Nb , where degT(n) is node degree, N is number of tree fragments, b is average branching factor of the failed nodes, and dmax is an optional parame- ter proportional to the provided fault-tolerance. The lower bound of the maximum size of the recoverable failure is dmax 2 � �� � � nodes in arbitrary clusters and dmax�1 nodes in inter- nal clusters of the tree. The lower bound of the maximum diameter of internal faulty clusters is dmax�2. The scheme is scalable with the group size, as its overhead depends solely on N and dmax and performs local recovery since only the nodes closest to the failed cluster (on BC( )�� ) are involved in the recovery. The scheme also supports multi- ple, on-line and on-the-fly recovery – it is capable to recover the tree from several multiple failures with respect to the dmax parameter while communication in the tree continues, and the simultaneous recoveries do not interfere with each other because of the locality property. Protection selectivity and optional fault-tolerance is provided so that a trade-off be- tween survivability and costs can be easily chosen. The scheme is independent of the type of protected tree (regarding traffic direction, number of sources, etc.) and forms a basis for appli- cation-specific fragment reconnection. The simulations and measurements verifying the de- scribed behavior of the proposed bypass ring scheme were performed using the GFS file system ([21], [22]) as a test bed. GFS is a peer-to-peer large-scale file system providing a fault-tolerant and highly available file service. It extensively employs vast tree communication structures for replica-based management of its data, and it is a typical application to uti- lize the bypass ring scheme. GFS has been implemented ([23]) and simulated ([24]) in network simulator ns2. One of the important results is the fact that protection with dmax � 4 already provides ample fault-tolerance, and it is fully sufficient for the GFS application; the probability of employing rings in the recovery dramatically decreases with their diameter. The simulations also show the real scales of the recoverable failures. The average size of the recovered cluster in trees with average branching factor 4 is approx. 1.5 dmax, and the average diameter is approximately dmax�2 for 2 � dmax� 10. This result confirms the possibility to easily choose a trade-off between survivability and costs. 10 Conclusion In this paper, we summarized the general properties of large-scale environments and identified the requirements for generic failure recovery in survivable trees. The main contri- bution of the paper is the outline of a scalable platform for lo- cal failure recovery that meets all the required properties. The platform is based on bypass rings – the redundant cyclic struc- tures introduced in [3] and specified in detail in [2]. The re- covery provided by this platform is generic to the intent that it is independent of specific tree properties and communication pattern, and it enables application-specific tree reconnection to optimize the restored tree according to some extern con- straints and requirements. The performed simulations [24] in the GFS file system confirm the theoretical results. Future research in this area may include specification of particular mechanisms for (autonomous) management of fault-tolerance level and protection selectivity with respect to the state of individual nodes, or a proposal of more sophisti- cated reconnection methods tailored exactly to specific application requirements. Reference [1] Dynda, V.: “A Concept of Survivable Trees and its De- ployment in a Fault-Tolerant Multicast.” In: Workshop 2004, CTU, Prague, 2004, p. 234–235. [2] Dynda, V.: “A Bypass-Ring Scheme for a Fault-Tole- rant Multicast.” Acta Polytechnica, Vol. 43 (2003), No. 2, p. 18–24. [3] Dynda, V.: “A Simple Scheme for Local Failure Recovery of Multi-Directional Multicast Trees.” In: IFIP / IEEE ISCIS 2003, LNCS, Vol. 2869, Springer-Verlag, Ger- many, 2003, p. 66–74. [4] Deshpande, H., Bawa, M., Garcia-Molina, H.: “Stream- ing Live Media over a Peer-to-Peer Network.” Report No. CS-2001-31, CS Dept., Stanford University, 2001. [5] Yang, M., Fei, Z.: “A Proactive Approach to Reconstruct- ing Overlay Multicast Trees.” In: IEEE Infocom ’04, 2004. [6] Jia, W. et. al.: “An Efficient Fault-Tolerant Multicast Routing Protocol with Core-Based Tree Techniques.” IEEE Transactions on Parallel and Distributed Systems, Vol. 10 (1999), p. 984–999. [7] Manimaran, G., Chakrabarti, A.: “A Scalable Approach for Core Failure Recovery in Multicasting.” In: Advanced Computing and Communications, 2000, p. 191–196. [8] Chu, Y. H., Rao, S. G., Seshan, S., Zhang, H.: “Enabling Conferencing Applications on the Internet Using an Overlay Multicast Architecture.” In: ACM SIGCOMM ‘01, 2001. [9] Francis, P.: “Yoid: Extending the Multicast Internet Ar- chitecture.”, 2000 http://www.icir.org/yoid. [10] Zhang, B., Jamin, S., Zhang L.: “Host Multicast: A Framework for Delivering Multicast to End Users.” In: IEEE Infocom ’02, 2002. [11] Birrer, S., Bustamante, F. E.: Nemo: “Resilient Peer-to- -Peer Multicast without the Cost.” Report No. NWU- -CS-04-36, Northwestern University, 2004. © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 33 Czech Technical University in Prague Acta Polytechnica Vol. 45 No. 6/2005 [12] Banerjee, S., Bhattacharjee, B., Kommaredy, C.: “Scal- able Application Layer Multicast.” In: ACM SIGCOMM ’02, 2002. [13] Birrer, S. et. al.: FatNemo: “Building a Resilient Multi-Source Multicast Fat-Tree.” In: WCW ’04, LNCS, Vol. 3293, Springer-Verlag, Germany, 2004, p. 182–196. [14] Tran, D. A., Hua, K. A., Do, T.: “ZigZag: An Efficient Peer-to-Peer Scheme for Media Streaming.” In: IEEE Infocom ’03, Vol. 2, 2003, p. 1283–1292. [15] Fei, A. et. al.: “A Dual-Tree Scheme for Fault-Tolerant Multicast.” In: ICC ’01, Vol. 4, 2001. [16] Medard, M. et. al.: “Redundant Trees for Preplanned Recovery in Arbitrary Vertex-Redundant or Edge-Re- dundant Graphs.” IEEE/ACM Transactions on Networking, Vol. 7 (1999), No. 5, p. 641–652. [17] Padmanabhan, V. N., Wang, H. J., Chou, P. A.: “Resil- ient Peer-to-Peer Streaming.” In: IEEE ICNP ’03, 2003, p. 16–27. [18] Wu, C. et. al.: “A New Preplanned Self-Healing Scheme for Multicast ATM Network.” In: IEEE ICCT ’96, Vol. 2, 1996, p. 888–891. [19] Wu, C., Lee, W., Hou, Y.: “Back-up VP Preplanning Strategies for Survivable Multicast ATM Networks.” In: IEEE ICC ’97, Vol. 1, 1997, p. 267–271. [20] Chang, E. G., Roberts, R.: “An Improved Algorithm for Decentralized Extrema-Finding in Circular Configura- tion of Processors.” Communication of the ACM, Vol. 22 (1979), No. 5, p. 281–283. [21] Dynda, V., Rydlo, P.: ”Large-Scale Distributed File Sys- tem Design and Architecture.” Acta Polytechnica, Vol. 42 (2002), No. 1, p. 6–11. [22] Dynda, V., Rydlo, P.: “Fault-Tolerant Data Management in a Large-Scale File System.” In: IEEE ISADS 2002, Mexico, 2002, p. 219–235. [23] Zradička, L.: “A Distributed File System Model.” Master Thesis, Department of Computer Science and Engineer- ing, CTU Prague, 2003. [24] Řehák, P.: “Fault-Tolerance in a Distributed File Sys- tem.” Master Thesis, Department of Computer Science and Engineering, CTU Prague, 2005. Ing. Vladimír Dynda phone: +420 224 357 616 fax: +420 224 923 325 e-mail: xdynda@sun.felk.cvut.cz Department of Computer Science and Engineering Czech Technical University in Prague Faculty of Electrical Engineering Karlovo náměstí 13 121 35 Praha 2, Czech Republic 34 © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ Acta Polytechnica Vol. 45 No. 6/2005 Czech Technical University in Prague