AP05_6.vp


1 Introduction
Failure recovery is a fundamental task of the dependable

systems needed to achieve fault-tolerant communications,
smooth operation of system components and a comfortable
user interface. Increasingly popular distributed applications
providing data sharing, content distribution or stream data
delivery services include many different computers, often
at distant geographical locations. To communicate between
their nodes, these applications build tree-topology overlay
structures to connect the nodes and distribute information.

The failure recovery schemes for overlay trees use the
underlying network to build a completely new tree or to re-
store the tree keeping its original structure. While delay-prone
creation of a new tree from scratch is usually possible using
the same technique as for creation of the original tree, local
tree restoration keeping the rest of the tree intact is a rela-
tively unexplored area of research. Moreover, the large-scale
and dynamic nature of tree-based structures in the rapidly
evolving area of overlay communications requires the recov-
ery mechanisms to exhibit several key properties such as
scalability, independence of location and number of message

sources, optional level of fault-tolerance and support for
application-specific tree optimization requirements. It is be-
coming increasingly apparent that a generic failure recovery
platform providing a fragment-location and reconnection
framework with these properties would be beneficial for many
emerging applications.

The problem of the failure recovery of overlay trees con-
siders graph S � (�, �) of arbitrary topology, where � is a fi-
nite set of vertices representing nodes and � is a finite set of
edges representing links between the nodes. Graph S acts as
an underlying network for a tree-topology overlay network
modeled as graph T � (��, ��), where �� �� represents
tree nodes, �� is a finite set of core tree edges representing
overlay communication links connecting individual nodes
��. The goal of failure recovery is to protect a given tree net-
work T against failure of the faulty cluster �� ��� of one or
more adjacent nodes in the tree (see Fig. 1). Its task is to locate
the tree fragments caused by the failure to restore the distrib-
uted knowledge of the topology and reconnect the tree, omit-
ting the failed nodes, to enable communication in the tree to
continue.

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 27

Czech Technical University in Prague Acta Polytechnica Vol. 45  No. 6/2005

Generic Platform for Failure Recovery
in Survivable Trees

V. Dynda

Failure recovery is a fundamental task of the dependable systems needed to achieve fault-tolerant communications, smooth operation of sys-
tem components and a comfortable user interface. Tree topologies are fragile, yet they are quite popular structures in computer systems. The
term survivable tree denotes the capability of the tree network to deliver messages even in the presence of failures. In this paper, we analyze
the characteristics of large-scale overlay survivable trees and identify the requirements for general-purpose failure recovery mechanisms in
such an environment. We outline a generic failure recovery platform for preplanned tree restoration which meets those requirements, and we
focus primarily on its completeness and correctness properties. The platform is based on bypass rings and it uses a bypass routing algorithm to
ensure completeness, and specialized leader election to guarantee correctness. The platform supports multiple, on-line and on-the-fly recov-
ery, provides an optional level of fault-tolerance, protection selectivity and optimization capability. It is independent of the the protected tree
type (regarding traffic direction, number of sources, etc.) and forms a basis for application-specific fragment reconnection.

Keywords: fault tolerance, failure recovery, tree restoration, distributed algorithms.

��

ORIGINAL TREE T FAULTY CLUSTER ��= {n
8
, n

9
, n

10
} TREE FRAGMENTS T

i

T
1

T
2

T
3

T
4

T
5

T
6

T
7

Tree node

Tree edge

Faulty cluster ��

Failed tree node

Tree fragment T
i

T

Lost connection

n
8 n9

n
10

n
1

n
2

n
3

n
4

n
5

n
6

n
7

n
8

n
9

n
10

n
1

n
2

n
3

n
4

n
5

n
6

n
7

n
1

n
2

n
3

n
4

n
5

n
6

n
7

Fig. 1: Failure in the tree network and its partitioning into fragments


In this paper, we analyze the environment of general over-
lay tree networks and identify the requirements for failure
recovery in survivable trees ([1]). We outline a failure recovery
platform for preplanned tree restoration based on bypass
rings ([2], [3]) – cyclical redundant structures to be used in the
event of failure to locate and reconnect the tree fragments.
The rest of this paper is structured as follows. In the next
section, the related work is summarized. Sections 3 and 4
summarize the requirements for generic failure recovery and
relate the qualities that the corresponding recovery scheme is
expected to possess. Section 5 briefly describes the proposed
platform, and sections 6 and 7 deal with two main issues of
recovery – completeness and correctness. Section 8 outlines
the elementary possibilities of application-specific fragment
reconnection based on the platform. Section 9 discusses the
achieved results, and section 10 concludes and sets some
future directions.

2 Related work
Reconstruction of tree-topology graphs without starting

from scratch is a relatively new area of research in the field
of failure recovery. So far, on-demand restoration schemes
building at least the affected subtree anew have been used in a
number of applications. Several preplanned special-purpose
protocols based on pre-computed backup paths have also
been proposed, aiming at some of the mentioned properties
while neglecting others. Although there are many possible
applications, nearly all of the previous solutions are designed
for specific network-layer or overlay multicast schemes. The
important property of local recovery is mostly achieved only
in single-source multicast trees, and the scalability of many
solutions is limited by dependence of the control or memory
overhead on the group size kN.

There are several basic straightforward preplanned meth-
ods for multicast tree recovery (based on the group leave
operation in [4]). In the Grandfather method, each node main-
tains a backup link to its grandparent in the rooted tree.
When a node (except the root) fails, its child nodes contact the
grandparent, which either accepts the connection or redirects
them down the tree. Subtrees of the affected nodes remain
unchanged. In the Root method, the children of the failed
node try to recover the tree by connecting directly to the root
node, which uses the same strategy as in the Grandfather
approach above. In the Grandfather-All and Root-All methods,
all descendants of the failed node try to recover by contacting
the grandparent or root, respectively, in order to build the
whole affected subtree of the failed node anew using the
requested optimizations.

All the nodes maintain respective knowledge of their
ascendants in the multicast tree – the grandparent node in
Grandfather, the root node in Root and Root-All, and all
ascendants from grandparent of n to root node in the Grand-
father-All method. Except for the Grandfather, the methods
do not perform local failure recovery, as the affected nodes
contact ascending nodes far up the tree. The scalability of
these methods is limited either – in the Grandfather-All
method, each node maintains a link to the number of ascen-
dants proportional to the size of the group. The Root and
Root-All methods also load the root node with extensive com-
munication proportional to the group size. These two meth-

ods also rely on a single root node whose failure breaks down
the whole scheme. The Grandfather method is scalable and
keeps locality, but it does not cope with multiple adjacent fail-
ures in the tree. Moreover, all these methods are designed for
single rooted directed multicast trees only. However, these
simple methods represent four classes of a number of other
approaches based on the same principles.

For example, Proactive Reconstruction [5] belongs to the
Grandfather-All category, EFTMRP [6] for recovery in net-
work-layer CBT multicast uses a principle similar to the
Grandfather method, LFR core recovery [7] resembles the Root
method.

A different approach is chosen in specialized multicast
protocols that use administrative control topologies for group
management in addition to the data delivery tree. Narada [8]
is a protocol designed for small multicast groups, where each
node keeps a periodically refreshed state about all other
group members and uses this information to locate and re-
connect the fragments. Due to the state exchange inducing
the control overhead 	( )kN

2 , the Narada protocol is effective
only when the group is small. However, this is an explicit
design choice where the high overhead is traded off for
greater robustness in recovery from node failures. Yoid [9] and
HMTP [10] are examples of protocols using a dedicated node
called the rendezvous point to arbitrate the failures and locate
the fragments. In addition, cached links to several periodi-
cally discovered member nodes are used. HMTP nodes also
maintain an ancestor list similarly to the Grandfather-All
method. These methods are capable of recovery from large
failed clusters in their trees; however, the rendezvous point
may become a bottleneck and a single point of failure.

Another solution is used in overlay protocols for media
streaming Nemo [11], Nice [12], FatNemo [13] and ZigZag [14].
They all first construct an administrative highly connected
hierarchy among the nodes, and the data delivery tree is then
built using this structure. The hierarchy is organized into lay-
ers divided into clusters, where each cluster has a leader node
which then also belongs to a cluster in a higher level. When a
node fails, the leader of its highest layer or the leader of
the cluster of its children (depending on the protocol) is re-
sponsible for finding another node to take over the traffic
and reconnect the disconnected subtrees. After the recovery,
the administrative hierarchy is reorganized to adapt to the
topology changes. The failure recovery of these protocols
is efficient in heavy traffic multicasting, where the costs of
the highly connected hierarchy are amortized by the huge
amount of data. On the other hand, for less loaded trees, the
memory and control traffic overhead may be significant. The
data delivery trees are source-specific, the control overhead of
a node is 	(log )k Nk , where k is a constant proportional to the
size of the administrative clusters.

A Dual-tree network-layer protection scheme [15] con-
structs a node-disjoint secondary tree connecting tree leaves
in addition to the primary delivery tree. After failure, the
scheme identifies disconnected subtrees and reconnects them
to the rest of the tree using the secondary tree. The construc-
tion of redundant trees increasing network-layer multicast
reliability is also studied in [16], and it is employed in overlay
multicast as well in CoopNet [17]. Link-protection [18] and
path-protection [19] for network-layer multicasting in ATM

28 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 45  No. 6/2005 Czech Technical University in Prague


networks propose an individual backup path for each link in
the tree and for each source-destination pair, respectively.

3 Main issues of overlay failure
recovery
From the point of view of the design of failure recovery

mechanisms for survivable networks, the characteristics of the
environment as well as the properties of the protected tree
networks and the group model are essential. Recent trends in
a networked computing environment point towards large-
-scale unbounded network infrastructures with theoretically
unlimited number and size of groups interconnected with
overlay trees.

To design a practical and efficient failure recovery scheme
for these systems, the following characteristics of the comput-
ing environment must be taken into account:
� Characteristics of a distributed system – asynchronous commu-

nication, no global clock and autonomous behavior of the
nodes.

� No central authority controlling proper functionality of the
system; all nodes are peers.

� No global knowledge of the number of nodes, their identities
and the network topology.

� Unrestricted failure pattern; nodes fail arbitrarily at any time.
� Unlimited size of the system and the underlying network S.

The most significant attributes of the overlay tree struc-
tures involve:
� Traffic direction. In single-source trees, the message traffic

flows in a single direction from the root (source) node
to other member nodes. However, in many emergent ap-
plications, the message source may change in the runtime
or even several nodes become a message source simulta-
neously disseminating the traffic to the tree.

� Tree adaptation. As individual nodes may arbitrarily join or
leave the group, the respective tree is either expanded or
shrunk. Moreover, the tree may adapt its topology in order
to satisfy potential extern optimization requirements.

� No global knowledge and unlimited size of the tree. The size of
the group is not limited, and the number and identity of
the tree member nodes is not fully known. Each tree mem-
ber keeps only the information about its neighbors in the
tree for routing and possibly for tree adaptation purposes.

� Real-time operation. The tree operates in real time, such that
the traffic in the tree cannot be suddenly turned off or
switched to an off-line or stalled mode.

� Self-containment. Due to the scale of the system, the possible
number of groups and the overlay nature of the trees, the
information pertaining to a particular tree is required to be
kept solely at the tree member nodes, and no other node
is capable to hold even auxiliary information concerning
this tree.

It shapes up that a generic platform for failure recovery
available for different applications in this environment would
be profitable perhaps as a part of the middleware architec-
ture to increase reliability and cut down the costs. The listed
attributes represent the restrictive characteristics of the envi-
ronment and the protected tree. Of course, not all applica-

tions employing tree communication structures employ this
kind of trees, and the characteristics are somehow relaxed.
However, in many other applications, particularly large-scale
data sharing or data storing peer-to-peer systems, the tree
networks exhibit all these attributes, which then must be
reflected by the respective properties of the platform.

4 Survivable trees
A survivable tree is a general tree-topology communica-

tion network capable to deliver information to all its correct
member nodes even in the presence of failures. Consider
T � (��, ��) to be an overlay tree network, �� ��� and
�� T to be a connected vertex-induced subgraph of T. Failure

of faulty cluster �� causes T to be partitioned into fragments

Ti, i � 1, 2, …, N � card ( AT(��)),

where card ( AT(��)) is cardinality of a neighbor set of �� in T.
If card (��) � 1 then �� is called a single failure; if card (��) >1
then �� is a multiple failure of adjacent nodes in T.

A survivable tree T is required to deliver messages even in
the event of several single or multiple failures. A failure re-
covery is a process of reconnecting fragments Ti in a single
restored network T’ � (��\��, ��’), allowing the traffic to
continue. We focus on two principal properties that each re-
covery mechanism employed in a survivable tree must have –
correctness and completeness. Correctness is based on the essen-
tial requirement to keep the tree topology of the network
even after failure, since the correctness of most applications
depends precisely on the acyclic property of the graph. Com-
pleteness requires all the fragments of the original tree to be
connected in a single restored tree, allowing all correct nodes
to participate in T’.

The following extra qualities of the recovery scheme are
needed to address the characteristics of the large-scale un-
bounded environments of the targeted applications and
properties of the possible protected trees, and should form
the design subject of failure recovery in survivable trees.
� Scalability with the size of the protected tree and the under-

lying network.
� Multiple failure recovery. Capability to recover T from multi-

ple failures.
� Locality. The recovery affects only the tree nodes in the clos-

est neighborhood of the failed nodes, keeping the rest of
the tree in its original structure.

� On-line recovery. Ability to recovery several simultaneous
failures in a single tree.

� Computational symmetry. There is no arbiter node, all ��
nodes are peers.

� On-the-fly recovery. The failure recovery is performed while
the traffic in the tree goes on, even through the nodes
performing the recovery.

� Optional level of fault-tolerance provided by the scheme for
the survivable tree, allowing an optimal trade-off between
survivability and costs to be found.

� Protection selectivity allows the fault-tolerance level to be cho-
sen individually for each node. The survivable tree may
then provide stronger protection against failure of less reli-
able or functionally more important nodes in the tree.

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 29

Czech Technical University in Prague Acta Polytechnica Vol. 45  No. 6/2005


� Traffic direction independence. The recovery success depends
neither on the traffic direction in the tree nor on the link
orientation in multi-source multi-rooted trees.

� Optimization capability. The scheme takes into account the
application-specific requirements regarding the restored
tree.

There are only three methods capable of failure recovery
in a single multi-source tree – Yoid [9], HMTP [10] and
network-layer link-protection [18]. However, link-protection
recovers the network from link failures only, and Yoid and
HMTP are not fully distributed, as they need the RP node for
their operation. Other schemes are designed either for sin-
gle-source trees or for single-rooted shared trees, and they
usually do not provide optional failure recovery and protec-
tion selectivity. Moreover, the overhead of several recovery
methods is proportional to the group size, which degrades
their scalability. Local recovery is performed only in the
Grandfather type of restoration and in implicit multicast
schemes (e.g., Nemo [11], Nice [12], ZigZag [14]).

5 Bypass ring platform
When designing a failure recovery scheme for a survivable

tree, we face two main challenges to be solved while keeping
in mind the requirements for failure recovery in survivable
trees:
� How to locate all the fragments and route messages among

them
� How to avoid creating cycles during fragment reconnection

Our solution is based on bypass rings – virtual cyclic struc-
tures appended to the tree and providing alternative paths to
eliminate the failed nodes, locate the fragments and reroute
the traffic in the tree ([3]). Each bypass ring is identified by its
center node and diameter; its edges connect individual tree
branches of the center node in a distance proportional to
the diameter. Several concentric bypass rings of increasing
diameter form a bypass framework.

It is the responsibility of the bypass routing algorithm to lo-
cate all N � AT(��) nodes and route among them cyclically in
a uniform direction and order, regardless of the source and
destination of the messages, using edges of bypass frame-
works. This cyclical path bypassing cluster �� is referred to as
bypass cycle BC(��). The bypass cycle connects all N fragments

of the tree so that they can communicate and join together to
restore the tree. However, it is not possible to sequentially join
all the fragments on BC(��) since a cycle would occur in T’.
Instead, a single bypass edge on BC(��) that does not partici-
pate in the reconnection is to be identified in a distributed
way. This is the task of the leader link election (LLE) process,
which is based on comparing the hierarchical identifiers of the
fragments ([2]). A hierarchical identifier is a unique com-
pound value inferred from the structure of the failed cluster
found out by the routing algorithm.

The overall operation of the scheme involves several fun-
damental steps – scheme initialization, failure detection, designated
nodes discovery, leader link election, tree fragment reconnection and
bypass ring reconfiguration. In the initialization phase, the by-
pass frameworks are set up centered at selected tree nodes
against whose failure the tree is to be protected, with the
diameter depending on the desired protection level of the
particular node. In the event of failure, the failure detecting
nodes initiate the recovery immediately and use the bypass
routing to discover designated nodes, DN AT( ) ( )�� ��� –
the bypass cycle nodes with distinct properties related to the
type of protected tree allowing them to coordinate the rest of
the recovery process. Bypass cycle edges incident to the desig-
nated nodes become candidates for the LLE process. As the
leader election proceeds along the bypass cycle, the pairs of
fragments are successively joined together, forming greater
connected components until all the fragments are recon-
nected into the restored tree T’.

Various fragment reconnection methods respecting the
results of LLE can be designed to consider application-spe-
cific constraints and requirements regarding the tree proper-
ties (e.g., degree constraints, weight functions, latency or
bandwidth limitations, etc.).

6 Bypass routing
Bypass routing is one of the key components of the pro-

posed scheme, as it ensures its completeness. Clearly, the
success of the routing depends on the availability of the
respective bypass rings in the event of failure. To achieve
uniform direction and fragment order of the routing, the by-
pass rings are to be set up systematically in the tree. For this
purpose, we introduce the partial order of the tree, which
unambiguously specifies the sequence of neighbors SeqT(n) of
each tree member node n (e.g., according to their identifiers).

30 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 45  No. 6/2005 Czech Technical University in Prague

BR
T (n, 3)

(a)

R
�
(n, n

k
)

R�(n, n
k �1)

Seq
T
(n)

n

n
k

n
k �1

n
0

n
deg(n)�1

...

...

..
.

..
.

(b)

BF
T

(n, 4)

Seq
T
(n)

...

...

..
.

..
.

n

n
k

n
k �1

BR
T

(n, 2)

BR
T

(n, 3)

BR
T

(n, 4)

Fig. 2: Bypass ring BRT(n, 3) (a) and framework BFT(n, 4) (b)


The partial order defines the arrangement of each bypass
ring in the tree.

Each bypass ring is referred to as BRT(n, d), where n is
its center node and d is its diameter. BRT(n, d) consists of degT(n)
bypass edges connecting each pair of tree branches BT(n, nk)
and BT(n, nk�1) neighboring in SeqT(n). We define the positive
and negative ordered rays, RT

�(n, nk) and RT
�(n, nk) of each

branch BT(n, nk) according to the partial order as its leftmost
and rightmost path, provided that T is drawn as a planar
graph where SeqT(n) of each node follows a clockwise direc-
tion. Each bypass edge of BRT(n, d) connecting BT(n, nk) and

BT(n, nk�1) is then initiated on RT
�(n, nk) at distance

d
2

�
��

�
	�
from n

and terminated on RT
�(n, nk � 1) at distance

d
2

�
��

�
	�
, as shown in

Fig. 2a. The bypass framework is defined as the union of con-
centric bypass rings (see Fig. 2b):

BF n d BR n dT T
d

d

( , ) ( , )max
max

�
�2
� .

With this arrangement, all the bypass rings keep the same
direction, allowing the bypass routing algorithm to route
between branches of a given center node using its rings, and
to employ rings of lower diameters centered in particular
branches to route through those branches, while preserving
the direction.

Supposing that frameworks BF n dT ( , )max are set up
around each node n 
 �� in T, there are dmax bypass edges

initiated at each node and terminating at nodes at increas-
ing distances (up to dmax) on RT

�(n, nk) of each of the node’s
branches B n nT k( , ), n A nk T
 ( ). The routing itself is then
based on the fact that each faulty cluster is an intersection of
the respective tree branches of the nodes neighboring with
the cluster (as illustrated in Fig. 3):

��

��

T T i j
n A

B n n

i T

�


( , )
( )

� , where n A nj T i
 �( ) ��

At each node ni AT
 (��), the routing algorithm system-
atically browses B n nT i j( , ) using the bypass edges initiated at
ni to find another node neighboring with ��, node ni�1, which
is the next on BC(��). The branch lookup is performed se-
quentially by checking the nodes on RT

�(ni, nj) at an increasing
distance until the first non-faulty node, ni�1, is found. The
sequence of nodes on RT

�(ni, nj) is kept for further use by LLE.
This process is shown in Fig. 4a. Fig. 4b demonstrates a practi-
cal example of routing in the network from Fig. 1.

It can be shown [2] that bypass routing is feasible provided
that the length of the positive ordered ray between every two
bypass cycle neighbors is less or equal to dmax. The lower

bound of the maximum recoverable failure is thus
dmax

2
�
��

�
	�

nodes in arbitrary clusters and dmax�1 nodes in internal clus-
ters of the tree (i.e., the clusters not containing leaves of the
protected tree). The memory needed to keep routing infor-
mation at a node is equal to

	(deg ( ) )maxT n d� .

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 31

Czech Technical University in Prague Acta Polytechnica Vol. 45  No. 6/2005

��

B
T (n

i ,n
j )

n
i

n
j

..
.

n
i�1

BC (��)

B
T (n

i�1 ,n
j�k )

B
T (ni,nj)

��

BC
(�

�
)

n
i

n
j

n
j+k

n
i�1

n
i�2

(a) (b)

R
� (n i,

n j)

Fig. 3: Relation between nodes n AT
 ( )�� and faulty cluster ��

��

n
j

n
j �1

n
j�k

n
i�1

n
j �2

n
i

R
� (ni,

nj)

BC(��)

BR
T (n

j , 2)

BR
T (n

j , 3)

BR
T (n, k�1)

(a) (b)

��n8

n
9

n
10

n
1

n
2

n
3

n
4

n
5

n
6

n
7

(n
8 ,2)

(n
8
,3)

(n 9
,4

)
(n

10 ,2)

(n 1
0
,2

)

(n
1

0
,3

)

(n 9
,2

)

(n
9 ,2

)

(n
9 ,3)

(n
8 ,2

)

(n 8
,2)

Fig. 4: Routing from n Ai T
 ( )�� to ni�1 (a) and routing around  ��� � n n n8 9 10, , (b)


7 Leader link election
The task of the leader link election algorithm is to identify

a leader – a single edge on the bypass cycle that will not partic-
ipate in fragment reconnection in order to ensure correctness
of the restored tree while keeping it connected. Moreover, to
support on-line and on-the-fly recovery, we look for a solution
guaranteeing that once a link loses the election, it remains
lost and that there is not a state of the algorithm in which
there is no leader (i.e. all the fragments are connected). The
completeness property is provided by the bypass routing
algorithm; the bypass cycle forms a ring-topology communi-
cation structure for the election.

The election is based on identifiers of the bypass cycle
nodes; the basic idea is similar to the Chang-Roberts leader
election [20], where the maximum known ID is sent around
the cycle by means of ELECTION messages and compared
with the ID of each intermediate node. The LLE algorithm
exploits the favorable properties of Chang-Roberts on or-
dered cycles, where it needs only N � card(BC(��)) messages
and only a single ELECTION message to decide whether a
given node (the bypass cycle edge incident to it) loses the elec-
tion or not (see [3] for further details).

Except for single failures �� � { }n f where BC( )�� �
� A BR nT T f( ( , )�� )= 2 , the bypass cycle nodes are not or-
dered. For this reason, the hierarchical identifiers that
uniquely identify each node relative to another node in the
tree are used. The leaves of an arbitrary partially ordered
rooted tree are ordered in parts according to the hierarchical
identifiers based in their closest common parent node. Ap-
plying the principle of Chang-Roberts for an ordered ring,
more than one leader can be elected using N ELECTION mes-
sages. These leaders (except one) are thereafter eliminated by
a recursive sweep process considering hierarchical identifiers
related to the common root node. Fig. 5 illustrates the princi-
ple of eliminating multiple leaders (only the leaf nodes are
members of BC( )�� ).

The common root node for bypass cycle nodes is a faulty
node nr 
 �� with the minimum identifier determined to-
gether with the respective hierarchical identifiers step by step
by the routing algorithm as it browses the relevant ordered
rays in the bypass cycle lookup. In this way, the algorithm
utilizes a byproduct of the bypass routing – knowledge of the
cluster structure – to achieve 	( log )N Nb average message
complexity of the election, where b is the average branching

factor in ��. Moreover, the algorithm needs only N messages
to elect a leader link in an arbitrary bypass cycle in hierarchi-
cally ordered trees (e.g., binary search trees) and also in the
bypass cycles of single failures in general partially ordered
trees.

8 Fragment reconnection
The platform constituted by bypass routing and leader

election forms a basis for various fragment reconnection
methods responsible for joining the fragments into a single
connected tree T’ according to application-specific require-
ments. Fragment reconnection can be performed together
with the LLE process – as the LLE messages travel around the
bypass cycle and determine individual nodes not to be lead-
ers, the respective fragments can be joined to the rest of the
tree so that the data traffic can be transmitted immediately.

The most straightforward reconnection approach comes
directly from the LLE process. The fragments on BC(��) are
sequentially joined, except for the terminal fragments of the
leader link. The drawback is the fact that the diameter of the
failure recovery area AT T( )�� � is always equal to N � 1, which
might be a limitation for some applications because without
extern tree balancing, the tree would depreciate to a linear
graph after a certain number of recoveries. This is called LR
reconnection method.

Two different reconnection methods, called TRM and
HRM, were proposed in [3]. In the TRM method, all the frag-
ments are joined directly to one of the designated nodes,
while the HRM method allows the fragments to be joined in
longer consecutive sequences. As a generalization of these
two approaches, we propose the parameterized HR-x re-
connection method, where value x affects the length of the
successively joined fragments along BC( )�� and thus it
can influence the degree of the nodes and the diameter of
AT T( )�� �. The maximum number of new core edges inci-

dent to the affected nodes is

min ,
N

x
N

��
��

�
��

� �
�

�
�

�

�
�

1
2 1 .

The diameter is proportional to x as well. HR-1 thus repre-
sents the TRM, and HR-N is equivalent to the LR method.

The possibility to influence the properties of the restored
tree may help to balance or optimize it. The particular value
of x can even be chosen autonomously by each bypass cycle
node, so the local requirements may also be supported. We

32 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 45  No. 6/2005 Czech Technical University in Prague

n
s

......

SWEEP ON n
s

SWEPT LEADER

ELECTION

(a) (b)

LEADER

n
t

Seq
T
(n

t
)

Seq
T
(n

s
)

PARTIAL

ORDER

Fig. 5: Leader election and the recursive sweep process


note that the HR-x method is only an illustration of the
reconnection process. More sophisticated systematical recon-
nection methods can be applied, provided that they retain
correctness and completeness of the recovery.

9 Discussion
The characteristics of the computing environment in

large-scale distributed systems as well as the general proper-
ties of overlay trees in this environment place quite specific
requirements on tree restoration. The proposed platform for
generic failure recovery is designed to meet the mentioned
requirements of survivable trees and to form a basis for appli-
cation-specific fragment reconnection.

The node memory complexity of the bypass ring recovery
scheme is 	(deg ( ) )maxT n d� and the average message com-
plexity of the recovery is 	( log )N Nb , where degT(n) is node
degree, N is number of tree fragments, b is average branching
factor of the failed nodes, and dmax is an optional parame-
ter proportional to the provided fault-tolerance. The lower
bound of the maximum size of the recoverable failure is
dmax

2
�
��

�
	�

nodes in arbitrary clusters and dmax�1 nodes in inter-

nal clusters of the tree. The lower bound of the maximum
diameter of internal faulty clusters is dmax�2.

The scheme is scalable with the group size, as its overhead
depends solely on N and dmax and performs local recovery
since only the nodes closest to the failed cluster (on BC( )�� )
are involved in the recovery. The scheme also supports multi-
ple, on-line and on-the-fly recovery – it is capable to recover
the tree from several multiple failures with respect to the dmax
parameter while communication in the tree continues, and
the simultaneous recoveries do not interfere with each other
because of the locality property. Protection selectivity and
optional fault-tolerance is provided so that a trade-off be-
tween survivability and costs can be easily chosen. The scheme
is independent of the type of protected tree (regarding traffic
direction, number of sources, etc.) and forms a basis for appli-
cation-specific fragment reconnection.

The simulations and measurements verifying the de-
scribed behavior of the proposed bypass ring scheme were
performed using the GFS file system ([21], [22]) as a test bed.
GFS is a peer-to-peer large-scale file system providing a
fault-tolerant and highly available file service. It extensively
employs vast tree communication structures for replica-based
management of its data, and it is a typical application to uti-
lize the bypass ring scheme. GFS has been implemented
([23]) and simulated ([24]) in network simulator ns2.

One of the important results is the fact that protection
with dmax � 4 already provides ample fault-tolerance, and it
is fully sufficient for the GFS application; the probability of
employing rings in the recovery dramatically decreases with
their diameter. The simulations also show the real scales of
the recoverable failures. The average size of the recovered
cluster in trees with average branching factor 4 is approx.
1.5 dmax, and the average diameter is approximately dmax�2
for 2 � dmax� 10. This result confirms the possibility to easily
choose a trade-off between survivability and costs.

10 Conclusion
In this paper, we summarized the general properties of

large-scale environments and identified the requirements for
generic failure recovery in survivable trees. The main contri-
bution of the paper is the outline of a scalable platform for lo-
cal failure recovery that meets all the required properties. The
platform is based on bypass rings – the redundant cyclic struc-
tures introduced in [3] and specified in detail in [2]. The re-
covery provided by this platform is generic to the intent that it
is independent of specific tree properties and communication
pattern, and it enables application-specific tree reconnection
to optimize the restored tree according to some extern con-
straints and requirements. The performed simulations [24] in
the GFS file system confirm the theoretical results.

Future research in this area may include specification of
particular mechanisms for (autonomous) management of
fault-tolerance level and protection selectivity with respect to
the state of individual nodes, or a proposal of more sophisti-
cated reconnection methods tailored exactly to specific
application requirements.

Reference
[1] Dynda, V.: “A Concept of Survivable Trees and its De-

ployment in a Fault-Tolerant Multicast.” In: Workshop
2004, CTU, Prague, 2004, p. 234–235.

[2] Dynda, V.: “A Bypass-Ring Scheme for a Fault-Tole-
rant Multicast.” Acta Polytechnica, Vol. 43 (2003), No. 2,
p. 18–24.

[3] Dynda, V.: “A Simple Scheme for Local Failure Recovery
of Multi-Directional Multicast Trees.” In: IFIP / IEEE
ISCIS 2003, LNCS, Vol. 2869, Springer-Verlag, Ger-
many, 2003, p. 66–74.

[4] Deshpande, H., Bawa, M., Garcia-Molina, H.: “Stream-
ing Live Media over a Peer-to-Peer Network.” Report
No. CS-2001-31, CS Dept., Stanford University, 2001.

[5] Yang, M., Fei, Z.: “A Proactive Approach to Reconstruct-
ing Overlay Multicast Trees.” In: IEEE Infocom ’04,
2004.

[6] Jia, W. et. al.: “An Efficient Fault-Tolerant Multicast
Routing Protocol with Core-Based Tree Techniques.”
IEEE Transactions on Parallel and Distributed Systems,
Vol. 10 (1999), p. 984–999.

[7] Manimaran, G., Chakrabarti, A.: “A Scalable Approach
for Core Failure Recovery in Multicasting.” In: Advanced
Computing and Communications, 2000, p. 191–196.

[8] Chu, Y. H., Rao, S. G., Seshan, S., Zhang, H.: “Enabling
Conferencing Applications on the Internet Using an
Overlay Multicast Architecture.” In: ACM SIGCOMM
‘01, 2001.

[9] Francis, P.: “Yoid: Extending the Multicast Internet Ar-
chitecture.”, 2000 http://www.icir.org/yoid.

[10] Zhang, B., Jamin, S., Zhang L.: “Host Multicast: A
Framework for Delivering Multicast to End Users.” In:
IEEE Infocom ’02, 2002.

[11] Birrer, S., Bustamante, F. E.: Nemo: “Resilient Peer-to-
-Peer Multicast without the Cost.” Report No. NWU-
-CS-04-36, Northwestern University, 2004.

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 33

Czech Technical University in Prague Acta Polytechnica Vol. 45  No. 6/2005


[12] Banerjee, S., Bhattacharjee, B., Kommaredy, C.: “Scal-
able Application Layer Multicast.” In: ACM SIGCOMM
’02, 2002.

[13] Birrer, S. et. al.: FatNemo: “Building a Resilient
Multi-Source Multicast Fat-Tree.” In: WCW ’04,
LNCS, Vol. 3293, Springer-Verlag, Germany, 2004,
p. 182–196.

[14] Tran, D. A., Hua, K. A., Do, T.: “ZigZag: An Efficient
Peer-to-Peer Scheme for Media Streaming.” In: IEEE
Infocom ’03, Vol. 2, 2003, p. 1283–1292.

[15] Fei, A. et. al.: “A Dual-Tree Scheme for Fault-Tolerant
Multicast.” In: ICC ’01, Vol. 4, 2001.

[16] Medard, M. et. al.: “Redundant Trees for Preplanned
Recovery in Arbitrary Vertex-Redundant or Edge-Re-
dundant Graphs.” IEEE/ACM Transactions on Networking,
Vol. 7 (1999), No. 5, p. 641–652.

[17] Padmanabhan, V. N., Wang, H. J., Chou, P. A.: “Resil-
ient Peer-to-Peer Streaming.” In: IEEE ICNP ’03, 2003,
p. 16–27.

[18] Wu, C. et. al.: “A New Preplanned Self-Healing Scheme
for Multicast ATM Network.” In: IEEE ICCT ’96, Vol. 2,
1996, p. 888–891.

[19] Wu, C., Lee, W., Hou, Y.: “Back-up VP Preplanning
Strategies for Survivable Multicast ATM Networks.” In:
IEEE ICC ’97, Vol. 1, 1997, p. 267–271.

[20] Chang, E. G., Roberts, R.: “An Improved Algorithm for
Decentralized Extrema-Finding in Circular Configura-

tion of Processors.” Communication of the ACM, Vol. 22
(1979), No. 5, p. 281–283.

[21] Dynda, V., Rydlo, P.: ”Large-Scale Distributed File Sys-
tem Design and Architecture.” Acta Polytechnica, Vol. 42
(2002), No. 1, p. 6–11.

[22] Dynda, V., Rydlo, P.: “Fault-Tolerant Data Management
in a Large-Scale File System.” In: IEEE ISADS 2002,
Mexico, 2002, p. 219–235.

[23] Zradička, L.: “A Distributed File System Model.” Master
Thesis, Department of Computer Science and Engineer-
ing, CTU Prague, 2003.

[24] Řehák, P.: “Fault-Tolerance in a Distributed File Sys-
tem.” Master Thesis, Department of Computer Science
and Engineering, CTU Prague, 2005.

Ing. Vladimír Dynda
phone: +420 224 357 616
fax: +420 224 923 325
e-mail: xdynda@sun.felk.cvut.cz

Department of Computer Science and Engineering

Czech Technical University in Prague
Faculty of Electrical Engineering
Karlovo náměstí 13
121 35 Praha 2, Czech Republic

34 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 45  No. 6/2005 Czech Technical University in Prague