AP2002_01.vp


1 Introduction
The amount of data stored in digital storage has been

increasing at a very rapid rate. Since much of this data is
of critical importance for its owners, it is necessary to en-
sure transparent data availability and to secure it against
loss or disclosure. Till now, data backup has in many cases
been performed manually, using various backup data storage
types; in extreme cases the data has even been transformed
into printed form. Unfortunately, these methods of data
backup provide neither satisfactory protection against loss
nor greater availability, and such a solution is clearly unsatis-
factory. The probability of simultaneous hard drive failure
and backup tape loss or damage remains high for many appli-
cations. Moreover, such backup is far from transparent for
users. The cost of this approach is also much too high.

The Gaston File System is an experimental storage system
that provides users with the standard services of ordinary
file systems. Its major advantages are high data availability,
support for mobile and disconnected users, and the ability to
protect data from loss, damage or disclosure. One of the
primary purposes of the system is to perform under many
circumstances similar to existing LAN-based networked stor-
age systems. These properties are to be achieved by involving
thousands of computers spread across a large geographical
area (continents or even the whole world) and connected by
a computer network (Internet). High data availability and
protection are managed by massive replication, and data pro-
tection in the generally insecure environment of the global
network is achieved by suitable cryptographic mechanisms.

The GFS is still under development and is divided
into several mutually connected areas. These are: the system
architecture, data locating, shared data consistency, replica-
tion, caching, data security, authentication & authorization,
request distribution and naming schema & structure. It is
necessary to bind these research fields together into a single
complex and consistent system specification useful for subse-
quent implementation and testing.

The essential architecture of the system is based on clients
who deposit their data into the entire system and manipu-
late it, replica-managers managing stored (and replicated)
data, and data servers that store data. Each replica-manager
controls one or possibly more data servers for better perfor-

mance and fault-tolerance. Clients access the data transpar-
ently using a specified interface. Since shared storage space is
based on mutual reciprocity, client, replica-manager and data
server can be installed on one machine, thus providing the
system with shared resources in return for using its services.

An important means for protecting data in GFS is mas-
sive replication. Data is allowed to be replicated anytime/any-
where (even a copy in a client cache is considered to be
a replica). To provide maximum performance, data location
is of great importance. This is supported by separating data
from its physical location (such a class of data is called no-
madic data) and by replicas that can flow freely. The number
and location of the replicas can vary in time, based both on cli-
ent specification and actual usage patterns and network char-
acteristics. These propositions are important for mobile client
support, to deal with temporary network partitioning and es-
pecially to minimize communication costs.

One of most important tasks in the field of replication is to
ensure the consistency of shared files. As the main aim of GFS
is to achieve high availability, it is necessary to implement
optimistic replica control protocol providing data availability
even in a partitioned network. The protocol used in GFS
is based on version vectors, transaction semantics and on
automatic version conflict resolution. Conflict resolution is
performed taking into account data type, or using the sup-
plied application specific merge procedure. In order to en-
sure consistency it is also necessary to order update requests,
which, unfortunately, cannot be performed globally on such
a high number of replicas due to unacceptable time con-
sumption. Hence ordering is specified by means of smaller
variable groups of selected primary replica-managers,
which eventually make information available to the other rep-
lica-managers. Because of primary replica-manager group
variability, neither the system scalability nor fault-tolerance is
negatively affected.

To improve GFS performance and decrease network us-
age costs, the system uses caching of data. In GFS whole files
are cached to support mobile clients, who can be temporarily
disconnected and therefore can use data even in this mode of
operation. Since caching of data in distributed environments
causes data consistency problems, cached data are seen as
ordinary replicas and for this reason the replica management
described above is also used for cached data.

6 ©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 42  No. 1/2002

Large-scale File System Design and
Architecture
V. Dynda, P. Rydlo

This paper deals with design issues of a global file system, aiming to provide transparent data availability, security against loss and
disclosure, and support for mobile and disconnected clients.
First, the paper surveys general challenges and requirements for large-scale file systems, and then the design of particular elementary parts
of the proposed file system is presented. This includes the design of the raw system architecture, the design of dynamic file replication with
appropriate data consistency, file location and data security.
Our proposed system is called Gaston, and will be referred further in the text under this name or its abbreviation GFS (Gaston File System).

Keywords: file system, replication schema, data protection, consistency, data locating.


One of the requirements on GFS is data security. As
the system is open to many different clients, and as it uses
Internet as its communication subsystem, it is exposed to
many security threats. The system must protect the data not
only when it is being exchanged, but also when it is stored
at data servers. For these purposes well known and widely
accepted cryptographic algorithms are used. For change de-
tection of sensitive or non-encrypted data, digital signatures
are used. Like almost every distributed system, GFS performs
authentication of clients to check the user’s identity, which
is also based on cryptography. However, data is not only
protected by cryptographic means. Every client who wants to
perform any operation on the data must be authorized to per-
form such an operation. In this case access control lists are
used, as in many other modern operating systems.

Since GFS is intended to provide users with a transparent
file system with sharing capability, a global name space is
used. This means that each user uses the same file name to
access another’s file. Moreover, the names of files are inde-
pendent of the number of replicas and their locality, hence
files in the system are transparently presented to the user.
Another important benefit is that there is no need to remem-
ber file locations, because all replicas of one particular file are
equivalent.

The design of the above mentioned challenging proper-
ties of GFS is further described below in several chapters
dealing with the most important design issues.

2 Requirements to DFS
The proper design of a distributed file system should

reflect several basic characteristics of the system, which arise
from the essential features of a distributed environment and
the typical data access pattern in DFSs. These include the fol-
lowing:
� Read/write access

A distributed file system usually stores data accessible
in read and write mode (unlike WWW and other simi-
lar distributed services storing just read-only data objects).
Furthermore, these data can be shared by more than one
user. The system must efficiently support strong consistency
models for shared objects that are often updated, which
causes higher communication costs. The replica management
should minimize these costs.
� Dynamic access pattern

The client usage patterns may change quickly and
frequently. The replica management must maintain system
stability while responding reasonably to these changes. The
environment of a large-scale distributed file system, as wide
as the Internet, contains many individual machines. The
protocols and algorithms used must scale well with the
number of replicas for each object and the total number of
machines in the system. Any centralized authority or informa-
tion source would quickly become a performance bottleneck.
� Federal structure

The replica managers of a large-scale distributed file
system may reside in disjoint administrative domains, so the
system must allow replica managers to make autonomous
decisions. No replica manager can force others to perform
any activity that is against their interest, and the protocols that

are used must take into account the possibility of cooperation
refusal by any party.
� Unreliable environment

It is not possible to eliminate all errors of individual com-
ponents (communication lines, replica managers) of large-
-scale systems, so at least some replicas may not be available at
any moment and the network may even become temporarily
partitioned. Also in these cases the data must be as available as
possible.
� Insecure environment

The replica managers, clients and network infrastructure
cannot be trusted. To prevent the replica management system
from becoming a security risk, it is necessary for all decisions
to consider only authenticated information. The system must
also prevent a small number of compromised replica manag-
ers from gaining control of the system.

3 System architecture
The Gaston file system consists of three main architecture

components of the highest importance – replica managers, cli-
ents and data servers:
� Replica managers. These manage stored (and replicated)

data and run data consistency and update protocols in
order to be able to provide the correct data.

� Clients. These are the basic elements in the system. They de-
posit and manipulate the data in the system. If the client
caches the data, it is at the same time the replica manager,
considering cached data as ordinary replicas.

� Data servers, which physically store the data.

The elementary architecture is presented in Fig. 1.
The system supports three types of users – ordinary users,

dedicated servers and diskless terminals:
� The ordinary users of the system are caching clients (thus

also replica managers), who use the system and in turn
provide the system with their resources. The data of other
users can be replicated at these clients.

©  Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 7

Acta Polytechnica Vol. 42  No. 1/2002

Fig. 1: Elementary system architecture


� The dedicated servers are machines intended just to store
data. They act as relatively stable replica managers sup-
posed to operate continuously.

� The diskless terminals are non-caching clients manipulating
the data and communicating with replica managers. This
is the case of small and possibly mobile and temporarily
disconnected devices (cellular phones, small handheld
computers, etc.).

4 Data consistency
The Gaston Global File System works under a transaction

model of computation, which uses transactions for data ma-
nipulation. The system infers transactions by mapping the file
operation call sequences to the transaction types. The goal of
consistency and update protocols is to ensure one-copy
serializability – data correctness. Since the means for achiev-
ing high data availability is massive data replication, it is not
possible to assume a connected network of replica managers
all the time. In such a large system, network partitioning will
surely occur. Thus, the optimistic strategy for partitioned rep-
lica control is used. This protocol is based on precedence
graphs [1].

A precedence graph models the necessary ordering be-
tween transactions and is used to check serializability across
partitions. Each partition maintains a log file, from which
read-sets, write-sets and serialization order can be deduced.
The sequence of transactions in partition K

i
is denoted as

T T T1
i

2
i

n
i, , ,� .

The nodes of the precedence graph represent transac-
tions; the edges represent interactions between transactions.
The first step in construction of the graph is to model interac-
tions in each individual partition. Two types of edges are in-
troduced:
1. Data Dependency Edges T Tj

i
k
i

� ,

if � � � �writeset readsetT Tji ki� � 0, j < k.
2. Precedence Edges T Tj

i
k
i

� ,

if � � � �readset writesetT Tji ki� � 0, j < k.
Both types of edges demonstrate that the transaction

processing order has influenced the computation result.
The graph constructed in this way must be acyclic, since the
orientation of an edge is always consistent with the serializa-
tion order, and within each partition serializability is always
provided.

To complete the precedence graph (at reconnection),
a conflict between partitions must be represented. The follow-
ing type of edge is defined for this purpose:

Interference Edges T Tj
i

k�
l , i � l,

if � � � �readset writesetT Tji k� �l 0.

An interference edge represents the dependency between
transaction T j

i that read the data in one partition and transac-

tion Tk
l that wrote the same data in another partition. This

edge between T j
i and Tk

l indicates that T j
i logically precedes

Tk
l since the value read by T j

i could not be influenced by any

update transaction in another partition. Thus, the interfer-

ence edge signals read/write conflict; a pair of these edges
indicates a write/write conflict.

If the resulting precedence graph is acyclic, then transac-
tions from partitions can be represented by a single serial
history that can be obtained by topologically sorting the
graph. If the graph contains the cycles, then the computation
is not serializable and detected inconsistencies are resolved
by rolling back (undoing) transactions and their dependent
transactions (connected by dependency edges) in the reverse
order of execution until the resulting subgraph is acyclic. To
merge partitions, the final value of each data object is then
forwarded to other partitions. If the transaction cannot be
rolled back, some compensating actions need to be per-
formed to nullify the effect of that transaction.

In order to address the very high number of replicas,
a kind of epidemic algorithm is used for update propagation
among replicas that make the tentative commits of the trans-
action until the global stable commit is announced.

5 Replication
Data replication is an important means for achieving high

data availability, fault tolerance and thus better data protec-
tion. To provide maximum performance, the data locality is
of great importance. This is supported by separating data
from its physical location (this kind of data is called nomadic
data) and by replicas that can flow freely. The number and
location of replicas can vary in time based both on client
specification and actual usage patterns and network charac-
teristics. The enhanced ADR algorithm [2] is used for dy-
namic replication scheme management.

The ADR algorithm constitutes the replication scheme
that forms the tree subnetwork. The size and the location of
this tree changes dynamically depending on the request local-
ity and frequency. All requests for an object from clients
are routed to the closest replica manager in the replication
scheme (RS) along the shortest path. The following types of
replica managers are defined:
� RS – neighbor – a replica manager that belongs to the

replication scheme and has a neighbor that does not be-
long to RS,

� RS – fringe – a leaf replica manager of the subgraph in-
duced by RS.

The replication scheme is modified through the following
tests that are executed at the end of each predefined time
period t :
� The expansion test. The test is executed by all RS – neigh-

bor replica managers. Each RS – neighbor replica manager
RMi compares for each of its neighbors RM RSj � values of

� �rcnt xr
j

i and � �h rcnt x

A
k j

1 �

	
�


 wk i
RMk

i

,

� which is the total number of write requests received in the
last time period t from RMi itself or from a neighbor other
than RMj (A

i denotes a neighbor set of replica manager
RMi). If � �rcnt x hr

j
i � 1, then replica manager RMi sends to

RMj a copy of the file with an indication to save it. Thus RMj
joins RS and the replication scheme expands. The

8

Acta Polytechnica Vol. 42  No. 1/2002


expansion test is successful when the (if) condition is satis-
fied for at least one neighbor of RMi.

� The contraction test. The test is executed by all RS – fringe
replica managers. Each RS – fringe replica manager RMi
compares for its only neighbor RM RSj 	 the values of

� �rcnt xw
j

i and � �h rcnt x

A
k j

2 �

	
�


 rk i
RMk

i

– the total number of read requests received in the last time
period t from RMi itself or from a neighbor other than RMj.
If � �rcnt x hw

j
i � 2, then RMi requests permission from RMj

to leave a copy of the file. Thus replication scheme RS
shrinks.

� The switch test. The test is executed by replica manager RMi
if RS = {RMi} and the expansion test has failed. For each
neighbor RMj it compares the values of

� �rcnt xj i and � �h rcnt x

A
k j

3 �

	 �
�


 k i
RM RMk

i
i

– the total number of all requests received in the last time
period t from all replica managers apart from RMj. If

� �rcnt x hj i � 3, then RMi sends a copy of the file to RMj
with an indication that RMj becomes the new singleton
replica manager in RS, and RMi discards its copy. Thus the
replica migrates towards the center of the request activity.

At the end of each time period t all RS – neighbor replica
managers execute the expansion test and all RS - fringe replica
managers execute the contraction test. Each replica manager
that is both RS – neighbor and RS – fringe first executes the
expansion test, and if it fails it executes the contraction test.
A replica manager that does not have a neighbor in RS first
executes the expansion test and if it fails it executes the switch
test.

An example of the dynamic replication scheme is pre-
sented in Fig. 2.

The initial replication scheme RS ( t0) = {RM4}.
Issued requests (in each time period t):

RM1 – RM6, RM8: 4 read requests, 2 write requests
RM7: 20 read requests, 12 write requests.

t1: RM4 (as an RS – neighbor of RM3 and RM5) executes the
expansion test. Since the number of reads requested by

RM3 is 12 and the number of writes requested by RM4 and
RM5 is 20, replica manager RM3 does not enter the repli-
cation scheme. The number of reads requested by RM5 is
32 and the number of writes requested by RM4 and RM3 is
8. Thus, RS ( t1) = {RM4, RM5}.

t2: first, RM4 performs the expansion test that fails, and then
it executes the contraction test. It is successful since RM4
receives 18 write requests from RM5 and 16 read requests
from RM4 and RM3. At the same time RM5 performs the
expansion test and successfully includes RM7 in the repli-
cation scheme (the number of reads from RM7 is 20 and
the number of write requests from the other replica man-
agers is 14). The resulting RS(t2)={RM5, RM7}.

t3: RS stabilizes at RS( t3) = {RM5, RM7} and it will not change
further.

Extension of the ADR algorithm to an arbitrary net-
work topology is relatively straightforward using the network
spanning tree. The modification that will be used in the
Gaston System implementation concerns the connectivity of
the replication scheme, and is able to create pseudoreplicas to
address the problem of extremely distant clients.

6 File addressing
To locate data in the large system we use two distinct algo-

rithms. The first one is a probabilistic algorithm that uses
modified Bloom filters. It can be characterized as a fast algo-
rithm that adheres to the principle of locality. Every node in
the network keeps a filter (union) of objects that are accessible
at this node and every edge also keeps filters of objects that
are accessible using this edge. These filters (edge filters)
define objects that are accessible at different distances (filter
for distance one, two, etc.), thus forming a kind of routing
table. An example of the algorithm is shown in Fig. 3.

Here, a document characterized as (1,0,1,1,0) stored at
node n4 is being accessed from node n1. The filters (the filter
of the document and filters at nodes and edges) are compared
to match and sequently this means, that the document is not
stored at node n1, but it can be reached on a node at distance
2. At node n2 a comparison of the filters shows that the
document is not stored at the node, but can be reached in
distance 1 and in the direction to node n4. At node n4 we have
perfect match, so the document can be reached. However,
we must confirm this result with a complete scan of the
documents stored at node n4, due to the simple fact that the
filters are unions.

9

Acta Polytechnica Vol. 42  No. 1/2002

Fig. 2: Example of a dynamic replication scheme

Fig. 3: Example of probabilistic file addressing


The second algorithm is used when the first one fails.
It is a modified algorithm originally developed by Plaxton
et al [3]. The algorithm uses embedded trees rooted in
every node, which are assigned numbers that correspond to
the object IDs. Routing to the node means routing to the
information where physical data is stored. This mechanism
is extended with information about the object that might
be found during routing to the root, thus increasing fault
tolerance. An example of this algorithm is shown in Fig. 4.
The mechanism is similar to routing on an n-dimensional
hypercube. Simply said, given a node from which we access
data we compare particular parts of its ID and climb the tree
in the direction corresponding to the matching parts. In the
example, we access the node identified as 3268 from node
4291. We compare particular elements of the ID from right
to left, thus passing through nodes 0328, 0768, 2268 and
finally to node 3268. As can be seen, the nodes we are passing
through become more similar to the node as we come closer
to it.

7 Authorization and data protection
Distributed systems nowadays require far more protection

of data than ever before. This is due to the simple fact that
more people can easily access the distributed world than in
the early days of the Internet. Protection using only well
guarded servers is not sufficient, indeed it is not possible in
the architecture that we present here. Data can be stored at
potentially dangerous servers, which may not be guarded as
necessary. Using them just as storage is sufficient from the
view of the resistance against “physical” attacks. We propose
to use cryptography to protect data content from restricted
users (those who are not allowed to know the content, not only
those who are not allowed to access the system). Data is
therefore encrypted at the client nodes, which ensures that
only the client knows the data and controls who can read
the content. This approach not only provides good security,
but also gives the system the property of good scalability.
The system is designed to adapt to any appropriate cipher

algorithm for this purpose. Of course, the system is open
to several other improvements on the server and/or com-
munication side of the system, where the underlying layers
can use their own protection mechanisms (IPSEC in the
communication layer, hw encryption of data at servers etc.)

Protection of new data is straightforward – data is en-
crypted by a user and then sent to the replica manager.
Protection of changed data is a more complex problem,
which is solved by a modified mechanism presented in [4].
A changed block of data is encrypted and attached to the end
of the file, but the old block is also preserved to decrypt
all data that consequently follow this changed block (we
proposed to use the CBC mode of a cipher algorithm). This
enables all the data of the file not to be reciphered, so
that performance is not severely affected. As this solution
preserves old blocks of data, the old versions of the file are
also saved. The update situation is shown in Fig. 5.

10

Acta Polytechnica Vol. 42  No. 1/2002

Fig. 4: Example of a deterministic file addressing mechanism

Fig. 5: Update of a file


A distributed file system is accessed by many users, so
users’ data must somehow be protected against access by
other users. On the other hand, some people may want to
publish their data to be read by others so some authorization
to the data has to be introduced. In Gaston we use the
following general principle [5].

The file identifier is extended with an encrypted random
number and effective rights to the file for a given user.
Unencrypted rights are attached for a user to know what
rights he has been assigned at the moment. As he does not
know cryptographic key to decrypt the encrypted part, he is
unable to change the rights. The encryption key is known
only to the issuer of the identifier and may be stored in
a special part of the file system. A random number is used to
prevent guessing of file identifiers. Changing the rights that
are attached to the identifier has almost no effect on the
resulting access. All user rights are stored in access control lists
(ACL) that are associated with every file.

8 Conclusion
This paper has introduced the general architecture and

key design issues of a large-scale file system. Our proposal of
the Gaston file system is designed to provide high data avail-
ability, security and user mobility. These goals are achieved
particularly by deploying file replication with reasonable data
consistency and fast file location and also by proper cipher
and authorization techniques that allow unencrypted data to
be completely hidden from the replica managers.

Our design is based on the following basic characteristics,
reflecting the essential requirements for DFSs:
� File replication providing high data availability and sup-

port for mobile and disconnected users.
� Transactional processing solving multiple read/write ac-

cesses and allowing achievement of data consistency cor-
responding to the requirements of DFSs.

� A dynamic replication scheme reacting to a frequently
changing access pattern.

� Cryptography for securing user data and support for any
appropriate cipher algorithm to protect data.

� Advanced authorization which helps in dealing with access
rights.

Future work on the Gaston file system project will include
advanced remote data modification techniques at distrusted
replica managers, and also improving replica management in
the field of load-dependent request distribution.

Symbols
Ai Set of neighbor nodes to node ni
ACL Access Control List
ADR Adaptive Data Replication
CBC Cipher Block Chaining

DS Data Server
DFS Distributed File System
GFS Gaston File System
Ki Graph partition i
ni Node i
RMi Replica Manager i
RS (ti) Replication Scheme in the time ti
RS (ti) Graph complement to Replication Scheme RS in

the time ti
� �rcnt xr

j
i Read request count to xi initiated by node nj

� �rcnt xw
j

i Write request count to xi initiated by node nj

readset (T j
i ) Set of data read by transaction T j

i

ti Time i
T j

i Transaction with sequential number j performed
in partition Ki

writeset (T j
i ) Set of data written by transaction T j

i

xi Data item stored at node ni

References
[1] Davidson, S. B.: An Optimistic Protocol for Partitioned

Distributed Database Systems. Ph.D. Thesis, Department of
Electrical Engineering and Computer Science,
Princeton University, Princeton, NJ, 1982

[2] Wolfson, O., Jajodia, S., Huang, Y.: An adaptive replication
algorithm. ACM Trans. on Database Systems, 1997,
Vol. 22, No. 2, pp. 255–314

[3] Plaxton, C., Rajaraman, R., Richa, A.: Accessing nearby
copies of replicated objects in a distributed environment. In.
Proc. of the 9th ACM Symp. on Parallel Algorithms and
Architectures, 1997, pp. 311–320

[4] Kubiatovicz, J., Bindel, D., Chen, Y., Czerwinski, S.,
Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weather-
spoon, H., Weimer, W., Wells, C., Zhao, B.: Oceanstore:
An architecture for global-scale persistent storage. In Proc. of
the 9th Int’l Conf. on Architectural Support for Program-
ming Languages and Operating Systems, 2000

[5] Couloris, G., Dollimore, J., Kindberg, T.: Distributed
Systems: The Concepts and Design. Addison-Wesley Pub.
Co., 1994, ISBN 0-20162-433-8

Ing. Vladimír Dynda
e-mail: xdynda@fel.cvut.cz

Ing. Pavel Rydlo
e-mail: xrydlo@fel.cvut.cz

phone: +420 2 2492 3325
fax: +420 2 2492 3325

Department of Computer Science and Engineering
Czech Technical University in Prague
Faculty of Electrical Engineering
Karlovo náměstí 13
121 35 Praha 2, Czech Republic

11

Acta Polytechnica Vol. 42  No. 1/2002

Fig. 6: Unique file identifier