AP05_1.vp


1 Introduction
Heterogeneous data sets contain data that may be repre-

sented using different data models and different structuring
primitives. They may use different definition and manipula-
tion facilities, and run under different operating systems and
on different hardware [3]. Schemas have been used in infor-
mation systems for a long time for these data sets. They pro-
vide a structural representation of data or information. A
schema is a model of data sets which can be used for both
understanding and querying data. As diverse data represen-
tation environments and application programs are devel-
oped, it is becoming increasingly difficult to share data across
different platforms, primarily because the schemas developed
for these purposes are developed independently and suffer
from problems like data redundancy and incompatibility.
When we consider different systems interacting with each
other, it is very important to transfer data from one system to
another. This has led to research on heterogeneous database
systems. (Multidatabase systems make up a subclass of hetero-
geneous database systems.) Heterogeneity in databases also
leads to problems like schema matching and integration. The
problem of schema matching is becoming an even more im-
portant issue in view of the new technologies for the Semantic
Web [4].

The operation which produces a match of schemas in or-
der to perform some sort of integration between them is
known in the literature as a matching operation. Matching is
intended to determine which attribute in one schema corre-
sponds to which attribute in another. Performing a matching
operation among schemas is useful for many particular appli-
cations such as mediations, schema integration, electronic
commerce, ontology integration, data warehousing, and
schema evolution. Such an operation takes two schemas as in-
put and produces a mapping between elements of the two
schemas that correspond semantically to each other [29].

Until recently, schema matching operations have typically
been performed manually, sometimes with some support
from graphical tools, and therefore they are time-consuming
and error-prone. Moreover, as systems become able to handle
more complex databases and applications, their schemas be-
come larger. This increases the number of matches to be per-
formed. The main goal of this paper is to survey briefly the
different issues that arise in managing schemas and to show
how they are tackled from different perspectives.

The remainder of the paper is structured as follow. Section
2 describes schema heterogeneity. Section 3 presents schema
matching approaches. Section 4 introduces schema integra-
tion methodologies. Section 5 describes data integration. In
section 6 we present our proposal for a data integration sys-
tem in the context of heterogeneous XML data sources. Sec-
tion 7 concludes the paper.

2 Schema heterogeneity
Schemas developed for different applications are hetero-

geneous in nature, i.e. although the data is semantically simi-
lar, the structure and syntax of its representation are different.
Data heterogeneity is classified according to the level of
abstraction at which they are detected and handled (data in-
stance, schema or data model). Schema heterogeneity arises
due to different alternatives provided by one data model
to develop schemas from the same part of the real world. For
example, a data element modelled as an attribute in one re-
lational schema may be modelled as a relation in another
relational schema for the same application domain. The het-
erogeneity of schemas can be classified into three broad
categories:
� Platform and system heterogeneity [22] – differences in

operating systems, hardware, and DBMS systems.
� Syntactic and structural heterogeneity, which encompasses

the differences between data model, schema isomorphism
[35], domain, and entity definition incompatibility [14] and
data value incompatibility [10].

� Semantic heterogeneity – this includes naming conflicts
(synonym and homonyms) and abstraction level conflicts
[23] due to generalization and aggregation.

3 Schema matching
To integrate or reconcile schemas we must understand

how they correspond. If the schemas are to be integrated, the
corresponding information should be reconciled and mod-
elled in a single consistent way. Methods for automating
the discovery of correspondences use linguistic reasoning on
schema labels and the syntactic structure of the schema. Such
methods have come to be referred to as schema matching.
Schema matching is a basic problem in many database appli-
cation domains, such as data integration, E-business, data
warehousing, and semantic query processing.

24 ©  Czech Technical University Publishing Housee http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 45  No. 1 /2005 Czech Technical University in Prague

Schema Management for Data
Integration: A Short Survey
A. Almarimi, J. Pokorný

Schema management is a basic problem in many database application domains such as data integration systems. Users need to access and
manipulate data from several databases. In this context, in order to integrate data from distributed heterogeneous database sources, data
integration systems demand the resolution of several issues that arise in managing schemas. In this paper, we present a brief survey of the
problem of schema matching which is used for solving problems of schema integration processing. Moreover, we propose a technique for
integrating and querying distributed heterogeneous XML schemas.

Keywords: schema matching, schema integration, data integration.


To motivate the importance of schema matching, we
should understand the relation between a symbol and its
meaning. We can consider a word to be a symbol that evokes a
concept which refers to a thing. The meaning is in the appli-
cation that deals with the symbol, and in general in the mind
of the designer, and not in the symbol itself. Hence, it is diffi-
cult to discover the meaning of a symbol. The problem gets
more complicated as soon as we move to a more realistic situa-
tion in which, for example, an attribute in one schema is
meant to be mapped in two more specialized attributes in
another schema. In general we can say that the difficulty
of schema matching is related to the lack of any formal way
to expose the intended semantic of the schema.

To define a match operation, a particular structure for its
input schemas and output mapping must be chosen. It can
be represented by an entity- relationship model, an object-
-oriented model, XML, or directed graphs. In each sort of
representation, there is a correspondence among the set of
elements of the schemas. For example, entities and attributes
in an entity-relationship model; objects in an object oriented
model; elements in XML; and nodes and edges in graphs.
A mapping is defined to be a set of mapping elements, each of
which indicates how the elements in the schemas are related.

There are several classification criteria that must be con-
sidered for realization of individual matching. Matching tech-
niques may consider the instance data level as in [17, 38] or
schema level information [12, 15]. Such techniques can be
performed for one or more elements of one schema to one or
more elements of the other.

Various approaches have been developed over the years
that can be grouped into classes, according to the kind of
information and the actual idea used:
� Manual approaches. The mechanisms used in these ap-

proaches involve the use of an expert to solve the matching,
for example drag and drop.

� Schema based approaches. These are based on knowledge of
the internal structure of a schema and its relation with
other schemas.

� Data driven approaches. Here, the similarities are more likely
to be observed in the data than in the schema.

4 Schema integration
Schema integration is the process of combining database

schemas into a coherent global view. Schema integration is
necessary in order to reduce data redundancy in heteroge-
neous database systems. It is often hard to combine different
database schemas because of the different data models or
structural differences in how the data is represented and
stored. Thus, there are many factors that may cause schema
diversity [6]:
� different user or view perspectives,
� equivalence among constructs of the model,
� incompatible design specifications,
� common concepts can be represented by different repre-

sentations.

There are several features of schema integration that
make it difficult. The key issue is resolution of conflicts among
the schemas. A schema integration method can be viewed as
a set of steps to identify and resolve conflicts. Schema conflicts

represent differences in the semantics that different schema
designers associate with syntactic representation in the data
definition language. Even when the two schemas are in the
same data model, conflicts like naming and structural may
arise.

Naming conflicts occur when the same data is stored in
multiple databases, but is referred to by different names.
Naming conflicts arise when names are homonyms and when
names are synonyms. The homonym naming problem is
when the same name is used for two different concepts. The
synonym naming problem occurs when the same concept is
described using two or more different names.

Structural conflicts arise when data is organized using
different model constructs or integrity constraints. Some
common structural conflicts are:
� type conflicts – using different model constructs to repre-

sent the same data,
� dependency conflicts – a group of concepts related differ-

ently in different schemas ( e.g. 1-to-1 participation versus
1-to-N participation),

� key conflicts – a different key for the same entity,
� Interschema properties – schema properties that only arise

when two or more schemas are combined.

The schema integration process involves three major
steps:
1. Pre-integration, a step in which input schemas are re-ar-

ranged in various ways to make them more homogeneous
(both syntactically and semantically).

2. Correspondence identification, a step devoted to the
identification of related items in the input schemas and
the precise description of the relationships these inter-
-schemas.

3. The final step, which actually unifies the corresponding
items into an integrated schema and produces the associ-
ated mappings.

A robust integration methodology must be able to hand-
le both naming and structural conflicts. There have been
various attempts from different perspectives. The work [25]
broadly classifies these attempts into two categories:
� Structural approaches – also called the common data model

approach. In this, the participating databases are mapped to
a common data model. The problem with such systems
is the amount of human participation required. Human
intervention is required to qualify the mappings between
the individual databases and the common model.

� Semantic approaches – these use a higher order language that
can express information ranging over individual databases.
Ontology based integration approaches belong to this cate-
gory. Many research projects (SHOE [21], ONTOBroker
[7], OBSERVER [19]) and others use ontologies to create a
global schema [20, 30].

In the past several years, many systems have been devel-
oped in various research projects on data integration using
the techniques mentioned above. Here are some of the more
prominent representative systems:

� Pegasus [1] takes advantage of object-oriented data model-
ling and programming capabilities. It allows the user to
access and to manipulate multiple autonomous hetero-

©  Czech Technical University Publishing House e http://ctn.cvut.cz/ap/ 25

Czech Technical University in Prague Acta Polytechnica Vol. 45  No. 1 /2005


geneous distributed object-oriented relational and other
information systems through a uniform interface.

� Mermaid [36] uses a relational common data model and al-
lows only relational schema integration.

� Clio [34] was developed by IBM around 2000. It involves
transforming legacy data into a new target schema. Clio in-
troduces an interactive schema mapping paradigm, based
on value correspondences.

� Garlic [11, 18] uses an ODMG-93 based object oriented
model. It extends ODMG to allow modelling of data items
in the case of a relational schema with weak entity.

� TSIMMIS [13, 37] and MedMaker [31] were developed at
Stanford around 1995. They use the Object Exchange
Model (OEM) [32] as a common data model. OEM allows
irregularity in data. The main focus is to generate media-
tors and wrappers based on application specification.

� MIX [8, 3], a successor of TSIMMIS, uses XML to provide
the user with an integrated view of the underlying data-
base systems. It provides a query/browsing interface called
Blended Browsing and Querying.

These were the prominent techniques in the structuring
approach. There are many other techniques which use ontol-
ogy as a common data model or use ontologies to translate
queries over component databases. Below we present some of
these techniques:
� Information Manifold [24] employs a local-as-view ap-

proach. It has an explicit notion of global schema/ontology.
� The OBSERVER [28] system uses a different strategy for

information integration. It allows individual ontologies
and defines terminological relationships between them,
instead of creating a global ontology to support all the
underlying source schemas.

5 Data integration
Data integration is the process of combining data at the en-
tity-level. After schema integration has been completed, a
uniform global view has been constructed. However, it may
be difficult to combine all the data instances in the combined
schemas in a meaningful way. Combining the data instances
is the focus of data integration.

Data integration is difficult because similar data entities in
different databases may not have the same key. Determining
which instances in two databases are the same is a complicated
task, if they do not share the same key. Entity identification
[27] is the process of determining the correspondence be-
tween object instances from more than one database. Data in-
tegration is further complicated because attribute values in
different databases may disagree or be range values. Simply
said, data integration is the process which:
� takes as input a set of databases (schemas), and
� produces as output a single unified description of the input

schemas (the integrated schema) and the associated map-
ping information supporting integrated access to existing
data through the integrated schema.

Parent and Spaccapietra [33] present a general data inte-
gration process in their survey on database integration. First,
they convert a heterogeneous schema to a homogeneous
representation, using transformation rules that explain how

to transform constructs from the source data models to the
corresponding one in the target common data model. The
transformation specification produced by this step specifies
how to transform instance data from the source schema to the
corresponding target schema. Then, correspondences are
investigated, using the semantic descriptions of the data to
produce correspondence assertions. Finally, correspondence
assertions and integration rules are used to produce the
unified schema.

In general, data integration systems can be classified into
data-warehouse and mediator-wrapper systems. A data ware-
house [9] is a decision support database that is extracted from
a set of data sources. The extraction process requires data to
be transformed from the source format into the data ware-
house format. A mediator-wrapper approach [39] is used
to integrate data from different databases and other data
sources by introducing a middleware virtual database, called a
mediator, between the data sources and the application using
them. Wrappers are interfaces to data sources that translate
data into a common data model used by the mediator.

Based on the direction of the mappings between a source
and a global schema or common schema, mediator-wrapper
systems can be classified into so called global-as-view and
local-as-view [19, 26]. In global-as-view (GAV) approaches
[16], each item in the global schema/ontology is defined in
terms of source schemas/ontologies. In local-as-view (LAV)
approaches, each item in each source schema/ontology is
defined in terms of the global schema/ontology. Methods
for query rewriting and query answering views are presented
in [11]. The most important techniques in the literature for
LAV are presented.

6 Integration and querying XML via
mediation
In this section, we propose a general framework for a sys-

tem for XML date Integration and Querying XML via Media-
tion (IQXM) [2]. The architecture of IQXM is shown in Fig. 1.
IQXM mainly refers to the problem of integrating hetero-
geneous XML data sources. It can be used for resolving
structural and semantic conflicts for distributed heteroge-
neous XML data. A global XML schema is specified by the de-
signer to provide a homogeneous view over heterogeneous
XML data. A mediation layer is proposed for describing
mappings between global and local schemas. An XML media-
tion layer is introduced to manage: (1) establishing appropri-
ate mappings between the global schema and the schemas of
the sources; (2) querying XML data sources in terms of the
global schema. The XML data sources are described by XML
Schema language. The former task is performed through a
semi-automatic process that generates local and global paths.
A tree structure for each XML schema is constructed and
represented by a simple form. This is in turn used for assign-
ing indices manually to match local paths to corresponding
global paths. By gathering all paths with the same indices,
the equivalent local and global paths are grouped automati-
cally, and an XML Metadata Document is constructed. The
Query Translator acts to decompose global queries into a set
of subqueries. A global query from an end-user is translated
into local queries for XML data sources by looking up the cor-
responding paths in the XML Metadata Document.

26 ©  Czech Technical University Publishing Housee http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 45  No. 1 /2005 Czech Technical University in Prague


7 Conclusion
In this paper, we have presented some problems behind

schema management, such as schema matching and schema
integration. Schema matching is a basic problem in many da-
tabase application domains. We have introduced some of the
past and current approaches employed to solve these prob-
lems. Finally, we have described a framework for an XML data
integration and querying system.

Acknowledgements
This work supported in part by the National programme

of research (Information society project 1ET100300419).

References
[1] Ahmed, R. et al.: “The Pegasus Heterogeneous Multi da-

tabase System.” IEEE Computer, Vol. 24, 1991, p. 19–27.
[2] Almarimi, A., Pokorný, J.: “Querying Heterogeneous

Distributed XML Data.” In: Databases and Information
Systems, Int. Baltic Conf. on DB&IS 2004, Riga, Latvia,
Acta Universitatis Latviensis, Latvias Universitate, 2004,
p. 177–191.

[3] Attaluri G. et al.: “The CORDS Multidatabase Project.”
IBM Systems Journal. Vol. 34, 1995, No. 1, p. 39–62.

[4] Berners-Lee, T., Hendler, J., Lassila, O.: “The Semantic
Web: A New Form of Web Content that is Meaningful to
Computers will Unleash a Revolution of New Possibil-
ities.” The Scientific American. Vol. 284, 2001, p. 34–43.

[5] Baru, C. et al.: “XML-Based Information Mediation with
MIX.” In: Proc. of the ACM SIGMOD International
Conference on Management of Data, 1999, p. 597–599.

[6] Batini, C., Lenzerini, M., Navathe, S.: “A Comparative
Analysis of Methodologies for Database Schema Integra-

tion.” ACM Computing Surveys. Vol. 18, 1986, No. 4,
p. 323–364.

[7] Benjamins, R., Fensel, D.: “The Ontological Engi-
neering Initiative-KA2.” In: Proc. of the 1st Int. Conf.
on Formal Ontologies in Information Systems, FOIS’98
(Ed. N. Guarino), Trento, Italy, IOS Press, 1998,
p. 287–301.

[8] Baru, C. et al.: “XML-based Information Mediation with
MIX.” In: Proc. of SIGMOD’99, 1999, p. 597–599.

[9] Bernstein, P. A., Rahm, E.: “Data Warehouse Scenarios
for Model Management.” In: Proc. 19th Int. Conf. on
Entity-Relationship Modeling, Lecture Notes in Com-
puter Science, Vol. 1920. Springer, Berlin Heidelberg
New York, 2000, p. 1–15.

[10] Breibart, Y. J. et al.: “Database Integration in a Dis-
tributed Heterogeneous Database System.” In: Proc. of
2nd Int. IEEE Conf. on Data Engineering, Los Angeles,
CA, 1986.

[11] Calvanese, D., Lembo, D., Lenzerini, M.: “Survey on
Methods for Query Rewriting and Query Answering
Views.” Technical report, University of Roma, Italy,
April 2001.

[12] Castano, S. et al.: “Global View of Heterogeneous Data
Sources.” IEEE Trans Data Knowledge Eng. Vol. 13, 2001,
No. 2, p. 277–297.

[13] Chawathe, S., et al.: “The TSIMMIS project: Integration
of Heterogeneous Information Sources.” In: Proc. of the
Information Processing Society of Japan Conference,
Tokyo, Japan, 1995, p. 7–18.

[14] Czejdo, D. B., Rusinkiewicz, M., Embley, D.: “An Ap-
proach to Schema Integration and Query Formulation
in Federated Database Systems.” In: Proc. of ICDE,
1987, p. 477–484.

©  Czech Technical University Publishing Housee http://ctn.cvut.cz/ap/ 27

Czech Technical University in Prague Acta Polytechnica Vol. 45  No. 1 /2005

Fig. 1: System architecture


[15] Doan, A. H., Domingos, P., Levy, A.: “Learning Source
Descriptions for Data Integration.” In: Proc. WebDB
Workshop, 2000, p. 81–92.

[16] Friedman, M., Levy, A., Millstein, T.: “Navigational
Plans for Data Integration.” In: Proc. of the 16th Na-
tional Conf. on AAAI ’99, Orlando, Florida, 1999,
p. 67–73.

[17] Goldman, R., Widom, J.: “Data Guides: Enabling Query
Formulation and Optimization in Semi-structured Data-
bases.” In: Proc. of 23rd Int. Conf. on VLDB, Athens,
Greece, 1997, p. 436–445.

[18] Haas, L. et al.: “Optimizing Queries across Diverse Data
Sources.” In: Proc. of the 23rd Int. Conf. on VLDB,
Athens, Greece, 1997, p. 276–285.

[19] Halevy, A. Y.: “Answering Queries Using Views: A Sur-
vey.” VLDB Journal. Vol. 10, No. 4, December, 2001,
p. 270–294.

[20] Hakimpour, F., Geppert, A.: “Resolving Semantic Het-
erogeneity in Schema Integration: an Ontology Based
Approach.” In: Proc. of Int. Conf. on Formal Ontologies
in Information Systems FOIS’01 (Eds. Ch. Welty
and B. Smith), New York, ACM Press, October 2001,
p. 297–308.

[21] Heflin, J., Hendler, J.: “Semantic Interoperability on
the Web.” In: Proc. of Extreme Markup Languages
2000. Graphic Communications Association, 2000,
p. 111–120.

[22] Hull, R.: “Managing Semantic Heterogeneity in Data-
bases: A Theoretical Perspective.” In: Proc. of Principles
of Database Systems (PODS’97), Tucson, Arizona, USA,
1997, p. 51–61.

[23] Kashyap, V., Sheth, A.: “Semantic and Schematic Sim-
ilarities between Database Objects: A Context-based Ap-
proach.” VLDB Journal. Vol. 5, No. 4, 1996, p. 276–304.

[24] Kirk, T. et al.: “The Information Manifold.” In: Proc. of
AAAI Spring Symposium on Information Gathering.
AAAI, Standford, CA, March, 1995, p. 85–91.

[25] Lakshmanan, L., Sadri, F., Subramanian, I.: “On the
Logical Foundations of Schema Integration and Evolu-
ion in Heterogeneous Database Systems.” In: Proc. of
DOOD’93, Phoenix, AZ, 1993, p. 81–100.

[26] Lenzerini, M.: “Data Integration: A Theoretical Perspec-
tive.” In: Proc. of the ACM Symposium on Principles of
Database Systems, Madison, Wisconsin, USA, June 2002,
p.233–246.

[27] Lim, E. et al.: “Entity Identification in Database Integra-
tion.” In: Proc. of Int. Conf. on Data Engineering, Los
Alamitos, Ca., USA, IEEE Computer Society Press, 1993,
p. 294–301.

[28] Mena, E. et al.: “Domain Specific Ontologies for Seman-
tic Information Brokering on the Global Information
Infrastructure.” In: Proc. of International Conference
on Formal Ontologies in Information Systems, FOIS’98,
Trento, Italy, IOS Press, June 1998, p. 269–283.

[29] Milo, T., Zohar, S.: “Using Schema Matching to Simpli-
fy Heterogeneous Data Translation.” In: Proc. 24th Int.
Conf. on VLDB, 1998, pp. 122-133.

[30] Pepijn, R. S. Visser et al: “Resolving Ontological Hetero-
geneity in the KRAFT Project.” In: Proc. of 10th Int.
Conf. on Database and Expert Systems Applications
DEXA’99. University of Florence, Italy, August 1999,
p. 668–677.

[31] Papakonstantinou, Y., Garcia-Molina, H., Ullman, J.:
“Medmaker: A Mediation System Based on Declarative
Specifications.” In: Proc. of ICDE Conference, New Or-
leans, Feb, 1996, p. 132–141.

[32] Papakonstantinou, Y., Garcia-Molina, H., Widom, J.:
“Object Exchange across Heterogeneous Information
Sources.” In Proc. of 11th Int. Conf. on Data Engi-
neering, Taipei, Taiwan, March, 1995, p. 251–260.

[33] Parent, C., Spaccapietra, S.: “Issues and Approaches of
Database Integration.” CACM, Vol. 41 (1998), No. 5,
p. 166–178.

[34] Renee, J., et al.: “Schema Mapping as Query Discovery.”
In: Prof. 26th Int. Conf. on VLDB Cairo, Egypt, Septem-
ber, 2000, p. 77–87.

[35] Sheth, A., Kashyap, V.: “So Far (Schematically) yet So
Near (Semantically).” In: Proc. of the IFIP DS-5 Confer-
ences on Semantics of Interoperable Database Systems,
Lorne, Australia, November 1992, p. 283–312.

[36] Templeton, M. et al.: “Mermaid: a Front End to Distrib-
ute Heterogeneous Databases.” In: Proc. of the IEEE,
Vol. 75 (1987), No. 5, p. 695–708.

[37] Ullman, J.: “Information Integration Using Logical
Views.” In Proc. of the Int. Conf. on database Theory,
1997, p. 19–40.

[38] Wang, Q., Wong, K.: “Approximate Graph Schema Ex-
traction for Semi-structured Data.” In: Proc. Extended
Database Technologies, Lecture Notes in Computer Sci-
ence, Vol. 1777. Springer, Berlin Heidelberg New York,
2000, p. 302–316.

[39] Wiederhold, G.: “Mediators in the Architecture of Fu-
ture Information Systems.” IEEE Computer, Vol. 25,
No. 3, March 1992, p. 38–49.

Abdelsalam Almarimi, MSc.
e-mail: belgasem_2000@yahoo.com

Department of Computers

Czech Technical University
Faculty of Electrical Engineering
Karlovo nám. 13
121 35 Praha 2, Czech Republic

Prof. RNDr. Jaroslav Pokorný, CSc.
e-mail: pokorny@ksi.ms.mff.cuni.cz

Department of Software Engineering

Charles University
Faculty of Mathematics and Physics
Malostranské nám. 25
118 00 Praha 1, Czech Republic

28 ©  Czech Technical University Publishing Housee http://ctn.cvut.cz/ap/

Acta Polytechnica Vol. 45  No. 1 /2005 Czech Technical University in Prague