Format Template
Volume 1 | No. 2 | July – December 2017
SJCMS | P-ISSN: 2520-0755 | E-ISSN: 2522-3003 © 2017 Sukkur IBA University – All Rights Reserved
57
Schema Integration of Web Tables (SIWeT)
Nayyer Masood1, Amna Bibi2, Muhammad Arshad Islam1
Abstract:
Schema integration has been mainly applied in database environment whether schemas
to be integrated belonged to a single organization or multiple ones. Schema extraction is a
relatively new area where schema is extracted from a web table. The extracted schema is not as
much concretely defined as in a typical database environment. The work in this paper brings two
areas together where extracted schemas from multiple web tables are integrated to form a global
schema. The data are also extracted from web tables and placed into global table. This creates a
large repository of data of the same domain extracted dynamically from websites which is then
available for different types of ad-hoc queries. This work also imposes challenges on schema
integration to be studied in the context of schema extraction and other way round.
Keywords: Schema Extraction, Schema Iintegration, Semantic Heterogeneities, Web Tables.
1. Introduction
Web is a tremendous source of huge
volume of data which is in structured, semi-
structured and unstructured formats.
Structured data include lists and tables. A
table is a collection of rows containing data in
one or multiple columns. Each column
represents an attribute. Web tables are a
simple, meaningful, effective and popular way
of representing data on web. Data from these
sources can be accessed either using search
engines or navigating through the web pages.
However, structures (or schemas) of these
web tables are not available, so they cannot be
queried efficiently; hence their data mostly
remain inaccessible to the users.
Extracting data from these tables and
storing it in a database could be very useful for
many value added services, like in business
and social web domains. For example, in
business domain, data extraction techniques
help in reducing time and manpower and
increase the business efficiency. This data can
help the analysts and manager to revise or
change the business strategies and plans.
Context-aware advertising, customer care,
comparative shopping, Meta query, opinion
1 Department of Computer Science, Capital University of Science & Technology, Islamabad, Pakistan
Corresponding Author: nayyer@cust.edu.pk
2 Department of Computer Science, Virtual University Rawalpindi, Pakistan
mining and database building are the major
applications of web data extraction techniques
[1].
Schema extraction is the process of
extracting schema from the structured data
(e.g. tables, spreadsheets) on the web. The
process is followed by fetching the also data
from these web tables. The extracted schema
is used to create tables in a database, which are
populated with the extracted data. The
database tables can then be used for better and
efficient querying.
Many sites on the web can be found that
belong to the same domain. For example, sites
from the banking domain, education,
entertainment etc. We can apply schema
extraction on multiple sites of the same
domain and can store tables extracted from
these sites at a single place for efficient
querying. However, even in this case the
query will be applied to individual tables if
data are to be accessed from multiple sites. It
will be more beneficial if schema integration
could be applied on schemas extracted from
the multiple sites of the same domain. Schema
integration is the process of
merging/combining same or similar data items
to build a canonical schema. For example,
mailto:nayyer@cust.edu.pk
Nayyer Masood (et al.) Schema Integration of Web Tables (SIWeT) (pp. 57 -6 5)
Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 1 No. 2 July – December 2017 © Sukkur IBA University
58
every university website shows its faculty
information generally including faculty
member’s name, his/her designation, higher
degree and research interests etc. Many of the
university websites show this data in the form
of web tables due to ease in creating,
visualizing and understanding of tables. If
schema extraction is applied to multiple
universities’ web sites to fetch the schema and
data and store them in database tables; further
schema integration can be on these database
tables to get a canonical schema. This will
give us a single schema/table containing data
of faculty members belonging to different
universities. We can then apply
advanced/efficient queries to extract required
information about faculty members of
different universities.
Structure of this paper is as follows: section
2 presents the review of the related literature
encompassing both schema extraction and
schema integration. Section 3 presents
proposed approach comprising three phases.
Finally, section 4 concludes the paper.
2. Literature Review
The review of related literature falls into
two major categories: schema extraction and
schema integration. Both of them are
discussed in the following:
2.1. Schema Extraction
Several approaches are available for
schema extraction that can broadly be
categorized as manual, wrapper induction and
automatic extraction. In wrapper induction
[2], [3], firstly the pages/data records are
labeled manually and a set of extraction rules
is deducted. These rules are then used to
extract data from similar pages. This
technique still involves manual effort.
Automatic method [4], [6] finds patterns or
grammars from similar pages containing
similar data records and uses these patterns to
extract data from new pages. The pages to
extract the patterns are provided manually or
by some other system.
The approach of Zhai & Liu [6] provides an
automated system to extract data from web
and put it in a database. Web pages are
constructed using HTML tags. The
tag is used to represent table on web. The
tag is used to insert the rows and tag
inserts the data in a particular cell of that row.
The proposed approach firstly identifies the
data records. For this purpose, visual
information is used to construct a tag tree
which is constructed by following the nested
structure of HTML code. Second task is to
align and extract data from the identified data
records using partial tree alignment technique.
Tree edit distance is used to identify the data
records. Trees are matched with each other
node by node. Trees are aligned by gradually
growing a seed tree Ts. The tree with the
maximum records is chosen as the starting
seed tree. Then for each node ni in Ti a
matching node ns is found in Ts. When a
matching node is found in Ts, a link from ni to
ns is created. If no match is found for ni then
seed tree is expanded by inserting this node
into it. Data item nodes are not used during the
matching. The data item in the matched nodes
children is inserted into one column in
database table.
Adelfio & Semat [4] proposed a conditional
random field (CRF) based classification
technique with the combination of logarithmic
binning. Each web table contains different
types of rows, like caption, heading, empty
and data rows etc. Each row has been
classified based upon the row features which
include formatting, layout, style and values of
the cell and then all these features are
combined using binning to construct row
features. In next step, logarithmic binning
method is applied in which individual cell
attributes are used collectively to encode row
features. For each possible row feature a bin is
formed and each bin is assigned a value which
represents its feature. After row features
extraction, row labels are assigned to each row
based on CRF. CRF is trained with human
classified rows. After training the CRF is used
to label huge volume of data. The output of the
CRF is a sequence of row labels like
“TNNHGDDDAGDDDABN”. This output
helps in extracting schema of the relational
Nayyer Masood (et al.) Schema Integration of Web Tables (SIWeT) (pp. 57 -6 5)
Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 1 No. 2 July – December 2017 © Sukkur IBA University
59
table. Column names are decided based upon
the header row(s), data type is determined by
the type frequency within the data rows of
each column, additional attributes can be
determined by the group header rows and data
rows are determined by the data records.
George, David and Sharad [7] introduced a
technique to covert the web tables to relational
tables. This is the first end to end approach
which produces an access compatible
canonical table. The HTML table is converted
into an excel table and from excel table its
CSV file is generated. Table is segmented
based upon the indexing property rather on
appearance features. To segment the table
minimum indexing point (MIP) and four
critical cells CC1, CC2, CC3 and CC4 are
calculated. CC1 and CC2 determine the stub
headers; CC3 and CC4 indicate the data
regions. MIP (CC2) is determined by
searching from the cell A1 for unique columns
and row header rows. The categories can be
extracted by comparing the number of unique
elements in the cross-product of a pair or
header rows with the length of the header.
From the category extraction output,
canonical table is generated. This table can be
used to query the data.
The technique proposed by Purnamasari,
Wicaksana, Harmanto and Banowosari in [8]
first finds the area of the table and then
extracts data from it. First of all table is
detected and then property (title) portion of
the table is detected before extracting data
from it. The technique is divided into three
steps and algorithm for each step is
formulated. In first step, number of rows and
columns are calculated by counting the |
...
tags in tag and the … |
tags in each tag. The algorithm also
checks the colspan attribute in the tag. It
adds the value of colsapn in the column count.
In second algorithm the property of the table
is detected. Generally the first row of the table
contains the headings of the columns. The
algorithm checks for the row span attribute in
each | tag in |
tag of table to calculate
the length of the property of the table. Third
algorithm actually extracts the data from the
table. It takes the value of the rowspan
returned from the second algorithm to extract
the heading of the columns. While reading the
data in tag of |
tag, it checks the value
of colspan. If its value is greater than 1, it
concatenates the content in this cell with the
columns below it. After reading the header
rows, it reads the cells row by row.
2.2. Schema Integration
It is the process that takes two or more
schemas as input (called source or member
schemas) and merges them into a
single/canonical one. Other terms used for
schema integration are database integration,
data integration, database interoperability, etc.
The most critical step in schema integration is
schema matching in which two schemas are
compared with each other to identify the
elements modeling same or similar concepts.
Once identified, the similar elements are
merged into a common one in the schema
merging phase, which is a relatively
straightforward task. The main problem in
schema matching is identification and
resolution of semantic heterogeneities. A
semantic heterogeneity reflects a situation
when same or similar concept is modeled
differently in two schemas. These differences
arise mainly due to differences in the context
of organization, popularity of using acronyms
in defining schemas, idiosyncratic
abbreviations and models of the same domain
[11]. The schema integration approaches can
be broadly categorized into two; schema based
and instance based.
Schema based integration approaches
exploit the information in the schema, like,
name, data type and different types of
constraints, to identify semantically similar
schema elements. These techniques have been
further classified as element-level and
structure-level in [13]. Element-level schema
matching approaches compare the entity types
from different schemas in isolation without
considering their links/relationships with
other entity types. These approaches mainly
include string-based techniques [14], [15] that
use matchers like prefix, suffix, edit-distance
Nayyer Masood (et al.) Schema Integration of Web Tables (SIWeT) (pp. 57 -6 5)
Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 1 No. 2 July – December 2017 © Sukkur IBA University
60
and N-gram; NLP-based techniques [16, 17]
that apply natural language processing
techniques, like tokenization, lemmatization
and elimination on the names of the entity
types and then apply some matching
technique; constraint-based techniques [18]
where constraints from schema are used to
compare the schema elements, like data type,
integrity constraints etc. Structure-level
approaches mainly cover graph-based [19],
taxonomy-based [20] and model-based [21].
A hybrid approach has been adopted in
COMA++ [22], where a combination of
matchers is applied on input schemas and the
results are combined to establish final
similarity between elements.
Table I. Comparison of Different SE Approaches
S
#
Paper
Referen
ce
Schema
Extractio
n
Fully
Automate
d
File
Format
Techniques Data
Domain
Multipl
e
Source
s
1 4 Yes Yes HTML,
Table,
Spreadshe
et
Supervised Different
domain
No
2 6 No Yes HTML
table
Tree based Shoppin
g data
No
3 7 No Yes HTML,
table,
Spreadshe
et
Index
Based
Statistica
l data
No
4 8 No Yes HTML
tables
Programmi
ng
Not
mentione
d
No
5 9 No Yes HTML
tables
Tree based Different
domains
No
6 10 No Yes HTML
tables
Tree based Different
domains
No
The literature review of schema merging
approaches reveals that most of the
approaches strive to maximize the automated
part of the process as performing SI
completely manually is very time consuming
and laborious task. Moreover, most critical
part of SI process is handling semantic
heterogeneities which exist across multiple
schemas due to the fact that these schemas are
built independently in certain contexts that are
entirely different from each other even if they
belong to same domain.
3. Proposed Approach
This article presents the novel idea of
applying schema integration (SI) process on
the schemas that have been extracted through
schema extraction (SE) process from web
tables of multiple web sites belonging to the
same domain. To prove the concept, it is
planned to test proposed approach on the
domain of faculty members of computer
science departments of different universities.
However, the idea can be applied in any
domain. The basic idea behind this approach
is to access those websites where the data of
faculty members have been placed in the form
of tables, as shown in Fig. 1 below. Then,
using the SE approach presented in [8], extract
the schema and data from different websites.
After that, different schema matching
Nayyer Masood (et al.) Schema Integration of Web Tables (SIWeT) (pp. 57 -6 5)
Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 1 No. 2 July – December 2017 © Sukkur IBA University
61
approaches will be applied to identify
semantically similar elements among the
elements extracted from different universities
websites.
The semantically similar elements are
merged with each other and the data are finally
stored in a single table for further queries. The
proposed approach comprises three major
phases; preprocessing, schema extraction and
schema integration. In the following, these
three phases have been explained.
3.1. Preprocessing
Basic objective of this step is to provide
neat and clean web table source to SE step so
that an accurate schema could be extracted out
of it. Neat and clean web page source means
removing all unnecessary or irrelevant code or
tags from the web page source that means any
source or tags other than that contains the web
table including table headings and data.
Figure 1: An Example Web Table of Faculty Data
It is a critical and difficult task, as there is too
much difference the way web tables are
defined on different web sites.
In the first step of preprocessing phase, web
sites of universities will be found manually
that store the faculty data in the web table form
and store that URL in a database table along
with the other basic information about the
university and department. This is an ongoing
process and the database of university pages
will keep on increasing. Once we have that
data, this table will be handed over to a
crawler which picks the URL of websites one
by one and downloads the source code of the
web pages and stores it in a text file. In the
next step, clipping is performed and additional
or irrelevant code/tags are removed and only
the part contained within and <\table>
tags are left that contains the web table. This
is going to be a bit tricky, as a web page
generally contains many tables (for example,
for formatting purpose) and out of those tables
the one that presumably contains required data
will be picked. One possible strategy in this
regard can be, to pick the table that contains
multiple rows and within each row there are
multiple columns; this is also a requirement of
SE approach [8] that has been selected in
proposed approach. As an example, parts of
HTML code of two web tables (after clipping)
have been shown in Fig. 2 below. Both of
these pages present the faculty data in the form
of a table, but the variation in the coding can
still be seen as the code in the right column
Nayyer Masood (et al.) Schema Integration of Web Tables (SIWeT) (pp. 57 -6 5)
Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 1 No. 2 July – December 2017 © Sukkur IBA University
62
contains a lot of formatting instructions
whereas one on the left simply contains the
data inside HTML tags. The SE approach that
we have selected [8] assumes web table to be
in a specific format (fig. 1). However, it is
possible that a website does contain the
, <\table> tags but still is not in the
required format. So one objective of
preprocessing phase is to identify such pages
and put them aside rather than passing those to
next phase because the adopted approach will
not be able to successfully extract schema out
of such pages.
Staff Name |
Staff
Designation |
Staff Image |
Dr. Muhammad
Anwar-ul-Rehman Pasha |
Professor |
|
Mr. Abid Rafique |
Assistant Professor |
|
Dr. Muhammad Din
Choudhry |
Assistant
Professor |
|
…………..
<\table>
Figure 2: Sample Code for Two Different Web Tables
3.2. Schema Extraction
There are many approaches proposed for
SE in literature, but we have selected one
proposed in [8] because it is quite recent,
simple to understand and implement,
moreover, it performs comparatively well on
the web pages that are in the specific format
assumed by the approach. There are many
extensions possible in this approach, but our
main objective is to apply SE for integration
purposes, so we are using approach of [8] as
such rather than suggesting any enhancement.
The SE approach that we have selected
assumes that the web table is contained inside
, <\table> tags, and there are multiple
rows between <\table> tags. The first
row contains the headings of the columns and
remaining rows contain the data. One special
Nayyer Masood (et al.) Schema Integration of Web Tables (SIWeT) (pp. 57 -6 5)
Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 1 No. 2 July – December 2017 © Sukkur IBA University
63
feature of the approach is that it can manage
the situations where header is spanned upon
multiple rows. In this paper, we are not
discussing SE approach of [8] in detail;
interested readers can refer the actual paper.
The preprocessing phase has already placed
the clipped web table code in a text file. In this
phase, SE will be applied to all web tables
stored in text files and from each file the first
row is separated as header row and remaining
as data rows. Inside header rows there are
different column names that are separately
stored in an array. The n column names in the
header row of the web table are stored in the
first n elements of n array. Remaining rows of
the web table are assumed to contain data. So
data values from each row are stored in the
next n elements of array. This process
continues till all the rows of the web tables
have been processed. At the end of this
process, we have an array in which first n
elements contain the names of the columns
and each next set of n array components
contain attribute values.
3.3. Schema Integration
Schema Integration (SI) is the third phase
of our approach (SIWeT) where the extracted
schemas in the previous phase will be merged
to form a global schema. As mentioned
earlier, semantic heterogeneities is the major
problem faced in SI. This problem is further
amplified when SI is implemented on
extracted schemas where we do not have
concretely defined schemas rather extracted
from a web page by a semi-automated tool.
Such schemas may have certain errors that
may not exist in properly defined database
schemas. Like, extracted schemas may have
inappropriate or invalid data type assigned to
an attribute because data types are assigned to
attributes on the basis of evaluation of data, for
example, a web table may contain date in
number format like ‘021002’ representing
October 02, 2012, but SE approach may
assign it numeric seeing all numeric data.
There can be other such issues that make SI in
extracted schema environment more complex
as compared to a traditional database
environment.
The SI approach in our SIWeT comprises
applying multiple schema matchers on
extracted schemas and then combining the
output of these matchers. First of all, we
define a global table for faculty members. This
table is defined with maximum possible
attributes that can be relevant to a faculty
member. The SI task then becomes finding the
most similar attribute in this global table for
each of the attribute in every extracted
schema. In order to find similarity between
attributes, SIWeT builds taxonomy of similar
terms using existing resources, like WordNet.
In addition to this taxonomy, N-gram and edit-
distance matchers are also applied. These
matchers return the score of similarity
between different terms. There scores are
averaged and the pair having maximum
similarity are considered as similar to each
other. When corresponding attributes of
extracted schemas have been found within the
global table, then the data from web table will
be inserted into the global table under the
attributes found similar to the attributes of the
extracted table, along with two additional
attributes mentioning the university Id and
department name. This process will be applied
to all the extracted web tables from different
universities and we will have a global table
containing data containing data from many
universities at a single place.
We plan to build a web application that lets
the users query this table using simple
keywords or SQL queries. This will be a great
source of information for researchers and other
interested users to find the relevant data.
4. Conclusion
In this paper, we have proposed
application of schema integration approaches
on the schemas extracted from web tables; so
this work is basically a merger of two research
areas, that is, SE and SI. This merger extends
both areas as SE has been mostly applied on a
single site in literature, whereas we are
applying it on multiple sites. Our approach
also extends SI research as the process has
been mainly applied in database environment
Nayyer Masood (et al.) Schema Integration of Web Tables (SIWeT) (pp. 57 -6 5)
Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 1 No. 2 July – December 2017 © Sukkur IBA University
64
where properly defined database schemas are
available defined through DBMSs, whereas
we are applying it on extracted schemas. This
will help to establish new dimensions in SI.
In future, we plan to implement our
approach in real environment by extracting
data from large number of web tables and
merging them into a single table. The SE
approach that we have adopted works on web
tables in a specific format, there are many other
formats of the web tables on which this
approach cannot be applied. The SE approach
can be extended to handle other web table
formats. There are many web tables that store
multiple attributes in a single column; we need
to evaluate the data extracted from one column
to identify relevant attributes.
REFERENCES
[1] Ferrara, E., De Meo, P., Fiumara, G.,
Baumgartner, R. (2014). Web data
extraction, applications and techniques:
A survey. Knowledge-based systems, 70,
301-323.
[2] Cohen, W. W., Hurst, M., Jensen, L. S.
(2002). A flexible learning system for
wrapping tables and lists in HTML
documents. In Proceedings of the 11th
international conference on World Wide
Web (pp. 232-241).
[3] Kushmerick, N. (2000). Wrapper
induction: Efficiency and
expressiveness. Artificial
Intelligence, 118(1-2), 15-68.
[4] Adelfio, M. D., Samet, H. (2013).
Schema extraction for tabular data on the
web. In Proceedings of the VLDB
Endowment, 6(6), (pp 421-432).
[5] Zeeshanuddin, S. (2011) A library of
schema matching algorithms for
dataspace management systems.
[6] Zhai, Y., Liu, B. (2005). Web data
extraction based on partial tree
alignment. In Proceedings of the 14th
international conference on World Wide
Web (pp. 76-85).
[7] Nagy, G., Embley, D. W., Seth, S.
(2014). End-to-end conversion of HTML
tables for populating a relational
database. In 11th IEEE International
Workshop on Document Analysis
Systems (DAS) (pp. 222-226).
[8] Purnamasari, D., Wicaksana, I. W. S.,
Harmanto, S., Banowosari, L. Y. (2015).
HTML table wrapper based on table
components. International Journal of
Computer Applications in
Technology, 52(4), 237-243.
[9] Lerman, K., Knoblock, C., Minton, S.
(2001). Automatic data extraction from
lists and tables in web sources. In IJCAI-
2001 Workshop on Adaptive Text
Extraction and Mining (Vol. 98).
[10] Gultom, R. A., Sari, R. F., Budiardjo, B.
(2011). Proposing the new Algorithm
and Technique Development for
Integrating Web Table Extraction and
Building a Mashup. Journal of
Computer Science, 7(2), 129-136.
[11] Lukyanenko, R., Evermann, J. (2011). A
Survey of Cognitive Theories to Support
Data Integration. In Proceedings of the
Seventeenth Americas Conference on
Information Systems, All Submissions.
Paper 30.
[12] Evermann, J. (2009) Theories of
meaning in schema matching: An
exploratory study. Information Systems
34(1), 28-44.
[13] Shvaiko, P., Jérôme, E. (2005). A survey
of schema-based matching approaches.
Journal on Data Semantics IV, (pp. 146-
171). Berlin, Heidelberg: Springer-
Verlag.
[14] W. Cohen, P. Ravikumar, and S.
Fienberg (2003). A comparison of string
metrics for matching names and records.
In Proceedings of the workshop on Data
Cleaning and Object Consolidation at
the International Conference on
Knowledge Discovery and Data Mining
(KDD) (Vol. 3, pp. 73-78).
[15] H. H. Do and E. Rahm (2001). COMA -
a system for flexible combination of
schema matching approaches. In
Proceedings of the Very Large Data
Bases Conference (VLDB) (pp. 610–
621).
http://dl.acm.org/citation.cfm?id=2137776&CFID=804236646&CFTOKEN=60927805
http://dl.acm.org/citation.cfm?id=2137776&CFID=804236646&CFTOKEN=60927805
http://dl.acm.org/citation.cfm?id=2137776&CFID=804236646&CFTOKEN=60927805
http://dl.acm.org/citation.cfm?id=2137776&CFID=804236646&CFTOKEN=60927805
http://dl.acm.org/citation.cfm?id=2137776&CFID=804236646&CFTOKEN=60927805
Nayyer Masood (et al.) Schema Integration of Web Tables (SIWeT) (pp. 57 -6 5)
Sukkur IBA Journal of Computing and Mathematical Sciences - SJCMS | Volume 1 No. 2 July – December 2017 © Sukkur IBA University
65
[16] F. Giunchiglia, P. Shvaiko, and M.
Yatskevich (2004). S-Match: an
algorithm and an implementation of
semantic matching. In Proceedings of the
European Semantic Web Symposium
(ESWS), (pp 61–75).
[17] J. Madhavan, P. Bernstein, and E. Rahm
(2001). Generic schema matching with
Cupid. In Proceedings of the Very Large
Data Bases Conference (VLDB), (pp 49–
58).
[18] P. Valtchev and J. Euzenat (1997).
Dissimilarity measure for collections of
objects and values. Lecture Notes in
Computer Science, 1280, (pp 259–272).
[19] D. Shasha, J. T. L. Wang, and R. Giugno
(2002). Algorithmics and applications of
tree and graph searching. In Proceedings
of the Symposium on Principles of
Database Systems (PODS), (pp 39–52).
[20] N. Noy and M. Musen (2001). Anchor-
PROMPT: using non-local context for
semantic matching. In Proceedings of the
workshop on Ontologies and
Information Sharing at the International
Joint Conference on Artificial
Intelligence (IJCAI), (pp 63–70).
[21] P. Bouquet, L. Serafini, and S. Zanobini
(2003). Semantic coordination: A new
approach and an application. In
Proceedings of the International
Semantic Web Conference (ISWC), (pp
130–145).
[22] D. Aum¨uller, H. H. Do, S. Massmann,
and E. Rahm (205). Schema and
ontology matching with COMA++. In
Proceedings of the International
Conference on Management of Data
(SIGMOD), Software Demonstration.