293

Acta Polytechnica CTU Proceedings 2(1): 293–296, 2015

293

doi: 10.14311/APP.2015.02.0293

Non Relational Models for the Management of Large Amount of
Astronomical Data

B. L. Martino1, M. Federici 2

1Associated INAF IAPS
2Istituto di Astrofisica e Planetologia Spaziali, INAF IAPS Via fosso del Cavaliere 100, 00133 Roma, Italy

Corresponding author: brunolmartino@iasi.cnr.it

Abstract

The objective of this paper is the comparison between two different database typologies: the relational and the non-

relational architecture, in the context of the applications related to the use and distribution of astronomical data. The

specific context is focused to problems quite different from those related to administrative and managerial environments

within which were developed the leading technologies on which are based the modern systems of massive storage of

data. The data provided by astronomical instrumentation are usually filtered out by the front-end system (trigger,

anticoincidence, DSP etc.), so they do not require special controls of congruence. Moreover, the related storage systems

must be able to ensure an easy growth, minimizing human systemistic interventions and automating the related actions.

The use of a non-relational architecture (NoSQL), offers great advantages during the insertion of informations within a

data base, while the response speed of the queries is mainly tied to their type and complexity.

Keywords: DBMS - MongoDB - NoSQL - MySQL - GSC catalog.

1 Using DBMS

A careful planning of the use of a file system allows to
store informations in a rational way but, whatever the
criterion used to organize an archive based only on files,
it is not possible to build a search system that can guar-
antee sufficient flexibility. The user of a database, con-
ceived in this manner, is required to know in detail its
structure and it’s organization. Almost any application
focused on the analysis of astronomical data may read
data in FITS format (Wells et al., 1981) FITS format
allows to add to a collection of data a set of additional
information used to allow their better characterization.
A DBMS (Data Base Management System) is a soft-
ware infrastructure designed to operate on large data
sets with the goal of optimizing the data:

• storage

• access

• sharing

• protection
A DBMS allows, through the use of its command lan-
guage, the imposition of constraints of consistency, the
creation of indexes to improve performance and re-
trieval of data independently from their physical rep-
resentation. Well-known examples of the use of ad-
vanced databases in high-energy physics are represented

by SPIRES (Bourne et al., 2003), OPERA (Agafonova
et al., 2009) and !CHAOS (Bisegni, 2012) and is con-
solidated practice their adoption as a storage medium
in the ground segment.

1.1 Relational models

The relational databases today are the most popular;
the model they are based it was proposed by Codd
(1970). The relational model allows access to data at
the logical level by providing a complete independence
from their physical organization. The logical represen-
tation of the data in the relational model is based on the
concept of relationship, in algebraic terms; it is common
to use the term ”table” in place of ”algebraic relation-
ship” and the term ”relationship” to indicate an as-
sociation between the data. The language SQL (Date
et al., 1997) is the standard language ”de facto” for
defining, manipulating and interrogation of relational
databases; this is a declarative language and not pro-
cedural. In the relational model a logical unit of work
is defined transaction and is constituted by a sequence
of operations of reading and writing, which must meet
some properties, known as ACID property (from Atom-
icity, Consistency, Isolation, Durability). Atomicity: a
transaction is an atomic unit of operations, validated or
canceled depending on whether or not they reach a suc-
cessful conclusion (rollback / commit); Consistency: at

293

http://dx.doi.org/10.14311/APP.2015.02.0293


B. L. Martino, M. Federici

the end of the transaction if the initial state is correct,
even the final state must be; Isolation: the action of a
transaction should not interfere with each other; Du-
ration: the effects of a transaction must be persistent;
The most popular relational databases are ORACLE
(Kunh, 2010) and MySQL (Schwartz, 2012)).

1.2 Non relational models

Recently, have been developed a series of new DBMS
systems, to provide an high horizontal scalability, in or-
der to achieve high performance in the read / write op-
erations of database distributed across multiple servers
geographically delocalized (cloud). Many of these new
systems are called NoSQL data stores. The definition of
NoSQL, which stands for ”Not Only SQL” or ”Not Re-
lational”, was used for the first time in 1998 for an open
source relational database. The non-relational systems,
do not attempt to provide the classic ACID guarantees,
typical of relational databases, but embrace the model
B.A.S.E. (Basic Available, Soft-state, Eventual consis-
tency) where some constraints are relaxed: Basic Avail-
able: the database can operate even if a part is no longer
available; Soft-state: the system status may change
over time even in absence of input data; Eventual con-
sistency: the data may not be updated immediately,
but will be consistent throughout the system within a
finite time As stated by the CAP theorem (Lynch et
al., 2002) can not be achieved Consistency, Availability
and Partition tolerance at the same time but only two
of these features at a time. The systems based on non-
relational architecture follow the BASE model and are
those that allow to overcome the major limitation of the
RDBMS (Relational Data Base Management System):
the scalability. In many areas, consistency and/or avail-
ability offered by relational databases are not essential,
e.g. in astronomy and astrophysics applications. In a
highly available and tolerant partitioning system, alter-
ation of data base will reach all nodes not instantly but
within a finite time, if a reading is done on a node is
not synchronized with the last write, it returns the last
valid data (stale data).

2 DBMS NoSQL

A property of NoSQL database is to be free of pattern
(schemaless), with consequent advantages and disad-
vantages like ease of deployment, but sometimes, more
difficulties to construct complex queries. The non-
relational logical models not have the same expressive
power of the relational model and can be classified into
four main families: key-value, column-family, document
store and graph. The most common are the MongoDB
NoSQL architecture (Chodorow, 2011), HBase (George,
2011) and Cassandra (Lakshman et al., 2010); their
characteristics are summarized in Figure 1. MongoDB

is a document-oriented database, which is based on ag-
gregates, which may have a structure with multiple hi-
erarchical levels and groups that can be variously in-
dexed. MongoDB uses the JavaScript language, which
is inherently single-threaded while Cassandra uses the
language CQL (Cassandra Query Language, simplified
variant of SQL), which allows the management of dis-
tributed databases. Both rely on the file system of the
machine on which they were installed without introduc-
ing data abstractions. HBase is a distributed database
(based on the project Hadoop (Sammer, 2012) writ-
ten in Java, that uses storage devices located on dif-
ferent hosts also geographically distributed, intercon-
nected via networks. HBase uses HDFS, a portable and
scalable distributed file system initially developed for
the framework Hadoop.

Figure 1: Key features of NoSQL architectures.

To carry out complex operations is used the MapReduce
paradigm (Dean et al., 2004), which allows to divide the
computation in many elaborations of lower complexity
to achieve the processing of large datasets in parallel on
multiple cores/CPU/computer.

3 MySQL vs MongoDB

In order to evaluate the possible use with advantage
of non-relational architectures in astronomy and astro-
physics area we realized the MDBirs system (showed
in Figure 2), composed by 10 PC, equipped with In-
tel i7, 16Gbytes of RAM, 2 HD 1Tbytes connected via
ethernet LAN at 1Gbit of speed

Figure 2: The MDBirs System.

294


Non Relational Models for the Management of Large Amount of Astronomical Data

In order to assess objectively the behavior of the sys-
tem tests were carried out so as not to make use of the
advanced characteristics of the query languages used.
For this reason it was chosen as the working set of data
the GSC catalog (Stars General Catalog). GSC is com-
posed of a single table and is used primarily to provide
support for the planning and guiding stars of the HST
observations (Dalcanton, 2009), the JWST (Gardner at
al., 2006), Gaia (Busso et al., 2012) and some ground-
based telescopes of new generation. Furthermore, the
services provided by the machines on which they are
installed databases are not affected by processes (user
or system) not strictly essential to their functioning
and the software installed on the machines which host
the test has been aligned to the same versions. The
databases under test are the two most widely used open
source architectures in the world, mySQL for its versa-
tility, speed and diffusion and MongoDB for its attitude
to the horizontal growth (scalability) and its robustness.
Starting from the data available, have been generated
some of the series of samples of increasing amplitude,
so as to highlight the behavior of the two systems both
with regard to populate the database, than to retrieve
the information of interest. The test was performed us-
ing queries significant from the point of view of the com-
munity of users, which allow to highlight the behavior
of databases in real operating conditions. In particular:

• query1: selection on HEALPix (mapping system
applied to the celestial sphere)

• query 2: selection of objects in a spatial region
which satisfy some conditions in magnitude and
color

• query 3: selection of non-stellar objects which
meet a condition in color

• query 4: computing of the average magnitudo,
resolution of one degree square

4 Test Results

The database engine installed by default in MySQL
since version 5.0 (InnoDB) is transactional, ACID com-
pliant and uses the row-level locking strategy (con-
straint of exclusive use for the time necessary to perform
the required actions). The data entry test was made
from data provided in CSV (text data whose fields are
marked by the separator character comma.) As shown
by the graph in Figure 3, the behavior of MongoDB
is faster than mySQL, thanks to the absence of checks
carried out on the integrity of the data; information are
stored in files as key-value pairs.

Figure 3: Insert performances vs data set size.

Regarding the execution time of the test queries, the
results are shown in the graph of Figure 4 and are sub-
stantially comparable. In the case of the GSC catalog ,
the occupation of the relevant MySQL table is about
116 Gbytes of data collection while the correspond-
ing MongoDB occupies about 657 Gbytes (the size is
5 times bigger). The tables were not associated with
indexes, so searches are performed on the whole set of
data with a time dependent to amplitude of the set it-
self.

Figure 4: Queries performances vs data set size.

5 Future Activities

MongoDB is a good choice to obtain good safety and
performances using local data sharding. Our goal is to
achieve a full geographic data delocalization (CLOUD)
in order to obtain: continuity of access to data (total
disaster recovery) to the user community optimized av-
erage access time We evaluated two possible scenarios:
HBase: Hadoop Based NoSQL Data Base Cassandra:
The Apache Cassandra database HBase is a master-
slave system including two types of machines: HMas-
ter: access control HRegionserver: local data replica-
tion By using the Stargate plugin can be achieved local
caches able to speed up response times by minimizing
the network traffic to the master. Every region server
keeps a copy of the data so the system ensures very

295


B. L. Martino, M. Federici

high reliability. It is a NoSQL solution initially devel-
oped by Facebook which has in the P2P architecture its
focal point.

6 Conclusions

MongoDB has proven much more efficient and fast in
data entry and is particularly suited to the manage-
ment of flows of data to be stored without downtime.
In the configuration we used the rate of loading is about
15000 rows/sec (the average rows length is 256 bytes).
By using more powerful hardware such as high speed
network interfaces (10 Gbit/sec or higher) it’s possible
to greatly increase the speed of data acquisition. Re-
garding the search of data in the database the results
show a behavior strongly dependent on the number of
records and the presence or absence of critical elements
such as complex calculations. The command language
of MongoDB does not contain any advanced mathemat-
ical primitives as in the case of MySQL, which can rely
on a large library of mathematical functions. The use
of MongoDB on structures designed to be handled by
a RDBMS is inappropriate because it does not exploits
the potential of its schemaless organization (nosql).

Acknowledgement

We thanks Paola Marrese by ASDC-ASI for providing
us the test data, , the Director of IAPS-INAF Prof
Pietro Ubertini, and a special thanks to Franco Gio-
vannelli and his staff for allowing us to participate to
this workshop

References

[1] Wells, D. C., Greisen, E. W., and Harten, R. H.:
1981, A Flexible Image Transport System.
Astronomy & Astrophysics Supplement Series,
44, 363-370.

[2] Bourne, C. P., Hahn, T. B.: 2003, A history of
online information services, 1963-1976, MIT press.

[3] Agafonova, N., et al.: 2009, The detection of
neutrino interactions in the emulsion/lead target
of the OPERA experiment, Journal of
Instrumentation.
doi:10.1088/1748-0221/4/06/P06020

[4] Bisegni, C.: 2012, Thesis in Computer Science,
University Tor Vergata, Rome.

[5] Codd, E.F.: 1970, A relational model of data for
large shared data banks, Communications of the
ACM 13 (377-387). doi:10.1145/362384.362685

[6] C. J. Date, C. J., Darwen, H.: 1997, A Guide to
the SQL standard: a users guide to the standard
database language SQL, Addison Wesley.

[7] Kuhn, D.: 2010, Pro Oracle Database 11g
Administration, Apress.
doi:10.1007/978-1-4302-2971-1

[8] Schwartz, B.: 2012, High Performance MySQL:
Optimization, Backups, and Replication, O’Reilly.

[9] Lynch, N., Gilbert, S.: 2002, Brewer’s conjecture
and the feasibility of consistent, available,
partition-tolerant web services, ACM SIGACT
News, Volume 33 Issue 2, pg. 51-59.

[10] Chodorow, K.: 2011, Scaling MongoDB, O’Reilly.

[11] George, L: 2011, HBase: The Definitive Guide,
O’Reilly.

[12] Lakshman, A., Malik, P.: 2010, Cassandra: a
decentralized structured storage system, ACM
SIGOPS Operating Systems Review, vol. 44,
no.2. doi:10.1145/1773912.1773922

[13] Sammer, E.: 2012, Hadoop Operations, O’Reilly.

[14] Dean, J., Ghemawat, S.: 2004, MapReduce:
Simplified Data Processing on Large Clusters, in
OSDI’04: Sixth Symposium on Operating System
Design and Implementation, San Francisco, CA,
December.

[15] Dalcanton, J.J.: 2009, 18 years of science with the
Hubble Space Telescope, Nature Jan
1;457(7225):41-50.

[16] Gardner, J. P., at al.: 2006, The James Webb
Space Telescope, Space Science Reviews.
doi:10.1007/s11214-006-8315-7

[17] Busso, G., De Angeli, F., Montegriffo, P.: 2012,
The GAIA photometric data processing, in Proc.
SPIE 8442, Space Telescopes and Instrumentation

296

http://dx.doi.org/10.1088/1748-0221/4/06/P06020
http://dx.doi.org/10.1145/362384.362685
http://dx.doi.org/10.1007/978-1-4302-2971-1
http://dx.doi.org/10.1145/1773912.1773922
http://dx.doi.org/10.1007/s11214-006-8315-7

	Using DBMS
	Relational models
	Non relational models

	DBMS NoSQL
	MySQL vs MongoDB
	Test Results
	Future Activities
	Conclusions