Designing a New Raster Sub-System for GRASS-7 Martin Hruby IT4Innovations Centre of Excellence Brno University of Technology Božetěchova 2, BRNO, 61266, Czech republic hrubym fit.vutbr.cz Keywords: GRASS, GIS, raster sub-system, geographical data, 3D rasters Abstract The paper deals with a design of a new raster sub-system intended for modern GIS systems open for client and server operation, database connection and strong application interface (API). Motivation for such a design comes from the current state of API working in GRASS 6. If found attractive, the here presented design and its implementation (referred as RG7) may be integrated to the future new generation of the GRASS Geographical Information System version 7-8. The paper describes in details the concept of raster tiling, computer storage of rasters and basic raster access procedures. Finally, the paper gives a simple benchmarking experiment of random read access to raster files imported from the Spearfish dataset. The experiment compares the early implementation of RG7 with the current implementation of rasters in GRASS 6. As the result, the experiment shows the RG7 to be significantly faster than GRASS in random read access to large raster files. Introduction and motivation GRASS [1] is a well-known open-source GIS system with a long history, wide range of included analytical tools and rather large community of users. Most of the users cooperate on its development, they comment its quality, do report its bugs and some of them implement their own system or analytical tools in form of GRASS independent modules. The main GRASS’es strength is traditionally supposed to be in the raster analysis and so, rasters should be something very well working in GRASS. Unfortunately, the current raster Application Interface (API) of GRASS-6 does not offer functions which would comfortably support a development of complicated raster analytical algorithms. Moreover, the current raster core probably will not process large raster files efficiently. This paper is giving an alternative which, if found reasonable, might replace the current raster core with this new one – referred as RG7 (Rasters for GRASS-7) in this paper. There are several open source initiatives and traditional software packages oriented to pro- cessing of raster files. Let us mention at least the famous GDAL [5] (eventually GDAL/OGR), Postgis Raster [6] and Raster3D [7]. The GDAL library is mostly a collection of storage for- mats and a very large software complex. Postgis gives mostly a storage functionality with a certain support for raster data manipulation and analysis – however, via SQL statements, which is an opposite approach to RG7. The Raster3D library is the only one software in- tended strictly to GRASS and, currently, is still in development. Comparing to them, the RG7 library has got a different motivation which comes mainly from the user’s perspective. The main idea is to provide a very abstract view to the raster data, moreover, to provide it Geoinformatics FCE CTU 2011 11 Hruby M.: Designing a New Raster Sub-System for GRASS-7 in highly efficient way. And, at second, to keep the library code as simple as possible. The paper is going to discuss the Application Interface giving the user a set of functions operating the user data (raster map layers), the format of storing raster data and the whole architecture of the raster sub-system. Let us begin from the current state of the GRASS, it means, let us summarize the current abilities of raster API in GRASS-6 [2] (and consequently the architecture of the current raster core): • Set of functions for opening and closing an existing raster or creating a new raster file. • A function (G_get_map_row(...)) for reading a specified row from a raster layer. • A function (G_put_map_row(...)) for writing a specified row to a raster layer. • Other support routines implementing the metadata including an eventual reclass table. Clearly, any API must contain functions for opening, creating and closing the raster data. As a new feature, we may discuss a multi-user access to the data, various rights to access the data, remote access to the data (e.g. in form of a GRASS file server), transactions an so on. System of reading and writing a row of raster is the main point of criticism of the current state of API. The problem has two levels of discussion: at first, user’s approach to the data and at second, system implementation of such a call. In the current state, an user obtains a particular row of the raster no matter what part of that he needs. This kind of a sequential approach is sufficient for rather primitive tools (r.mapcalc, r.in/out, etc.), but let us imagine an analytical algorithm requiring a random access (or better say – an index-sequential access) to the whole raster layer. A developer implementing such a tool has to create his own library over the GRASS raster API caching the lines and serving as a necessary abstraction over the API. As I suppose, we would rather like let the developer concentrate himself on the core algorithm of his application. On the other hand, the current GRASS raster sub-system together with the set of raster tools is nowadays optimized for such an approach. It must me told in the very beginning: the here presented approach is not generally better than the current one, but it has a good chance to be a good advantage for future GRASS’es developmental grow. This paper presents some sort of an opposition to the current row access approach, even if some developers claim that this row-approach is natural for a large part of raster analytic algorithms. Some comparisons and contemplations are put in the section "Experimental results". Let us now give the features of the RG7 raster sub-system: • Support for extremely large raster files with a guarantee of constant time access to any part of the layer, i.e. 1) any part of the raster is accessible in the same time as another one, 2) size of the raster has no influence to the access time (or just marginal). • Native support for multiple data formats and storing the rasters in a SQL database (for example in PostgreSQL [3] or SQlite3 [4]). • Real multi-user access and GRASS ready to work as a file server. • Integrated 3D-raster concept. Geoinformatics FCE CTU 2011 12 Hruby M.: Designing a New Raster Sub-System for GRASS-7 • Comfortable API ready to give any sort of view to the data. Development of user raster application on a very high abstraction level. • Raw raster data and other information packages (mainly the set of various metadata) operated in a unique manner and stored in an unique place, where SQL databases are preferred1. Processing the extremely large rasters (size of gigabytes) together with any user-specified sliding window over the raster needs a special design of a store format and quick access to sub-region within the file. This is why the RG7 is based on raster tiles. Every raster file available through RG7 will be physically decomposed to a grid of so called physical tiles. This is the main difference to the current GRASS raster concept – not the row, but a tile is the raster atomic element of data. A tile is a block of fixed size no matter what size has got the raster file, i.e. raster files with long rows do not influence the complexity of their processing. The advantages and disadvantages are discussed in the section "Conclusion of the experiments". It will be shown, that with the RG7 design, the technical approach of storing the data is absolutely independent to its logical tile representation. Moreover, many storing principles can be used, e.g. storing in a classical file managed by an operating system (referred as a OS-file), in a SQL database or virtualized via network. The main interest of the RG7 design is focused into the set of functions making an interface between an user and the GRASS kernel. Simplifying the style of raster programming might encourage new users to bring their algorithms to the GRASS code. Technical and geographical definition of the tiles Definition of a tile A tile is a geographically defined sub-region within a raster layer. A raster file itself is defined by its boundary coordinates and number of cells in north-south and east-west directions. Similarly, a tile is defined by its boundary coordinates and number of cells in both directions. By decomposing the raster to a set of tiles, we get an organized system of small data elements (tiles) which are easy to operate. A tile becomes as an almost regular grid decomposition of a raster file, as it is shown in Figure 1. This system of recursive decomposition making a spatial index is called the Quad-Tree (let us assume this term to be a common knowledge in the GIS community). As we may see bellow, there are several ways of defining a tile regarding its size. There are generally two approaches to raster file decomposition: • Regular - all tiles are of the same size and the size is equal in both directions (the tile has a square shape). Constructing a spatial index in form of a Quad-Tree is not necessary to navigate through the raster. An optional implementation of raster pyramids is logical and easy. 1Motto: no files, no troubles. Geoinformatics FCE CTU 2011 13 Hruby M.: Designing a New Raster Sub-System for GRASS-7 – Advantages - mapping a geographical coordinate to a particular tile has a constant time access, it is easy to implement and operate. – Disadvantages - not very practical as the input raster must have a square shape and its number of raster elements in both directions (cells) must be 2n, where n ∈ N. • Irregular - the file is decomposed using a Quad-Tree procedure with a minimum tile attribute (see bellow). It may be applied to all existing raster files as the limitation of regularity is not required here. An optional implementation of raster pyramids is still possible. – Advantages - no limits for rasters in the meaning of their size. – Disadvantages (time issue) - each tile is of different size. Mapping a geographical coordinate to a particular tile must be done through the Quad-Tree spatial index, i. e. in logarithmic time complexity. This computational overhead can be successfully minimized in the internal algorithms of RG7, for example by using an explicit topological links among the neighbor tiles. – Disadvantages (space issue) - number of tree nodes (i.e. virtual tiles) grows expo- nentially with height of the tree. The height of tree depends on size of the raster and size of the minimum tile. Clearly, the requirement of having all files of size in 2n is not very practical. Therefor, the RG7 concept strongly assumes the files of any size and implements the irregular tiling. Figure 1: A demonstration of the irregular Quad-Tree Decomposition of a raster to a tree of virtual and physical tiles In RG7, size of a tile is not fixed, i.e. is not equal for all tiles, and it is influenced by so called minimum region. Four tiles is the minimum number of tile decomposition, i.e. having a raster (or generally any region) with countX × countY cells and minimum region of size mregX × mregY cells, the raster is split into a quadruple of virtual sub-tiles only if (1) Geoinformatics FCE CTU 2011 14 Hruby M.: Designing a New Raster Sub-System for GRASS-7 holds. If the condition (1) is false, the region countX × countY is not split and remains as a physical tile (see Figure 1 where the red rectangle defines the minimum region for further decomposition). The regions which allow their decomposition are then called the virtual tiles. countX ≥ 2 ·mregX ∧ countY ≥ 2 ·mregY (1) The raster spatial index is based on Quad-Trees, where the tree nodes are are referred as virtual tiles and leafs (non divisible regions) are the physical tiles. See the Algorithm 1. The condition (1) also causes every physical tile to be generally larger then the specified minimum region, thus the minimum region does not denote a wanted size of a tile, it just define the stop condition of the decomposition. Each tile (virtual or physical) is addressed by a prefix code as described in Figures 2. In Figure 2, the top-left tile is labelled 0 and when this one tile is decomposed further (see Figure 2 on right), a prefix 0− is added. Algorithm 1 Decomposition of a region to a tree of sub-regions class Ras_vtile { Ras_vtile(Region reg) : region(reg) { ... } ... void decompose(ICoord mreg) { phtile = 0; if (region.countX() >= 2*mreg.x && region.countY() >= 2*mreg.y) { int mx = region.countX()/2; int my = region.countY()/2; subtiles[0] = new Ras_vtile(region.sub_region(0, my, mx, region.countY())); subtiles[1] = new Ras_vtile(region.sub_region(mx, my, region.countX(), region.countY())); subtiles[2] = new Ras_vtile(region.sub_region(0, 0, mx, my)); subtiles[3] = new Ras_vtile(region.sub_region(mx, 0, region.countX(), my)); for (int a=0; a<4; a++) subtiles[a]->decompose(mreg); } else phtile = new Ras_phtile(region); } Region region; Ras_vtile *subtiles[4]; Ras_phtile *phtile; } In such a convention, each physical tile has got its label and using that label, one can find the tile in a quad-tree hierarchy. The label is also an index key to identify the tile in a database storage. Geoinformatics FCE CTU 2011 15 Hruby M.: Designing a New Raster Sub-System for GRASS-7 0- 1- 2- 3- 0-0 0-1 1-0 1-1 0-2 0-3 1-2 1-3 2-0 2-1 3-0 3-1 2-2 2-3 3-2 3-3 Figure 2: Addressing tiles in a decomposition of a raster (depth 1 and 2) Architecture of the raster sub-system Having defined the concept of Quad-Tree for RG7, we may proceed to description of the RG7 itself. The RG7 sub-systems consists of the following modules, each responsible for certain functionality: • Physical module (PH) – PH implements a direct low level access to raster files. It creates a spatial index tree, reads and writes particular tiles. PH also implements a memory cache over tiles. This module has the major influence to practical run-time efficiency of RG7. • Connection module (CN) – CN establishes a connection between a particular raster application and a physical module. CN implements a kernel which controls the user access to the files. CN also manage a possible multi-user (or multi-application) access to the files. • Raster window module (RW) – RW implements an user specified view to a raster file. RW contains algorithms of raster resampling and presentation. RW specifies the user API in various formats, i.e. at least the full RG7 C++ API and GRASS standard-like C API. • Raster application modules (AM) – AM is a set of user implemented raster analytical modules using RG7 API at a CN or RW level. The RG7 API shall connect users through the CN and RW modules. Direct access to PH module is not assumed for a standard user. The direct access is allowed only for debugging and benchmarking purposes. The whole design is prepared for object-oriented (OO) implementation in C++. Object oriented C++ is preferred for its high level of abstraction and support of semi-standard C++ libraries (Boost, STL). To keep backward compatibility with GRASS-6 raster API, a certain part of RG7 API will contain headers in classical C language. There is absolutely no technical problem in combining a C and C++ code in one software package. Moreover, in my personal opinion, the dogma of keeping GRASS in pure C might be a serious limitation in GRASS’es future development. Geoinformatics FCE CTU 2011 16 Hruby M.: Designing a New Raster Sub-System for GRASS-7 The Physical module of RG7 Definitions of the basic geometrical objects Let XCoord denotes an abstract class implementing a 2D point in a plane2. Let X and Y are the two components of the coordinate – X representing the eastings and Y representing the northings. Then, let Coord is defined in floating point numbers and ICoord in integers. Coord expresses a geographical coordinate (in any coordinate system) and ICoord just a relative coordinate, mostly an index to the raster grid. Region will denote a geographical rectangular space given by its left-bottom corner (LB:Coord) and right-top corner (RT:Coord). Surely, speaking about LB and RT is equal to giving four numbers expressing north, south, east and west boundaries of the region. Rasterization of the region is given by number of cells in east-west dimension – countX() and south-north dimension – countY(). The Region implements a method inside(Coord i) returning true, resp. false, if a point i is geographically inside the region, resp. if it does not. Basic definitions - A physical tile A physical tile (Ras_phtile in the code) is geographically defined by its region. The raster data contents is stored in a 2D array, in a matrix object, of dimension region.countX() × region.countY() cells. Other internal attributes are not interesting at the moment. Let us see the main I/O methods of the physical tile. The functionality of Ras_phtile is mainly in these methods: • wphys(ICoord i, dtype v) – writes a value v to the matrix at i position. • rphys(ICoord i) – returns a value at i position. • write_to_compressed_ba – the function outputs a serialised stream of matrix contents to a compressed byte array (compressed using run-length method). The compression methods are subjects for further work and thus kept simple in the current design. • read_from_compressed_ba – loads and decompresses the serialized stream to the matrix object. • allocate() and deallocate() the matrix object. The existence of data in physical tiles is only virtual and the tile’s contents is loaded just in the moment of its demand. By allocating the matrix(Ras_phtile), we mean allocating a computer memory for matrix object. By deallocating we mean giving the memory back on heap. Virtuality of the raster data contents The attribute matrix of Ras_phtile consumes a non-trivial part of computer memory. When a raster file is opened, RG7 automatically creates its tree representation made of virtual tiles and physical tiles. But no data is loaded from the database storage yet and so, no physical tile allocates a memory for the matrix buffers. 2It might be extended with the 3rd dimension very soon. Geoinformatics FCE CTU 2011 17 Hruby M.: Designing a New Raster Sub-System for GRASS-7 Ras_phtile object representing a particular tile stores the raster contents only virtually until its allocate() method is invoked. The method allocates a memory for matrix and the tile’s contents may be loaded (read_from_compressed_ba). When the tiles is not needed, its deallocate() method frees the matrix memory. Clearly, before deallocating the buffer with some write changes, the tile’s contents must be stored in the database. This memory management approach is going to be described bellow in further details. The Physical module in C++ classes Let us now introduce the main C++ classes making the Physical module of RG7 (See Figure 3): • Ras_phys_file – manages a spatial index of a file and physical tiles. This class is abstract in the meaning that it just manages the tiles without any direct link to their physical storage (this is done through the following Ras_interface class). The Ras_phys_file class manages an amount of memory used by tiles via calls allocate() and deallocate() (see section "Memory management"). • Ras_interface – holds metadata (an instance of Ras_metadata) for a file and imple- ments its particular data format. As it will be described bellow, the RG7 functionality may be extendible right through these interfaces. RG7 may implement interfaces for storing the data in files, SQL databases or even in network services. • Ras_metadata – implements a storage of metadata for a raster file. The metadata contains a lot of raster’s attributes, but at least two: region specifying the raster’s geometry and minimum_region determining the raster’s decomposition to tiles. The Ras_phys_file instance represents a single opened raster file. When opening a raster file, the RG7 system instances a Ras_interface relevant for the raster file (depending on its particular data format). An instance of Ras_interface (simply the interface or iface) is re- sponsible for the input and output of raster’s metadata (at least region and minimum_region) and for all I/O operations over tiles. Then, an instance of Ras_phys_file is created (and given the interface). Having region and minimum_region, the Ras_phys_file can establish the quad-tree representing the spatial-index (Algorithm 1). By opening a raster, the RG7 only constructs relevant data structures in computer memory (objects Ras_phys_file, Ras_interface, Ras_metadataand the spatial index). Loading the raster contents (the map itself) depends then on user’s requirements. In multi-user (multi- application) access, the Ras_phys_file is instanced only once in the RG7 kernel. Memory management over the physical tiles As it has been already mentioned, a given raster is split to a grid of physical tiles which are supposed to contain the relevant data. Certainly, the whole raster will probably not fit the computer memory, therefor some sort of memory management must be designed. This memory management is going to be very similar to a well-known system of virtual memory managed by todays operating systems (virtual and physical memory pages). Geoinformatics FCE CTU 2011 18 Hruby M.: Designing a New Raster Sub-System for GRASS-7 Figure 3: UML overview on the Physical module classes A physical tile may be in one of two possible states (relevant for the tile’s memory consump- tion/reservation) – allocated or deallocated, i.e. consuming computer memory or not. When the user requests a raster attribute at some geographical coordinate, this request is translated to its logical coordinates which identify a particular tile p and local index lc. Then: 1. If p is deallocated at the moment, Ras_phys_file allocates a memory for p, and asks its interface (iface) to load the tile from its disk storage. Then, p is allocated. 2. If p is allocated, p.rphys(lc) is returned. Certainly, not all tiles can be allocated at the same moment due some operation memory limi- tations. For this purpose, Ras_phys_file can be set to keep as maximum max_available_tiles physical tiles in allocated state (as default, max_available_tiles is set -1 and then there are no limits). As it has been mentioned, the memory management is inspired by the virtual memory management – all physical tiles contain data only virtually and the Ras_phys_file object assigns the memory resources on demand, it means, in a situation when there are already max_available_tiles physical tiles allocated and another tile is required to load its data, one of the current allocated tiles must be deallocated. The algorithm doing such a deci- sion is another problem to discuss. At the current state of RG7, the Ras_phys_file object keeps certain access statistics on tiles and selects the latest accessed tile to be deallocated. Geoinformatics FCE CTU 2011 19 Hruby M.: Designing a New Raster Sub-System for GRASS-7 Virtual storage interfaces One of the most valuable features of the RG7 design is in allowing developers to imple- ment various formats of storing the rasters. Ras_phys_file operates a raster file in an abstract manner invoking its interface for certain basic operations. Every class derived from Ras_interface implementing the following methods may define its own data format: • open() – opens an existing raster file and loads its metadata. • close() – closes all files needed by the raster layer. • create() – creates a new raster layer using the given metadata. • writemetadata() – stores the raster file’s metadata. • loadmetadata() – loads the raster file’s metadata. • swaptile(p)– if the contents of p.matrix was modified, p.matrix is serialized and flushed to disk storage (or any persistent device). Invoked usually when p is selected to become deallocated or when closing the whole file. • swapped_tile_available(p)– returns a boolean saying whether a raster contents of p is present on storage or not. If p is filled with NULL values, there is no need to store that fact on disk. • load_swapped_tile(p)– a tile p loads its data contents from disk storage (p must be allocated before). Having such an interface, one may implement any data format or a way of storing the raster data. For example, these interface definitions are going to be included in RG7 early design: • Ras_interface_sqlite – the main assumed interface for RG7 via SQLite3 [4]. Tile’s matrix contents is stored in an SQL database in BLOB records. • Ras_interface_postgres – similar to Ras_interface_sqlite, but implemented for PostgreSQL. PostgreSQL works as an independent OS process, thus this storage proces- sor might be faster for heavier traffic loads than SQLite3 (especially with multiple-core CPUs). This is an issue for further testing and experience. • Ras_interface_grass6 – an interface providing a compatibility to current GRASS-6 raster storing format. The interface has to read multiple rows to complete a 2D tile, thus it can be efficient only with classical raster row-oriented applications. • Ras_interface_WMS – an interface providing an abstraction over WMS services. WMS raster files are selected as read-only then. • Ras_interface_sfile – all physical tiles are stored in an unique disk file and the interface keeps an offset table in a separate disk file (similar to ESRI Shapefile .shx). Fast and rather easy to implement. Efficient implementation of this physical storage level is essential for the RG7’s read/write per- formance. For this reason, the Ras_interface_sqlite interface defines an unique database index (RasterID ×TileID) for fast searching in the DB storage. Geoinformatics FCE CTU 2011 20 Hruby M.: Designing a New Raster Sub-System for GRASS-7 Raster Metadata in RG7 Metadata are operated through the interface object using its writemetadata() and loadmetadata() methods. The loaded contents is then kept in Ras_metadata objects having the following parts: • Geographical description of the raster – boundary region of the raster (referred as basic region) and its size (number of cells in both directions). • User comments – various text fields inspired mostly in GRASS raster metadata. • Tiling – minimum region (see 2). • Values description – particular null value and the cell’s data type (char, two-byte integer and four-byte float). • Reclassification – reclassification table and reference to an original raster file (may be generally in different format). • Colour palette – reference to a standard implemented palette or an user-specified palette (just for graphical presentation). RG7 does not assume an extra raster storage to keep explicitly the null data, thus there must be one value reserved to specify the NULL contents of a cell. By default, the null value is set to numerical zero. Let us remind that a tile containing only null values in all cells is not stored, i.e. swapped_tile_available(p) interface method should return false. If such a tile is modificated by writing some non-null data, the tile is then swapped when deallocated. The connection module of RG7 The Connection module provides the interconnection between the user application (analytical module) and the internal raster kernel managing the set of active Ras_phys_file. The Connection module consists of two classes: • Ras_kernel – the class is instanced as a singleton RG7. The RG7 object serves the users to open, create and close the requested raster files. For a required file, Ras_kernel returns a particular Raster object making a handle to a particular open raster. • Raster – objects of this class provides a handle to open rasters, i.e. realize required read and write operations to the rasters. The object also gives the complete metadata information. The Ras_kernel and Raster objects are supposed to be an only channel to a particular source of raster data from the user’s perspective (to a local GIS kernel, to a remote GIS file server, etc.). There are some details in multi-user or multi-process accesses in the design to finish. At the moment, let us assume just one application processing the data through just one Ras_kernel channel. The Ras_kernel object registers all open/created raster files and assigns a Ras_phys_file ob- ject for each one. Multiple open request to a unique file always leads to a single Ras_phys_file object. Geoinformatics FCE CTU 2011 21 Hruby M.: Designing a New Raster Sub-System for GRASS-7 As the Raster objects provide the complete information access in both read/write directions (together with the Window module), it in fact makes the basic necessary interface between an user and the raster GIS kernel. The whole raster implementation is thus encapsulated inside this abstraction and so, the RG7 concept may serve as a new abstract API for applications even without the new tiling approach, just as an abstract layer over the classical GRASS raster engine. The window layer The RWindow class provides a final element of RG7 C++ API. An application may specify a window which can slide within the region of Raster object. The important fact is that an user does not invoke explicitly the read/write operations – he just points the window as a text cursor and shifts that as he needs. When a window is placed (or shifted/moved), its RWindow object requests the Raster object to obtain the relevant data. All data edits are made through the window as well, so if RWindow objects is asked to move at some other coordinate (or is being destroyed), RWindow object then performs the write operation automatically. RWindow object offers a basic resampling of a raster based on its internal attribute region. If the region is left default, i.e. taken from the file’s metadata (its basic region), then the win- dow reads the raster in its original resolution and in fact, it accesses the physical raster data. Furthermore, an user may specify his own region and then the region translates the relative co- ordinates of the window to the geographical coordinates and identifies the source cells. By de- fault, the resample method is set to the nearest neighbor. Specifying an own region is frequent in analytical tools respecting the standard GRASS monitor. The GRASS monitor settings (a global region accessible via g.region) is available by invoking Ras_kernel::monitor(). The RWindow operations are these: • read(ICoord c) – returns a value of a cell at c local position within the window. • write(ICoord c, dtype val) – similarly, writes a value val. • topleft()/topright()/bottomleft()/bottomright() – move the window on such a position. • shift_right()/... – shifts the window one cell on right/left/down/up. • moveAt(ICoord c)/moveAt(Coord c) – points the window at a given position. There are various sorts of raster windows in the RG7 design: • RWindow – basic single-layer window with user defined size and region of resampling. • RWindowPicture – a variant of RWindow, exportable to a bitmap picture. • RWindowRow – a variant of RWindow where the size is automatically set to (1,columns) where columns follows the geometry of the given raster file. • RWindowMulti – a multi-layer window. It allows opening multiple raster files with an unique window, each file either for write or read access. Geoinformatics FCE CTU 2011 22 Hruby M.: Designing a New Raster Sub-System for GRASS-7 • RWindow3D – a window suitable for 3D raster requests. In fact, that is a multi-level RWindow defined on a single file. Let us see the following demonstrative example of various access windows. RWindow r1(handle, ICoord(3,3)); RWindow r2(handle, ICoord(3,3), RG7.monitor()); RWindow r3(handle, ICoord(1, RG7.monitor().countX()), RG7.monitor()); RWindowRow r4(handle, RG7.monitor()); RWindowMulti r5(ICoord(3,3), RG7.monitor()); r5.add(handle); r5.add(RG7.open("elevation.dem"), READONLY); RWindowPicture r6(handle, RG7.monitor()); The r1 window of size 3 × 3 is open for previously open raster using its natural (physical) resolution and for its natural region. Similarly, r2 is created with the same window size, but defined on the global GRASS region specified by the g.region, i.e. including its resolution. The windows r3 a r4 are equivalent. The r5 has got two layers with rasters "geology" (previously open) and "elevation.dem". Sliding r5 will load/save raster cells of both layers simultaneously. The window r6 is going to have its size specified by the current monitor settings, i.e. by its boundaries and resolution. In fact, it gives an whole picture of the raster as it might be seen in the d.mon GUI window and then printed by r6.plot("out.bmp") as a raster picture. Application interface to RG7 RG7 is implemented in GNU C++, it means that its code is made to be easily understable for developers, ready for enhancements and using various STL libraries wherever it is possible. Over its core, various APIs might be specified: • API compatible with the already existing C API standard such that absolutely no modifications to the existing raster tools will be needed. • C++ API giving the users all the new enhancements of RG7. • API connecting other programming languages. It is absolutely sure that implementing RG7 may not hurt the overall functionality of GRASS, i.e. of GRASS analytical and system modules. It is a matter of time and future experience if some modules will be re-implemented to gain a higher performance coming with a new raster storing and processing. An 3D raster support The RG7 concept is open for 3D-raster extension with only minor modifications to its basic algorithms, I/O interface and program code. The suggestion is the following: • The 3rd dimension is discretized in the regular way with a constant step. It makes a set of 2D unique raster files, where identification of a particular tile is done in two steps: Geoinformatics FCE CTU 2011 23 Hruby M.: Designing a New Raster Sub-System for GRASS-7 identification in 2D and determination of the layer in 3rd dimension. • Labeling the tile is extended with a number of a required layer (I/O operations provided by the interfaces). • Window may slide on an single or on all layers (3D window). Experimental results This section is going to present few benchmark experiments demonstrating access times to rasters in different manners of their use. We assume the following datasets (Table 1) imported from Spearfish60 demonstration dataset [8]. The rasters were generated with appropriate g.region res= resolution and then exported by r.out.ascii. The data were then imported to RG7 software prototype. All benchmarks have been done based on the SQLite3 interface (see the section "Virtual storage interfaces"), i.e. with rasters stored in SQLite3 database. The minimum region was set 32 × 32 cells for almost all layers (the "geolarge" file with 64x64 tiling). The benchmark was performed on a 4xCPU Xeon 3GHz PC with 8 GB RAM running on Linux OS. The benchmark measures the whole time of the application’s time run using the time unix utility, i.e. the time measured also includes some application overhead with initialization etc. It should be mentioned at the very beginning of this case study, that the experiment (called the benchmark here) does not have a quality of a proper laboratory measurement and its main purpose was not estimate the exact algorithmic complexity of RG7 access times on rasters. The purpose was rather to show the general difference between access times on GRASS and RG7 which will be evident and so, not very precise measurement of run times is generally acceptable. Spearfish Test-name Original size Test size Num. of tiles original name (rows, columns) (imported to RG7) soils soils 750 x 950 750 x 950 256 geology geol 140 x 190 140 x 190 16 geology geolbig 140 x 190 14,000 x 19,000 65,536 geology geol05 140 x 190 28,000 x 38,000 262,144 geology geolarge 140 x 190 140,000 x 190,000 4,194,304 (tile ≥ 64x64) Table 1: Testing raster files imported from Spearfish60 dataset. The Table 1 consists of raster files taken from the Spearfish dataset (Spearfish original name) with their original stored resolution (Original size in rows and columns). The files has been resampled and exported in GRASS and then given a case-study identifier (Test-name) and case-study resolution (Test size). When imported to RG7, each file has been automatically decomposed to the raster tiles (Number of tiles). The "geolarge" file has been decomposed with 64x64 minimum region tiling, the others with the default 32x32 minimum region tiling. Random physical read access In this experiment, a given raster is open and sets its "max available tiles" attribute denoting a maximum number of tiles in the cache. Let us remind the term "maximum available tiles" (see 2) denoting a cache size of tiles in memory. The experiment proceeds a given number of iterations, where in each iteration: Geoinformatics FCE CTU 2011 24 Hruby M.: Designing a New Raster Sub-System for GRASS-7 1. a random coordinate c withing the raster’s region is generated, 2. a cell value at the position c is requested. If the required physical tile is not present in the cache, the tile must be loaded from its disk storage (respecting the limitation of "max available tiles"). As the access is fully random, only two experiments have a sense – an experiment with a single tile cache ("max available tiles" is one) and an experiment with a non-limited cache ("max available tiles" is "full", i.e. unlimited). We should keep in mind that, with this random access, the randomly chosen cell very surely points on a different tile than in the previous iteration. Test-name max. available n. iterations time [s] avg. access time tiles per tile [ms] geol 1 105 4.4 0.044 soils 1 105 5 0.05 geolbig 1 105 8 0.08 geol05 1 105 10.3 0.10 geolbig 1 106 96 0.096 geol full 106 0.162 – soils full 106 0.253 – geolbig full 106 8.6 0.09 geol05 full 106 30.4 0.03 geolarge full 106 6 0.09 Table 2: Benchmark "random access" results measured on RG7 When having only one "max available tiles", almost at every iteration, the requested tile must be loaded through the given interface, i.e. from the SQLite3 database engine. The results show that obtaining any physical tile takes in average about 0.1 ms no matter what the whole size the raster file has (or, at least, the access times are in the same digit place). In other words, at the current implementation of RG7, this prototype can proceed approximately 10, 000 tile loads per second with absolutely random order of tiles. Let us note, that this number will be double or triple when using burst readings (database transactions, SQL selects of multiple tiles, etc.). If the cache supports an unlimited number of physical tiles (referred as "full"), almost all tiles of the input raster must be loaded during the iterations, but just once. The complete time for "geol" and "soils" is trivial, just "geolbig" in compare to "soils" shows about 6 seconds more to load all 65536 tiles (it makes aprox. 0.09 ms/tile), which is not too bad for this early implementation of RG7. See Table 2 to get the experiment’s results. The experiment has been extended with a more detail sampling of the runtime of the bench- marking program for the "geol", "geolbig" and "geol05" raster files. The Figure 4 shows the results when using a single-tile cache and the unlimite cache. The left graph is not very surprising – the access to a tile takes some constant time, thus the resulting function for all raster files is linear. On the right side we may see, that (see the "geolbig" file) after certain number of iterations, the cache gets filled with all tiles and further iterations does not touch the disk and the request are completed within the cache itself. Geoinformatics FCE CTU 2011 25 Hruby M.: Designing a New Raster Sub-System for GRASS-7 Figure 4: Benchmark "random access" results measured on RG7 with geol, geolbig and geol05 datasets. Measured runtime with single tile cache (on left) and unlimited cache (on right). The Figure 5 shows the average access time to a tile. Let us mention mainly the effect of removing the application overhead (loading the program, establishing the quad-tree, etc.) causing the convergence of the computed average time to a certain true value. Figure 5: Benchmark "random access" results measured on RG7 with geol, geolbig and geol05 datasets. Measured average access time to a tile with single tile cache (on left) and unlimited cache (on right). Let us proceed a similar experiment done in GRASS. The experiment was implemented using a demonstration tool called r.example (loop with random Rast_get_row(infd, inrast, random()%nrows, data_type)). The region had to be manually set regarding the current experiment, e.g. g.region rast=geology. The results at the Tables 2 and 3 clearly show the following important observations: • GRASS is significantly faster than RG7 with 1-tile cache in small files. That’s probably because the GRASS kernel loads the whole file in one shot at the first read access and the further readings are then done using its internal cache. The file "geol" is also so small (around 8KB) that it takes only 2-3 pages of virtual memory and thus kept whole Geoinformatics FCE CTU 2011 26 Hruby M.: Designing a New Raster Sub-System for GRASS-7 Test-name n. iterations time [s] avg. access time per row [ms] geol 105 0.5 0.005 soils 105 1.7 0.017 geolbig 105 34 0.34 geol05 105 112 1.12 geolarge 105 915 9.15 geol 106 4.5 0.0045 soils 106 17 0.017 geolbig 106 314 0.314 geol05 106 963 0.963 geolarge 106 4,176 4.176 Table 3: Benchmark "random access" results measured on GRASS in OS disk buffers. • GRASS is significantly slower than RG7 in small files when RG7 has got an unlimited cache (soils, 0.253s versus 17s). • GRASS is significantly slower in large files (see the "geolbig" results) compared mainly in case of a single tile cache – 8s versus 32s (96s versus 314s). To conclude the first experiment, it must be told that the measurement is not very precise due to the influence of OS disk buffers which are not under the tester’s control, but, anyway: • In the category of small files, we compare GRASS results with RG7 unlimited-cache results, and, RG7 wins. We assume that the raster is surely cached in GRASS, thus this comparison is correct. • In the category of large file, we compare GRASS results with RG7 single-tile results, and, RG7 wins. We assume the GRASS not keeping such a large file in buffers, thus, again – this comparison is correct. Correlation between number of tiles and the access time One might think about the time complexity of index-sequential accessing to single raster tiles. Is there any correlation between their size and expected time necessary to fetch a required tile? Certainly, there is some connection, however very small as it is shown in this sub-experiment. I generated nine square raster files similar to the previous ones. These rasters are not gen- erated from any existing raster file (like before), they are fully artificial but having all the same contents. See the Table 4 for the raster file definitions and for the measured results. The experiments are basically identical to the previous bundle of runs. The benchmarking program was executed with a given raster file and set to use just single tile cache. Number of iterations was set equally 1 million of iterations. The table displays the overall time of each program’s execution. As we may see, through all rasters files, the million of iterations (and loading a tile) took about the same time no matter what the size of raster actually is. Geoinformatics FCE CTU 2011 27 Hruby M.: Designing a New Raster Sub-System for GRASS-7 File ID Number of rows Number of Runtime of (columns) stored tiles the experiment [s] g1000 1,000 256 66 g2000 2,000 1,024 67 g5000 5,000 16,384 42 g10000 10,000 65,536 46 g20000 20,000 262,144 54 g50000 50,000 1,048,576 74 g70000 70,000 4,194,304 72 g200000 200,000 16,777,216 54 g300000 300,000 67,108,864 81 Table 4: Generated raster files and runtime measurements for the experiment in Chapter 2 Random window read access This experiment consists of random read accesses with a square window (3x3). The experi- ment is done with either 4 tiles cache (the window may intersect up to 4 tiles) and unlimited cache. Surely, the practical raster analyzes do not "jump" from a random point to another, the experiment just shows a possible access time when sliding a 3x3 raster window. The results (see Table 5) are not compared to GRASS as the measured times would be just a triple of the previously measured ones. Test name max. ava-tiles readings time [s] avg. access time per window [µs] geol 4 105 6 60 soils 4 105 6 60 geolbig 4 105 8.5 85 geol full 105 0.15 1.5 soils full 105 0.2 2 geolbig full 105 4.7 47 geolbig full 106 9 90 Table 5: Benchmark "window access" results Conclusion of the experiments The RG7 implementation seems to be more efficient in the here presented benchmarks than the classical GRASS raster sub-system. This paper is not a comparative study of GRASS and RG7 in raster processing performance. That’s another topic for another paper which might be worked out when RG7 gets more advanced and tuned. Anyway, there is a big performance reserve and hope in processing various RWindow requests as the burst selects on SQLite3 database are done in shorter time than a sum of tile’s individual selects. That’s the point where RG7 might compete GRASS in classical row sequential analyzes as well. There is only only one point of performance difference between these two approaches where GRASS seems to be still faster: d.mon. The d.mon utility displays a rather small number of rows and so, it reads a small number of rows from disk. Comparing to that, RG7 has to read a sequence of tiles which then complete the requested single row. The same experience Geoinformatics FCE CTU 2011 28 Hruby M.: Designing a New Raster Sub-System for GRASS-7 is probable in processing rasters in overview mode, i.e. not in their original resolution. Doing such a fast overview on the data is possible via raster pyramids. Conclusions and future work The RG7 design has been presented in this paper. The design is certainly in a very early developmental stage with, at the moment, no proper connection to its target infrastructure, and as it has been mentioned, GRASS-7 (or 8) might be the target. Moreover, the RG7 implementation is currently very theoretical and needs certain optimizations to ensure high performance of the resulting raster kernel. Using the virtual interfaces (see the section "Virtual storage interfaces"), the RG7 library makes an abstraction over unlimited number of data storage formats and methods. The GIS system based on RG7 can operate raster data sources like GRASS raster format, RG7 Sqlite3 interface, WMS, GeoTIFF, etc. – all in the same manner. The benchmark experiments demonstrated that a random access to a raster via RG7 is faster than via classical GRASS raster sub-system. This is pretty sure at the moment. One may say, that this success is perhaps just marginal as a large number of practical raster analyses reads the rasters sequentially and that’s perfectly working in the current GRASS. However, as it was mentioned in 2, the optimized RWindowRow interface will probably defeat this argument. There might be some criticism regarding the searching time overhead in Quad-tree spatial index. In fact, there are two sorts of searching trees: searching through Quad-tree spatial index and searching in database index file. Both are variants of tree data structures. Let us mention that the practical Quad-tree height is rather small, e.g. 12 levels for raster 200, 000× 200, 000 cells. Time spent in searching within such a tree is really marginal. Similarly, in the case of the SQL database index file. I wanted to show that tiling the raster, i.e. splitting the raster to a grid of small elements can guarantee some sort of independence on raster’s size. Saying constant access time would be too strong as there are several aspects in computers influencing the result, however the experiments presented here give some support to the concept of RG7. The concept of RG7 has just one weakness at the moment, and that is the time spent in generating the Quad-tree when opening the raster. This is going to be fixed in the next RG7 development step. Moreover, the computing performance is not the biggest issue. RG7 is attempting to show a different manner of accessing the raster data, and, I would say, a better manner than the current API provides. The idea of "windows" (or cursors?) of variable size is certainly not new. In the Computer science terminology, the rasters storage and raster API make together so called Abstract Data Type (ADT). The computer scientists know for many years that a good ADT ensures clever, efficient and flexible algorithms. On the other hand, a bad ADT kills surely the applications. Let us give a good raster ADT to the future GRASS program generation. The RG7 design certainly counts on the standard GRASS-6 raster API ensuring a full back- ward compatibility with the existing raster analytical tools. Geoinformatics FCE CTU 2011 29 Hruby M.: Designing a New Raster Sub-System for GRASS-7 Acknowledgement: This work has been supported by the Grant Agency of Brno University of Technology No. FIT-S-11-1 Advanced secured, reliable and adaptive IT and the Czech Ministry of Education under the Research Plan No. MSM0021630528 "Security-Oriented Research in Information Technology". This work was also supported by the European Regional Development Fund in the IT4Innovations Centre of Excellence project(CZ.1.05/1.1.00/02.0070). Author would like to thank to three anonymous reviewers for their useful comments and remarks. References 1. GRASS Homepage: http://grass.fbk.eu/ 2. GRASS-6 Raster API Manual: http://grass.osgeo.org/programming6/gisrasterlib.html 3. PostgreSQL Homepage: http://www.postgresql.org/ 4. SQLite3 Homepage: http://www.sqlite.org/ 5. GDAL Homepage: http://www.gdal.org/ 6. PostGIS Homepage: http://postgis.refractions.net/ 7. Raster3D Manual Page: http://grass.osgeo.org/manuals/html70_user/raster3D. html 8. Spearfish Data set: http://grass.fbk.eu/download/data6.php Geoinformatics FCE CTU 2011 30 http://grass.fbk.eu/ http://grass.osgeo.org/programming6/gisrasterlib.html http://www.postgresql.org/ http://www.sqlite.org/ http://www.gdal.org/ http://postgis.refractions.net/ http://grass.osgeo.org/manuals/html70_user/raster3D.html http://grass.osgeo.org/manuals/html70_user/raster3D.html http://grass.fbk.eu/download/data6.php