Whats the relationship between Relational Database Management Systems (RDBMS) and scientific file formats like netCDF and HDF? Sometimes I think of netCDF-3 as a persistence format for Fortran 77. The only features it really adds is arbitrary metadata (in the form of attributes) and shared dimensions. Both features are simple but powerful, but somehow missed by at least some other formats for scientific data, e.g. neither GRIB nor BUFR allow arbitrary metadata (everything has to be in a controlled vocabulary), and neither HDF nor OPeNDAP have the full generality of shared dimensions, instead using somewhat less general coordinate variables (a.k.a. dimension scales in HDF and Grib maps in OPeNDAP). NetCDF-4/HDF5 is a lot more sophisticated than netCDF-3, but the main additional feature is in allowing space/time tradeoffs with chunking/compression, as well as parallel I/O for high performance.
A really useful outsider’s perspective on scientific file formats like netCDF and HDF comes from Jim Gray, a Turing award winner and one of the most respected database researchers in the past 25 years. Jim was a research scientist at IBM, Tandem, DEC, and then Microsoft, until he tragically was lost at sea in 2007 sailing solo off the coast of California. He and his boat were never found, and a search for him was organized not only physically but virtually, with volunteers examining satellite images supplied by Digital Globe in hopes for a clue as to where his boat went down. He was clearly loved as well as admired. Jim and his group at Microsoft developed TerraServer and SkyServer, among the few successful applications of RDBMS software to very large scientific data holdings.
Among his many contributions to database theory and practice is a 2005 paper Scientific Data Management in the Coming Decade [1] that gives a database-centric view of scientific file formats in the context of a vision of how to manage scientific data in the future. Required reading for anyone thinking about these issues; here I will summarize some of the salient points.
Gray and his collaborators call scientific file formats like HDF and netCDF, "nascent database systems" which provide simple schema languages and simple indexing strategies, with data manipulation primitives focused on getting subsetted array data into application memory where it can be manipulated by user written software. The problem is that these primitives use a single CPU, and operate on a single file at a time, which does not scale to larger and larger datasets.
Drawing from the experience of database research and development, they assert the following as necessary features for a scalable Scientific Data Management Systems (SDMS), all substantially lacking from current systems:
Data independence: the physical data organization can change without breaking existing applications
Schema language: powerful data definition tools allow one to specify the abstract data formats and to specify how the data is organized.
Intelligent indexes: for efficiently subsetting and filtering the data.
Non-procedural query: allows automatic search strategies to be generated that take advantage of indexes as well as CPU and IO parallelism.
The payoff for this architecture is the ability for an SDMS to automatically parellelize data access.
There's no question that parallel access is the key to scaling. Its accepted wisdom that, at least with today’s tools and programming languages, parallel software development is too difficult for application programmers. I personally am convinced by these arguments, and I believe that deep changes are needed at the level of the application programmer’s interface to take advantage of automatically generated parellelization.
The key to allowing automatic parellelism is that the application must be able to specify the entire set of data to be operated on, so that the SDMS can parellize access to it. This is what Gray calls set at a time processing. This is in contrast to the way data is accessed through the netCDF API: iterating over subsets of array data, processing one slice at a time in memory, then discarding it. If the data is spread out over multiple files, then there is another iteration over those files. To a data server or data manager module, these requests are independent of the previous one, and only speculative optimization is possible, that is, guessing what I/O is needed next. Of course this freedom for the application program to choose which data to process on-the fly makes the program powerful and easy to program. But if the program could describe the set of data that it will use, and even the order in which it will use it, a powerful enough system could optimise the I/O. Can you say "factor of 100" ?
If the application could also specify the computation to run on the data, then the computation can be "sent to the data" for further parellelization and optimization. This is what SQL, the query language for relational databases does: specifies the set of data to operate on, and the computation to be performed. RDBMS have sophisticated cost algorithms built into it to figure out the lowest cost way to satisfy the request, choosing between multiple strategies and potentially taking advantage of parellel CPUs and file systems.
The relational data model was chosen to make all this possible. Scientific data is much more heterogeneous. But these principles are well worth understanding to see how they might apply. Gray speculates that HDF and netCDF would become object types inside the next generation of object-relational databases. This at least gets this data into the DBMS framework to allow experimentation to happen.
To date, there have not been a lot of success stories in putting heterogeneous scientific data into relational databases (with the notable exception of SkyServer and its relatives), unless the data fits closely into the relational data model. Michael Stonebraker, another influencial DBMS researcher (responsible for Ingres, PostGRES, StreamBase, and many other research and commercial databases), argues in an influential paper [2] that we've done all we can do with the relational data model, and other paradigms are now needed. A serial database entrepreneur, he has started the SciDB project, which extends the relational model with arrays. Given his success record and his explicit intention "to satisfy the demands of data-intensive scientific problems", this effort is worth tracking and perhaps joining.
[1] Scientific Data Management in the Coming Decade, Jim Gray, David T. Liu, Maria A. Nieto-Santisteban, Alexander S. Szalay, Gerd Heber, and David DeWitt, January 2005 http://research.microsoft.com/apps/pubs/default.aspx?id=64537
[2] Michael Stonebraker, U?ur Çetintemel, "One Size Fits All": An Idea Whose Time Has Come and Gone, pp. 2-11, 21st International Conference on Data Engineering, IEEE Computer Society Press, Tokyo, Japan, April 2005, 0-7695-2285-8. http://www.cs.brown.edu/~ugur/fits_all.pdf
Interesting approach, good to her that
Posted by 76.110.22.17 on February 08, 2012 at 07:13 AM MST #