Gray's 3rd requirement for Scientific Data Management Systems:
3. Intelligent indexes: for efficiently subsetting and filtering the data.
The word index here is overloaded, as I've been using it to describe accessing array subsets using array indices. Here, a search index is a way to efficiently find subsets of data based on the data's values.
Scientific file formats like netCDF/HDF lack search indexes, but databases invariably have some kind of functionality for these purposes, typically using B-trees and (to a lesser extent) hash tables to implement it. The DBA (database administrator) is able to add and delete search indexes to optimize selected queries, without breaking existing applications. That's only approximately true since some applications require fast access or else they are in fact broken (if your ATM machine took an hour to decide if you can withdraw money, you wouldn't use it). Nonetheless, the separation of search indexes from the logical view of datasets is a powerful abstraction that's partially responsible for the success of relational databases.
When would it be useful to build search indexes on top of netCDF files? A simple answer is when the cost of building the index can be amortized over enough queries to justify the extra work and space to build the index. If the index is only used once, its not worth building. So it depends on what the expected queries are, and under what circumstances they can use the same index.
- Are there indices that would be used by many researchers?
- Are there indices that would be repeatedly used by a small group of researches?
- How much data do they want once they find what they are looking for?
- Do they need all the data (eg for visualization) or do they really want to run some computation over it?
- Could the computation be sent to the data?
- Is it a simple computation that can be expressed in some elegant algebra?
- Is there a high-level query language or are we trying to satisfy file-at-a-time processing using array indexes?
- How is the data subset organized? Is there data locality on disk?
- What does the query actually return?
- How would a DBA recognize that an index would be justified?
And on and on. In the database community, lots of research has been done to answer these kinds of questions, but very little research that I know of has been done for scientific data access.
In ignorance, all of us are experts, and so undaunted, I offer the following ways to proceed.
First, I'm interested in the case when these files are big; hundreds of megabytes at least, more typically we have collections of many files that are a few gigabytes each, i.e. a few terabytes of data. I'd like to build servers that can eventually handle petabytes of data. Some numbers that I gathered for a talk at XLDB in October of 2010:
- NCAR Mass Store has 8 PB, adding 2.5 PB / year.
- NCAR Research Archive holds 55 Tb in 600 datasets.
- NASA ESDIS has 4.2 PB / 4000 datasets / 175M files.
- NOAA CLASS has 30 PB now, 100 PB by 2015
- IPCC AR4 has 35 TB / 75K files
- IPCC AR5 will be ~2.5 PB
The data in these archives is not going to be stuck into a relational database. It may be that some hybrid file/database system might come along that dominates the problem, but there's nothing proven on the immediate horizon. So principle 1: canonical datasets will be kept in scientific file formats like netCDF/HDF. By canonical I mean that no matter what else archival centers do, the data stored in scientific file formats will be considered to be correct, and will be used to validate other access methods, like web services, query systems, cloud computation, etc.
Second, I'm interested in Earth science data, meaning data that usually is associated with time and spatial locations (aka georeferenced data). The problems of accessing general scientific data will have to wait, for now lets build a system specialized to our community. How should one, in general, store such data, not knowing in advance how it might be used? So principle 2: partition the data by time. This means divide your data into separate files by some period of time, perhaps days, months, or years. Make the files fairly large, to minimize the overhead of managing very large numbers of files. For current technology, I'd say something like 100 Mb to 1 Gb in each file.
Third, abstractions for storing scientific data are needed at a higher level than multidimensional arrays. If one wants to find things in a large collection of stuff , you have to have a way to identify the thing and say what it is when you have it. In computer science, we have converged on the Object as the way to do this. In the abstract data modeling going on in the scientific community, we are calling these things Features. Features provide the higher level abstraction needed to do intelligent indexing, since now its clear what a query should return: feature objects. So principle 3: the data stored in scientific data files are collections of features. Multidimensional arrays might be the way the feature is provided to your program, but your program needs to be rewritten to understand features, if you want it to scale to very large datasets.
Amazingly, storing data in scientific file formats, partitioning them by time, and using metadata conventions to describe the semantics of features is what data providers are (mostly) already doing. Wow, is that a coincidence or what? So, we will just have to build our most excellent next generation Earth Science Data Management System (ESDMS) on top of these best practices in this best of all possible worlds.
I loved this blog post right up to the point where you reach principle 2: "partition the data by time". This is fine sometimes but completely wrong other times depending on the end use of the data. I have a blog post describing the problem titled Data producers vs. data consumers:
http://mazamascience.com/WorkingWithData/?p=84
Right now, for instance I am in the middle of reformatting 2 Tb of HYSPLIT model output FROM a format that partitions by time TO a collection of NetCDF files organized by location and then Julian Day. (The goal is to interactively extract location specific climatologies.) Access times have gone from hours to under a second.
As you pointed out, the best way to store data depends on the questions we will ask of that data. Often, I find myself advocating duplicating the data so that we have both synoptic and location-specific organization.
Extra disk space is way cheaper than the human costs associated with excessive data access times.
Posted by Jonathan Callahan on June 14, 2011 at 06:33 AM MDT #