Rich Signell wrote:
NetCDF-Java folk,
I'm trying to figure out how best to store the Global and US "Surface
summary of day data" at:
http://www.ncdc.noaa.gov/oa/climate/climatedata.html#daily
in NetCDF format with the CDM Point Feature type conventions:
http://www.unidata.ucar.edu/software/netcdf-java/CDM/CFpoints.html
This is daily-averaged surface data (temp, air pressure, etc) that
starts in 1929 with just a few stations, and now has thousands of
global stations. It's stored on a ftp site with directories for
each year which containing gzip compressed text files, one for each
station. The files in the 2010 directory are replaced every few days
with new updated files.
In present form the compressed text files take up 2.9GB, but if we
made a single NetCDF file with 22 vars x 81 years x 10,000 stations it
would be 29TB without compression.
So looking at the Point Data specs, it seems we could take several approaches:
1. Write with fixed time,station dimensions, fill missing values with
NaN, and use the NetCDF4 deflation.
2. Use 5.8.2.2 Ragged array (contiguous) representation
3. Use 5.8.2.3 Ragged array (non-contiguous) representation
since the records in the files are updated regularly, perhaps option
2 is out, so I'm leaning toward option 3, in which you have just one
dimension for the each data variable and write all the station data
into it, but you have another variable which specifies the station ID
it corresponds to.
Does this sound right?
Thanks,
Rich
Hi Rich and all:
This is a interesting challenge on such a large datasets to get good read response.
First, you have to decide what kinds of queries you want to support and what
kind of response time is needed. I have generally used the assumption that the
common queries that you want to optimize are:
1) get data over a time range for all stations in a lat/lon box.
2) get data for a single station over a time range, or for all time.
Usually I would break the data into multiple files based on time range, aiming for a file size of 50-500 Mb. I also use a different format for current vs archived data, so that the current dataset can be added to dynamically, while the archived data is rewritten (once) for speed of retrieval.
Again, all depends on what queries you want to optimize so ill wait for your
thoughts on that.
Another question is what clients need to access this data. Are you writing your
own web service, do you just want remote access from IDV, or ??
I would think that if we're careful, we can get netcdf-4 sizes that are similar
to compressed text, but we'll have to experiment. The data appears to be
integer or float with a fixed dynamic range, which is amenable to storing as an
integer with scale/offset. integer data compresses much better than floating
point due to the noise in the low bits of the mantissa. So one task you should
get started on is to examine each field and decide its data type. if floating
point, decide on its range and the number of significant bits.