[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Dealing with large archives



Hi Tennesee,

Tennessee Leeuwenburg wrote:

Secondly :

I am trying to work out how to structure my data by date. I will have a number of data sets (NWP Models) which will get updated daily, or even multiple times per day. Quite quickly I will reach the point where I will have hundreds of data sets published. Even a week's worth of data at 2 per day across 3 sources is 42 data sets.

I have two tasks - one would be to automate the updating of the configuration files so that new data sets get incorporated as they become available, and the other would be structuring the data pages in a sensible way for users to access.

The THREDDS catalog generation tool can automate generation of catalogs but it does not generate aggregation server config files. Actually, it can generate the parts that aren't aggregations, i.e., the plain THREDDS catalogs parts of the config file. I've always wanted to extend it to deal with the aggregation part of the aggServer config but have never gotten around to doing so.

We're currently working on the next release of the THREDDS server. The OPeNDAP netCDF server side of that should be quite a bit easier to configure (e.g., give it a directory and it serves all the files in that directory that match a certain pattern). The configuration for the aggregation part of the server is still up in the air but it will very likely be different from the current configuration syntax. This should get ironed out in the next 3-6 months. In the mean time, you might take a look at the catalog generator (http://www.unidata.ucar.edu/projects/THREDDS/tech/cataloggen/index.html) and see if that helps any.


I was wondering what practises people might have adopted or found successful in the past with regards to handling large amounts of data? Have people typically arranged archive data as aggregations, or linked to archive catalogs from the top-level catalog? What have people found best?


For some of our large and/or rapidly changing data collections, we have setup a data collection subsetting capability. Basically, we have a document that defines the set of allowed subsetting queries for that collection and then a service that responds to those queries generally with a THREDDS catalog of the requested subset. This is pretty alpha stuff and we haven't really advertised it much but we find it useful. Some rough documentation on this is available at http://www.unidata.ucar.edu/projects/THREDDS/tech/dqc/DqcStatus.html.

Ethan

--
Ethan R. Davis                                Telephone: (303) 497-8155
Software Engineer                             Fax:       (303) 497-8690
UCAR Unidata Program Center                   E-mail:    address@hidden
P.O. Box 3000
Boulder, CO  80307-3000                       http://www.unidata.ucar.edu/
---------------------------------------------------------------------------