[THREDDS #ETD-820941]: aggregation and datasetscan

Hi Valentijn,

A few problems here.

First off, The scan location in an NcML aggregation only supports local 
directories; it does not support remote URLs. It does not use the 
CrawlableDataset framework. At some point we hope to use CrawlableDatasets to 
implement the scan functionality but that is still a ways off.

NcML aggregation can aggregate any kind of dataset the netCDF-java library can 
read (this includes netCDF, GRIB, and a few other data formats). If each 
dataset is explicitly listed (in a netcdf element location attribute), they can 
be local files, OPeNDAP datasets, or HTTP served netCDF. However, the location 
specified in a scan element must be a local file directory.

Second, netcdf/aggregation elements and datasetScan elements cannot be nested. 
They are handled by seperate pieces of code (NcML for aggregation and TDS for 
datasetScan) and so don't know how to work together.

I would suggest you try aggregating a small number of the remote datasets you 
are working with and see how that goes. You will have to list them individually 
in the aggregation like:

<netcdf 
location="http://data.nodc.noaa.gov/cgi-bin/nph-dods/pathfinder/Version5.0/Monthly/1985/198501.s04m1pfv50-sst-16b.hdf
 " />

Not a great long term solution for a collection this size. Of course, we are 
still trying to figure out how aggregation will scale on large collections (and 
we're looking at scaling on local datasets). So, as things stand you probably 
wouldn't want to aggregate all the data in these collections anyway.

Ethan

PS I'm going to be out of the office for the next two weeks or so (my wife and 
I are expecting a baby on Friday). So, John will probably be answering 
questions for awhile.

> Dear Ethan,
> 
> I have read the documentation for aggregation:
> http://www.unidata.ucar.edu/software/netcdf/ncml/v2.2/Aggregation.html
> but cannot get it to work. Bas and I have spend considerable time over
> the last couple of months getting the new remote dataserver
> implementation in THREDSS 3.4 to be more intiutive, but to no avail. I
> hope you can help us furhter.
> 
> Below is the datasetScan that causes problems. In fact, nothing out of
> the netcdf section has any effect.
> Our intention with the netcdf section is the following (1-4 can be done
> with the a local ncml wrapper which you send us some time ago):
> 1. Rename variable sst to sea surface temperature, and its unit from
> temp to degC
> 2. Rename variable lat to latitude
> 3. Rename variable lon to longitude
> 4. Rename attribute add_off to add_offset
> 5. Aggregate over time. Time is not an existing variable in the dataset,
> so we should make a new one. The value of the new time variable is
> extracted from the filename
> 
> <datasetScan name="Pathfinder" path="pathfinder"
> location="http://data.nodc.noaa.gov/cgi-bin/nph-dods/pathfinder";
> <http://data.nodc.noaa.gov/cgi-bin/nph-dods/pathfinder>
> ID="pathfinderTest" addDatasetSize="true" addLatest="true">
> <filter>
> <include wildcard="*-sst*.hdf"/>
> <include wildcard="*qual*.hdf"/>
> </filter>
> <crawlableDatasetImpl
> className="thredds.crawlabledataset.CrawlableDatasetDods" />
> <metadata inherited="true">
> <serviceName>remoteopendap3</serviceName>
> </metadata>
> <netcdf
> xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";
> <http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2>
> enhance="true">
> <variable name="sst">
> <attribute name="long_name" type="String" value="sea surface
> temperature" />
> <attribute name="units" type="String" value="degC" />
> <attribute name="add_offset" orgName="add_off" />
> <!--attribute name="missing_value" type="short" value="0" /-->
> </variable>
> <variable name="lat">
> <attribute name="long_name" type="String" value="latitude" />
> <attribute name="units" type="String" value="degrees_north" />
> </variable>
> <variable name="lon">
> <attribute name="long_name" type="String" value="longitude" />
> <attribute name="units" type="String" value="degrees_east" />
> </variable>
> <dimension name="time" length="0" />
> <variable name="time" type="int" shape="time">
> <attribute name="units" value="secs since 1970-01-01 00:00:00"
> />
> <attribute name="_CoordinateAxisType" value="time" />
> </variable>
> <aggregation dimName="time" type="JoinNew">
> <variableAgg name="sea surface temperature"/>
> <scan location="
> <http://data.nodc.noaa.gov/cgi-bin/nph-dods/pathfinder>
> http://data.nodc.noaa.gov/cgi-bin/nph-dods/pathfinder/Version5.0/Monthly
> /1985/
> <http://data.nodc.noaa.gov/cgi-bin/nph-dods/pathfinder/Version5.0/Monthl
> y/1985/198501.s04m1pfv50-sst-16b.hdf.html> "
> <http://data.nodc.noaa.gov/cgi-bin/nph-dods/pathfinder>  suffix=".hdf"
> dateFormatMark="#yyyyMM" />
> </aggregation>
> </netcdf>
> </datasetScan>
> 
> 
> Q1. Does datasetScan and crawlableDataset allow for aggregating
> over time, and updating/enhancing existing variables? If not, can you
> make this work in the forseeable future (and provide us with an example
> catalog)?
> 
> I also found a support request that covers aggregation in some
> detail, but the example i copied below doesn't work either. I also
> observe that CrawlableDatasetDods is not used in this example, yet
> dirLocations are used to store URL's:
> http://www.unidata.ucar.edu/support/help/MailArchives/netcdf/msg03368.ht
> ml
> 
> From this i quote:
> 
> 
> I have enclosed a full xml-file, but here is how it looks now:
> ...
> <dataset name="MERSEA CLASS 1 Aggregated files">
> <service name="this" serviceType="OpenDAP" base="" />
> <service name="TOPAZ" serviceType="OpenDAP"
> base="/thredds/dodsC/">
> <datasetRoot path="topaz"
> 
> dirLocation="http://nerscweb.bccs.uib.no/nersc/nph-dods/mersea-ip/nat/me
> r
> sea-class1/" /> </service>
> <metadata inherited="true">
> <serviceName>this</serviceName>
> <dataType>Grid</dataType>
> </metadata>
> <dataset name="Best estimate - Atlantic"
> ID="mersea-ip-topaz-class1-nat-be"
> urlPath="topaz/mersea-ip-topaz-class1-nat-be">
> <serviceName>TOPAZ</serviceName>
> <netcdf
> xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";
> <http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2%22> ;>
> <dimension name="time" length="0" />
> <variable name="time" type="int" shape="time">
> <attribute name="units" value="secs since 1970-01-01
> 00:00:00" />
> <attribute name="_CoordinateAxisType" value="time" />
> </variable>
> <aggregation dimName="time" type="JoinNew">
> <netcdf
> 
> location="http://nerscweb.bccs.uib.no/nersc/nph-dods/mersea-ip/nat/merse
> a
> 
> -class1//topaz_V2_mersea_nat_grid1to8_da_class1_b20050706_f200506299999.
> nc
> " coordValue="1120003200" />
> ...
> 
> In the above exmaple, the referenced remote server serves netCDF
> datafiles that do not have the coordinate dimension "Time", but the
> aggregation adds this dimension (probably based on the filename).
> 
> 
> Q2. How would you add "Extracting date coordinates from the
> filename (joinNew)" to this catalog config?
> 
> 
> Cheers, valentijn


Ticket Details
==================
Ticket ID: ETD-820941
Department: Support THREDDS
Priority: Normal
Status: Open


  • 2006 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the thredds archives: