Re: [thredds] Collection Metadata from Granules

To: Ted Habermann <Ted.Habermann@xxxxxxxx>
Subject: Re: [thredds] Collection Metadata from Granules
From: John Caron <caron@xxxxxxxxxxxxxxxx>
Date: Wed, 15 Apr 2009 10:16:42 -0600

Hi Ted, Simon, Pauline et al:

Thanks for restarting this conversation. I'll add a few comments fornow, and like others, promise to think harder about some of the otherissues and post more later.

The original premise of THREDDS was metadata services. After producingthe THREDDS catalog spec, we turned our resources to writing dataservices (opendap, OGC, etc), using the CDM (netcdf-java) software stack.

Weve always considered granularity to be the most critical issue to getright. Our driving use case is large collections of (mostly) homogenousdata files. The number of these collections is 2-6 orders of magnitudeless than the number of files. At Unidata we have at most a few hundredsuch collections. Annotating them, even by hand, is not completely outof the question.

So a large part of our development efforts has been "aggregation" offiles, so that they can logically be seen as a single dataset.


Ok, a few inline comments below:

Ted Habermann wrote:

Simon et al.,
We would be very interested in working with you to explore adding thisharvesting approach to GeoNetwork. I expect that there will be aplethora of challenges. I am using this e-mail to collect and expose my,admittedly long-winded and certainly primordial, thoughts on a couple,and possibly to initiate discussion and evolution...
Content - My experience (could easily be incorrect) is that the THREDDScommunity has really focused on “use” metadata which tends to berelatively sparse (most importantly) and generally more customized. Thisreflects the emergence of THREDDS from the scientific community whichtraditionally shares that focus. As a result, I expect that thethreddsmetadata elements exist only in a small minority of catalogs.

Yes, generally true, although it would be nice to get a crawler toquantify this a bit more.

"use" metadata is needed to provide OGC-type data services, iesubsetting in coordinate space. this is in contrast to "discovery"metadata, which is what the thredds metadata elements are intended for,which we hoped would be an enabling technology for metadata services.

This situation is exacerbated by the evolution of THREDDS towardsauto-generation of catalogs from file systems. I’m fairly sure that thisprocess does not involve opening the files (for performance reasons) sometadata that might be in those files is generally not incorporated inthe catalog.

Autogenerating catalogs from the file system is our attempt to lower the"barrier to entry" for data providers. We can extract times from thefilenames (assuming the info is there), but dont open the files on thefly, with the exception of FMRC (forecast model run collections), whichopens one "prototype" dataset and autogenerates some metadata.

We'd like to add background extraction of metadata, and other ways ofconnecting to metadata sources. (more on that below)


> I suspect that hand-hewn catalogs with lots of metadata are

rare. BTW - I suspect that the same obvious (over-)generalizationapplies to the files that underlie most of these catalogs (again I haveno real quantitative evidence for this). There are a few groups outthere creating netCDF files with really high-quality metadata contentand that number may be growing, but it is still small. This reflects thefact that most creators and users of these files understand them prettywell and can generally use them successfully with information gleanedfrom conversations or scientific papers and presentations. The focus onhigh-quality standard metadata generally comes more from archives andgroups interested in the preservation of understanding. This is adifferent group.


well said.

Identifiers/Collisions - The idea of unique identifiers exists in acouple of the metadata standards we are generally thinking about (DIF,FGDC-RSE, ISO) but it could easily be stretched in this situation. Forexample, if we agree that 1) there is a complete collection of metadataobjects associated with a dataset and identified by an identifier and 2)that the content of some subset of those objects is included in aTHREDDS Catalog, how is the identifier for that subset related to theidentifier of the complete set? How might collisions between theseidentifiers be handled during the harvest process?
The collision question also comes up if we consider where the metadataare coming from. Slide 6 in the presentation I sent out yesterday showsmetadata emerging from a Collection Metadata Repository and beingwritten into the granules. In our architecture, this is also the sourceof metadata that might be written into a THREDDS Catalog and harvestedto GeoNetwork, Geospatial One-Stop, GEOSS, GCMD, ... . It is also theauthoritative source for complete up-to-date information pointed to bythe URLs in the granules. Harvesters need to identify and manage thesepotential collisions. This seems potentially very messy to me.

we did add the notion of a "naming authority", that gets prepended tothe dataset identifier, with the idea that providers could solve theseproblems themselves, and be sure that (with a unique naming authority)their ids would be globally unique. (we could probably refocus that tocorrespond better with subsequent developments in URI naming standards).

If some group solved this correctly, its likely other groups would bevery happy to adopt it.

Mappings - This problem really boils down to the flatness inherent inthe netCDF attribute model which has a parameter-value rather than anXML-DTD or schema history. This model only includes one level ofcontainers so the only way to associate multiple attributes is to holdthem together with a variable (I’m pretty sure on this but am not asexperienced with netCDF as many others on this list).As I said yesterday, we created the NCML from FGDC using an XSLT, so weknow the mapping we used. It is fairly easy to reflect that mapping inthe names of the elements from FGDC because the structure of FGDC issimple and fairly flat. This is not true for ISO, as you know. There aremany embedded objects with potentially repeating objects in them. Ofcourse, you could name them with xpaths but this seems like a difficultpath to go down, particularly when you have a link to a complete, valid,and up-to-date record available.
Consider the relatively clean and important problem of OnlineResources.In THREDDS land these are strings (like fgdc_metadata_link) whosefunctions are known by convention: this one is a link to a complete fgdcrecord. Most of our datasets have multiple OnlineResources with completeISO descriptions (linkage, name, description, ...). Writing those into anetCDF file without losing critical information does not seemstraightforward to me. This is a really simple case. I don’t even wantto think about writing ISO Citations into netCDF or THREDDS!

There are two possible containers for metadata: 1) the threddscatalog/dataset element and 2) the dataset itself, as viewed through thedata service protocol (ie opendap, WCS, WMS, etc). Ideally theinformation content is the same in both, although the representation isdifferent.

Weve mostly taken the view that the discovery metadata needs to be inthe thredds element, and the use metedata needs to be in the dataset.However, we could and should allow the discovery metadata to getautomatically injected into the dataset itself.

I think that allowing arbitrary XML in the catalog should allow for anypossible complex scheme such as ISO. Injecting that into the datasethowever will require some more complex conventions, as Ted points outwith his example of a netcdf-3 file.

Anyway, long-winded as I promised. All things considered I think theapproach I suggested provides maximum bang for the buck! We will work onadding the ISO links...
Ted


Back to the topic of metadata sources:

On the one hand, a THREDDS catalog is "just" an XML encoding of metadataof online resources. Other groups (typically large curators such as NGDCand NCAR) generate these catalogs as needed from an RDBMS that storesthe metadata.

On the other hand, THREDDS catalogs are used to configure the dataservices of a TDS, and must be correctly integrated if you want toactually provide access to the data itself.

Id like to understand what/how this problem is currently beingaddressed, and what we could add to the TDS to do it better.

References:
- [thredds] Running THREDDS on top of old OPeNDAP servers
  - From: Pauline Mak
- Re: [thredds] Running THREDDS on top of old OPeNDAP servers
  - From: Simon.Pigot
- Re: [thredds] Running THREDDS on top of old OPeNDAP servers
  - From: Pauline Mak
- Re: [thredds] Running THREDDS on top of old OPeNDAP servers
  - From: Simon.Pigot
- [thredds] Collection Metadata from Granules
  - From: Ted Habermann
- Re: [thredds] Collection Metadata from Granules
  - From: Simon Pigot
- Re: [thredds] Collection Metadata from Granules
  - From: Ted Habermann

2009 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the thredds archives: