Hi Ted, Simon, Pauline et al:
Thanks for restarting this conversation. I'll add a few comments for
now, and like others, promise to think harder about some of the other
issues and post more later.
The original premise of THREDDS was metadata services. After producing
the THREDDS catalog spec, we turned our resources to writing data
services (opendap, OGC, etc), using the CDM (netcdf-java) software stack.
Weve always considered granularity to be the most critical issue to get
right. Our driving use case is large collections of (mostly) homogenous
data files. The number of these collections is 2-6 orders of magnitude
less than the number of files. At Unidata we have at most a few hundred
such collections. Annotating them, even by hand, is not completely out
of the question.
So a large part of our development efforts has been "aggregation" of
files, so that they can logically be seen as a single dataset.
Ok, a few inline comments below:
Ted Habermann wrote:
Simon et al.,
We would be very interested in working with you to explore adding this
harvesting approach to GeoNetwork. I expect that there will be a
plethora of challenges. I am using this e-mail to collect and expose my,
admittedly long-winded and certainly primordial, thoughts on a couple,
and possibly to initiate discussion and evolution...
Content - My experience (could easily be incorrect) is that the THREDDS
community has really focused on “use” metadata which tends to be
relatively sparse (most importantly) and generally more customized. This
reflects the emergence of THREDDS from the scientific community which
traditionally shares that focus. As a result, I expect that the
threddsmetadata elements exist only in a small minority of catalogs.
Yes, generally true, although it would be nice to get a crawler to
quantify this a bit more.
"use" metadata is needed to provide OGC-type data services, ie
subsetting in coordinate space. this is in contrast to "discovery"
metadata, which is what the thredds metadata elements are intended for,
which we hoped would be an enabling technology for metadata services.
This situation is exacerbated by the evolution of THREDDS towards
auto-generation of catalogs from file systems. I’m fairly sure that this
process does not involve opening the files (for performance reasons) so
metadata that might be in those files is generally not incorporated in
the catalog.
Autogenerating catalogs from the file system is our attempt to lower the
"barrier to entry" for data providers. We can extract times from the
filenames (assuming the info is there), but dont open the files on the
fly, with the exception of FMRC (forecast model run collections), which
opens one "prototype" dataset and autogenerates some metadata.
We'd like to add background extraction of metadata, and other ways of
connecting to metadata sources. (more on that below)
> I suspect that hand-hewn catalogs with lots of metadata are
rare. BTW - I suspect that the same obvious (over-)generalization
applies to the files that underlie most of these catalogs (again I have
no real quantitative evidence for this). There are a few groups out
there creating netCDF files with really high-quality metadata content
and that number may be growing, but it is still small. This reflects the
fact that most creators and users of these files understand them pretty
well and can generally use them successfully with information gleaned
from conversations or scientific papers and presentations. The focus on
high-quality standard metadata generally comes more from archives and
groups interested in the preservation of understanding. This is a
different group.
well said.
Identifiers/Collisions - The idea of unique identifiers exists in a
couple of the metadata standards we are generally thinking about (DIF,
FGDC-RSE, ISO) but it could easily be stretched in this situation. For
example, if we agree that 1) there is a complete collection of metadata
objects associated with a dataset and identified by an identifier and 2)
that the content of some subset of those objects is included in a
THREDDS Catalog, how is the identifier for that subset related to the
identifier of the complete set? How might collisions between these
identifiers be handled during the harvest process?
The collision question also comes up if we consider where the metadata
are coming from. Slide 6 in the presentation I sent out yesterday shows
metadata emerging from a Collection Metadata Repository and being
written into the granules. In our architecture, this is also the source
of metadata that might be written into a THREDDS Catalog and harvested
to GeoNetwork, Geospatial One-Stop, GEOSS, GCMD, ... . It is also the
authoritative source for complete up-to-date information pointed to by
the URLs in the granules. Harvesters need to identify and manage these
potential collisions. This seems potentially very messy to me.
we did add the notion of a "naming authority", that gets prepended to
the dataset identifier, with the idea that providers could solve these
problems themselves, and be sure that (with a unique naming authority)
their ids would be globally unique. (we could probably refocus that to
correspond better with subsequent developments in URI naming standards).
If some group solved this correctly, its likely other groups would be
very happy to adopt it.
Mappings - This problem really boils down to the flatness inherent in
the netCDF attribute model which has a parameter-value rather than an
XML-DTD or schema history. This model only includes one level of
containers so the only way to associate multiple attributes is to hold
them together with a variable (I’m pretty sure on this but am not as
experienced with netCDF as many others on this list).
As I said yesterday, we created the NCML from FGDC using an XSLT, so we
know the mapping we used. It is fairly easy to reflect that mapping in
the names of the elements from FGDC because the structure of FGDC is
simple and fairly flat. This is not true for ISO, as you know. There are
many embedded objects with potentially repeating objects in them. Of
course, you could name them with xpaths but this seems like a difficult
path to go down, particularly when you have a link to a complete, valid,
and up-to-date record available.
Consider the relatively clean and important problem of OnlineResources.
In THREDDS land these are strings (like fgdc_metadata_link) whose
functions are known by convention: this one is a link to a complete fgdc
record. Most of our datasets have multiple OnlineResources with complete
ISO descriptions (linkage, name, description, ...). Writing those into a
netCDF file without losing critical information does not seem
straightforward to me. This is a really simple case. I don’t even want
to think about writing ISO Citations into netCDF or THREDDS!
There are two possible containers for metadata: 1) the thredds
catalog/dataset element and 2) the dataset itself, as viewed through the
data service protocol (ie opendap, WCS, WMS, etc). Ideally the
information content is the same in both, although the representation is
different.
Weve mostly taken the view that the discovery metadata needs to be in
the thredds element, and the use metedata needs to be in the dataset.
However, we could and should allow the discovery metadata to get
automatically injected into the dataset itself.
I think that allowing arbitrary XML in the catalog should allow for any
possible complex scheme such as ISO. Injecting that into the dataset
however will require some more complex conventions, as Ted points out
with his example of a netcdf-3 file.
Anyway, long-winded as I promised. All things considered I think the
approach I suggested provides maximum bang for the buck! We will work on
adding the ISO links...
Ted
Back to the topic of metadata sources:
On the one hand, a THREDDS catalog is "just" an XML encoding of metadata
of online resources. Other groups (typically large curators such as NGDC
and NCAR) generate these catalogs as needed from an RDBMS that stores
the metadata.
On the other hand, THREDDS catalogs are used to configure the data
services of a TDS, and must be correctly integrated if you want to
actually provide access to the data itself.
Id like to understand what/how this problem is currently being
addressed, and what we could add to the TDS to do it better.