Hi Ted,
Sorry, I'm going to ask some pretty dumb questions here, because I still
rather fuzzy on the THREDDS metadata, ISO and generally metadata stuff...
I'm also loosely involved in a project that is hoping to crawl datasets
through OPeNDAP and then recreating THREDDS catalogs with metadata
information (hence my initial question about running THREDDS over
OPeNDAP servers) Where the end goal is to allow us to express our entire
digital library fully in ISO 19115. There's already been some
interaction between Simon and Jason :)
Content - My experience (could easily be incorrect) is that the THREDDS
community has really focused on “use” metadata which tends to be
relatively sparse (most importantly) and generally more customized. This
reflects the emergence of THREDDS from the scientific community which
traditionally shares that focus. As a result, I expect that the
threddsmetadata elements exist only in a small minority of catalogs.
Yeap, I hear you :) I am in the process of moving across to THREDDS as
our main OPeNDAP servers, so hopefully, I can inject some more metadata
into those datasets!
On the topic of generating catalogs from filesystem... Simon - I'm not
entirely sure how an external application like GeoNetwork is going to
create configuration catalogs for THREDDS, the "location" attribute has
to point to actual files themselves and how these files are organised in
the underlying filesystem structures... Although, it is only a *single*
attribute...
This situation is exacerbated by the evolution of THREDDS towards
auto-generation of catalogs from file systems. I’m fairly sure that this
process does not involve opening the files (for performance reasons) so
metadata that might be in those files is generally not incorporated in
the catalog. I suspect that hand-hewn catalogs with lots of metadata are
rare. BTW - I suspect that the same obvious (over-)generalization
applies to the files that underlie most of these catalogs (again I have
no real quantitative evidence for this). There are a few groups out
there creating netCDF files with really high-quality metadata content
and that number may be growing, but it is still small. This reflects the
Indeed. From experience, most data providers are keen to get the data
out there, and making datasets compliant to any metadata convention is
seen as a blocker rather than enabler (unfortunately), so a lot of
information is left out in the initial package.
fact that most creators and users of these files understand them pretty
well and can generally use them successfully with information gleaned
from conversations or scientific papers and presentations. The focus on
high-quality standard metadata generally comes more from archives and
groups interested in the preservation of understanding. This is a
different group.
Yes, again you've hit the nail on the head about this.
Identifiers/Collisions - The idea of unique identifiers exists in a
couple of the metadata standards we are generally thinking about (DIF,
FGDC-RSE, ISO) but it could easily be stretched in this situation. For
example, if we agree that 1) there is a complete collection of metadata
objects associated with a dataset and identified by an identifier and 2)
that the content of some subset of those objects is included in a
THREDDS Catalog, how is the identifier for that subset related to the
identifier of the complete set? How might collisions between these
identifiers be handled during the harvest process?
If I understand correctly - the problem is that these identifiers are
not always generated by some central repository, but rather from the
data providers/repository providers. Just looking at the OAI spec again
- it says that "Individual communities may develop community-specific
URI schemes for coordinated use across repositories". So does anyone
know if such thing exists?
The collision question also comes up if we consider where the metadata
are coming from. Slide 6 in the presentation I sent out yesterday shows
metadata emerging from a Collection Metadata Repository and being
written into the granules. In our architecture, this is also the source
of metadata that might be written into a THREDDS Catalog and harvested
to GeoNetwork, Geospatial One-Stop, GEOSS, GCMD, ... . It is also the
authoritative source for complete up-to-date information pointed to by
the URLs in the granules. Harvesters need to identify and manage these
potential collisions. This seems potentially very messy to me.
I'm looking at slide 6 at the moment and have a question... How does it
deal with datasets that are continuely updated? For example, we updated
the Argo dataset on a weekly basis through rsync. The ncML files will
need to be updated. Furthermore, this will introduce a lag between the
content of the file and the ncML file. I'm more in favour of generating
the data dependent figures from the file itself... (bad for performance,
but at least the metadata will always be relevant.)
Mappings - This problem really boils down to the flatness inherent in
the netCDF attribute model which has a parameter-value rather than an
XML-DTD or schema history. This model only includes one level of
containers so the only way to associate multiple attributes is to hold
them together with a variable (I’m pretty sure on this but am not as
experienced with netCDF as many others on this list).
I've had trouble getting my head around this problem too! The THREDDS
catalog (rather than the metadata in NetCDF files) is a bit richer,
since metadata can be inherited by nested datasets. So it should solve
*some* of the flatness issues.
As I said yesterday, we created the NCML from FGDC using an XSLT, so we
know the mapping we used. It is fairly easy to reflect that mapping in
the names of the elements from FGDC because the structure of FGDC is
simple and fairly flat. This is not true for ISO, as you know. There are
many embedded objects with potentially repeating objects in them. Of
course, you could name them with xpaths but this seems like a difficult
path to go down, particularly when you have a link to a complete, valid,
and up-to-date record available.
Consider the relatively clean and important problem of OnlineResources.
In THREDDS land these are strings (like fgdc_metadata_link) whose
functions are known by convention: this one is a link to a complete fgdc
record. Most of our datasets have multiple OnlineResources with complete
ISO descriptions (linkage, name, description, ...). Writing those into a
netCDF file without losing critical information does not seem
straightforward to me. This is a really simple case. I don’t even want
to think about writing ISO Citations into netCDF or THREDDS!
Anyway, long-winded as I promised. All things considered I think the
approach I suggested provides maximum bang for the buck! We will work on
adding the ISO links...
Sorry for being a bit slow here... So are you creating the ISO documents
separately, and then reference them using something like
iso_metadata_link in a NetCDF file's attribute? How is the ISO
documents generated in the first place (is this the NetCDF writer?) and
where will they be hosted? I suppose, those files can also be hosted by
the THREDDS server if we only configure the http service for those
files? I do like how clean this approach is by simply adding an new
attribute to the metadata. It sounds very achievable!
Cheers,
-Pauline.
--
Pauline Mak
ARCS Data Services
Ph: (03) 6226 7518
Email: pauline.mak@xxxxxxxxxxx
Jabber: pauline.mak@xxxxxxxxxxx
http://www.arcs.org.au/
TPAC
Email: pauline.mak@xxxxxxxxxxx
http://www.tpac.org.au/