NOTE: The galeon
mailing list is no longer active. The list archives are made available for historical reasons.
random distracting comments are below: John Graybeal wrote:
On Aug 20, 2009, at 9:54 AM, Tom Whittaker wrote:One of the single biggest mistakes that the meteorological community made in defining a distribution format for realtime, streaming data was BUFR -- because the "tables" needed to interpret the contents of the files are somewhere else....and sometimes, end users cannot find them!Perhaps this is a problem with the way the tables are made available, and not simply the fact they are separate from the data stream? After all, many image files (for example) are not described internally at all, but no one seems to have trouble working with those images.... (I know that's oversimplifying the difference, but it's instructive nonetheless.)
part of the problem is indeed the "way the tables are made available": no registry of canonical versions, mistakes in the "official WMO table" (!), non machine-readable official WMO table (!!). but the biggest problem is deeply part of the BUFR design: one needs the tables to parse the BUFR message at the syntactic level. Other external tables (eg GRIB) are at the semantic level, so you can still extract the numbers even if you dont know what they mean. But BUFR requires the tables to simply parse the message, which makes the above issues with the tables fatal, or worse: if your tables are wrong, your reader can silently return erroneous values. So "self-describing" on both the syntactic and semantic level == good.
NetCDF and ncML maintain the essential metadata within the files: types, units, coordinates -- and I strongly urge you (or whomever) not to make the "BUFR mistake" again -- put the metadata into the files!Maybe you think all the essential metadata is within the netCDF file, but in my opinion it isn't. I often find the essential metadata, particularly of the semantic variety, to be absent. And I know of communities that have had significant difficulty with the provenance (for example) within CF/netCDF files.The generalization (point) of this observation is that different people require different metadata, sometime arbitrarily complex or peripheral metadata. And I don't think you want ALL that metadata in the same file as the data -- especially when the data may be coming not in a file, but in a stream of records.
Yes, in contrast to my claim above for self-describing files, you are stating the "metadata incompleteness theorum" that says "it is impossible to put all essential metadata in a file". Proof by induction: for any given set of metadata, I can find some user who needs one more piece of information not in that set. QED ;^{ (Thats why the TDS allows arbitrary metadata annotations that can be added to the dataset without having to rewrite it. Doesnt refute the theorum, but does allow for solving the problem for your friends ;^)
Do not require the end user to have to have an internet connection to simply "read" the data.... many people download the files and then take them along" when traveling, for example.Ah, in the era of linked data, or LinkedData [2] -- which will be our era in 5 years from now, if not already -- this problem will be solved, because all will insist on having the internet connection when they are traveling. Witness the trajectory of internet availability at scientific conferences.
True but still, creating a self contained dataset is good, for reasons of Keeping Things Simple.
galeon
archives: