Hi again Jonathan,
As promised, I am sending along a more detailed review of your proposed
netCDF conventions. (For those reading along, this proposal is
available at http://www-pcmdi.llnl.gov/drach/netCDF.html) This is
prefaced by a few general comments, along with some musings about how
netCDF might best satisfy the needs of users and applications
developers.
General Comments
----------------
1. Conventions ought to be as simple and undemanding as possible,
to make the use of netCDF as easy as possible. This may sound
like a platitude, but one reason for netCDF's popularity to date
has been the ease with which users can get started. And we all
know how critical it is for the viability of a software product
that it first be widely popular.
2. We (the netCDF community, or at least the ocean-atmosphere subset
of us) might want to consider defining a "core" set of basic
conventions, with extensions to suit specific groups. These core
conventions should be broadly applicable and include basic
principles for storing and documenting certain types of
information. Groups specializing in climate, NWP, oceanography,
observational datasets, etc. could define additional conventions to
facilitate data exchange amongst themselves, so long as the files
are consistent with broader conventions.
3. As I mentioned in my previous mail, I am, in general, opposed to
the use of external tables. While they can be handy for centers
which exchange much the same data all the time, it is problematic
for researchers who can sometimes get quite "imaginative" and end
up with things that aren't in the "official" table. Complicating
things is the fact that there often tends to be more than one
"official" table. However, the fact that you are not replacing a
description with a code from a table reduces the problem
tremendously, so I'll go along on this one given it's erstwhile
utility.
4. There seems to be a preference in your proposal for associating
additional qualities with the axes/coordinate-variables themselves
(eg, contraction, bounds, subcell, ancillary coordinate
variables). While this might be a clever way to associate the
added info with a multitude of data variables, it may also lead to
the expansion in the number of dimensions, since all this
additional information may not be applicable to every data
variable employing that dimension. In that case, a new dimension
which is essentially identical to an existing dimension will have
to be created. The alternative, historically, has been to use
referential attributes attached to the data variables to
specifically associate alternate sources of information. (See
http://www.unidata.ucar.edu/packages/netcdf/NUWG/draft.html) These
are also more general, as they are not limited to 1-D.
5. Your proposal does not rule out the use of referential attributes,
but neither does it endorse or exploit them. Any particular
reason? More generally, it would certainly be helpful (and brave)
if you would let us all know your thoughts along the lines of the
recent discussion concerning multidimensional coordinates.
6. Please, give us some (many!) CDL examples!
Specific comments:
-----------------
- Section 3: Data types
I am not really a Comp-Sci person, so it's possible I'm missing
something critical about "byte" vs. "char" here. I've already
learned to cope with the signedness differences between our SGI's
and our Cray T90, and it didn't seem that difficult. But the
proposed change does mean existing applications will have to be
modified, never an exciting prospect. Also, I'm not sure how to
handle "char" in any mathematical expressions (eg, decompression).
What is it that "may become ambiguous in future versions of netCDF"
that is driving this?
- Sections 8 Axes and dimensionality of a data variable
9 Coordinate variables and topology
In the spirit of simplicity, I don't think I would make storage of
coordinate variables mandatory if they are simply 1,2,3,... (though
*we* always do store them). The generally accepted default of
assigning coordinates of "1, 2, 3, etc." seems reasonable, and most
software already seems to handle this.
I suppose the ability to define 0-dimensional variables could come
in handy, though such a quantity is probably more appropriately
stored as a global attribute. At least one plotting package here
(Ferret) cannot handle this, however. BTW, 0-D variables are an
extension to COARDS - you do not call attention to this.
I can see that there there might be some use for variables with
more than 4 dimensions, but this is likely to frustrate some
existing utilities.
I very much like singleton dimensions. People and utilities are too
quick to toss a dimension when "contracting" (eg, averaging,
extracting), when there is still usable placement information.
- Section 11: Units
Exploiting the units database of UDUNITS is fine, but I am less
comfortable relying on the syntax of any particular package. What
does this gain us, especially if this is not an approved method of
"compression" (although it does serve as such)?
- Section 12: Physical quantity of a variable
- "units": I would like to see "none" added as a legitimate
characterization, as it would serve as a definite affirmation
that the variable really does have no units.
- "quantity": any time it is proposed that something be made
*mandatory*, I have to consider it long and hard. In this case,
it seems that the existing COARDS approach is already adequate
for deducing the sense of lat/lon/vertical/time dimensions. Why
is "quantity" so necessary? There also seems to be a potential
failure mode here in that someone could encode incompatible
"units" and "quantity". Nevertheless, I must concede that use
of a "quantity" out of some "official" table would make the
precise definition less prone to misinterpretation.
- "modulo": simple as it is, this is the clearest explanation I've
seen.
- Section 16: Vertical (height or depth) dimension
First, it seems as though you are proposing that the utility of the
COARDS "positive" attribute be replaced by the sense conferred on
the axis by its "quantity". If so, I don't agree. The presence of
"positive" in the file is preferable to a look-up in an extra
table.
Second, the direction of "positive" is not merely a display issue.
It is a critical piece of information that defines the "handedness"
of the coordinate system being defined.
Third, I'm not sure that the vertical dimension is necessarily
recognizable from the order of the dimensions. What about a "mean
zonal wind" that is Y-Z?
Fourth, you rightly note that requiring units for a vertical
coordinate which has no units "means defining dimensionless units
for any dimensionless quantity one might wish to use for that
axis". However, rather than be concerned with some inconsistency
of treatment relative to data variables, this brings up a larger
issue, namely: How does one recognize that an axis is "vertical"
if it is not a standard "quantity" and does not employ units that
look "vertical"? *Furthermore*, how does one recognize what
direction *any* axis points in if the "quantity" is not
georeferenced and the units are nondescript? For example, a
channel model here uses Cartesian horizontal coordinates
(units=meters) and a "zeta" hybrid vertical coordinate. Our local
solution to this dilemma is to attach an attribute "cartesian_axis"
to each coordinate variable that indicates to downstream
applications which (if any) cartesian axis each dimension is
associated with (values are "X|Y|Z|T|N"). Without this
information, we'd have to simply assume that the axes are X-Y-Z
order (ie, we can't tell that "zonal mean wind" is oriented Y-Z).
- Section 17: Ancillary coordinate variables
You might want to emphasize that this is a lead in to sections
18-21, which are different kinds of "ancillary coordinate
variables". One possible problem with the proposed definition is
that it is limited to 1-D. Thus, even if I calculate and store the
altitude of all the points in my sigma model, I can't associate it
with the arrays of temperature in sigma coordinates, etc.
Another possible problem is that this ancillary information might
not be applicable to all data variables employing that dimension.
(See general comments above.)
- Section 19: Associated coordinate variables
There is already a mechanism for "associating" additional
coordinate information without requiring yet another defined
attribute name: Referential attributes. I have typically seen them
attached to data variables, but I see no reason why they could not
be attached to main coordinate variables, too.
- Section 21: Boundary coordinate variables
Is there any particular reason why you made the additional
dimension the slowest varying dimension? The information is all
there, of course, but my intuition would like to see the min's and
max's interleaved.
- Section: 23: Contracted dimensions
I definitely like the idea of a "contraction" attribute to
document, in a general way, the operation that was performed.
Although I haven't tried either this approach or that from the NCAR
CSM conventions, I think this will be more general. We should,
though, get together and agree on a set of valid strings (eg, "min"
vs. "minimum").
However, there might be a problem with the assertion that the
contracted dimension is of size 1. How would I store, and
document, say, a time-series of monthly means?
I'm still not sure I understand simultaneous contractions. A CDL
example would help here.
I am still trying to figure out how I would use these bits of
metadata to store some data that I, personally, have had to deal
with. We take 3-D winds and calculate kinetic energy, then take a
vertical average (mean) over pressure from some pressure level
(say, 50mb) down to the earth's surface. Now, since the surface
pressure varies two-dimensionally, it seems that a dimension, being
1-D, will not be adequate to store the information about the
integration bounds. Any idea how I would do this?
- Section 24: Time axes
Suppose I have an idealized model that is not associated with any
calendar - it just ticks off hour after hour. How would I be
allowed to specify time is this case?
- Section 26: Non-Gregorian Calendars
You picked the string "noleap" to indicate a calendar with 365 days
per year, but UDUNITS already has "common_year" defined for this.
Any particular reason for not using that?
Also, I would like to lobby to add "perpetual" (some of our
experiments use, eg, perpetual-January conditions), "none" (see
above), and "model". A "model" calendar would take the place of
your "360" and would allow for some number other than 360. Of
course, you'd need an additional auxiliary attribute to specify
just what that number is, maybe in terms of "days per month". For
a "perpetual" calendar, you'll also need an additional attribute to
indicate the Julian day of the year for which the "perpetual"
conditions are held constant.
- Section 27: Unimonth calendar
I see the attraction of being able to convert between any calendar
and "unitime". This is the same thing many of us do when we store
date/time as YYYYMMDDHHNNSS. I'm not opposed to this, but I
wouldn't want to use it in place of a UDUNITS-type representation.
Hopefully, UDUINTS will someday handle additional calendars. (Hint,
hint!)
One difficulty with this sort of an axis is that an otherwise
continuous coordinate now become non-continuous; ie, there are lots
of what appear to be "breaks" in the timeline. Utilities can be
made to realize that some processing is needed, but this will
require more work.
A (much) more significant difficulty w.r.t. calendars is comparing
variables which use two different calendars; eg, a climate model
run using 360-day years vs. observed data using the Gregorian
calendar. As far as I know, there really hasn't been any definitive
method laid out for doing this. My intuition tells me that there
really is no universal way to do it, since these two "planets"
don't really move the same way - each quantity is going to have
problems with one cycle or another. You might still have some luck
defining a method for certain types of quantities by converting to,
say, "fractional year". Annual and seasonal (ie, 1/4-year) averages
might be OK. But how would you define and compare June averages
from planet "A" with planet "B"? Suppose you wanted to calculate
4xdaily difference maps over the course of a year? If you do it
based on the definition of a day, one planet gets back to winter
slightly ahead of the other. If you do it in some sort of
fractional year, the diurnal cycles get out of synch. Any ideas?
- Section 28: Intervals of Time
Here we might have a conflict with my desire to be able to store
time simply as "days" or "hours" in the case of an idealized model
which has no calendar-sense. (See section 24 comments.) With your
proposal, such units would be interpreted as an *interval* of
time. Of course, that's sort of what it is if you take "time" as
being relative to the start of an integration, but I don't think
that's what you want, since one might still wish to calculate a
time "interval" for idealized cases, too.
I'm not sure how else one might handle this, though...
Also, unless I'm missing something, storing "monthly" data by
defining a "unitime" axis with units of "days" doesn't necessarily
buy us more than a "time" axis with units of "months". Both are
"non-standard" units that can be interpreted only after determining
the calendar type.
- Section 29: Multiple time axes and climatological time
It took me some time to grasp (I think) what you are doing here.
This seems a clever way to specify the time of data points by using
a set of "benchmark" points and another set of points that measure
time "wrt" those benchmark points. But some CDL examples
demonstrating its usefulness are critical here. It would seem that
the same information could be stored using other constructs in your
proposal, without the (significant) complications introduced by a
two-dimensional time coordinate. What would a time series of June
means look like with and without "wrt"?
- Section 30: Special surfaces
Here, again, I am uncomfortable with the use of an external table,
at least in the context of any "core" conventions. If it were
necessary to document a variable which is located at, eg, the
earth's "surface", one could use a referential attribute to record
the vertical location of each point:
dimensions:
lon = 20;
lat = 10;
zsfc = 1;
variables:
float tstar(zsfc,lat,lon) ;
tstar:zsfc="zstar" ;
float zstar(lat,lon) ;
float lat(lat) ;
float lon(lat) ;
- Section 31: Invalid values in a data variable
As I mentioned before, I heartily agree with your distinction
between "invalid" and "missing" values. In practice, both are (or
should be) outside the "valid_range", in which case (most)
utilities/packages know to ignore them. But this is real, valuable
information content that has not been exploited in any conventions
to date.
I'm not sure I agree with the inferences you are requiring on the
part of generic applications when it comes to using _FillValue to
define a default "valid_range". I guess if I were writing such an
application, this could make some sense. Still, the definition of
_FillValue as an invalid value should be enough to simply tell me
to ignore that point, nothing more.
- Section 32: Missing values in a data variable
I think that the data should be checked against the "missing_value"
*before* unpacking. First, I think there is already a pretty
strong convention that "missing_value" be of the same type as the
data. Second, some packages simply display the packed values, and
they wouldn't be able to detect missing values. Third, I've been
burned and confused often enough by varying machine precision to be
quite shy of comparing computed values.
However, handling missing values when unpacking packed data does
present a real problem! Imagine a subroutine which unpacks, say,
SHORT values into a FLOAT array. This routine will be able to
reliably detect missing values, but what value is it to put in the
FLOAT array? We solve this by storing a global FLOAT attribute
which specifies this number. If a file has no such attribute, we
stuff a default value in it. In any case, we inform the user of
what was used.
- Section 33: Compression by gathering
This seems like a compact way to store variables with large areas
of no interest, but it is kind of complicated. Something like the
following might be more intuitive:
dimensions:
lon = 20;
lat = 10;
landpoint=96;
zsfc = 1;
variables:
float soilt(zsfc,landpoint) ;
soilt:zsfc="zstar" ;
soilt:landpoint="mask" ; // keyword = "mask" means to
soilt:mask="landmask" ; // look for attrib="mask"
float zstar(lat,lon) ;
byte landmask(lat,lon) ; // contains 96 1's, the rest 0's
float lat(lat) ;
float lon(lat) ;
------------------------------------------------------------------------
That should do it! I hope I don't sound too negative. Quite the
opposite, in fact: I sincerely hope that your work prompts an update
to the existing conventions. Much of what you propose is new, and
quite necessary.
Cheers-
John P. Sheldon
(jps@xxxxxxxx)
Geophysical Fluid Dynamics Laboratory/NOAA
Princeton University/Forrestal Campus/Rte. 1
P.O. Box 308
Princeton, NJ, USA 08542
(609) 987-5053 office
(609) 987-5063 fax
---
No good deed goes unpunished.
---