Hi again Jonathan, As promised, I am sending along a more detailed review of your proposed netCDF conventions. (For those reading along, this proposal is available at http://www-pcmdi.llnl.gov/drach/netCDF.html) This is prefaced by a few general comments, along with some musings about how netCDF might best satisfy the needs of users and applications developers. General Comments ---------------- 1. Conventions ought to be as simple and undemanding as possible, to make the use of netCDF as easy as possible. This may sound like a platitude, but one reason for netCDF's popularity to date has been the ease with which users can get started. And we all know how critical it is for the viability of a software product that it first be widely popular. 2. We (the netCDF community, or at least the ocean-atmosphere subset of us) might want to consider defining a "core" set of basic conventions, with extensions to suit specific groups. These core conventions should be broadly applicable and include basic principles for storing and documenting certain types of information. Groups specializing in climate, NWP, oceanography, observational datasets, etc. could define additional conventions to facilitate data exchange amongst themselves, so long as the files are consistent with broader conventions. 3. As I mentioned in my previous mail, I am, in general, opposed to the use of external tables. While they can be handy for centers which exchange much the same data all the time, it is problematic for researchers who can sometimes get quite "imaginative" and end up with things that aren't in the "official" table. Complicating things is the fact that there often tends to be more than one "official" table. However, the fact that you are not replacing a description with a code from a table reduces the problem tremendously, so I'll go along on this one given it's erstwhile utility. 4. There seems to be a preference in your proposal for associating additional qualities with the axes/coordinate-variables themselves (eg, contraction, bounds, subcell, ancillary coordinate variables). While this might be a clever way to associate the added info with a multitude of data variables, it may also lead to the expansion in the number of dimensions, since all this additional information may not be applicable to every data variable employing that dimension. In that case, a new dimension which is essentially identical to an existing dimension will have to be created. The alternative, historically, has been to use referential attributes attached to the data variables to specifically associate alternate sources of information. (See http://www.unidata.ucar.edu/software/netcdf/NUWG/draft.html) These are also more general, as they are not limited to 1-D. 5. Your proposal does not rule out the use of referential attributes, but neither does it endorse or exploit them. Any particular reason? More generally, it would certainly be helpful (and brave) if you would let us all know your thoughts along the lines of the recent discussion concerning multidimensional coordinates. 6. Please, give us some (many!) CDL examples! Specific comments: ----------------- - Section 3: Data types I am not really a Comp-Sci person, so it's possible I'm missing something critical about "byte" vs. "char" here. I've already learned to cope with the signedness differences between our SGI's and our Cray T90, and it didn't seem that difficult. But the proposed change does mean existing applications will have to be modified, never an exciting prospect. Also, I'm not sure how to handle "char" in any mathematical expressions (eg, decompression). What is it that "may become ambiguous in future versions of netCDF" that is driving this? - Sections 8 Axes and dimensionality of a data variable 9 Coordinate variables and topology In the spirit of simplicity, I don't think I would make storage of coordinate variables mandatory if they are simply 1,2,3,... (though *we* always do store them). The generally accepted default of assigning coordinates of "1, 2, 3, etc." seems reasonable, and most software already seems to handle this. I suppose the ability to define 0-dimensional variables could come in handy, though such a quantity is probably more appropriately stored as a global attribute. At least one plotting package here (Ferret) cannot handle this, however. BTW, 0-D variables are an extension to COARDS - you do not call attention to this. I can see that there there might be some use for variables with more than 4 dimensions, but this is likely to frustrate some existing utilities. I very much like singleton dimensions. People and utilities are too quick to toss a dimension when "contracting" (eg, averaging, extracting), when there is still usable placement information. - Section 11: Units Exploiting the units database of UDUNITS is fine, but I am less comfortable relying on the syntax of any particular package. What does this gain us, especially if this is not an approved method of "compression" (although it does serve as such)? - Section 12: Physical quantity of a variable - "units": I would like to see "none" added as a legitimate characterization, as it would serve as a definite affirmation that the variable really does have no units. - "quantity": any time it is proposed that something be made *mandatory*, I have to consider it long and hard. In this case, it seems that the existing COARDS approach is already adequate for deducing the sense of lat/lon/vertical/time dimensions. Why is "quantity" so necessary? There also seems to be a potential failure mode here in that someone could encode incompatible "units" and "quantity". Nevertheless, I must concede that use of a "quantity" out of some "official" table would make the precise definition less prone to misinterpretation. - "modulo": simple as it is, this is the clearest explanation I've seen. - Section 16: Vertical (height or depth) dimension First, it seems as though you are proposing that the utility of the COARDS "positive" attribute be replaced by the sense conferred on the axis by its "quantity". If so, I don't agree. The presence of "positive" in the file is preferable to a look-up in an extra table. Second, the direction of "positive" is not merely a display issue. It is a critical piece of information that defines the "handedness" of the coordinate system being defined. Third, I'm not sure that the vertical dimension is necessarily recognizable from the order of the dimensions. What about a "mean zonal wind" that is Y-Z? Fourth, you rightly note that requiring units for a vertical coordinate which has no units "means defining dimensionless units for any dimensionless quantity one might wish to use for that axis". However, rather than be concerned with some inconsistency of treatment relative to data variables, this brings up a larger issue, namely: How does one recognize that an axis is "vertical" if it is not a standard "quantity" and does not employ units that look "vertical"? *Furthermore*, how does one recognize what direction *any* axis points in if the "quantity" is not georeferenced and the units are nondescript? For example, a channel model here uses Cartesian horizontal coordinates (units=meters) and a "zeta" hybrid vertical coordinate. Our local solution to this dilemma is to attach an attribute "cartesian_axis" to each coordinate variable that indicates to downstream applications which (if any) cartesian axis each dimension is associated with (values are "X|Y|Z|T|N"). Without this information, we'd have to simply assume that the axes are X-Y-Z order (ie, we can't tell that "zonal mean wind" is oriented Y-Z). - Section 17: Ancillary coordinate variables You might want to emphasize that this is a lead in to sections 18-21, which are different kinds of "ancillary coordinate variables". One possible problem with the proposed definition is that it is limited to 1-D. Thus, even if I calculate and store the altitude of all the points in my sigma model, I can't associate it with the arrays of temperature in sigma coordinates, etc. Another possible problem is that this ancillary information might not be applicable to all data variables employing that dimension. (See general comments above.) - Section 19: Associated coordinate variables There is already a mechanism for "associating" additional coordinate information without requiring yet another defined attribute name: Referential attributes. I have typically seen them attached to data variables, but I see no reason why they could not be attached to main coordinate variables, too. - Section 21: Boundary coordinate variables Is there any particular reason why you made the additional dimension the slowest varying dimension? The information is all there, of course, but my intuition would like to see the min's and max's interleaved. - Section: 23: Contracted dimensions I definitely like the idea of a "contraction" attribute to document, in a general way, the operation that was performed. Although I haven't tried either this approach or that from the NCAR CSM conventions, I think this will be more general. We should, though, get together and agree on a set of valid strings (eg, "min" vs. "minimum"). However, there might be a problem with the assertion that the contracted dimension is of size 1. How would I store, and document, say, a time-series of monthly means? I'm still not sure I understand simultaneous contractions. A CDL example would help here. I am still trying to figure out how I would use these bits of metadata to store some data that I, personally, have had to deal with. We take 3-D winds and calculate kinetic energy, then take a vertical average (mean) over pressure from some pressure level (say, 50mb) down to the earth's surface. Now, since the surface pressure varies two-dimensionally, it seems that a dimension, being 1-D, will not be adequate to store the information about the integration bounds. Any idea how I would do this? - Section 24: Time axes Suppose I have an idealized model that is not associated with any calendar - it just ticks off hour after hour. How would I be allowed to specify time is this case? - Section 26: Non-Gregorian Calendars You picked the string "noleap" to indicate a calendar with 365 days per year, but UDUNITS already has "common_year" defined for this. Any particular reason for not using that? Also, I would like to lobby to add "perpetual" (some of our experiments use, eg, perpetual-January conditions), "none" (see above), and "model". A "model" calendar would take the place of your "360" and would allow for some number other than 360. Of course, you'd need an additional auxiliary attribute to specify just what that number is, maybe in terms of "days per month". For a "perpetual" calendar, you'll also need an additional attribute to indicate the Julian day of the year for which the "perpetual" conditions are held constant. - Section 27: Unimonth calendar I see the attraction of being able to convert between any calendar and "unitime". This is the same thing many of us do when we store date/time as YYYYMMDDHHNNSS. I'm not opposed to this, but I wouldn't want to use it in place of a UDUNITS-type representation. Hopefully, UDUINTS will someday handle additional calendars. (Hint, hint!) One difficulty with this sort of an axis is that an otherwise continuous coordinate now become non-continuous; ie, there are lots of what appear to be "breaks" in the timeline. Utilities can be made to realize that some processing is needed, but this will require more work. A (much) more significant difficulty w.r.t. calendars is comparing variables which use two different calendars; eg, a climate model run using 360-day years vs. observed data using the Gregorian calendar. As far as I know, there really hasn't been any definitive method laid out for doing this. My intuition tells me that there really is no universal way to do it, since these two "planets" don't really move the same way - each quantity is going to have problems with one cycle or another. You might still have some luck defining a method for certain types of quantities by converting to, say, "fractional year". Annual and seasonal (ie, 1/4-year) averages might be OK. But how would you define and compare June averages from planet "A" with planet "B"? Suppose you wanted to calculate 4xdaily difference maps over the course of a year? If you do it based on the definition of a day, one planet gets back to winter slightly ahead of the other. If you do it in some sort of fractional year, the diurnal cycles get out of synch. Any ideas? - Section 28: Intervals of Time Here we might have a conflict with my desire to be able to store time simply as "days" or "hours" in the case of an idealized model which has no calendar-sense. (See section 24 comments.) With your proposal, such units would be interpreted as an *interval* of time. Of course, that's sort of what it is if you take "time" as being relative to the start of an integration, but I don't think that's what you want, since one might still wish to calculate a time "interval" for idealized cases, too. I'm not sure how else one might handle this, though... Also, unless I'm missing something, storing "monthly" data by defining a "unitime" axis with units of "days" doesn't necessarily buy us more than a "time" axis with units of "months". Both are "non-standard" units that can be interpreted only after determining the calendar type. - Section 29: Multiple time axes and climatological time It took me some time to grasp (I think) what you are doing here. This seems a clever way to specify the time of data points by using a set of "benchmark" points and another set of points that measure time "wrt" those benchmark points. But some CDL examples demonstrating its usefulness are critical here. It would seem that the same information could be stored using other constructs in your proposal, without the (significant) complications introduced by a two-dimensional time coordinate. What would a time series of June means look like with and without "wrt"? - Section 30: Special surfaces Here, again, I am uncomfortable with the use of an external table, at least in the context of any "core" conventions. If it were necessary to document a variable which is located at, eg, the earth's "surface", one could use a referential attribute to record the vertical location of each point: dimensions: lon = 20; lat = 10; zsfc = 1; variables: float tstar(zsfc,lat,lon) ; tstar:zsfc="zstar" ; float zstar(lat,lon) ; float lat(lat) ; float lon(lat) ; - Section 31: Invalid values in a data variable As I mentioned before, I heartily agree with your distinction between "invalid" and "missing" values. In practice, both are (or should be) outside the "valid_range", in which case (most) utilities/packages know to ignore them. But this is real, valuable information content that has not been exploited in any conventions to date. I'm not sure I agree with the inferences you are requiring on the part of generic applications when it comes to using _FillValue to define a default "valid_range". I guess if I were writing such an application, this could make some sense. Still, the definition of _FillValue as an invalid value should be enough to simply tell me to ignore that point, nothing more. - Section 32: Missing values in a data variable I think that the data should be checked against the "missing_value" *before* unpacking. First, I think there is already a pretty strong convention that "missing_value" be of the same type as the data. Second, some packages simply display the packed values, and they wouldn't be able to detect missing values. Third, I've been burned and confused often enough by varying machine precision to be quite shy of comparing computed values. However, handling missing values when unpacking packed data does present a real problem! Imagine a subroutine which unpacks, say, SHORT values into a FLOAT array. This routine will be able to reliably detect missing values, but what value is it to put in the FLOAT array? We solve this by storing a global FLOAT attribute which specifies this number. If a file has no such attribute, we stuff a default value in it. In any case, we inform the user of what was used. - Section 33: Compression by gathering This seems like a compact way to store variables with large areas of no interest, but it is kind of complicated. Something like the following might be more intuitive: dimensions: lon = 20; lat = 10; landpoint=96; zsfc = 1; variables: float soilt(zsfc,landpoint) ; soilt:zsfc="zstar" ; soilt:landpoint="mask" ; // keyword = "mask" means to soilt:mask="landmask" ; // look for attrib="mask" float zstar(lat,lon) ; byte landmask(lat,lon) ; // contains 96 1's, the rest 0's float lat(lat) ; float lon(lat) ; ------------------------------------------------------------------------ That should do it! I hope I don't sound too negative. Quite the opposite, in fact: I sincerely hope that your work prompts an update to the existing conventions. Much of what you propose is new, and quite necessary. Cheers- John P. Sheldon (jps@gfdl.gov) Geophysical Fluid Dynamics Laboratory/NOAA Princeton University/Forrestal Campus/Rte. 1 P.O. Box 308 Princeton, NJ, USA 08542 (609) 987-5053 office (609) 987-5063 fax --- No good deed goes unpunished. ---