It has been pleasing to read all the recent postings on proposed conventions. A lot of good work has gone into these. It is obvious that there is a real felt need to define and refine netCDF conventions. I hope these comments of mine will facilitate this process. ABBREVIATIONS USED I will refer to Gregory, Drach and Tett's "Proposed netCDF conventions for climate data" using their initials 'GDT'. I will abbreviate "netCDF User's Guide for C" to 'NUGC'. WHAT KIND OF CONVENTIONS ARE DESIRABLE? The relevance of many of the issues raised by GDT & others is not restricted to climate data. I would like to see some (but not too many) additional generic conventions adopted with the same status as NUGC Sections 2.3.1 (Coordinate Variables) and 8.1 (Attribute Conventions). I suggest there should be a separate chapter in NUGC for conventions, including those now in Sections 2.3.1 and 8.1. There is no reason why the only standard names should be those of attributes. So dimensions and variables could also have standard names. I develop generic software and the existence of non-generic conventions worries me because my software does not take these into account. Some conventions (e.g. using 'lat' as a standard name for latitude) are no problem if this is just to assist humans. But my software certainly does not treat 'lat' as special and I suspect it is unreasonable to expect it to do so. It is important that software documentation specify conventions used. So, for example, a particular geographically oriented package might have conventions such as: - the maximum rank is 4 - dimensions must be named 'time', 'height, 'latitude', 'longitude' - dimensions must appear in this order (but some may be omitted) I consider such conventions too restricting for a field as broad as 'climate data'. It would be useful to have lists of conventions adopted by various packages. It might then be possible to find a reasonable set of common conventions. I fully agree with John Sheldon (ncdigest 405) that "Conventions ought to be as simple and undemanding as possible". They should be as few, general and orthogonal as possible. I have been disappointed at the poor level of conformance to the current conventions in NUGC. Section 8.1 is quite short and standardises only 15 attribute names (of which several are specified as ignorable by generic applications). Yet I have encountered far too many netCDF files which contravene Section 8.1 in some way. For example, we are currently processing the NCEP data set from NCAR. An extract follows. It is obvious that a great deal of effort has gone into preparing this data with lots of metadata and (standard and non-standard) attributes, etc. But it is also obvious that there cannot be any valid data because the valid minimum (87000) is greater than the maximum short (32767)! And Section 8.1 states that the type of valid_range should match that of the parent variable i.e. should be a short not a float. Obviously the values given are unscaled external data values rather than internal scaled values. short slp(time, lat, lon) ; slp:long_name = "4xDaily Sea Level Pressure" ; slp:valid_range = 87000.f, 115000.f ; slp:actual_range = 92860.f, 111360.f ; slp:units = "Pascals" ; slp:add_offset = 119765.f ; slp:scale_factor = 1.f ; slp:missing_value = 32766s ; slp:precision = 0s ; It would be useful to have a utility which checked netCDF files for conformance to these conventions. It could also provide other data for checking validity such as counting the number of valid and invalid data elements. I guess I have to take some of the blame. I was one of the authors of NUGC and I was largely responsible for rewriting Section 8.1 last year while I was working at Unidata. I tried to make it clearer and simpler. In particular, I tried to simplify the relationship between valid_range, valid_min, valid_max, _FillValue and missing_value. But it seems that we have failed to make the current conventions sufficiently clear and simple. And we need to be careful not to make it even harder for the writers of netCDF files by defining so many conventions that they need to be 'netCDF lawyers'. NAMING CONVENTIONS AND RECOMMENDATIONS I like Russ Rew's suggestion of some use for global attributes whose names match those of variables or dimensions. But I am not sure what the best use might be! I also like Russ Rew's suggestion of allowing periods ('.'s) in names. I think there should be a recommendation that names consist of whole words unless there is some strong reason to do otherwise. So 'latitude' would be preferred to 'lat'. Note that such full-word variable names often obviate the need for a 'long_name' attribute. GDT suggest avoiding case sensitivity. I do not think this is the kind of thing which should be standardised in a convention. Instead it should be recommended as good practice. But it is sometimes natural to use names such as "x" and "X". I suggest recommending that that dimension names should be singular rather than plural. Thus 'point' is better than 'points'. COORDINATE VARIABLES There is a need to allow string values (e.g. Station names) for coordinate variables. But I disagree with Russ Rew on allowing 2D NUMERIC coordinate variables for such things as dates. Such names are essentially nominal values with no inherant ordering. But dates are ordered and more simply represented by a single number than by some kind of multi-base number. I am yet to be convinced that any of the multi-dimensional coordinate variable ideas is basic enough to deserve adoption. GDT suggest several new attributes for coordinate variables. In particular my impression is that they propose representing longitude somewhat as follows: float lon(lon); lon:long_name = "longitude"; lon:quantity = "longitude"; lon:topology = "circular"; lon:modulo = 360.0f; lon:units = "degrees_east"; There is a lot of redundancy here, especially if 'lon' is the standard name for longitude. I would prefer to replace the above by: float longitude(longitude); longitude:modulo = 360.0f; longitude:units = "degrees_east"; Here 'longitude' is the standard name for longitude but this is relevant only to users, not software. The special properties which software needs to know about are given by the attributes 'modulo' and 'units'. The other proposed attributes 'quantity' and 'topology' do not appear to provide any useful additional information. But I do like the idea of 'modulo' for cyclic variables such as longitude. I suggest the monotonicity requirement should be relaxed if modulo is specified. Instead there should be a uniqueness requirement. So the longitudes could be (0 90 180 -90) but not (0 90 180 270 360) since 360 is equivalent to 0. The uniqueness requirement would disallow any value included in a previous interval. So the total range would have to be less than 360. The following would be illegal: (-180 -90 0 315) since 315 is equivalent to -45 which is covered by the first interval from -180 to -90. Some cyclic variables (e.g. month of year) have a non-zero origin. So we could also have an attribute called say 'modulo_origin' (default 0) as in: short moy(moy) moy:long_name = "month of year"; moy:modulo = 12; moy:modulo_origin = 1; but I doubt whether this is really worthwhile. I wish to propose allowing missing (invalid) values in coordinate variables. All corresponding data in the main variable would also have to be missing. In particular this would simplify the problem of calendar dimensions which GDT discuss. You could simply allocate 31 days to every month and set data for illegal dates (e.g. 30 Feb) to a missing value. Note that the extra space required is only 1.8%. I disagree with GDT's suggestion that every dimension have a coordinate variable. This would triple the space required for the following time series: dimensions: time = 1000000; variables: float time(time); short temperature(time); temperature:add_offset = 300.0f; temperature:scale_factor = 0.1f; Note that it is not possible in this case to use a short for time, since 1000000 different values are needed. It would be nice (especially for such time series) to have an efficient way of specifying a coordinate variable with constant steps i.e. an arithmetic progression (AP). I propose doing this by replacing the rule that the shape of a coordinate variable consist of just a dimension with the same name. The new rule should allow any single dimension with any size (including 0). (There is of course also the issue of whether multiple dimensions should be allowed.) Then any trailing undefined elements of a coordinate variable would be defined as an AP as follows: If have coordinate variable with size > 1 then: AP = var(0), var(1), ..., var(size-1), a+d, a+2d, a+3d, ... where a = var(size-1) and d = var(size-1) - var(size-2) If size = 1 then d defaults to 1 so AP = var(0), var(0)+1, var(0)+2, var(0)+3, ... If size = 0 then a defaults to 0 and d defaults to 1 so AP = 0, 1, 2, 3, ... If no coordinate variable then again AP = 0, 1, 2, 3, ... So if the time vector is the AP (100, 100.5, 101, ...) days, then the above example could be written as either: dimensions: time = 1000000; zero = 0; variables: int time(zero); // datatype is irrelevant time:add_offset = 100.0; time:scale_factor = 0.5; time:units = "days"; short temperature(time); temperature:add_offset = 300.0f; temperature:scale_factor = 0.1f; or: dimensions: time = 1000000; two = 2; variables: double time(two); time:units = "days"; short temperature(time); temperature:add_offset = 300.0f; temperature:scale_factor = 0.1f; data: time = 100.0, 100.5; UNITS I often see netCDF files with units unknown to udunits. I just want to underline GDT's specification that the only legal units are those in the current standard udunits.dat file. Any other 'units' should be specified in some other manner e.g. by some non-standard attribute. I suggest recommending plural rather than singular units (if the unit is an English word). Thus 'days' rather than 'day'. But do not attempt to pluralise something which is not a word like 'degC'! John Sheldon in ncdigest 405 suggested allowing units = "none". I prefer units = " ", which already works with current version of udunits. STEVE EMMERSON'S POSTING TO NCDIGEST 408 I found Steve's distinction between 'manifold' and 'base' useful. I agree that the netCDF coordinate variable convention has caused confusion by using the same name for both. The convention works fine for the traditional (John Caron's "classic # 1") case, but does not generalise naturally. I look forward to the next installment when Steve finally reveals what in the world the third element 'WORLD' is!! :-) JOHN CARON'S MOTIVATING EXAMPLES I like the idea of this list. John's examples 2 and 9 are both examples of non-gridded (irregular or scattered) data. The same single index is used in separate vectors to get coordinates and data values. Example 3 simply generalises this to multiple subscripts. A satellite data example is: float radiance(line, pixel); float latitude(line, pixel); float longitude(line, pixel); which is essentially the same as: float radiance(point); float latitude(point); float longitude(point); It seems to me that examples 4 and 10 are just mixtures of these with the classical 1. Example 5 needs more detail. I assume var has other dimensions. I seem to remember Russ Rew suggesting a 2D coordinate variable for this case along the lines of: dimensions: latitude = 56; longitude = 64; level = 10; range = 2; variables: float level(level, range); // level(k,0) = bottom, level(k,1) = top float var(latitude, longitude, level); This has some appeal, but it does not seem basic enough to justify generalising coordinate variables to 2D. Example 6 has too little information for me to understand. If you simply want a non-georeferencing example then why not use Steve Emmerson's (now already famous) spiral wire example. But this is essentially the same as 2 and 9. I found example 8 unnecessarily complex. I assume corr_var(lat1, lon1, lat2, lon2) gives the correlation between the point (lat1, lon1) and the point (lat2, lon2). A simpler case is the following involving annual precipitation measured at 100 stations: dimensions: year = UNLIMITED; station = 100; variables: float precipitation(year, station) float precipitation.correlation(station, station) where precipitation.correlation(i,j) is the correlation between precipitation at station i and precipitation at station j. This could also be used in place of Example 7. The precipitation for each year is a vector of 100 elements. The calculation of a correlation matrix requires that these 100 all be in the same array, rather than 100 separate variables. An example I would like to add is the following, which is a variation of one someone (I forget who) posted recently. Note that this is gridded data, unlike the above examples. Let's call it the 'Sparce Gridded' example, which in this case obviates the need to store missing values for ocean points in the main array (at the lesser cost of storing missing values in a pointer array): dimensions: time = UNLIMITED; latitude = 56; longitude = 64; land_point = 1100; // about 30% of 56*64 variables: float latitude(latitude); float longitude(longitude); short land_index(latitude, longitude); // value of 'land_point' index land_index:valid_min = 0; land_index:missing_value = -1; // -1 = ocean point float soil.temperature(time,land_point) COMMENTS ON SPECIFIED SECTIONS OF GDT SECTION 5: Global Attributes I have found the history attribute especially useful for variables calculated from some other variable (in the same or a different file). This provides the information which GDT suggest putting into an attribute called 'contraction' (See Section 23 and Appendix B). This raises the possibility of allowing a history attribute for each variable as well as a global history attribute. The wording (mine I must confess!) in Section 8.1 of NUGC needs changing to make it clear that a line should be appended whenever a variable is created or modified. I'm afraid I don't like any of the proposed other new attributes. I prefer the name 'source' (as in CSM Conventions) in place of 'institution' and 'production'. The 'conventions' attribute should include any version numbers, etc. rather than having additional attributes such as 'appendices'. I am not convinced of the need for a 'calendar' global attribute. Calendars are discussed further below. SECTION 8: Axes and dimensionality of a data variable I like the distinction between 'axes' and 'dimensions'. But it may be too late to change our terminology. In fact, the very first sentence in this section uses 'dimensions' when 'axes' seems to be implied! I would state that each axis is associated with a dimension and it is possible for several axes to share the same dimension. I prefer the normal term 'rank' to the rather antiquated and clumsy term 'dimensionality'. The only apparent reason to limit the rank or specify the order of axes would be for compatibility with specific packages. One can always use programs such as my ncrob to reduce the rank or transpose axes to any desired order. Dimension sizes of 0 (as well as 1) should be allowed. The most common 0 case is for the unlimited dimension, but others are occasionally useful (e.g. my proposal above defining a coordinate variable as an AP). This is really just one possible cause of no valid data - a situation which software should be able to handle. SECTION 12: Physical quantity of a variable I am unhappy with this proposed 'quantity' attribute. E.g. float z; z:quantity = "height"; Why not simply standardise the name of the variable? E.g. float height; In cases where there is a need to distinguish between similar variables, there could be a convention specifying part of the name. E.g. temperatures could be indicated by the suffix '.temperature' as in: float surface.air.temperature; float soil.temperature; using periods as suggested in Russ Rew's posting to ncdigest 405. And if there were two different latitude variables, these could be named say latitude.1 and latitude.2. But I do agree that there is a need for something more than just the 'units' attribute to give information about the nature of a variable. In particular a statistical procedure may want to calculate measures of location, dispersion, etc. appropriate to the level of measurement (nominal, ordinal, interval or ratio). For example: If level = ratio then calculate geometric-mean and coefficient-of-variation. If level = interval then calculate arithmetic-mean and standard-deviation. If level = ordinal then calculate median and semi-inter-quartile-range. If level = nominal then calculate mode. So I propose an attribute 'measurement_level' with the value "ratio", "interval", "ordinal" or "nominal". The default should be "interval", since - this includes "ratio" and thus covers most physical measurements - the "interval" property is adequate for most purposes (as seen by the ubiquity of the arithmetic-mean and standard-deviation). I have never been happy with having both the FORTRAN_format and the C_format giving essentially the same information. (Although it is usually possible to derive one from the other.) It might be better to replace z:FORTRAN_format = "F8.2"; z:C_format = "%8.2f"; by some language-independent attributes such as z:format_type = "F"; z:format_width = 8; z:format_decimal_places = 2; and if the variable is scaled (using scale_factor and add_offset) these should apply to the external value rather than the internal value (as C_format does). SECTION 16. Vertical (height or depth) dimension Is there any reason why one could not simply adopt the single standard variable name 'height' and handle depths as negative heights? E.g. float height; height:long_name = "depth"; height:units = "-1 * metres"; // udunits can handle this I suspect this also obviates the need for the "positive" attribute. SECTION 20. Bundles I agree that there is a need for string-valued coordinate variables. These are an example of nominal measurement level. The issue of whether a variable is continuous or discrete is related to measurement level to some extent (A nominal variable cannot be continuous). But even many ratio variables (e.g. counts) are discrete. SECTION 22. Point values versus average values I agree that this distinction is important. The rainfall example suggest a third alternative - a value integrated (accumulated) over intervals along one or more axes. I don't like the name 'subcell' - it does not have the desired connotations to me. Maybe something like <var>:point_value = 0; // 0 for false, 1 for true would be clearer. SECTION 23. Contracted dimensions Processes (e.g. sum) which reduce the rank are often called 'reductions'. The proposal here is to contract dimensions (axes?) to size 1 rather than eliminating them, so I suppose 'contraction' may be a reasonable term. This does document how such a variable was calculated (especially if boundaries are given via a 2D coordinate variable). But surely the 'history' attribute should provide this information. However as I mentioned before, I can see that a file with many variables could have a very long global history attribute and it might be better to also allow each variable to have its own 'history' attribute. (I prefer to limit the number of main variables in a file to a very small number, usually one.) The 'history' attribute should provide a complete audit-trail of every task (e.g. input, copy, reduction, other arithmetic) which created and modified the variable. But there are real problems when the whole process involves a series of tasks creating temporary values in memory, etc. I suspect the best solution is to write a full log to the global attribute 'history'. I can see benefit in standardising the names of reductions (contractions) to say: sum mean or am: arithmetic mean gm: geometric mean median mode prod or product count: sum of non-missing values var: variance sd: standard deviation min: minimum max: maximum But I would suggest using these as part of a standard variable naming convention such as that shown by the following examples: min.soil.temperature weighted.mean.sea.surface.temperature The obvious problem here is that names may become inconveniently long. Perhaps we could use standard abbreviations for the variable name itself, but use full words for the 'long_name'. E.g. float sst(latitude, longitude); sst:long_name = "sea surface temperature"; float wam.sst; wam.sst:long_name = "weighted arithmetic mean of sea surface temperature"; based on a naming convention for reductions where the prefix 'w' means 'weighted', so 'wsum' means 'weighted sum' and 'wam' means 'weighted arithmetic mean'. SECTION 24. Time Axes The CSIRO Division of Atmospheric Research (DAR) routinely uses two time axes, 'year' and 'month'. But I agree with GDT that there should be only one time axis. But obviously there must be convenient ways of calculating reductions for particular months, etc. Again I would prefer to standardise the variable name rather than introduce yet another attribute i.e. 'quantity'. I suggest both 'time' and '<something>.time' should be allowed. Note that climatologists often uses the units 'year' and 'month' in a deliberately imprecise manner. The exact definitions of month and year are irrelevant. All that matters is that there are 12 months in a year. Climate data should normally use a time axis with a unit of a day, month or year (or some multiple of these). If the unit is a day then there should be a fixed number (31 for 'normal' calendars such as Gregorian) days in each month. The time coordinate variable should have a missing value for each day which does not exist in the calendar used. I think this obviates the need for the 'calendar' global attribute and allows for most kinds of calendars without having to hard-code them into a standard. I agree that date/time should be represented by a single number. I suggest the form YYYYMMDD.d where d is a fraction of a day. So 19970310.5 represents noon on March 10, 1997. Similarly year/month is represented by YYYYMM. But such values are not acceptable to udunits and therefore cannot be used for time coordinate variables. int time(time); time:units = "days since 1995-01-01 00:00"; time:valid_min = 0; time:missing_value = -1; // for day which does not exist in calendar int YYYYMMDD(time); YYYYMMDD:long_name = "date in form YYYYMMDD"; YYYYMMDD:valid_min = 00010101; YYYYMMDD:missing_value = 0; // for day which does not exist in calendar data: time = 0, 1, .., 27, 28, 29, 30, 31, 32, .., 59, -1, -1, -1, 59, 60, .., YYYYMMDD = 19950101, 19950102, .., 19950128, 19950129, 19950130, 19950131, 19950201, 19950202, .., 19950228, 0, 0, 0, 19950301, 19950302, .. Thus a package might provide a (binary?) search function 'index' which could be used as follows to calculate the arithmetic mean of sst values for JJA (June, July, August) 1995: mean = am(sst(index(YYYYMMDD, 19950601) : index(YYYYMMDD, 19950831))) SECTION 25. Gregorian calendar I don't like the mixed Gregorian/Julian calendar (with a fixed conversion date of 1582-10-15) apparently used by udunits. I would prefer it to assume Gregorian unless explicitly specified otherwise such as follows: units = "days since 1995-01-01 00:00 Julian" I suspect that the most likely use of the Julian calendar would be for places such as Russia which I believe used it up until the Revolution early this century! But what about other calendars? There are calendars in China and India which are still very widely used. A Chinese oceanographer colleague informs me that the Chinese calendar is still used in oceanography, in particular for tide work. I am not suggesting this is a high-priority item, but udunits should allow for the incorporation of such calendars in the future. SECTION 27. Unimonth calendar I prefer the above "fixed length month with missing days" proposal. SECTION 31. Invalid values in a data variable There is a need to clarify terminology. I use 'missing' and 'invalid' interchangeably. But I do appreciate that it might be better English to consider a missing value as a special kind of valid value. But the NUGC 8.1 conventions state that - all values outside the valid range should be considered missing values - any specified missing_value attribute should be outside the valid range. I assume any value inside the valid range is valid and any value outside it is invalid. (I must confess that I may be partly to blame for this confusion in terminology in 8.1.) So I suspect we need another term for 'bad' data due to some error. Terms which come to mind include 'error values' and 'bad data'. Australians might call it 'crook data'! :-( The process of validating data should test for such bad data. But it is unreasonable to expect generic applications to do more than test whether values are within the valid range. And it is preferable to specify only one of valid_min or valid_max so that the application has to do only one test rather than two. As mentioned above, I strongly believe that coordinate variables should be allowed to have missing values. Interpolation should act as if the missing slabs were deleted. Note that all the following should be of the same data type as the parent variable: _FillValue, missing_value, valid_range, valid_min, valid_max. The final paragraph is similar to the rule in NUGC 8.1 for defining a default valid range in the absence of any of valid_range, valid_min or valid_max. It ensures that _FillValue is invalid. SECTION 32. Missing values in a data variable Note that NUGC 8.1 states that missing_value should be ignored by software. Its purpose is merely to inform humans that this special (invalid) value is used to represent missing data. The fact that the value is invalid implies that it will be treated as missing when read. So I disagree with the last two sentences of this section, which refer to software using the missing_value. Applications will typically write a value of _FillValue to represent undefined or missing data. But there is nothing to stop them writing any other invalid values, including any of the values in the missing_value attribute (which can be a vector).