Re: standard handling of scale/offset and missing data

Hi Harvey, thanks for your thoughts, my comments are below.

One clarification: VariableStandardized is read-only. I will (eventually) add a writeable version, in which case I would certainly follow your conventions. My task right now is to define behavior which does a reasonable job on (important) existing datasets. So I am more motivated to bend the rules than if I was trying to define the rules.

Davies, Harvey wrote:


As the main author of the "Attribute Conventions" section in the User's
Guide, I must take responsibility
for what is not clear.  The following comments are intended to make these
clearer (but these are just my
personal opinions)


For example, in practice, valid_range seems to be in unpacked units rather than packed. The manual is not that clear (to me) and I could imagine it being used both ways.


I find the all the terms 'packed/unpacked', 'scaled/unscaled' and
'raw/converted' confusing. (I used
the terms 'packed/unpacked' in the User's Guiode, but I now regret this.) We
need terms which suggest
'actual value written on disk' and 'logical data value in memory'.   How
about 'external' and 'internal'?
Any other suggestions?

ok


I will use 'internal' and 'external' in the following.  For example, the
internal type is often float with an
external type of short.

Re 'valid_range'.  This should be an external type and value, as should
valid_max, valid_min, _FillValue and
missing_value.  But add_offset and scale_factor should be an internal
type/value.


- ---------------------------
public class VariableStandardized extends Variable

A "standardized" read-only Variable which implements:
  1) packed data using scale_factor and add_offset
2) invalid data using valid_min, valid_max, valid_range, missing_data or _FillValue


I assume you mean 'missing_value', which I believe should be ignored on
input (see below).

yes, i meant missing_value.



if those "standard attributes" are present. If they are not present, it acts just like the original Variable.

Implementation rules for scale/offset:
1) If scale_factor and/or add_offset variable attributes are present, then this is a "packed" Variable. 2) the Variable element type is converted to double, unless the scale_factor and add_offset variable attributes are both type float ,in which case it converts it to float . 3) packed data is converted to unpacked data transparently during the read() call.


I am happy with these three rules.


Implementation rules for missing data:
1) if valid_range is present, valid_min and valid_max attributes are ignored. Otherwise, the valid_min and/or valid_max is used to construct a valid range.


The User's Guide states it is illegal to have valid_range if either
valid_min or valid_max is defined.  If
such a file exists in practice, I consider it better to force the user to
delete attributes to avoid such
ambiguity.

I guess the problem is that theres no library enforcement of such Conventions, and so i am inclined to relax the rules if it doesnt cause confusion.



2) a missing_value attribute may also specify a scalar or vector of missing values.


Yes, but note that this attribute is merely a hint for output & should be
ignored on input.

I dont understand why you ignore it on input. What if there is no valid_range specified? What if the missing_data is inside the valid_range?



3) if there is no missing_value attribute, the _FillValue attribute can be used to specify a scalar missing value.


For what purpose?  This could be reasonable on input if you are defining an
internal missing value, but
my understanding of your proposal is that you are simply defining an array
of data.

I'm not sure if I understand. Through the hasMissing() and isMissing() methods I am providing a service of knowing when the data is missing/invalid.


Before writing the section, I thought long and hard about the relation
between valid range, missing_value
and _FillValue.  We finally agreed to essentially deprecate missing_value
for simplicity.  On input, if there
is a valid_range then any value outside this is considered missing.  If
there is no valid_range then
_FillValue defines a valid max if it is positive, otherwise it defines  a
valid min.  On output missing data
may be written as any value outside the valid range.  However a particular
application may choose to
use the missing_value (or an element of it if it as a vector) as the value
to write for missing data.  So it
would make sense for generic applications to use the 1st element of the
missing_value for output (provided this was outside the valid range).

OK, I understand _FillValue better, thanks. Two things though: 1) it seems reasonable to pre-fill an array with valid values, since perhaps only a few data points need to be written that way. The above rules would seem to preclude this. 2) Is the default fill value supposed to operate the same way? If not, it seems funny that they might have radically different meaning.



Implementation rules for missing data with scale/offset:
   1) valid_range is always in the units of the converted (unpacked) data.


NO!!! See above.

The problem is that many important datasets use the internal units. I think theres a good argument that it is more natural since those would be the units a human would think in. Is there anything in the current manual that specifies this? I just reread it again and I dont see it.



2) _FillValue and missing_data values are always in the units of the raw (packed) data.


I agree.


If hasMissingData(), then isMissingData( double val) is called to determine if the data is missing. Note that the data is converted and compared as a double.


Harvey Davies, CSIRO Atmospheric Research,
Private Bag No. 1, Aspendale 3195
E-mail: harvey.davies@xxxxxxxxxxxx
Phone: +61 3 9239 4556
  Fax: +61 3 9239 4444


  • 2001 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: