Re: standard handling of scale/offset and missing data

To: "Davies, Harvey" <harvey.davies@xxxxxxxxxxxx>
Subject: Re: standard handling of scale/offset and missing data
From: John Caron <caron@xxxxxxxxxxxxxxxx>
Date: Fri, 20 Apr 2001 13:31:59 -0600

Hi Harvey, thanks for your thoughts, my comments are below.

One clarification: VariableStandardized is read-only. I will(eventually) add a writeable version, in which case I would certainlyfollow your conventions. My task right now is to define behavior whichdoes a reasonable job on (important) existing datasets. So I am moremotivated to bend the rules than if I was trying to define the rules.


Davies, Harvey wrote:


As the main author of the "Attribute Conventions" section in the User's
Guide, I must take responsibility
for what is not clear.  The following comments are intended to make these
clearer (but these are just my
personal opinions)

For example, in practice, valid_range seems to be in unpacked unitsrather than packed. The manual is not that clear (to me) and I couldimagine it being used both ways.


I find the all the terms 'packed/unpacked', 'scaled/unscaled' and
'raw/converted' confusing. (I used
the terms 'packed/unpacked' in the User's Guiode, but I now regret this.) We
need terms which suggest
'actual value written on disk' and 'logical data value in memory'.   How
about 'external' and 'internal'?
Any other suggestions?

ok


I will use 'internal' and 'external' in the following.  For example, the
internal type is often float with an
external type of short.

Re 'valid_range'.  This should be an external type and value, as should
valid_max, valid_min, _FillValue and
missing_value.  But add_offset and scale_factor should be an internal
type/value.

- ---------------------------
public class VariableStandardized extends Variable

A "standardized" read-only Variable which implements:
  1) packed data using scale_factor and add_offset

2) invalid data using valid_min, valid_max, valid_range, missing_dataor _FillValue


I assume you mean 'missing_value', which I believe should be ignored on
input (see below).


yes, i meant missing_value.

if those "standard attributes" are present. If they are not present, itacts just like the original Variable.
Implementation rules for scale/offset:
1) If scale_factor and/or add_offset variable attributes are present,then this is a "packed" Variable.2) the Variable element type is converted to double, unless thescale_factor and add_offset variable attributes are both type float ,inwhich case it converts it to float .3) packed data is converted to unpacked data transparently during theread() call.
I am happy with these three rules.
Implementation rules for missing data:
1) if valid_range is present, valid_min and valid_max attributes areignored. Otherwise, the valid_min and/or valid_max is used to constructa valid range.
The User's Guide states it is illegal to have valid_range if either
valid_min or valid_max is defined.  If
such a file exists in practice, I consider it better to force the user to
delete attributes to avoid such
ambiguity.

I guess the problem is that theres no library enforcement of suchConventions, and so i am inclined to relax the rules if it doesnt causeconfusion.

2) a missing_value attribute may also specify a scalar or vector ofmissing values.
Yes, but note that this attribute is merely a hint for output & should be
ignored on input.

I dont understand why you ignore it on input. What if there is novalid_range specified? What if the missing_data is inside the valid_range?

3) if there is no missing_value attribute, the _FillValue attributecan be used to specify a scalar missing value.
For what purpose?  This could be reasonable on input if you are defining an
internal missing value, but
my understanding of your proposal is that you are simply defining an array
of data.

I'm not sure if I understand. Through the hasMissing() and isMissing()methods I am providing a service of knowing when the data ismissing/invalid.


Before writing the section, I thought long and hard about the relation
between valid range, missing_value
and _FillValue.  We finally agreed to essentially deprecate missing_value
for simplicity.  On input, if there
is a valid_range then any value outside this is considered missing.  If
there is no valid_range then
_FillValue defines a valid max if it is positive, otherwise it defines  a
valid min.  On output missing data
may be written as any value outside the valid range.  However a particular
application may choose to
use the missing_value (or an element of it if it as a vector) as the value
to write for missing data.  So it
would make sense for generic applications to use the 1st element of the

missing_value for output(provided this was outside the valid range).

OK, I understand _FillValue better, thanks. Two things though: 1) itseems reasonable to pre-fill an array with valid values, since perhapsonly a few data points need to be written that way. The above ruleswould seem to preclude this. 2) Is the default fill value supposed tooperate the same way? If not, it seems funny that they might haveradically different meaning.

Implementation rules for missing data with scale/offset:
   1) valid_range is always in the units of the converted (unpacked) data.


NO!!! See above.

The problem is that many important datasets use the internal units. Ithink theres a good argument that it is more natural since those wouldbe the units a human would think in. Is there anything in the currentmanual that specifies this? I just reread it again and I dont see it.

2) _FillValue and missing_data values are always in the units of theraw (packed) data.
I agree.
If hasMissingData(), then isMissingData( double val) is called todetermine if the data is missing. Note that the data is converted andcompared as a double.
Harvey Davies, CSIRO Atmospheric Research,
Private Bag No. 1, Aspendale 3195
E-mail: harvey.davies@xxxxxxxxxxxx
Phone: +61 3 9239 4556
  Fax: +61 3 9239 4444

References:
- RE: ncdigest V1 #586
  - From: Davies, Harvey

2001 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdfgroup archives: