NOTE: The netcdf-hdf
mailing list is no longer active. The list archives are made available for historical reasons.
Quincey Koziol wrote:
Hi Russ,John Caron wrote:the scale/offset can be calculated easily from the data itself. often, people want to apply different scale/offset to different parts of the same array, eg vertical levels.and you replied:Hmm, how would you parameterize this? Would a user select various parts of the dataset's dataspace and specify scale/offset information for them?When Harvey Davies was here from Australia for a visit about 8 years ago, we worked out two kinds of scaling for varying packing parameters along one or more dimensions of a variable: predefined scaling andadaptive scaling.With predefined scaling, the scale and offset values associated with a packed variable were stored in auxiliary arrays, varying along just the subset of dimensions used by these arrays. For example, to store a packed array of temperatures, one might use dimensions: time = ... lat = ... lon = ... level = ... variables: byte temperature(time, level, lon, lat); double temperature_scale_factor(level); double temperature_add_offset(level); which would use a possibly different (scale_factor, add_offset) pair for packing temperatures on each atmospheric level. This would allow for greater precision using the same number of bits (or fewer bits for the same precision) than using one packing parameter pair for all the data, because this variable tends to have values that depend on level. It wouldn't work so well with other variables that don't have a level-dependence. With adaptive scaling, the optimum scale and offset values were to be computed by the library for each slab of the variable as it was written, and stored in automatically-generated associated variables (or multidimensional attributes). Although we defined interfaces for these types of scaling, they were never implemented. Implementing adaptive scaling seemed pretty ambitious, and even the predefined scaling would have required adoption of new conventions for naming associated variables, etc. And the proposals actually foundered on inability to agree on all the gory details, such as determining whether to permit the types of the scaling parameters to be user-specifiable in adaptive scaling, etc.Ok, I see. Hmm... I think that the adaptive scaling would actually be somewhat easier that the predefined scaling you describe in HDF5. With the adaptive scaling, each chunk in the dataset could be scanned to compute the optimum scale and offset values which would be stored with the chunk. Handling predefined scaling that varied according to a position within the dataspace seems like it would require accessing some information that was stored outside each chunk and that might be a little unusual in the current implementation. Predefined scaling that didn't vary across the dataspace would be easier than either of those methods, of course. Although it gets a little weird to define any sort of scaling on non-numeric datatypes, we've got a mechanism for disallowing that now. Quincey
If we have the ability to store variable length "compressed formats", then it seems like we can define some simple format for this, say a sequence
n (int) nbits (byte) scale (float/double) offset (float/double) n nbit integersso that the compressed data is self contained, with no need for auxilary variables or info stored outside the chunk.
couldnt both the "predefined" and "adaptive" scaling be stored in this way? if so, then the difference between the two would be in how the user specifies, ie the API.
Seems like we could start simple, just allow adaptive scaling on either the whole array, or varying along a single dimension. ( i think we get enough functionality to do that out of the ability to compress each chunk independently. ) output type is restricted to float or double. so all user has to specify is nbits and optionally a dimension.
theres one other issue that arises in practice, which is to deal with missing values. you want to 1) map missing values to a special bit pattern, and 2) exclude it from the calculation of dataMax or dataMin. so the user should also optionally specify a missing value, and it needs to be stored in the compression format.
to be explicit:<> *To compute the scale and offset:
o add_offset = dataMin oscale_factor = (2^n - 2) / (dataMax - dataMin), where n is the number of bits of the packed (integer) data type.
* The precision of the data will be 1.0 / scale_factor. * unpacked_data_value = (packed_data_value == 2^n-1) ? missing value : packed_data_value * scale_factor + add_offset
netcdf-hdf
archives: