Hello,
I am new to this mailing list and a scientist using netcdf rather than
a programmer who could really contribute to the netCDF development. Since
I am working with global atmospheric chemistry models which produce their
output in netcdf format, I am all too aware of the missing compression
capabilities of the netcdf library, unfortunately. Browsing the mailing
list archive, I found out that someone had actually implemented some type
of compression in 1998 and apparently this had reached a stage that it
was about to be included in the "official" netcdf program package.
Now I would like to know whether there are any active efforts to introduce
packing into netcdf (and if so when to expect this). I would be happy to
serve as a beta tester using files in the range of 10kB and 1GB
uncompressed size.
In case, no such activity is currently under way, I would like to
contribute a few thoughts on this issue regarding backward compatibility
as well as efficiency and usage. Please find these attached below.
I hope that this mail is not inappropriate for the netcdf group.
With kindest regards,
Martin Schultz
------------------------
compression ideas:
(1) from what I heard about the new bz2 compression, this should probably
be the algorithm of choice, especially since it is patent free and licensed
under the GPL.
(2) a primitive packing version which would even maintain compatibility to
elder netcdf versions could act on individual variables and use variable
attributes to indicate whether this variable is compressed. A variable
should only be compressed if it exceeds a certain size (e.g. 100kB).
The method would then replace the existing dimension description of the
variable with a single dimension indicating the number of bytes the
compressed variable takes up. The variable type would be changed to BYTE.
New variable attributes would be :
_compressed_ = 1 ! logical to indicate that this variable is compressed
_old_dims_ = integer array(10) ! holds original dimension indices
_old_type_ ! holds original variable type
Advantages:
* every old program can still parse these files without modifications.
* since the compressed variable is stored as byte vector in one piece, it
can be saved 1:1 in a file and the data can be retrieved with the bzip2
command. One would get a raw binary file which is easily read with almost
any software.
* since relatively large data blocks will be compressed, compression should
be effective in many real-world applications.
Disadvantages:
* storing or retrieving parts of the variable requires decompression of the
complete variable data, adding extra data along the unlimited dimension
requires decompression of the old variable, appending the new part and
recompression along with a dimension change.
* (thus) extremely inefficient for data sets with few huge variables or
in multi-threaded environments (because output can only be done on one
thread).
(3) A true support of compressed data would require changes that render
older netcdf versions incompatible. However, one should at least try to
preserve the header format, so that existing programs can at least analyze
what is in the file (I'm saying this naively without having looked at the
netcdf source code). Perhaps one could introduce an extra layer that
would manage the compression details. Compression could still be done
on a per-variable basis and only for variables that exceed a threshold
size. Then existing software could perhaps still read the uncompressed
parts of the file. Variable compression should be broken down along the
unlimited dimension, so that each "record" (time step etc.) would be
compressed individually. Information about the individual block sizes
would have to be stored in the extra layer.
Advantages:
* may maintain at least some backward compatibility
* adding extra records is easy and relatively fast
* retrieving individual records along the unlimited dimension is easy
and fast
Disadvantages:
* writing or reading subranges of the fixed dimensions is cumbersome
* because the entry point for compression is fixed (large variables along
the unlimited dimension) one does probably not achieve the maximum
possible compression ratio.