resurrect old topic: compressed netcdf

To: netcdfgroup@xxxxxxxxxxxxxxxx
Subject: resurrect old topic: compressed netcdf
From: Martin Schultz <martin.schultz@xxxxxxx>
Date: Sat, 12 Aug 2000 21:28:40 +0200 (MET DST)
Hello,

   I am new to this mailing list and a scientist using netcdf rather than
a programmer who could really contribute to the netCDF development. Since
I am working with global atmospheric chemistry models which produce their
output in netcdf format, I am all too aware of the missing compression
capabilities of the netcdf library, unfortunately. Browsing the mailing
list archive, I found out that someone had actually implemented some type
of compression in 1998 and apparently this had reached a stage that it 
was about to be included in the "official" netcdf program package. 

   Now I would like to know whether there are any active efforts to introduce
packing into netcdf (and if so when to expect this). I would be happy to
serve as a beta tester using files in the range of 10kB and 1GB 
uncompressed size.

   In case, no such activity is currently under way, I would like to 
contribute a few thoughts on this issue regarding backward compatibility
as well as efficiency and usage. Please find these attached below.

   I hope that this mail is not inappropriate for the netcdf group.

With kindest regards,
Martin Schultz


------------------------
compression ideas:

(1) from what I heard about the new bz2 compression, this should probably
be the algorithm of choice, especially since it is patent free and licensed
under the GPL.

(2) a primitive packing version which would even maintain compatibility to
elder netcdf versions could act on individual variables and use variable
attributes to indicate whether this variable is compressed. A variable
should only be compressed if it exceeds a certain size (e.g. 100kB).
The method would then replace the existing dimension description of the
variable with a single dimension indicating the number of bytes the 
compressed variable takes up. The variable type would be changed to BYTE.
New variable attributes would be :
_compressed_ = 1    ! logical to indicate that this variable is compressed
_old_dims_   = integer array(10)   ! holds original dimension indices
_old_type_          ! holds original variable type

Advantages:
* every old program can still parse these files without modifications. 
* since the compressed variable is stored as byte vector in one piece, it
  can be saved 1:1 in a file and the data can be retrieved with the bzip2
  command. One would get a raw binary file which is easily read with almost
  any software.
* since relatively large data blocks will be compressed, compression should
  be effective in many real-world applications.

Disadvantages:
* storing or retrieving parts of the variable requires decompression of the
  complete variable data, adding extra data along the unlimited dimension
  requires decompression of the old variable, appending the new part and
  recompression along with a dimension change.
* (thus) extremely inefficient for data sets with few huge variables or
  in multi-threaded environments (because output can only be done on one
  thread).


(3) A true support of compressed data would require changes that render 
older netcdf versions incompatible. However, one should at least try to
preserve the header format, so that existing programs can at least analyze
what is in the file (I'm saying this naively without having looked at the
netcdf source code). Perhaps one could introduce an extra layer that 
would manage the compression details. Compression could still be done
on a per-variable basis and only for variables that exceed a threshold
size. Then existing software could perhaps still read the uncompressed 
parts of the file. Variable compression should be broken down along the
unlimited dimension, so that each "record" (time step etc.) would be 
compressed individually. Information about the individual block sizes
would have to be stored in the extra layer.

Advantages:
* may maintain at least some backward compatibility
* adding extra records is easy and relatively fast
* retrieving individual records along the unlimited dimension is easy
  and fast

Disadvantages:
* writing or reading subranges of the fixed dimensions is cumbersome
* because the entry point for compression is fixed (large variables along
  the unlimited dimension) one does probably not achieve the maximum
  possible compression ratio.
Follow-Ups:
- Packing vs. Compression (was: resurrect old topic: compressed netcdf)
  - From: Ethan Alpert
2000 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdfgroup archives: