Packing vs. Compression (was: resurrect old topic: compressed netcdf)


I've been following this discussion and feel I need to say something.
Packing is NOT compression and compression is NOT packing. They are two
different subjects and should not necessarily be used in the same context.

Compression is generally a loss-less operation on the data, meaning NO
data is lost. Packing on the other hand generally involves removing unnecessary
portions of the data, ie low order bits. Packing is not appropriate when
all of the data is important, ie images. If you remove low order bits from
images you're not going to have the same picture you started with. For these
reasons compression and packing should be treated as different subjects.
Also it is important to note that even packed data can be compressed.

I'd be happy to send anyone who wants it some C code for packing floating
point numbers into bytes. It works fine as long as missing values are NOT
present.

-ethan

> 
> Hello,
> 
>    I am new to this mailing list and a scientist using netcdf rather than
> a programmer who could really contribute to the netCDF development. Since
> I am working with global atmospheric chemistry models which produce their
> output in netcdf format, I am all too aware of the missing compression
> capabilities of the netcdf library, unfortunately. Browsing the mailing
> list archive, I found out that someone had actually implemented some type
> of compression in 1998 and apparently this had reached a stage that it 
> was about to be included in the "official" netcdf program package. 
> 
>    Now I would like to know whether there are any active efforts to introduce
> packing into netcdf (and if so when to expect this). I would be happy to
> serve as a beta tester using files in the range of 10kB and 1GB 
> uncompressed size.
> 
>    In case, no such activity is currently under way, I would like to 
> contribute a few thoughts on this issue regarding backward compatibility
> as well as efficiency and usage. Please find these attached below.
> 
>    I hope that this mail is not inappropriate for the netcdf group.
> 
> With kindest regards,
> Martin Schultz
> 
> 
> ------------------------
> compression ideas:
> 
> (1) from what I heard about the new bz2 compression, this should probably
> be the algorithm of choice, especially since it is patent free and licensed
> under the GPL.
> 
> (2) a primitive packing version which would even maintain compatibility to
> elder netcdf versions could act on individual variables and use variable
> attributes to indicate whether this variable is compressed. A variable
> should only be compressed if it exceeds a certain size (e.g. 100kB).
> The method would then replace the existing dimension description of the
> variable with a single dimension indicating the number of bytes the 
> compressed variable takes up. The variable type would be changed to BYTE.
> New variable attributes would be :
> _compressed_ = 1    ! logical to indicate that this variable is compressed
> _old_dims_   = integer array(10)   ! holds original dimension indices
> _old_type_          ! holds original variable type
> 
> Advantages:
> * every old program can still parse these files without modifications. 
> * since the compressed variable is stored as byte vector in one piece, it
>   can be saved 1:1 in a file and the data can be retrieved with the bzip2
>   command. One would get a raw binary file which is easily read with almost
>   any software.
> * since relatively large data blocks will be compressed, compression should
>   be effective in many real-world applications.
> 
> Disadvantages:
> * storing or retrieving parts of the variable requires decompression of the
>   complete variable data, adding extra data along the unlimited dimension
>   requires decompression of the old variable, appending the new part and
>   recompression along with a dimension change.
> * (thus) extremely inefficient for data sets with few huge variables or
>   in multi-threaded environments (because output can only be done on one
>   thread).
> 
> 
> (3) A true support of compressed data would require changes that render 
> older netcdf versions incompatible. However, one should at least try to
> preserve the header format, so that existing programs can at least analyze
> what is in the file (I'm saying this naively without having looked at the
> netcdf source code). Perhaps one could introduce an extra layer that 
> would manage the compression details. Compression could still be done
> on a per-variable basis and only for variables that exceed a threshold
> size. Then existing software could perhaps still read the uncompressed 
> parts of the file. Variable compression should be broken down along the
> unlimited dimension, so that each "record" (time step etc.) would be 
> compressed individually. Information about the individual block sizes
> would have to be stored in the extra layer.
> 
> Advantages:
> * may maintain at least some backward compatibility
> * adding extra records is easy and relatively fast
> * retrieving individual records along the unlimited dimension is easy
>   and fast
> 
> Disadvantages:
> * writing or reading subranges of the fixed dimensions is cumbersome
> * because the entry point for compression is fixed (large variables along
>   the unlimited dimension) one does probably not achieve the maximum
>   possible compression ratio.
> 
> 
> 


  • 2000 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: