[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #LFV-605873]: netcdf "64-bit offset" with compression



Hi Brian,

> I am running the CESM model, and is generating quite a lot of output
> data. So I used nccopy to compress the output files (deflate/compression
> level 1), which worked flawlessly and reduced the size to ~50%. However,
> the original format was "64-bit offset" and by compressing it it is only
> possible to have "netCDF4 classic" or "netCDF4".
> 
> I then realised that all my plotting routines slowed way down (factor
> 10+), and doing some testing I found out that it wasn't the compression
> (slowdown with uncompressed netCDF4 format as well), but the change from
> "64-bit offset".
> 
> So my questions are:
> 
> 1: Are there any reasons that "64-bit offset" cannot be compressed?

Yes, compression requires chunking (also known as "multidimensional
tiling"), which is supported in the two netCDF-4 formats.  The
netCDF-4 formats use the HDF5 library and storage to implement
chunking and compression.

Without chunking, it would be necessary to uncompress an entire file
even to access only a small subset of datad.  Each chunk is a
separate unit of compression and uncompression, so only the chunks
containing data to be accessed need to be uncompressed.

> 2: Is there an inherent reason that "netCDF4" format is slower than
> "64-bit offset" or can it be some system specifics on my servers.

Accessing data from the HDF5-based netCDF-4 format can be either
slower or faster than the netCDF-3 classic or 64-bit offset formats,
depending on the pattern of access and details of how the data is
stored in the HDF5 format.

Specifically, if you use an unlimited (record) dimension or 
compression, the data must be chunked in the HDF5 file.  Specifying 
the shapes and sizes of chunks to match common patterns of how the 
data will be accessed is important for tuning the efficiency of 
access in HDF5 files.  It's also important to provide enough chunk 
cache that chunks you are accessing repeatedly stay in memory, 
avoiding an expensive disk read of a whole chunk if you need only 
a few values from the chunk.

If you aren't using the unlimited dimension in your output file, then
each variable's data will be stored contiguously, which avoids any
overhead related to chunking.  If you're seeing slow access
performance with contiguous data in netCDF-4 files, that's puzzling,
and we'd like to know more details.  You can get information about
storage (contiguous or chunked) and about chunk shapes by looking at
the output of "ncdump -h -s", to see the "Special Attributes" with
names beginning with "_".  Looking at that information might make it
clear whether the problem you're seeing is with bad chunk shapes or
sizes, which can happen with bad chunk shape defaults.

I've written 2 blogs about chunking and its performance implications
that might make these performance issues clearer:

  
http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters
  
http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_choosing_shapes

The first also has a reference to an HDF5 white paper providing some
guidance to chunking and performance.  The nccopy utility can be used
to "rechunk" data or even "dechunk" it to convert unlimited dimensions 
to fixed size and make the storage of every variable contiguous, but 
then you can't compress the data (compressed data must use chunking).

--Russ

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: LFV-605873
Department: Support netCDF
Priority: Normal
Status: Closed