Re: [netcdfgroup] some advice on setting chunk sizes for netCDF-4 data...

Howdy Ed!
Thank you for the new documentation! As it happens, I am refining the user interface and writing docs for GrADS on the very same subject ... I sure would have benefitted from joining that fireside chat. If you are open to suggestions for default chunk size settings, I would like to lobby for chunks a bit smaller than your proposal, something more along the lines of GRIB records, which are 2-dimensional, varying only in longitude and latitude. In general terms, chunks should have size > 1 only for the fastest and second-fastest varying dimensions. I can't speak for all the software out there, but in GrADS, I/O requests vary only in 1 or 2 dimensions, so being forced to read three- dimensional chunks will be really costly (in memory terms) and will also slow down the performance to the point of being unusable as grid resolution increases.

On a related note, I would also like to lobby to reduce these parameters
     #define NC_LEN_TOO_BIG 65536
     #define NC_LEN_WAY_TOO_BIG 1048576
by a couple orders of magnitude. A chunk that is 65536 on a side would never fit into the 32Mb default cache. The cache must be big enough to hold at least ~50-100 chunks. And the cache is allocated on a per- variable basis, so if you are forced to set the cache size large because chunk size is large, then you're in danger of running out of memory (unless memory is an unlimited resource on your system, which is not the usual case). Chunks that are too small do a lot less harm than chunks that are too big.

Respectfully submitted,
Jennifer

--
Jennifer M. Adams
IGES/COLA
4041 Powder Mill Road, Suite 302
Calverton, MD 20705
jma@xxxxxxxxxxxxx



On Dec 8, 2009, at 3:54 PM, Ed Hartnett wrote:


Howdy all!

Here in (normally) sunny Boulder, Colorado, we have been having some
very cold weather. As we huddle around the iron stove in rough-hewn log
cabin that houses the netCDF programming team (wishing we had more
coal for our fire) we fell to talking about how to set chunk sizes for
netCDF-4/HDF5 data.

The setting of good chunk sizes depends on how the data will be
read, but it must be decided when the data are written.

For those out there who are also interested in increasing performance
with good chunk sizes in netCDF-4/HDF5 files, I can offer some
information.

New Documentation:
------------------

I have added a section on chunking to the NetCDF Users Guide. The latest
version can be found here:
http://www.unidata.ucar.edu/software/netcdf/docs_snapshot/netcdf.html#Chunking

Use the Chunk Cache:
--------------------

The chunk cache is important for chunking. It is (by default) 1 MB for
netCDF-4.0, and the default was increased to 32 MB for
netCDF-4.0.1. (The chunk cache can also be set at run-time with the
nc_set_chunk_cache function; the default can be set at configure time.)

You must set the chunk cache to be larger than one chunk, obviously. How
much larger depends on your access pattern. Note that this is the one
aspect of chunking that can be controlled by the data reader.

Test Performance with the bm_file Program:
------------------------------------------

There is a program called "bm_file" which comes with the netCDF
distribution (you must configure with --enable-benchmarks), and can be
used to test different chunk/deflation/shuffle settings (with or without
parallel I/O) to guide your selections. It is described in the new
section of the manual.

Default NetCDF-4 Chunking:
--------------------------

The default chunking of netCDF is to assign a chunk size of 1 for
unlimited dimensions, and chunk size matching the full dimension length for fixed dimensions, unless those fixed dimensions are very large. This
works well for small data sets, or data sets which will be read in one
"record" at a time.

A complete discussion of the default chunking is in the Users Guide. I
am certainly very open to suggestions as to better default chunk size
choices.

Thanks!

Ed

--
Ed Hartnett  -- ed@xxxxxxxxxxxxxxxx

_______________________________________________
netcdfgroup mailing list
netcdfgroup@xxxxxxxxxxxxxxxx
For list information or to unsubscribe,  visit: 
http://www.unidata.ucar.edu/mailing_lists/





  • 2009 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: