Hi Chris,
I wrote a couple of blogs on netCDF-4 chunking that might provide
additional guidance:
http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters
http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_choosing_shapes
You're right that there's no single chunking strategy that fits all
access patterns, which makes it important to not rely on default
chunking if you know how the data will be accessed.
> float time(time) ; time:_Storage = "chunked" ; time:_ChunkSizes = 16384 ;
>
> This is the 1-d array -- similar to our case. How did you come up with the
> 16384 (2^14)? Is there a benefit to base-2 numbers here -- I tend to do
> that, too, but I'm not sure why.
Disk blocks are typically powers of 2 bytes, for example 4096 or 1048576
bytes. For netCDF-4 and HDF5, a chunk is the atomic unit of I/O access.
Thus a chunk size that is a multiple of the disk block size or slightly
under makes sense when you access only a few chunks at a time. It's not
that important, if a large number of chunks are typically accessed at
once. Providing a large enough chunk cache to prevent re-accessing the
same chunk repeatedly is also important.
> But my minimum tests have indicated that performance isn't all that
> sensitive to chunk sizes within a wide range.
In the blogs above, I present some cases where chunk sizes and shapes
can make a significant difference in performance. An HDF5 white paper
also demonstrates this point:
http://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/
Thanks for offering some new ideas for a chunking strategy. We hereby
resolve to try to improve chunking in the New Year!
--Russ