On Tue, Dec 31, 2013 at 2:01 PM, Russ Rew <russ@xxxxxxxxxxxxxxxx> wrote:
> I wrote a couple of blogs on netCDF-4 chunking that might provide
> additional guidance:
>
>
> http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters
>
> http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_choosing_shapes
>
> Thanks Russ -- helpful stuff.
You're right that there's no single chunking strategy that fits all
> access patterns, which makes it important to not rely on default
> chunking if you know how the data will be accessed.
But if you don't, it would be nice if the default chunking was not-too-bad.
> Disk blocks are typically powers of 2 bytes, for example 4096 or 1048576
> bytes. For netCDF-4 and HDF5, a chunk is the atomic unit of I/O access.
> Thus a chunk size that is a multiple of the disk block size or slightly
> under makes sense when you access only a few chunks at a time.
good to know. -- it also brings up a complication -- I had a hard time
finding any mention on the docs about whether chunk sizes are specified in
bytes or in number of elements -- some experimentation leads me to think
it's the latter. But block sizes are of coures in bytes, so one might want
to do some math to make sure you're getting the chunk size you want.
> But my minimum tests have indicated that performance isn't all that
> sensitive to chunk sizes within a wide range.
>
> In the blogs above, I present some cases where chunk sizes and shapes
> can make a significant difference in performance.
well, I was testing with a 1-d variable, and in that case, you needed more
than an order of magnitude to see any difference -- 1 was really really
horrible, 16 better, but really bad, 128 OK, and not much different than
512. And once you got that big the difference between 1024 and 1MB was not
all that big.
I expect that it would be similar for larger dimensionally -- i.e. a
(1, 512, 512) is going to have vastly different performance depending on
how you access it, and it will be very different than a (512, 512, 1)
chunk. But not that different than, say a (1. 128, 128) chunk.
Thanks for offering some new ideas for a chunking strategy. We hereby
> resolve to try to improve chunking in the New Year!
>
Cool -- I think the missing piece is that all the discussions I"ve seen,
and the defaults, were developed with with 3d or greated variables in mind,
and 1d fell through the cracks -- but the defaults should at least do OK
with 1-d.
An extension to my earlier proposal. This:
http://www.unidata.ucar.edu/software/netcdf/docs/default_chunking_4_0_1.html
Indicates that there is are:
#define NC_LEN_TOO_BIG 65536
#define NC_LEN_WAY_TOO_BIG 1048576
There should probably be a MINIMUM_CHUNK_SIZE as well -- essentially you
never want an overall chunk size of 1, or any really small number. I
suspect 1k or so would be good.
(unless it's a variable with only a few values in it -- in which case
hopefully not an unlimited dimension)
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@xxxxxxxx