« Narrowing In On... | Main | Smaller Chunk Sizes... »

05 January 2010

It's all about finding a good set of default chunk sizes for netCDF-4.1

Tests seem to be indicating that, for the 3D data, a chunk size of 32 or 64 for the unlimited dimension provides a good trade-off in performance for time series and time step reads, without inflating the file size too much.

This makes intuitive sense as well. Larger chunk sizes mean that any left over chunks (i.e. chunks that are only partially filled with data) are going to take up more space on the disk and make the file bigger.

Here's some numbers from the latest tests. The top test is the netCDF classic format case. These are the time step reads.

cs[0] cs[1] cs[2] cache(MB) deflate shuffle 1st_read_hor(us) avg_read_hor(us)
0     0     0     0.0       0       0       35974            3125
32    64    128   1.0       0       0       261893           2931
32    64    256   1.0       0       0       132380           3563
32    128   128   1.0       0       0       151692           3657
32    128   256   1.0       0       0       8063             2219
64    64    128   1.0       0       0       133339           4264
64    64    256   1.0       0       0       28208            3359
64    128   128   1.0       0       0       27536            3051
64    128   256   1.0       0       0       110620           2043

Here are the time series reads:

cs[0] cs[1] cs[2] cache(MB) deflate shuffle 1st_read_ser(us) avg_read_ser(us)
0     0     0     0.0       0       0       3257952          8795
32    64    128   1.0       0       0       1427863          15069
32    64    256   1.0       0       0       2219838          4394
32    128   128   1.0       0       0       2054724          4668
32    128   256   1.0       0       0       3335330          4347
64    64    128   1.0       0       0       1041324          3581
64    64    256   1.0       0       0       1893643          2995
64    128   128   1.0       0       0       1942810          3024
64    128   256   1.0       0       0       3210923          3975

For the time series test, we see that smaller chunk sizes for the horizontal dimensions work better, and larger chunk sizes for the time dimension work better.

For the horizontal read we see that larger chunk sizes for the horizontal dimensions work better, and small chunk sizes along the time dimension.

Maybe the answer *is* to go with the current default scheme, but just make the sizes of the chunks that it writes much bigger.

I would really like 64 x 64 x 128 for the data above, except for the (possibly spurious) high value for the first horizontal read in that case.

Posted by $entry.creator.screenName