Faster time series for the people!
What HDF5 chunk cache sizes are good for reading timeseries data
in netCDF-4? I'm sure you have wondered - I know I have. Now we know:
.5 to 4 MB. Bigger caches just slow this down. Now that came as a
surprise!
The first three numbers are the chunk sizes of the 3 dimensions of the
main data variable. The next two columns show the deflate (0 = none) and
shuffle filter (0 = none). These are all the same for every run,
because the same input file is used for all these runs - only the chunk
cache size is changed when (re-)opening the file. The Unix file cache is
cleared between each run.
The two times shows are the number of micro-seconds to read a
time-series of the data, and the average time to read a time series
after all time series are read.
*** Benchmarking pr_A1 file pr_A1_256_128_128.nc with various HDF5 chunk caches... cs[0] cs[1] cs[2] cache(MB) deflate shuffle 1st_read_ser(us) avg_read_ser(us) 256 128 128 0.5 0 0 1279615 2589 256 128 128 1.0 0 0 1279613 2641 256 128 128 4.0 0 0 1298543 2789 256 128 128 16.0 0 0 1470297 34603 256 128 128 32.0 0 0 1470360 34541
Note that for cache sizes of < 4 MB, the first time series read took
1.2 - 1.3 s, and the average time was .0025 - .0028 s. But when I
increased the chunk cache to 16 MB and 32MB, the time for the first read
went to 1.5 s, and the avg time for all reads went to .035 s - an order
of magnitude jump!
I have repeated these tests a number of times, always with this result for chunk cache buffers 16 MB and above.
I am planning on changing the netCDF-4.1 default to 1 MB, which is the
HDF5 default. (I guess we should have listened to the HDF5 team in the
first place.)