« Smaller Chunk Sizes... | Main | Why I Don't... »

Large-Enough Cache Very Important When Reading Compressed NetCDF-4/HDF5 Data

07 January 2010

The HDF5 chunk cache must be large enough to hold an uncompressed chunk.

Here's some test runs showing that a large enough cache is very important when reading compressed data. If the chunk cache is not big enough, then the data have to be deflated again and again.

The first run below uses the default 1MB chunk cache. The second uses a 16 MB cache. Note that the times to read the first time step are comparable, but the run with the large cache has a much lower average time, because each chunk is only uncompressed one time.

bash-3.2$ sudo ./clear_cache.sh && ./tst_ar4 pr_A1_z1_64_128_256.nc -h
cs[0] cs[1] cs[2] cache(MB) deflate shuffle 1st_read_hor(us)   avg_read_hor(us)
64    128   256   1.0       1       0       387147             211280

bash-3.2$ sudo ./clear_cache.sh && ./tst_ar4 pr_A1_z1_64_128_256.nc -h 
bash-3.2$ -c 16000000 pr_A1_z1_64_128_256.nc
s[0] cs[1] cs[2] cache(MB)  deflate shuffle 1st_read_hor(us)   avg_read_hor(us)
64   128   256   15.3       1       0       320176             4558

For comparison, here's the time for the netCDF-4/HDF5 file which is not compressed:

bash-3.2$ sudo ./clear_cache.sh && ./tst_ar4 -h pr_A1_64_128_256.nc
cs[0] cs[1] cs[2] cache(MB)  deflate shuffle 1st_read_hor(us)  avg_read_hor(us)
64    128   256   1.0        0       0       459               1466

And here's the same run on the classic netCDF version of the file:

bash-3.2$ sudo ./clear_cache.sh && ./tst_ar4 -h 
bash-3.2$ pr_A1.20C3M_8.CCSM.atmm.1870-01_cat_1999-12.nc
cs[0] cs[1] cs[2] cache(MB)  deflate shuffle 1st_read_hor(us)  avg_read_hor(us)
0     0     0     0.0        0       0       2172              1538

So the winner is NetCDF-4/HDF5 for performance, with the best read time for the first time step, and the best average read time. Next comes the netCDF classic file, then the netCDF-4/HDF5 compressed file, which takes two order of magnitude longer than the classic file for the first time step, but then catches up so that the average read time is only 4 time slower than the classic file.

The file sizes show that this read penalty is probably not worth it:

pr_A1.20C3M_8.CCSM.atmm.1870-01_cat_1999-12.nc    204523236
pr_A1_z1_64_128_256.nc                            185543248
pr_A1_64_128_256.nc                               209926962

So the compressed NetCDF-4/HDF5 file saves only 20 MB out of about 200, about 10%.

The uncompressed NetCDF-4/HDF5 file is 5 MB larger than the classic file, or about 2.5% larger.

Posted by $entry.creator.screenName