« File Size and Chunki... | Main | The Point of All... »

Narrowing In On Correct Chunksizes For the 3D AR-4 Data

04 January 2010

We're getting there...

It seems clear that Quincey's original advice is good: use large, squarish chunks.

My former scheme of default chunk sizes worked not terribly for the innermost dimensions (it used the full length of the dimension), but the use of a chunksize of 1 for unlimited dimensions was a bad one for read performance.

Here's some read numbers for chunksizes in what I believe is the correct range of chunksizes:

cs[0] cs[1] cs[2] cache(MB) deflate shuffle 1st_read_hor(us) avg_read_hor(us)
0     0     0     0.0       0       0       7087             1670
64    128   256   1.0       0       0       510              1549
128   128   256   1.0       0       0       401              1688
256   128   256   1.0       0       0       384              1679
64    128   256   1.0       1       0       330548           211382
128   128   256   1.0       1       0       618035           420617

Note that the last two are deflated versions of the data, and are 1000 times slower to read as a result.

The first line is the netCDF classic file. The non-deflated HDF5 files easily beat the read performance of the classic file, probably because the HDF5 files are in native endianness and the netCDF classic file has to be converted from big-endian to little-endian for this platform.

What is odd is that the HDF5 files have a higher average read time than their first read time. I don't get that. I expected that the first read would always be the longest wait, but once you started, subsequent reads would be faster. But not for these uncompressed HDF5 files. I am clearing the cache between each read.

Here's my timing code:

 /* Read the data variable in horizontal slices. */
    start[0] = 0;
    start[1] = 0;
    start[2] = 0;
    count[0] = 1;
    count[1] = LAT_LEN;
    count[2] = LON_LEN;

    /* Read (and time) the first one. */
    if (gettimeofday(&start_time, NULL)) ERR;
    if (nc_get_vara_float(ncid, varid, start, count, hor_data)) ERR_RET;
    if (gettimeofday(&end_time, NULL)) ERR;
    if (timeval_subtract(&diff_time, &end_time, &start_time)) ERR;
    read_1_us = (int)diff_time.tv_sec * MILLION + (int)diff_time.tv_usec;

    /* Read (and time) all the rest. */
    if (gettimeofday(&start_time, NULL)) ERR;
    for (start[0] = 1; start[0] < TIME_LEN; start[0]++)
       if (nc_get_vara_float(ncid, varid, start, count, hor_data)) ERR_RET;
    if (gettimeofday(&end_time, NULL)) ERR;
    if (timeval_subtract(&diff_time, &end_time, &start_time)) ERR;
    avg_read_us = ((int)diff_time.tv_sec * MILLION + (int)diff_time.tv_usec +
                   read_1_us) / TIME_LEN;