Proof New Default Chunk Cache in 4.1 Improves Performance

A last minute change before the 4.1 release ensures that this common case will get good performance.

There is a terrible performance hit if your chunk cache is too small to hold even one chunk, and your data are deflated.

Since the default HDF5 chunk cache size is 1 MB, this is not hard to do.

So I have added code such that, when a file is opened, if the data are compressed, and if the chunksize is greater than the default chunk cache size for that var, then the chunk cache is increased to a multiple of the chunk size.

The code looks like this:

/* Is this a deflated variable with a chunksize greater than the                                                                                               
* current cache size? */
if (!var->contiguous && var->deflate)
{
   chunk_size_bytes = 1;
   for (d = 0; d < var->ndims; d++)
     chunk_size_bytes *= var->chunksizes[d];
   if (var->type_info->size)
     chunk_size_bytes *= var->type_info->size;
   else
     chunk_size_bytes *= sizeof(char *);
#define NC_DEFAULT_NUM_CHUNKS_IN_CACHE 10
#define NC_DEFAULT_MAX_CHUNK_CACHE 67108864
   if (chunk_size_bytes > var->chunk_cache_size)
   {
     var->chunk_cache_size = chunk_size_bytes * NC_DEFAULT_NUM_CHUNKS_IN_CACHE;
     if (var->chunk_cache_size > NC_DEFAULT_MAX_CHUNK_CACHE)
        var->chunk_cache_size = NC_DEFAULT_MAX_CHUNK_CACHE;
     if ((retval = nc4_reopen_dataset(grp, var)))
        return retval;
   }
}

I am setting the chunk cache to 10 times the chunk size, up to 64 MB max. Reasonable? Comments are welcome.

The timing results show a clear difference. First, two runs without any per-variable caching, but the second run sets a 64MB file level chunk
cache that speeds up timing considerably. (The last number in the row is the average read time for a horizontal layer, in miro-seconds.)

bash-3.2$ ./tst_ar4_3d  pr_A1_z1_256_128_256.nc 
256     128     256     1.0             1       0           836327       850607

bash-3.2$ ./tst_ar4_3d -c 68000000 pr_A1_z1_256_128_256.nc
256     128     256     64.8            1       0           833453       3562

Without the cache it is over 200 times slower.

Now I have turned on automatic variable caches when appropriate:

bash-3.2$ ./tst_ar4_3d  pr_A1_z1_256_128_256.nc 
256     128     256     1.0             1       0           831470       3568

In this run, although no file level cache was turned on, I got the same response time. That's because when opening the file the netCDF library noticed that this deflated var had a chunk size bigger than the default cache size, and opened a bigger cache.

All of this work is in support of the general netCDF user writing very large files, and specifically in support of the AR-5 effort.

The only downside is that, if you open up a file with many such variables, and you have very little memory on your machine, you will run out of memory.

Could Code Review Work at Unidata?

Recently I was reminded that I used to be a proponent of code review...

I am (still) a big fan of code review. But code review only works under certain circumstances, and I think those circumstances are hard to find at Unidata.

  • There must be 3 or more reviewers (in addition to programmer who wrote the code). They must all be programmers in daily practice with the same language and environment as the reviewed code, and working on the same project.
  • Code is judged against written, reviewed requirements.
  • Code that satisfies requirements with least number of functions, variables, lines of code (within reason) is best. (Occum's Razor for code.)
  • Must be full buy-in from the programmers and all their management. This is a very expensive process.
  • All released code (except test code) must be reviewed.
  • Major (user-affecting) and minor (bad coding) defects are identified but not fixed at review. There is no discussion about potential fixes - a problem is identified and the review moves on.
  • All major and minor problems must be fixed by original programmer, and the code re-submitted for re-review. (Usually much quicker, but still necessary.)
  • Review should be one hour at a time, with one or two hours more mandatory preparation by all reviewers.
  • Records are kept, project manager follows-up.
  • Unanimous decision required for all defects and to pass review.
  • No supervisors or spectators ever.

As you can see, to meet all these conditions is no small feat. In fact, I have hardly ever seen it done. When I have seen it done, it has worked very well. Ideally, it becomes a place where all the project programmers learn from each other. The best practices spread through the project code, and bad practices wither away.

How can this work at Unidata? We don't have the programmers!

If we got more programmers, I would still think that other, more inexpensive software process improvements should take place first (like written requirements, and requirements review).

However, I would be willing to participate in any code review that anyone organizes at Unidata as long as two conditions are met:
  1. It seems to be part of a serious effort - not just a casual thing. There must be feedback of review into product. (That is, something must be done with results of the review.) And there must be written requirements so I know what the code is supposed to do.
  2. My project(s) must be compensated in some way with time. I need someone else to do that quarter-day a week of work I would give up to have time for review. For example, I would do plenty of good reviewing for anyone who answered 3 netCDF support questions a week!

Why I Don't Think the Number of Processors Affects the Wall Clock Time of My Tests...

Using the taskset command I demonstrate that my benchmarks run on one processor.

Recently Russ raised a question: are my wall clock times wrong because I have many processors on my machine. I believe that the answer is no.

Firstly, without special efforts, linux has no way of breaking a process onto more than one processor. When I run my benchmarking program, tst_ar4, I see from the top command than one processor goes to > 90% use, and all the others remain at 0%.

Secondly, I confirmed this with the taskset command, which I had never heard of before. It limits a process to one (or any other number) of processors. In fact, it lets you pick which processors are used. Here's some timing results that show I get about the same times using the taskset command as I do without it, on a compressed file read:

bash-3.2$ sudo ./clear_cache.sh && ./tst_ar4 -h -c 35000000 \
bash-3.2$ pr_A1_z1_64_128_256.nc
cs[0] cs[1] cs[2]  cache(MB) deflate shuffle 1st_read_hor(us) avg_read_hor(us)
64    128   256    33.4      1       0       304398    4568

bash-3.2$ sudo ./clear_cache.sh && taskset -c 4 ./tst_ar4 -h -c 35000000 \
bash-3.2$ pr_A1_z1_64_128_256.nc
cs[0] cs[1] cs[2]  cache(MB) deflate shuffle 1st_read_hor(us) avg_read_hor(us)
64    128   256    33.4      1       0       306810   4553

bash-3.2$ sudo ./clear_cache.sh && ./tst_ar4 -h -c 35000000 \
bash-3.2$ pr_A1_z1_64_128_256.nc
cs[0] cs[1] cs[2]  cache(MB) deflate shuffle 1st_read_hor(us) avg_read_hor(us)
64    128   256    33.4      1       0       292707   4616

bash-3.2$ sudo ./clear_cache.sh && taskset -c 4 ./tst_ar4 -h -c 35000000 \
bash-3.2$ pr_A1_z1_64_128_256.nc
cs[0] cs[1] cs[2]  cache(MB) deflate shuffle 1st_read_hor(us) avg_read_hor(us)
64    128   256    33.4      1       0       293713   4567

Large-Enough Cache Very Important When Reading Compressed NetCDF-4/HDF5 Data

The HDF5 chunk cache must be large enough to hold an uncompressed chunk.

Here's some test runs showing that a large enough cache is very important when reading compressed data. If the chunk cache is not big enough, then the data have to be deflated again and again.

The first run below uses the default 1MB chunk cache. The second uses a 16 MB cache. Note that the times to read the first time step are comparable, but the run with the large cache has a much lower average time, because each chunk is only uncompressed one time.

bash-3.2$ sudo ./clear_cache.sh && ./tst_ar4 pr_A1_z1_64_128_256.nc -h
cs[0] cs[1] cs[2] cache(MB) deflate shuffle 1st_read_hor(us)   avg_read_hor(us)
64    128   256   1.0       1       0       387147             211280

bash-3.2$ sudo ./clear_cache.sh && ./tst_ar4 pr_A1_z1_64_128_256.nc -h \
bash-3.2$ -c 16000000 pr_A1_z1_64_128_256.nc
s[0] cs[1] cs[2] cache(MB)  deflate shuffle 1st_read_hor(us)   avg_read_hor(us)
64   128   256   15.3       1       0       320176             4558

For comparison, here's the time for the netCDF-4/HDF5 file which is not compressed:

bash-3.2$ sudo ./clear_cache.sh && ./tst_ar4 -h pr_A1_64_128_256.nc
cs[0] cs[1] cs[2] cache(MB)  deflate shuffle 1st_read_hor(us)  avg_read_hor(us)
64    128   256   1.0        0       0       459               1466

And here's the same run on the classic netCDF version of the file:

bash-3.2$ sudo ./clear_cache.sh && ./tst_ar4 -h \
bash-3.2$ pr_A1.20C3M_8.CCSM.atmm.1870-01_cat_1999-12.nc
cs[0] cs[1] cs[2] cache(MB)  deflate shuffle 1st_read_hor(us)  avg_read_hor(us)
0     0     0     0.0        0       0       2172              1538

So the winner is NetCDF-4/HDF5 for performance, with the best read time for the first time step, and the best average read time. Next comes the netCDF classic file, then the netCDF-4/HDF5 compressed file, which takes two order of magnitude longer than the classic file for the first time step, but then catches up so that the average read time is only 4 time slower than the classic file.

The file sizes show that this read penalty is probably not worth it:

pr_A1.20C3M_8.CCSM.atmm.1870-01_cat_1999-12.nc    204523236
pr_A1_z1_64_128_256.nc                          185543248
pr_A1_64_128_256.nc                               209926962

So the compressed NetCDF-4/HDF5 file saves only 20 MB out of about 200, about 10%.

The uncompressed NetCDF-4/HDF5 file is 5 MB larger than the classic file, or about 2.5% larger. 

Smaller Chunk Sizes For Unlimited Dimension

More tests...

r_A1_4_64_128.nc pr_A1_8_64_128.nc pr_A1_16_64_128.nc pr_A1_32_64_128.nc \
pr_A1_64_64_128.nc
cs[0] cs[1] cs[2]  cache(MB) deflate shuffle  1st_read_hor(us) avg_read_hor(us)
0    0    0     0.0       0    0        2155        1603
4    64    128    1.0       0    0        7021        1567
8    64    128    1.0       0    0        14084        1538
16    64    128    1.0       0    0        82906        1570
32    64    128    1.0       0    0        145295        2138
64    64    128    1.0       0    0        21960        2825
cs[0] cs[1] cs[2]  cache(MB) deflate shuffle  1st_read_ser(us) avg_read_ser(us)
0    0    0    0.0       0    0        2399157        9181
4    64    128    1.0       0    0        2434194        15954
8    64    128    1.0       0    0        2317802        13627
16    64    128    1.0       0    0        1531121        12686
32    64    128    1.0       0    0        1299189        12265
64    64    128    1.0       0    0        863365        2356 
Unidata Developer's Blog
A weblog about software development by Unidata developers*
Unidata Developer's Blog
A weblog about software development by Unidata developers*

Welcome

FAQs

News@Unidata blog

Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
  • feed AWIPS (17)
Browse by Topic
« January 2010 »
SunMonTueWedThuFriSat
     
6
10
11
12
13
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
      
Today