NetCDF-4 Horizontal Data Read Performance with Cache Clearing
03 January 2010
Here are my numbers for doing horizontal reads with different cache sizes.
The times are the time to read each horizontal size, reading all of them.
I realize that reading just one horizontal slice will give different
(much higher) times. The reason is that when I read the first horizontal
level the various caches along the way will start filling up with the
following levels, and then when I read them I get very low times. So
reading it this way allows the caching to work. Reading just one
horizontal level and stopping the program (to clear cache), will result
in the worst case scenario for the caching.
But what should I be optimizing for? Reading all horizontal levels? Or just reading one level?
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us)
0 0 0 0.0 0 0 1527
1 16 32 1.0 0 0 1577
1 16 128 1.0 0 0 1618
1 16 256 1.0 0 0 1515
1 64 32 1.0 0 0 1579
1 64 128 1.0 0 0 1586
1 64 256 1.0 0 0 1584
1 128 32 1.0 0 0 1593
1 128 128 1.0 0 0 1583
1 128 256 1.0 0 0 1571
10 16 32 1.0 0 0 2128
10 16 128 1.0 0 0 2520
10 16 256 1.0 0 0 4309
10 64 32 1.0 0 0 4083
10 64 128 1.0 0 0 1751
10 64 256 1.0 0 0 1713
10 128 32 1.0 0 0 1692
10 128 128 1.0 0 0 1862
10 128 256 1.0 0 0 1749
256 16 32 1.0 0 0 10594
256 16 128 1.0 0 0 3681
256 16 256 1.0 0 0 3074
256 64 32 1.0 0 0 3656
256 64 128 1.0 0 0 3042
256 64 256 1.0 0 0 2773
256 128 32 1.0 0 0 3828
256 128 128 1.0 0 0 2335
256 128 256 1.0 0 0 1581
1024 16 32 1.0 0 0 35622
1024 16 128 1.0 0 0 2759
1024 16 256 1.0 0 0 2912
1024 64 32 1.0 0 0 2875
1024 64 128 1.0 0 0 2868
1024 64 256 1.0 0 0 3816
1024 128 32 1.0 0 0 2780
1024 128 128 1.0 0 0 2558
1024 128 256 1.0 0 0 1628
1560 16 32 1.0 0 0 154450
1560 16 128 1.0 0 0 3063
1560 16 256 1.0 0 0 3700
Posted by $entry.creator.screenName
NetCDF-4 Performance With Cache Clearning
03 January 2010
Now I have made some changes in my timing program, and I think I am getting better (i.e. more realistic) times.
Firstly, I now clear the cache before each read.
Secondly, I don't try and read the horizontal sections and the
timeseries in the same program run - whichever one is done first loads
the cache for the other, and gives unrealistic times. Now I time these
separately.
OK, so here are some timeseries reads times. The first row in netCDF classic data:
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_time_ser(us)
1 16 32 1.0 0 0 2434393
1 16 128 1.0 0 0 2411127
1 16 256 1.0 0 0 2358892
1 64 32 1.0 0 0 2455963
1 64 128 1.0 0 0 2510818
1 64 256 1.0 0 0 2482509
1 128 32 1.0 0 0 2480481
1 128 128 1.0 0 0 2489436
1 128 256 1.0 0 0 2504924
10 16 32 1.0 0 0 1146593
10 16 128 1.0 0 0 1156650
10 16 256 1.0 0 0 1259026
10 64 32 1.0 0 0 1150427
10 64 128 1.0 0 0 2384334
10 64 256 1.0 0 0 2438587
10 128 32 1.0 0 0 1258380
10 128 128 1.0 0 0 2521213
10 128 256 1.0 0 0 2528927
256 16 32 1.0 0 0 174062
256 16 128 1.0 0 0 358613
256 16 256 1.0 0 0 404662
256 64 32 1.0 0 0 400489
256 64 128 1.0 0 0 688528
256 64 256 1.0 0 0 1267521
256 128 32 1.0 0 0 404422
256 128 128 1.0 0 0 1374661
256 128 256 1.0 0 0 2445647
1024 16 32 1.0 0 0 78718
1024 16 128 1.0 0 0 346506
1024 16 256 1.0 0 0 378813
1024 64 32 1.0 0 0 340703
1024 64 128 1.0 0 0 665649
1024 64 256 1.0 0 0 1269936
1024 128 32 1.0 0 0 380796
1024 128 128 1.0 0 0 1269627
1024 128 256 1.0 0 0 2513330
1560 16 32 1.0 0 0 58124
1560 16 128 1.0 0 0 332641
1560 16 256 1.0 0 0 372587
1560 64 32 1.0 0 0 323445
1560 64 128 1.0 0 0 635165
1560 64 256 1.0 0 0 1263225
1560 128 32 1.0 0 0 372226
1560 128 128 1.0 0 0 1265999
1560 128 256 1.0 0 0 2712887
These numbers make more sense. It takes about 2.3 seconds to read the time series from the classic file.
Ed
Posted by $entry.creator.screenName
Demonstrating Caching and Its Effect on Timing
02 January 2010
The cache can really mess up benchmarking!
For example:
bash-3.2$ sudo bash clear_cache.sh && ./tst_ar4_3d -h -c
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us) read_time_ser(us)
64 256 128 4.0 0 0 66 2102
bash-3.2$ sudo bash clear_cache.sh && ./tst_ar4_3d -h
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us) read_time_ser(us)
64 256 128 4.0 0 0 1859 2324282
In the first run of tst_ar4_3d, with the -c option, the sample data file
is first created and then read. The read time for the time series read
is really low, because the file (having just been created) is still
loaded in a disk cache somewhere in the OS or in the disk hardware.
When I clear the cache and rerun without the -c option, the sample data
file is not created, it is assumed to already exist. Since the cache has
been cleared, the time series read has to read the data from disk, and
it takes 1000 times longer.
Well, that's why they invented disk caches.
This leads me to believe that my horizontal read times are fake too,
because first I am doing a time series read, those loading some or all
of the file into cache. I need to break that out into a separate test, I
see, or perhaps make the order of the two tests controllable from the
command line.
Oy, this benchmarking stuff is tricky business! I thought I had found
some really good performance for netCDF-4, but now I am not sure. I need
to look again more carefully and make sure that I am not being faked
out by the caches.
Ed
Posted by $entry.creator.screenName
Effects of Clearing the Cache on Benchmarks
02 January 2010
How to win friends and influence benchmarks...
I note that I have a shell in my nc_test4 directory,
clear_cache.sh. I have to sudo to run it, but when I do, it has a
dramatic effect on the time that the time series read takes.
The following uses the new (not yet checked in) test program
tst_ar4_3d.c, which seeks to set up a simpler proxy data file for the
AR-4 tests. I want to show that a simpler file (but with the same-sized
data variable) has similar performance to the slightly more dressed up
pr_A1 file from AR-4 that I got from Gary. That's because my simpler
file is easier to create in a test program.
bash-3.2$ ./tst_ar4_3d -h
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us) read_time_ser(us)
64 256 128 4.0 0 0 1420 2281847
bash-3.2$ ./tst_ar4_3d -h
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us) read_time_ser(us)
64 256 128 4.0 0 0 81 3159
bash-3.2$ ./tst_ar4_3d -h
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us) read_time_ser(us)
64 256 128 4.0 0 0 76 2983
bash-3.2$ sudo bash clear_cache.sh
bash-3.2$ ./tst_ar4_3d -h
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us) read_time_ser(us)
64 256 128 4.0 0 0 1410 2504315
Wow, what a difference a cleared cache makes!
Here's the clear_cache.sh script:
#!/bin/bash -x
# Clear the disk caches.
sync
echo 3 > /proc/sys/vm/drop_caches
Posted by $entry.creator.screenName
More Cache Size Benchmarks
31 December 2009
Why does increasing cache size slow down time series access so much?
bash-3.2$ ./tst_ar4 -h pr_A1_256_128_128.nc
cs[0] cs[1] cs[2] cache(MB) deflate shuffle read_hor(us) read_time_ser(us)
256 128 128 0.5 0 0 217 2773
256 128 128 1.0 0 0 214 1935
256 128 128 4.0 0 0 214 1929
256 128 128 32.0 0 0 160 84440
256 128 128 128.0 0 0 129 82407
Posted by $entry.creator.screenName