nco & netcdf aficionados.
In preparing to optimize nco via MPI, I've been doing some profiling of the
code to see where it's bogging down.
Here are the benchmark-like runs done with oprofile, a free, very cool,
easy-to-install, easy-to-use, stochastic profiler available for a variety of
platforms (oprofile.sf.net)
The first few have been done 2x to show the kind of reproducibility between
runs (they're not identical b/c oprofile time-samples instead of using
hardware counters or synthetic CPUS (like valgrind) - the upside is that
there's no need to specially instrument the code (besides adding the -g CFLAG
for compilation) and (unlike valgrind), the apps run at close to normal
speed; the downside is that the process will miss a few counts).
I've truncated most of the outputs, except the second below to show the entire
range of counts. Some comments <---- are inserted. This approach has
pointed out some targets for us to optimize, but it also points out some
problems in general with aspects of netcdf files (some of which has probably
been gone over in depth before). Most troubling for nco is the byteswapping
bottleneck (swapn4b at netcdf-3.6.0/src/libsrc/ncx.c:119), but there are a
number of similar byeswap fn()s. Some of the nco's spend >95% of the whole
execution time byteswapping and many common operations spend >20%
byteswapping.
Charlie may have mentioned this in the past - I'd like to bring it up again.
Would it be possible to allow pass-thru of those bytes that are not being
changed (it seems silly to byteswap twice for no reason).
Or, since a file read once and written on one architecture is most likely to
be read and written again on the same architecture, can the byte order be
written native for that arch with the order noted in the attributes? Does
netcdf have any mechanisms for allowing this approach currently?
One reason for revisiting this issue is that Charlie was buttonholed by a guy
who was in charge of postprocessing IPCC files and the was pretty perturbed
by the amount of time it took to do it. His requirements may not have been
directly related, but even 30% speedup is significant for petabyte data
sizes.
I'm not sure yet how much time opendap spends doing this, but it wouldn't
surprise me if it's similar.
Methology:
After starting the oprofile daemon:
'sudo opcontrol --no-vmlinux --start'
to start the profiler, I ran:
'sudo opcontrol --reset'
between each run to clear the profile counters to prevent previous runs from
contaminating successive runs. Because of the way it works, it can
accurately sample very long-running processes and your results will be
confounded (as mine were) if different apps call the same functions or even
(possibly) different functions in the same library. So make sure to reset
oprofile before running another test.
In each stanza below, the commandline shown was run after resetting oprofile
and the resulting report was truncated in most instances at the 1-2% level.
/usr/local/bin/ncap -h -O -s "nu_var1
[time,lat,lon,lev]=d4_01*d4_02*(d4_03**2)-(d4_05/d4_06)" -s "nu_var2
[lat,time,lev,lon]=(d4_13/d4_02)*((d4_03**2)-(d4_05/d4_06))" -s "nu_var3
[time,lat,lon]=(d3_08*3d_01)-(3d_05**3)-(3d_11*3d_16)" -s "nu_var4
[lon,lat,time]=(d3_08+3d_01)-(3d_05*3)-3d_11-17.33)" -p /data/hjm/nco_bm
ipcc_dly_T85.nc /data/hjm/nco_bm/foo.nc
(2 runs)
CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
samples % symbol name
44010 96.8466 swapn4b <- 96% of time spent byteswapping!!
688 1.5140 px_get
CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
samples % symbol name
44973 96.6995 swapn4b
699 1.5030 px_get
211 0.4537 ncio_px_get
115 0.2473 px_pgin
90 0.1935 getNCvx_float_float
90 0.1935 putNCvx_float_float
84 0.1806 ncx_howmany
54 0.1161 ncx_putn_float_float
47 0.1011 px_pgout
32 0.0688 ncx_getn_float_float
25 0.0538 ncio_px_rel
14 0.0301 anonymous symbol from section .plt
11 0.0237 px_rel
7 0.0151 NC_findvar
6 0.0129 nc_put_vara_float
6 0.0129 nco_put_vara
5 0.0108 NC_varoffset
4 0.0086 NCcoordck
4 0.0086 nc_get_vara_float
3 0.0065 NC_check_id
3 0.0065 getNCv_float
2 0.0043 NCvnrecs
2 0.0043 nco_var_val_cpy
2 0.0043 putNCv_float
2 0.0043 yyparse
1 0.0022 CreateNode
1 0.0022 NC_check_vlen
1 0.0022 NC_endef
1 0.0022 NC_lookupvar
1 0.0022 NCedgeck
1 0.0022 nc__create_mp
1 0.0022 nc_inq_att
1 0.0022 nco_free
1 0.0022 nco_get_vara
1 0.0022 nco_inq_dimid_flg
1 0.0022 nco_mss_val_get
1 0.0022 nco_typ_lng
1 0.0022 nco_var_fll
1 0.0022 new_NC
1 0.0022 odo1
1 0.0022 set_upper
1 0.0022 var_dfl_set
#####################################
/usr/local/bin/ncrcat -h -O -n 4,2,1 -p /data/hjm/nco_bm
skn_lgs_00.nc /data/hjm/nco_bm/foo.nc
(2 runs for 4 files each)
CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
samples % symbol name
162908 30.7513 px_get <--- looks like I/O bound..?
43256 8.1652 main
26857 5.0697 getNCvx_float_float
26640 5.0287 nco_var_get
18739 3.5373 nc_get_var1_float
18640 3.5186 ncio_px_get
16911 3.1922 nc_put_var1_float
16301 3.0771 NC_findattr
16017 3.0235 putNCvx_float_float
15633 2.9510 NCcoordck
13847 2.6138 NC_lookupattr
13791 2.6033 nco_pck_dsk_inq
13701 2.5863 getNCv_float
12746 2.4060 nco_get_var1
12226 2.3078 nco_inq_att_flg
9992 1.8861 NC_check_id
7214 1.3618 swapn4b
CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
samples % symbol name
156526 30.6816 px_get
41796 8.1927 main
26904 5.2736 getNCvx_float_float
25713 5.0402 nco_var_get
18392 3.6051 nc_get_var1_float
18277 3.5826 ncio_px_get
16134 3.1625 nc_put_var1_float
15173 2.9742 NC_findattr
14905 2.9216 NCcoordck
14595 2.8609 putNCvx_float_float
13830 2.7109 getNCv_float
13303 2.6076 nco_pck_dsk_inq
12602 2.4702 nco_get_var1
12523 2.4547 NC_lookupattr
11285 2.2120 nco_inq_att_flg
#####################################
/usr/local/bin/ncrcat -h -O -n 30,2,1 -p /data/hjm/nco_bm
skn_lgs_00.nc /data/hjm/nco_bm/foo.nc
(2 runs for 30 files each to see if there was any change in distribution (not
really))
CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
samples % symbol name
1188898 30.3816 px_get <--- I/O bound..?
329096 8.4098 main
194113 4.9604 getNCvx_float_float
193716 4.9503 nco_var_get
139195 3.5570 nc_get_var1_float
138361 3.5357 ncio_px_get
130332 3.3306 nc_put_var1_float
120096 3.0690 NC_findattr
117864 3.0119 putNCvx_float_float
115390 2.9487 NCcoordck
100581 2.5703 nco_pck_dsk_inq
100496 2.5681 NC_lookupattr
97406 2.4892 nco_get_var1
96558 2.4675 getNCv_float
85875 2.1945 nco_inq_att_flg
77161 1.9718 NC_check_id
52039 1.3298 swapn4b
CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
samples % symbol name
1143662 29.1803 px_get
330483 8.4322 main
209255 5.3391 getNCvx_float_float
202318 5.1621 nco_var_get
141829 3.6187 nc_get_var1_float
134400 3.4292 ncio_px_get
133916 3.4168 nc_put_var1_float
120232 3.0677 NCcoordck
117733 3.0039 NC_findattr
114406 2.9190 putNCvx_float_float
106404 2.7149 getNCv_float
102715 2.6208 nco_pck_dsk_inq
102116 2.6055 nco_get_var1
99158 2.5300 NC_lookupattr
90043 2.2974 nco_inq_att_flg
79626 2.0316 NC_check_id
52848 1.3484 swapn4b
#####################################
/usr/local/bin/ncra -D2 -h -O -n 4,2,1 -p /data/hjm/nco_bm
ipcc_dly_T85_00.nc /data/hjm/nco_bm/foo.nc
CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
samples % symbol name
227839 65.9413 nco_var_add_tll_ncra <---- optimization / MPI target!
109567 31.7109 swapn4b <---- again, byteswapping
5253 1.5203 nco_var_nrm
1200 0.3473 px_get
#####################################
/usr/local/bin/ncpdq -D2 -h -O -a 'lon,time,lev,lat' -p /data/hjm/nco_bm
ipcc_dly_T85.nc /data/hjm/nco_bm/foo.nc
CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
samples % symbol name
1736463 96.9924 nco_var_dmn_rdr_val <---- optimization / MPI target!
49459 2.7626 swapn4b
2915 0.1628 anonymous symbol from section .plt
659 0.0368 px_get
/usr/local/bin/ncflint -D2 -h -O -w '0.5' -p /data/hjm/nco_bm
ipcc_dly_T85_00.nc ipcc_dly_T85_01.nc /data/hjm/nco_bm/foo.nc
CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
samples % symbol name
115001 29.4886 nco_var_mlt
113175 29.0204 nco_var_cnf_typ
82781 21.2268 nco_var_add_tll_ncflint
75683 19.4067 swapn4b <---- again, byteswapping
969 0.2485 px_get
#####################################
/usr/local/bin/ncwa -D2 -h -O -y sqrt -a
lat,lon /data/hjm/nco_bm/ipcc_dly_T85.nc /data/hjm/nco_bm/foo.nc
CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
samples % symbol name
1535141 95.8639 nco_var_avg <---- optimization / MPI target!
37378 2.3341 nco_var_avg_reduce_ttl
26714 1.6682 swapn4b
CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
samples % symbol name
1539548 95.8061 nco_var_avg
38195 2.3769 nco_var_avg_reduce_ttl
26820 1.6690 swapn4b
[with a different, smaller file, relatively more time is spent byteswapping]
/usr/local/bin/ncwa -D2 -h -O -y sqrt -a lat,lon
~/data/nco_bm/sml_stl_28.nc ~/data/nco_bm/sml_stl_out.nc
CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
samples % symbol name
47241 89.9263 nco_var_avg
2607 4.9626 swapn4b
2450 4.6637 nco_var_avg_reduce_ttl
#####################################
#####################################
/usr/local/bin/ncbo -D2 -h -O --op_typ='-' -p /data/hjm/nco_bm
ipcc_dly_T85.nc ipcc_dly_T85_00.nc /data/hjm/nco_bm/foo.nc
CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
samples % symbol name
74162 69.2249 swapn4b <---- again, byteswapping
30484 28.4546 nco_var_sbt
1045 0.9754 px_get
#####################################
/usr/local/bin/ncea -h -O -D2 -n 32,2,1 -p /data/hjm/nco_bm
stl_5km_00.nc /data/hjm/nco_bm/foo.nc
CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
samples % symbol name
489123 65.1199 nco_var_add_tll_ncra <---- optimization / MPI target!
246314 32.7933 swapn4b <---- again, byteswapping
10816 1.4400 nco_var_nrm
2456 0.3270 px_get
#####################################
/usr/local/bin/ncecat -D2 -h -O -n 32,2,1 -p /data/hjm/nco_bm
skn_lgs_00.nc /data/hjm/nco_bm/foo.nc
CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
samples % symbol name
561993 54.5130 px_get <------ I/O bound..?
132219 12.8252 getNCvx_float_float
66122 6.4138 nc_get_vara_float
63292 6.1393 ncio_px_get
51907 5.0350 swapn4b
46320 4.4930 getNCv_float
27221 2.6404 NC_varoffset
23341 2.2641 ncx_getn_float_float
17055 1.6543 ncio_px_rel
13584 1.3176 ncx_howmany
13558 1.3151 px_rel
13390 1.2988 odo1
--
Cheers, Harry
Harry J Mangalam - 949 856 2847 (vox; email for fax) - hjm@xxxxxxxxx
<<plain text preferred>>