oprofiling nco & netcdf - some results.

To: dods@xxxxxxxxxxxxxxxx, netcdfgroup@xxxxxxxxxxxxxxxx
Subject: oprofiling nco & netcdf - some results.
From: Harry Mangalam <hjm@xxxxxxxxx>
Date: Tue, 16 Aug 2005 13:53:22 -0700
nco & netcdf aficionados.

In preparing to optimize nco via MPI, I've been doing some profiling of the 
code to see where it's bogging down.

Here are the benchmark-like runs done with oprofile, a free, very cool, 
easy-to-install, easy-to-use, stochastic profiler available for a variety of 
platforms (oprofile.sf.net)

The first few have been done 2x to show the kind of reproducibility between 
runs (they're not identical b/c oprofile time-samples instead of using 
hardware counters or synthetic CPUS (like valgrind) - the upside is that 
there's no need to specially instrument the code (besides adding the -g CFLAG 
for compilation) and (unlike valgrind), the apps run at close to normal 
speed; the downside is that the process will miss a few counts).

I've truncated most of the outputs, except the second below to show the entire 
range of counts.  Some comments <---- are inserted.  This approach has 
pointed out some targets for us to optimize, but it also points out some 
problems in general with aspects of netcdf files (some of which has probably 
been gone over in depth before).  Most troubling for nco is the byteswapping 
bottleneck (swapn4b at netcdf-3.6.0/src/libsrc/ncx.c:119), but there are a 
number of similar byeswap fn()s.  Some of the nco's spend >95% of the whole 
execution time byteswapping and many common operations spend >20% 
byteswapping.  

Charlie may have mentioned this in the past - I'd like to bring it up again.  
Would it be possible to allow pass-thru of those bytes that are not being 
changed (it seems silly to byteswap twice for no reason).
Or, since a file read once and written on one architecture is most likely to 
be read and written again on the same architecture, can the byte order be 
written native for that arch with the order noted in the attributes?  Does 
netcdf have any mechanisms for allowing this approach currently?

One reason for revisiting this issue is that Charlie was buttonholed by a guy 
who was in charge of postprocessing IPCC files and the was pretty perturbed 
by the amount of time it took to do it.  His requirements may not have been 
directly related, but even 30% speedup is significant for petabyte data 
sizes.

I'm not sure yet how much time opendap spends doing this, but it wouldn't 
surprise me if it's similar.


Methology:
After starting the oprofile daemon:
 'sudo opcontrol --no-vmlinux --start' 
to start the profiler, I ran:
 'sudo opcontrol --reset' 
between each run to clear the profile counters to prevent previous runs from 
contaminating successive runs.  Because of the way it works, it can 
accurately sample very long-running processes and your results will be 
confounded (as mine were) if different apps call the same functions or even 
(possibly) different functions in the same library.  So make sure to reset 
oprofile before running another test.

In each stanza below, the commandline shown was run after resetting oprofile 
and the resulting report was truncated in most instances at the 1-2% level.

/usr/local/bin/ncap -h -O -s  "nu_var1 
[time,lat,lon,lev]=d4_01*d4_02*(d4_03**2)-(d4_05/d4_06)"   -s "nu_var2 
[lat,time,lev,lon]=(d4_13/d4_02)*((d4_03**2)-(d4_05/d4_06))" -s "nu_var3 
[time,lat,lon]=(d3_08*3d_01)-(3d_05**3)-(3d_11*3d_16)" -s "nu_var4 
[lon,lat,time]=(d3_08+3d_01)-(3d_05*3)-3d_11-17.33)"  -p /data/hjm/nco_bm 
ipcc_dly_T85.nc /data/hjm/nco_bm/foo.nc

(2 runs)

CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
44010    96.8466  swapn4b      <- 96% of time spent byteswapping!!
688       1.5140  px_get

CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
44973    96.6995  swapn4b
699       1.5030  px_get
211       0.4537  ncio_px_get
115       0.2473  px_pgin
90        0.1935  getNCvx_float_float
90        0.1935  putNCvx_float_float
84        0.1806  ncx_howmany
54        0.1161  ncx_putn_float_float
47        0.1011  px_pgout
32        0.0688  ncx_getn_float_float
25        0.0538  ncio_px_rel
14        0.0301  anonymous symbol from section .plt
11        0.0237  px_rel
7         0.0151  NC_findvar
6         0.0129  nc_put_vara_float
6         0.0129  nco_put_vara
5         0.0108  NC_varoffset
4         0.0086  NCcoordck
4         0.0086  nc_get_vara_float
3         0.0065  NC_check_id
3         0.0065  getNCv_float
2         0.0043  NCvnrecs
2         0.0043  nco_var_val_cpy
2         0.0043  putNCv_float
2         0.0043  yyparse
1         0.0022  CreateNode
1         0.0022  NC_check_vlen
1         0.0022  NC_endef
1         0.0022  NC_lookupvar
1         0.0022  NCedgeck
1         0.0022  nc__create_mp
1         0.0022  nc_inq_att
1         0.0022  nco_free
1         0.0022  nco_get_vara
1         0.0022  nco_inq_dimid_flg
1         0.0022  nco_mss_val_get
1         0.0022  nco_typ_lng
1         0.0022  nco_var_fll
1         0.0022  new_NC
1         0.0022  odo1
1         0.0022  set_upper
1         0.0022  var_dfl_set


#####################################


/usr/local/bin/ncrcat -h -O  -n 4,2,1 -p /data/hjm/nco_bm 
skn_lgs_00.nc /data/hjm/nco_bm/foo.nc  

(2 runs for 4 files each)

CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
162908   30.7513  px_get      <--- looks like I/O bound..?
43256     8.1652  main
26857     5.0697  getNCvx_float_float
26640     5.0287  nco_var_get
18739     3.5373  nc_get_var1_float
18640     3.5186  ncio_px_get
16911     3.1922  nc_put_var1_float
16301     3.0771  NC_findattr
16017     3.0235  putNCvx_float_float
15633     2.9510  NCcoordck
13847     2.6138  NC_lookupattr
13791     2.6033  nco_pck_dsk_inq
13701     2.5863  getNCv_float
12746     2.4060  nco_get_var1
12226     2.3078  nco_inq_att_flg
9992      1.8861  NC_check_id
7214      1.3618  swapn4b

CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
156526   30.6816  px_get
41796     8.1927  main
26904     5.2736  getNCvx_float_float
25713     5.0402  nco_var_get
18392     3.6051  nc_get_var1_float
18277     3.5826  ncio_px_get
16134     3.1625  nc_put_var1_float
15173     2.9742  NC_findattr
14905     2.9216  NCcoordck
14595     2.8609  putNCvx_float_float
13830     2.7109  getNCv_float
13303     2.6076  nco_pck_dsk_inq
12602     2.4702  nco_get_var1
12523     2.4547  NC_lookupattr
11285     2.2120  nco_inq_att_flg


#####################################


/usr/local/bin/ncrcat -h -O  -n 30,2,1 -p /data/hjm/nco_bm 
skn_lgs_00.nc /data/hjm/nco_bm/foo.nc  

(2 runs for 30 files each to see if there was any change in distribution (not 
really))

CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
1188898  30.3816  px_get       <--- I/O bound..?
329096    8.4098  main
194113    4.9604  getNCvx_float_float
193716    4.9503  nco_var_get
139195    3.5570  nc_get_var1_float
138361    3.5357  ncio_px_get
130332    3.3306  nc_put_var1_float
120096    3.0690  NC_findattr
117864    3.0119  putNCvx_float_float
115390    2.9487  NCcoordck
100581    2.5703  nco_pck_dsk_inq
100496    2.5681  NC_lookupattr
97406     2.4892  nco_get_var1
96558     2.4675  getNCv_float
85875     2.1945  nco_inq_att_flg
77161     1.9718  NC_check_id
52039     1.3298  swapn4b

CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
1143662  29.1803  px_get
330483    8.4322  main
209255    5.3391  getNCvx_float_float
202318    5.1621  nco_var_get
141829    3.6187  nc_get_var1_float
134400    3.4292  ncio_px_get
133916    3.4168  nc_put_var1_float
120232    3.0677  NCcoordck
117733    3.0039  NC_findattr
114406    2.9190  putNCvx_float_float
106404    2.7149  getNCv_float
102715    2.6208  nco_pck_dsk_inq
102116    2.6055  nco_get_var1
99158     2.5300  NC_lookupattr
90043     2.2974  nco_inq_att_flg
79626     2.0316  NC_check_id
52848     1.3484  swapn4b


#####################################


/usr/local/bin/ncra -D2 -h -O  -n 4,2,1 -p /data/hjm/nco_bm 
ipcc_dly_T85_00.nc  /data/hjm/nco_bm/foo.nc

CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
227839   65.9413  nco_var_add_tll_ncra <---- optimization / MPI target!
109567   31.7109  swapn4b              <---- again, byteswapping 
5253      1.5203  nco_var_nrm
1200      0.3473  px_get


#####################################


/usr/local/bin/ncpdq  -D2 -h -O  -a 'lon,time,lev,lat' -p /data/hjm/nco_bm  
ipcc_dly_T85.nc  /data/hjm/nco_bm/foo.nc

CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
1736463  96.9924  nco_var_dmn_rdr_val  <---- optimization / MPI target!
49459     2.7626  swapn4b
2915      0.1628  anonymous symbol from section .plt
659       0.0368  px_get



/usr/local/bin/ncflint  -D2 -h -O   -w '0.5' -p /data/hjm/nco_bm 
ipcc_dly_T85_00.nc  ipcc_dly_T85_01.nc  /data/hjm/nco_bm/foo.nc

CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
115001   29.4886  nco_var_mlt
113175   29.0204  nco_var_cnf_typ
82781    21.2268  nco_var_add_tll_ncflint
75683    19.4067  swapn4b               <---- again, byteswapping 
969       0.2485  px_get


#####################################


/usr/local/bin/ncwa  -D2 -h -O  -y sqrt -a 
lat,lon /data/hjm/nco_bm/ipcc_dly_T85.nc /data/hjm/nco_bm/foo.nc

CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
1535141  95.8639  nco_var_avg            <---- optimization / MPI target!
37378     2.3341  nco_var_avg_reduce_ttl
26714     1.6682  swapn4b

CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
1539548  95.8061  nco_var_avg
38195     2.3769  nco_var_avg_reduce_ttl
26820     1.6690  swapn4b

[with a different, smaller file, relatively more time is spent byteswapping]
/usr/local/bin/ncwa -D2  -h -O -y sqrt -a lat,lon   
~/data/nco_bm/sml_stl_28.nc ~/data/nco_bm/sml_stl_out.nc

CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
47241    89.9263  nco_var_avg
2607      4.9626  swapn4b
2450      4.6637  nco_var_avg_reduce_ttl


#####################################


#####################################


/usr/local/bin/ncbo  -D2 -h -O  --op_typ='-' -p /data/hjm/nco_bm 
ipcc_dly_T85.nc  ipcc_dly_T85_00.nc /data/hjm/nco_bm/foo.nc

CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
74162    69.2249  swapn4b                <---- again, byteswapping 
30484    28.4546  nco_var_sbt
1045      0.9754  px_get


#####################################


/usr/local/bin/ncea -h -O -D2 -n 32,2,1 -p /data/hjm/nco_bm 
stl_5km_00.nc /data/hjm/nco_bm/foo.nc

CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
489123   65.1199  nco_var_add_tll_ncra    <---- optimization / MPI target!
246314   32.7933  swapn4b                 <---- again, byteswapping 
10816     1.4400  nco_var_nrm
2456      0.3270  px_get


#####################################



/usr/local/bin/ncecat -D2 -h -O  -n 32,2,1 -p /data/hjm/nco_bm 
skn_lgs_00.nc /data/hjm/nco_bm/foo.nc

CPU: AMD64 processors, speed 1804.13 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
561993   54.5130  px_get               <------ I/O bound..?
132219   12.8252  getNCvx_float_float
66122     6.4138  nc_get_vara_float
63292     6.1393  ncio_px_get
51907     5.0350  swapn4b
46320     4.4930  getNCv_float
27221     2.6404  NC_varoffset
23341     2.2641  ncx_getn_float_float
17055     1.6543  ncio_px_rel
13584     1.3176  ncx_howmany
13558     1.3151  px_rel
13390     1.2988  odo1

-- 
Cheers, Harry
Harry J Mangalam - 949 856 2847 (vox; email for fax) - hjm@xxxxxxxxx 
            <<plain text preferred>>
2005 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdfgroup archives: