[netcdfgroup] read performance slow compared to netCDF on other systems

Hello! We are installing netCDF 4.4.1 w/ HDF5 1.8.17 on our new Intel based
cluster. We've noticed the read performance on this cluster using ncks is
extremely slow compared to a couple other systems. For example, parsing a
file on our lustre 2.1 based filesystem takes less 8 seconds on our Cray
XK6-200m. Parsing the same file on the same filesystem on our new cluster
is taking 30+ seconds, with most of that time apparently spent reading in
the file.

Cray (hostname fish):
fish1:lforbes$ time ncks test.nc out.nc

real    0m4.804s
user    0m3.180s
sys    0m1.300s

Cluster (hostname chinook):
n0:loforbes$ time ncks mod.nc out.nc

real    0m32.435s
user    0m29.240s
sys    0m1.936s

As part of trying to figure out what's going on, I strace'ed the process on
both systems. One thing that jumps out at me is that the process running on
a compute node on our new cluster is executing a _lot_ more brk() calls to
allocate additional memory than on a login node of our Cray, at least 8
times as many in one test comparison (strace output files are available).
I'm not sure if this means anything, or how I can impact this behaviour.

I've tried recompiling NetCDF on our new cluster a variety of ways,
stripping out features like szip and enabling others like MMAP, but none of
the changes have impacted the performance.

Based on what I've seen googling and reading through the mail list
archives, I've also tried using `ncks --fix_rec_dmn` to generate a new
version of the input file (which is just over 650MBs) with a limited time

chinook01:loforbes$ ncdump -k test.nc
chinook01:loforbes$ ncdump -k mod.nc
chinook01:loforbes$ ncdump -s test.nc | head
netcdf test {
    time = UNLIMITED ; // (21 currently)
    nv = 2 ;
    x = 352 ;
    y = 608 ;
    nv4 = 4 ;
    double time(time) ;
        time:units = "seconds since 1-1-1" ;
chinook01:loforbes$ ncdump -s mod.nc | head
netcdf mod {
    time = 21 ;
    y = 608 ;
    x = 352 ;
    nv4 = 4 ;
    nv = 2 ;
    float basal_mass_balance_average(time, y, x) ;
        basal_mass_balance_average:units = "kg m-2 year-1" ;

This also didn't seem to make a difference.

Unfortunately, as the cluster administrator, my NetCDF knowledge is very
limited. The test file was provided by the researcher reporting this
problem. What he is experiencing is a significant application slow down due
to this issue occurring every time step when he reads/writes files. It more
than doubles the run time, making our new cluster unusable to him. I don't
think anything is necessarily "broken" with NetCDF, but I'm not sure what
further diagnostics to attempt or if there are other changes to the input
file I and the researcher should try. Any help would be appreciated. Thank


