Re: [netcdfgroup] slow reads in 4.4.1.1 vs 4.1.3 for some files

  • Subject: Re: [netcdfgroup] slow reads in 4.4.1.1 vs 4.1.3 for some files
  • From: Dave Allured - NOAA Affiliate <dave.allured@xxxxxxxx>
  • Date: Thu, 15 Dec 2016 18:03:38 -0700
On Thu, Dec 15, 2016 at 4:46 PM, Chris Barker <chris.barker@xxxxxxxx> wrote:

> On Thu, Dec 15, 2016 at 1:00 PM, dmh@xxxxxxxx <dmh@xxxxxxxx> wrote:
>
>> 1. Adding this feature to ncdump also requires adding
>>    it to the netcdf-c library API. But providing some means
>>    for client programs to pass thru parameter settings to the hdf5 lib
>>    seems like a good idea.
>>
>
> absolutely! that would be very helpful.
>
> -CHB
>

This may be premature.  The netcdf API already has its own chunk cache with
at least two functions to adjust tuning parameters.  It seems to me that
the netcdf facility would probably handle the current ncdump and gdal cases
nicely, though I have not tested it.  Please see this relevant
documentation:

http://www.unidata.ucar.edu/software/netcdf/docs/netcdf_perf_chunking.html

Simon, you might want to ask your gdal maintainer to give this a try.  If
it works, it should be simple and robust.  I would suggest increasing the
per-variable chunk size to at least 5 qualityFlags.nc uncompressed chunks,
and probably more.  5 is the number of chunks that span a single row for
this particular file.  This advice presumes that your typical read pattern
is similar to ncdump, which I speculate is first across single whole rows,
as I said earlier.

  columns = 4865 ;
  rows = 3682 ;
  uint quality_flags(rows, columns) ;
    quality_flags:_ChunkSizes = 891, 1177 ;

5 x 891 x 1177 x 4 bytes per uint uncompressed ~= 21 Mbytes

Note this is likely to be a little larger than the default cache size in
the current netcdf-C library, thus explaining some of the slow read
behavior.

You might also consider rechunking such data sets to smaller chunk size.
Nccopy and ncks can do that.  Rechunking may depend on your anticipated
spatial read patterns, so give that a little thought.

You might also consider reading the entire grid in a single get_vara call
to the netcdf API.  That is what my fast fortran test program did.  A naive
reader that, for example, loops over single rows may incur bad cache
activity that could be avoided.

--Dave