Re: [netcdfgroup] unlimited dimensions and chunking??

Chris,

FWIW, here is a chunk size recipe that works well for some rather large
gridded files that I work with.  This is optimized for writing sequentially
along the time dimension, and reading whole lat/lon grids sequentially or
randomly.  Selected metadata:

ncdump -hst uwnd.1979-2012.nc ...
level = 37 ; lat = 256 ; lon = 512 ; time = UNLIMITED ; // (49676
currently) int level(level) ; level:_Storage = "contiguous" ; float
lat(lat) ; lat:_Storage = "contiguous" ; float lon(lon) ; lon:_Storage =
"contiguous" ; float time(time) ; time:_Storage = "chunked" ;
time:_ChunkSizes = 16384 ; float uwnd(time, level, lat, lon) ;
uwnd:_Storage = "chunked" ; uwnd:_ChunkSizes = 1, 1, 256, 512 ;

This scheme depends on good chunk cacheing with adequate buffers for both
read and write. I think it is a good idea to design chunking on a
per-variable basis, not per-dimension. Think of chunks as small hyperslabs,
not dimension steps.

Note in particular the successful use of two very different chunk numbers
in two different variables on the unlimited time dimension.

I do not have answers for your specific questions right now, hopefully
someone else will respond.

--Dave

On Fri, Dec 27, 2013 at 2:15 PM, Chris Barker <chris.barker@xxxxxxxx> wrote:

> Hi all,
>
> We're having some issues with unlimited dimensions and chunking. First, a
> few notes:
>
> I'm using the netCDF4 python wrappers, and having different symptoms on
> Windows and Mac, so this could be issues in the py wrappers, or the
> netcdflib, or the hdf lib, or how one of those is built...
>
> If i try to use an unlimted dimension and NOT specify any chunking, I get
> odd results:
>
> On Windows:
>   It takes many times longer to run, and produces a file that is 6 times
> as big.
>
> On OS-X:
>   The mac crashes if I try to use an unlimited dimension and not specify
> chunking.
>
> This page:
>
>
> http://www.unidata.ucar.edu/software/netcdf/docs/default_chunking_4_0_1.html
>
> Does indicate that the default is chunksize of 1, which seems insanely
> small to me, but should at least work. Note: does setting a chunksize of 1
> mean that HDF will really use that small of chunks? -- it perusing those
> HDF docs, it seems it needs to beuild up a tree structure to
> store where all the chunks are, and there are performance implications to a
> large tree -- s a chunksize of 1 guarantees a really big tree. Wouldn't a
> small, but far from 1 value make some sense? like 1k or something?
>
> In my experiments with a simple 1-d array, with an unlimited dimension,
> writing a MB at a time, dropping the chunksize below about 512MB started to
> effect write performance.
>
> Very small chunks really made it crawl.
>
> And explicitly setting size-1 chunks made it crash (on OS-X with a malloc
> error). So I think that explains my problem.
>
> With smaller data sets, it works, but runs really slowly -- with a 8MB
> dataset, going from a chunksize of 1 to a chunksize of 128  reduced write
> time from 10 seconds to 0.1 second.
>
> Increasing to 16k reduces it to about 0.03 seconds -- larger than that
> makes no noticable difference.
>
> So I think I know why I'm having getting problems with unspecified
> chunksizes, and a chunksize of 1 probably shouldn't be the default!
>
> However, if you specify a chunksize, HDF does seem to allocate at least
> one full chunk in the file -- makes sense, so you wouldn't want to store
> very small variable with a large chunk size, but I suspect:
>
> 1) if you are using an unlimited dimension, you are unlikely to be storing
> VERY small arrays.
>
> 2) netcdf4 seems to have about 8k of overhead anyway.
>
> So a 1k or so sized default seems reasonable.
>
> One last note:
>
> From experimenting, it appears that you set chunksizes in numbers
> of elements  rather than number of bytes. Is that the case, I haven't been
> able to find it documented anywhere.
>
> Thanks,
>    -Chris
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker@xxxxxxxx
>
> _______________________________________________
> netcdfgroup mailing list
> netcdfgroup@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe,  visit:
> http://www.unidata.ucar.edu/mailing_lists/
>
  • 2013 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: