Hi all,
We're having some issues with unlimited dimensions and chunking. First, a
few notes:
I'm using the netCDF4 python wrappers, and having different symptoms on
Windows and Mac, so this could be issues in the py wrappers, or the
netcdflib, or the hdf lib, or how one of those is built...
If i try to use an unlimted dimension and NOT specify any chunking, I get
odd results:
On Windows:
It takes many times longer to run, and produces a file that is 6 times as
big.
On OS-X:
The mac crashes if I try to use an unlimited dimension and not specify
chunking.
This page:
http://www.unidata.ucar.edu/software/netcdf/docs/default_chunking_4_0_1.html
Does indicate that the default is chunksize of 1, which seems insanely
small to me, but should at least work. Note: does setting a chunksize of 1
mean that HDF will really use that small of chunks? -- it perusing those
HDF docs, it seems it needs to beuild up a tree structure to
store where all the chunks are, and there are performance implications to a
large tree -- s a chunksize of 1 guarantees a really big tree. Wouldn't a
small, but far from 1 value make some sense? like 1k or something?
In my experiments with a simple 1-d array, with an unlimited dimension,
writing a MB at a time, dropping the chunksize below about 512MB started to
effect write performance.
Very small chunks really made it crawl.
And explicitly setting size-1 chunks made it crash (on OS-X with a malloc
error). So I think that explains my problem.
With smaller data sets, it works, but runs really slowly -- with a 8MB
dataset, going from a chunksize of 1 to a chunksize of 128 reduced write
time from 10 seconds to 0.1 second.
Increasing to 16k reduces it to about 0.03 seconds -- larger than that
makes no noticable difference.
So I think I know why I'm having getting problems with unspecified
chunksizes, and a chunksize of 1 probably shouldn't be the default!
However, if you specify a chunksize, HDF does seem to allocate at least one
full chunk in the file -- makes sense, so you wouldn't want to store very
small variable with a large chunk size, but I suspect:
1) if you are using an unlimited dimension, you are unlikely to be storing
VERY small arrays.
2) netcdf4 seems to have about 8k of overhead anyway.
So a 1k or so sized default seems reasonable.
One last note:
>From experimenting, it appears that you set chunksizes in numbers
of elements rather than number of bytes. Is that the case, I haven't been
able to find it documented anywhere.
Thanks,
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@xxxxxxxx