Hi Leon,
> Thanks for mentioning chunk sizing; that's not something I had thought
> about. I've got one unlimited dimension, and it sounds like that means an
> inefficient default chunks size <
> http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/Default-Chunking.html
> #Default-Chunking>.
> ("For
> unlimited dimensions, a chunk size of one is always used." What's the
> unit?
> One DEFAULT_CHUNK_SIZE? Maybe it'll become clear as I read more.)
>
It means that if you have a variable with an unlimited dimension, such
as
float var(time, lon, lat)
where time is unlimited, then the default chunks will be of shape
1 x clon x clat
values (not bytes), for integers clon, clat computed to be smaller
than but proportioanl to the sizes of the lon and lat dimensions,
resulting in a default chunksize close to but less than 4 MB (so in
this case each chunk has about 1 million values). These default
chunks are not necessarily good for some kinds of access. A good
chunk size and shape may depend on anticipated access patterns as well
as disk block size of the file system on which the data is stored.
I've started a series of blog postings about chunk shapes and sizes,
but so far only posted the first part:
http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters
Eventually, with feedback on these, better guidance and software
defaults for chunking may result. I'll try to post the second
installment next week.
> I guess I've got some reading ahead of me. For resources, I see the
> powerpoint presentation<http://hdfeos.org/workshops/ws13/presentations/day1/H
> DF5-EOSXIII-Advanced-Chunking.ppt>that's
> linked to and the HDF5 page on
> chunking <http://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/>. Do you have
> any other recommendations?
I liked these papers, though they get a bit technical:
Efficient Organization of Large Multidimensional Arrays
http://cs.brown.edu/courses/cs227/archives/2008/Papers/FileSystems/sarawagi94efficient.pdf
Optimal Chunking of Large Multidimensional Arrays for Data Warehousing
http://www.escholarship.org/uc/item/35201092
--Russ
> Thanks.
> -Leon
>
> On Wed, Feb 20, 2013 at 4:31 PM, Russ Rew <russ@xxxxxxxxxxxxxxxx> wrote:
> >
> > Large chunk sizes might mean a lot of extra I/O, as well as extra CPU
> > for uncompressing the same data chunks repeatedly. You might see if
> > lowering your chunk size significantly improves network usage ...