[netcdfgroup] Fwd: [Hdf-forum] Chuck cache size proposal

Hi All:

I am forwarding this because from time to time there has been discussions on 
this list  as to how best  to compress/chunk and some of the tradeoffs.
-Roy

> Begin forwarded message:
> 
> From: Elena Pourmal <epourmal@xxxxxxxxxxxx>
> Subject: Re: [Hdf-forum] Chuck cache size proposal
> Date: February 14, 2016 at 4:55:07 PM PST
> To: HDF Users Discussion List <hdf-forum@xxxxxxxxxxxxxxxxxx>
> Reply-To: HDF Users Discussion List <hdf-forum@xxxxxxxxxxxxxxxxxx>
> 
> Hi David and Filipe,
> 
> Chunking and compression is a powerful feature that boosts performance and 
> saves space, but if not used correctly (and as you rightfully noted), leads 
> to performance issues.
> 
> We did discuss the solution you proposed and voted against it. While it is 
> reasonable to increase current default chunk cache size from 1 MB to ???, it 
> would be unwise for the HDF5 library to use a chunk cache size equal to a 
> dataset chunk size. We decided to leave it to applications to determine the 
> appropriate chunk cache size and strategies (for example, use 
> H5Pset_chunk_cache instead of H5Pset_cache, or disable chunk cache 
> completely!)
> 
> 
> Here are several reasons:
> 
> 1. Chunk size can be pretty big because it worked well when data was written, 
> but it may not work well for reading applications. An HDF5 application will 
> use a lot of memory when working with such files, especially, if many files 
> and datasets are open. We see this scenario very often when users work with 
> the collections of the HDF5 files (for example, NPP satellite data; the 
> attached paper discusses one of those use cases).
> 
> 2. Making chunk cache size the same as chunk size will only solve the 
> performance problem when data that is written/or read belongs to one chunk. 
> This is not usually the case. Suppose you have a row that spans among several 
> chunks. When application reads by one row at a time, it will not only use a 
> lot of memory because chunk cache is now big, but there will be the same 
> performance problem as you described in your email: the same chunk will be 
> read and discarded many times.
> 
> The way to deal with the performance problem is to adjust access pattern or 
> have chunk cache that contains as many chunks as possible for the I/O 
> operation. The HDF5 library doesn’t know this a priori and that is why we 
> left it to applications. At this point we don’t see how we can help except 
> educating our users.
> 
> I am attaching a white paper that will be posted on our Website; see section 
> 4. Comments are highly appreciated.
> 
> Thank you!
> 
> Elena
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Elena Pourmal  The HDF Group  http://hdfgroup.org   
> 1800 So. Oak St., Suite 203, Champaign IL 61820
> 217.531.6112
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Attachment: TechNote-HDF5-ImprovingIOPerformanceCompressedDatasets.pdf
Description: Adobe PDF document

> 
> 
> 
> 
> On Feb 10, 2016, at 10:27 AM, David A. Schneider <davidsch@xxxxxxxxxxxxxxxxx> 
> wrote:
> 
>> I think something like that is a great idea!
>> 
>> David
>> 
>> On 02/10/16 07:17, Filipe Maia wrote:
>>> Hi,
>>> 
>>> I think the current default chuck cache size behaviour of HDF5 is 
>>> inadequate for the type of data it is typically used with.
>>> This is specially a problem when using compression as most users/readers 
>>> will not set any chunk cache and cause endless decompression calls. I think 
>>> this hinders the use of compression and other kinds of filters.
>>> 
>>> I would like to propose that when a dataset is opened its chuck cache be 
>>> set to the largest of the file chunk cache (the one set with H5Pset_cache) 
>>> and the dataset chunk size.
>>> 
>>> I think this would be beneficial for the vast majority of the workloads.
>>> 
>>> Cheers,
>>> Filipe
>>> 
>>> 
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> Hdf-forum@xxxxxxxxxxxxxxxxxx
>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>> Twitter: https://twitter.com/hdf5
>> 
>> 
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> Hdf-forum@xxxxxxxxxxxxxxxxxx
>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>> Twitter: https://twitter.com/hdf5
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@xxxxxxxxxxxxxxxxxx
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5

**********************
"The contents of this message do not reflect any position of the U.S. 
Government or NOAA."
**********************
Roy Mendelssohn
Supervisory Operations Research Analyst
NOAA/NMFS
Environmental Research Division
Southwest Fisheries Science Center
***Note new address and phone***
110 Shaffer Road
Santa Cruz, CA 95060
Phone: (831)-420-3666
Fax: (831) 420-3980
e-mail: Roy.Mendelssohn@xxxxxxxx www: http://www.pfeg.noaa.gov/

"Old age and treachery will overcome youth and skill."
"From those who have been given much, much will be expected" 
"the arc of the moral universe is long, but it bends toward justice" -MLK Jr.