[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDFJava #BNA-191717]: chunking in Java



Jeff,

> Thanks for the additional info. I will be using release 4.3.21 (or later)
> regardless of what file format we ultimately end up using. You mentioned
> that 4.3.2 should improve the default chunking, but the results I sent were
> using newer than that release already, so it sounds like I shouldn't expect
> any improvements on the NC4 files size at this point, correct?

The improvement that netCDF C version 4.3.2 made was to change the default 
chunk size for 1-dimensional record variables to DEFAULT_CHUNK_SIZE bytes,
where DEFAULT_CHUNK_SIZE is a configure-time constant with default value
4194304.  I'm surprised that using different chunk sizes made no difference
in the file size, so I may try to duplicate your results to understand how 
that happened.

--Russ

> address@hidden> wrote:
> 
> > Hi Jeff,
> >
> > > From those articles the purpose of chunking is to improve performance for
> > > large multi-dimensional data sets. It seems like it won't really provide
> > > any benefit in out situation since we only have one dimension. I know
> > that
> > > NetCDF4 added chunking, but are all NetCDF4 files chunked, i.e., is there
> > > such a thing as a non-chunked NetCDF4 files? Or is that a contradiction
> > in
> > > terms somehow?
> >
> > No, all netCDF-4 files aren't chunked.  The simpler alternative,
> > contiguous layout,
> > is better if you don't need compression, unlimited dimensions, or support
> > for
> > multiple patterns of access that chunking makes possible in netCDF-4 files.
> >
> > A netCDF-4 variable can use contiguous layout if doesn't use an unlimited
> > dimension or any sort of filter such as compression or checksums.
> >
> > > Given that NetCDF4 readers are backwards-compatible with NetCDF3 files,
> > is
> > > there any reason not to use a NetCDF3 file from your perspective? My
> > > suspicion is that our requirement is just being driven by "use the latest
> > > version" rather than any technical reasons.
> >
> > I think I agree with you.  With only one unlimited dimension, and if you
> > don't need
> > the transparent compression that netCDF-4 makes possible, there's no need
> > to
> > not just use the default contiguous layout that a netCDF-3 format file
> > provides.
> > However, you should still use the netCDF-4 library, just don't specify the
> > netCDF-4
> > format when you create the file.  That's because the netCDF-4 software
> > includes bug
> > fixes, performance enhancements, portability improvements, and remote
> > access
> > capabilities mot available in the old netCDF-3.6.3 version software.
> >
> > The reason you were seeing a 7-fold increase in size is exactly as Ethan
> > pointed out,
> > due to way the HDF5 storage layer implements unlimited dimensions, using
> > chunking
> > implemented with B-tree data structures and indices, rather than a simpler
> > contiguous
> > storage used in the classic netCDF format.  The recent netcdf-4.3.2
> > version improves
> > the default chunking for 1-dimensional variables with an unlimited
> > dimension, as in
> > your case, so may be sufficient to provide both smaller files and benefits
> > of netCDF-4
> > chunking, but without testing I can't predict how close it comes to the
> > simpler netCDF
> > classic format in this case.  Maybe I can get time later today to try it
> > ...
> >
> > > I couldn't find anything on the NetCDF website regarding "choosing the
> > > right format for you". I was hoping there'd be something along those
> > lines
> > > in the FAQ, but no luck.
> >
> > The FAQ section on "Formats, Data Models, and Software Releases"
> >
> >    http://www.unidata.ucar.edu/netcdf/docs/faq.html
> >
> > is intended to clarify the somewhat complex situation with multiple
> > versions of netCDF
> > data models, software, and formats, but evidently doesn't help much in
> > your case of
> > choosing whether to use the default classic netCDF format, the netCDF-4
> > classic model
> > format, or the netCDF-4 format.
> >
> > Thanks for pointing out the need for improving this section, and in
> > particular the answer
> > to the FAQ "Should I get netCDF-3 or netCDF-4?", which should really
> > address the question
> > "When should I use the netCDF classic format?".
> >
> > --Russ
> >
> > > address@hidden> wrote:
> > >
> > > > Hi Jeff,
> > > >
> > > > How chunking and compression affect file size and read/write
> > performance
> > > > is a complex issue. I'm going to pass this along to our chunking expert
> > > > (Russ Rew) who, I believe, is back in the office on Monday and should
> > be
> > > > able to provide you with some better advise than I can give.
> > > >
> > > > In the mean time, here's an email he wrote in response to a
> > conversation
> > > > on the effect of chunking on performance that might be useful:
> > > >
> > > >
> > > >
> > http://www.unidata.ucar.edu/mailing_lists/archives/netcdfgroup/2013/msg00498.html
> > > >
> > > > Sorry I don't have a better answer for you.
> > > >
> > > > Ethan
> > > >
> > > > Jeff Johnson wrote:
> > > > > Ethan-
> > > > >
> > > > > I made the changes you suggested with the following result:
> > > > >
> > > > > 10000 records, 8 bytes / record = 80000 bytes raw data
> > > > >
> > > > > original program (NetCDF4, no chunking): 537880 bytes (6.7x)
> > > > > file size with chunk size of 2000 = 457852 bytes (5.7x)
> > > > >
> > > > > So a little better, but still not good. I then tried different chunk
> > > > sizes
> > > > > of 10000, 5000, 200, and even 1, which I would've thought would give
> > me
> > > > the
> > > > > original size, but all gave the same resulting file size of 457852.
> > > > >
> > > > > Finally, I tried writing more records to see if it's just a symptom
> > of a
> > > > > small data set. With 1M records:
> > > > >
> > > > > 8MB raw data, chunk size = 2000
> > > > > 45.4MB file (5.7x)
> > > > >
> > > > > This is starting to seem like a lost cause given our small data
> > records.
> > > > > I'm wondering if you have information I could use to go back to the
> > > > archive
> > > > > group and try to convince them to use NetCDF3 instead.
> > > > >
> > > > > jeff
> > > >
> > > >
> > > > Ticket Details
> > > > ===================
> > > > Ticket ID: BNA-191717
> > > > Department: Support netCDF
> > > > Priority: Normal
> > > > Status: Open
> > > >
> > > >
> > >
> > >
> > > --
> > > Jeff Johnson
> > > DSCOVR Ground System Development
> > > Space Weather Prediction Center
> > > address@hidden
> > > 303-497-6260
> > >
> > >
> > Russ Rew                                         UCAR Unidata Program
> > address@hidden                      http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: BNA-191717
> > Department: Support netCDF
> > Priority: Normal
> > Status: Closed
> >
> >
> 
> 
> --
> Jeff Johnson
> DSCOVR Ground System Development
> Space Weather Prediction Center
> address@hidden
> 303-497-6260
> 
> 
Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: BNA-191717
Department: Support netCDF
Priority: Normal
Status: Closed