[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #MAN-367636]: netCDF-4 file grows enormous when dimension set to unlimited



Meg,

> Thanks for investigating this. That makes sense that file chunks could be
> affecting the file size. Do you know if compression compresses across
> chunks? My understanding was that it didn't, but we were able to get decent
> results with compression. So I wonder if there is a combination of factors
> at work here. In any case, I'm interested to see what nccopy can do with
> resizing chunks.

Compression requires chunking, and both compression and uncompression
are done one chunk at a time, so you can access a small amount of data
from a large compressed file efficiently, having the library only
uncompress the chunks that contain the desired data.

Your example file demonstrates a problem with the current defaults for
guessing good chunk shapes.  The defaults have been changed several
times to try to provide better chunk shapes for conflicting goals. The
early defaults always set the chunk size for unlimited dimensions to
1, with the idea that the most common form of access would be 1 record
at a time, where a record was defined as a cross-section of data for
one specified value of each unlimited dimension.

That default turned out to be horribly inefficient for time-series
data, where users really wanted to efficiently access data for all
times for one specified value of each fixed dimension.

A compromise was implemented to set the default chunksize for
1-dimensional variables that use only an unlimited dimension to
1048576 instead of 1, which would waste some space in some cases to
improve performance in uses such as time-series data.

That default is horribly inefficient in the case you have of a large
number of variables using an unlimited dimension that never gets very
big (just 1 in your example file).

A convincing case can be made that no good default chunking strategy
works in all cases, and sometimes it's just necessary to explicitly
specify chunk sizes and shapes to reflect how the data is most
commonly accessed, or to balance multiple different ways to access the
data so that none is either optimally efficient or horribly
inefficient.

Another way to do this is have users specify a high-level chunking
strategy, for which good defaults can be automatically derived. That
is what Charlie Zender's NCO software
    http://nco.sourceforge.net/nco.html#Chunking
does.

Charlie even named one of the chunking strategies after me (-:
because I wrote a blog about this, just Google for "chunking data:
choosing shapes". The NCO documentation describes that chunking
strategy as:

   Chunksize Balances 1D and (N-1)-D Access to N-D Variable [default for 
netCDF4 input]

   Definition: Chunksizes are chosen so that 1-D and (N-1)-D
   hyperslabs of 3-D variables (e.g., point-timeseries or
   latitude/longitude surfaces of 3-D fields) both require
   approximately the number of chunks. Hence their access time should
   be balanced. Russ Rew explains the motivation and derivation for
   this strategy here.  cnk_map key values: ‘rew’, ‘cnk_rew’,
   ‘map_rew’.
   Mnemonic: Russ REW

I've appended a shell script that I hope demonstrates why the default
chunking results in a 222MB file, and a guess at what might be better
chunk shapes, assuming that record_number will only get to about
10. If you intend to store many more records in a file, you'll have to
change this appropriately.

> Thanks, too, for identifying the enhanced model "uint" type in the
> qualityFlags variable. We are actually trying to get the producer to use
> unsigned ints in some other cases, and they are refusing on the grounds
> that they want to use classic model. So it's curious that they made an
> exception for this variable.

Yup, sounds like a mistake on their part.

--Russ


> address@hidden> wrote:
> 
> > Hi Margaret,
> >
> > Thanks for sending the file. The netCDF-4 enhanced model feature you are
> > using is the primitive type "uint64" for one of the variables:
> >
> >   uint64 qualityFlags(report_number) ;
> >
> > The netCDF classic data model has only 8-, 16-, and 32-bit signed integer
> > types in addition to char, float, and double types. The enhanced model
> > added signed and unsigned 64-bit types as well other unsigned integer
> > types.
> >
> > The other issue is chunking and small chunk sizes for which the
> > netCDF4/HDF5
> > storage overhead can be large in extreme cases like you have encountered.
> > All
> > of the 61 variables in your file use the report_number dimension, as can
> > be seen
> > with
> >
> >   ncdump -h
> > OR_EXIS-L1b-SFEU_G16_s20151772200010_e20151772200010_c20151772201050.nc |
> > grep -v '[=:]'
> >
> > (the grep filter is to ignore all the attribute and dimension
> > declarations, to show
> > just variable declarations).
> >
> > An unlimited dimension must use chunking, and the default chunksizes are
> > not
> > good in this case, resulting in lots of chunks of size 1, 1x23, 1x35, and
> > 1x4, as
> > can be seen with
> >
> >   ncdump -s -h
> > OR_EXIS-L1b-SFEU_G16_s20151772200010_e20151772200010_c20151772201050.nc |
> > grep _ChunkSizes
> >
> > Each chunked variable in an HDF5 file has an associated B-tree data
> > structure used
> > to store each individual chunk, and the B-tree overhead is extreme for
> > small chunks.
> >
> > The good news is it's relatively easy to change the chunksizes to
> > something reasonable
> > using nccopy. I'll send more about that in a subsequent response.
> >
> > --Russ
> >
> > > Thanks for the reply, Ward. Here is the file. It is possible that some
> > > enhanced features are being used and I don't know about them. Anyway, I
> > > appreciate your taking a look.
> > >
> > > Meg
> > >
> > > address@hidden> wrote:
> > >
> > > > Hello Margaret,
> > > >
> > > > In regards to the first question, I wonder if the issue is the
> > variables
> > > > along this dimension have default fill values set; if so , this may
> > explain
> > > > the explosion in file size.  Would it be possible to get a copy of the
> > > > netCDF file to play around with? Also, what version of netCDF are you
> > > > working with?
> > > >
> > > > In regards to your second question; it sounds like there is either a
> > bug
> > > > in nccopy that makes it think you are using or perhaps the file is
> > using
> > > > some small part of the enhanced model?  That is just a guess; if you
> > can
> > > > provide the original 192k file, I will figure out why nccopy is giving
> > this
> > > > error.
> > > >
> > > > Thanks in advance, have a great day,
> > > >
> > > > -Ward
> > > >
> > > > > Hello,
> > > > >
> > > > > We have a 192K netCDF-4 file with no unlimited dimensions. One of
> > these
> > > > > dimensions is report_number, which has dimension 1. When this
> > dimension
> > > > is
> > > > > changed to "UNLIMITED // (currently 1)" the file expands to 213M, so
> > over
> > > > > 1000 times as large as the original. Do you know what could be
> > causing
> > > > > this? Is there any way to avoid it aside from compressing the file?
> > > > >
> > > > > Second question: I contacted Unidata a while ago, and Russ told me
> > that
> > > > we
> > > > > likely had netCDF-4 files that didn't actually use any of the
> > enhanced
> > > > > features. The data producer has confirmed that this is the case. I
> > was
> > > > > interested in seeing if using the netCDF-4 classic model would
> > reduce the
> > > > > file size, because other people here at NOAA have found netCDF-4
> > classic
> > > > > sizes to be on par with netCDF-3 (i.e., much less than the netCDF-4
> > > > > enhanced model). However, when I used the nccopy utility Russ
> > > > recommended,
> > > > > I got the following error:
> > > > >
> > > > > "Attempting netcdf-4 operation on strict nc3 netcdf-4 file"
> > > > >
> > > > > Is there some way to change the "strict nc3" flag, since this is
> > really a
> > > > > netCDF-4 file?
> > > > >
> > > > > Thanks for any information.
> > > > >
> > > > > Meg Tilton
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > address@hidden> wrote:
> > > > >
> > > > > > Meg,
> > > > > >
> > > > > > > Strange! That means our files are the netCDF-4 enhanced version,
> > so
> > > > I'm
> > > > > > > surprised anyone could get ncdump -x to work on them. I guess it
> > will
> > > > > > > remain an unsolved mystery.
> > > > > >
> > > > > > A possible explanation for the mystery is that your netCDF-4 files
> > > > > > really don't use any features of the enhanced model, but aren't
> > > > > > marked as netCDF-4 classic model files.  A netCDF-4 classic model
> > > > > > file is just a netCDF-4 file with a special scalar integer
> > attribute
> > > > > > named "_nc3_strict" in the root group, which is tested to enforce
> > > > > > never adding any features of the enhanced netCCDF-4 data model to
> > > > > > the file, so that it will always be readable using the netCDF-3
> > API.
> > > > > >
> > > > > > Back in version 4.1.1, I don't think ncdump tested the file type,
> > it
> > > > > > just printed whatever it could see through the API and displayed it
> > > > > > as NcML when the "-x" flag was used. But the ncdump code was never
> > > > > > modified to present the NcML representations for any of the
> > netCDF-4
> > > > > > enhanced model features, partly because those representations were
> > > > > > still under development when netCDF 4.1.1 was released.
> > > > > >
> > > > > > If your current netCDF files are really netCDF-4 files that don't
> > > > > > use any enhanced data model features, then you could mark them as
> > > > > > netCDF-4 classic model files using the "nccopy" utility:
> > > > > >
> > > > > >   nccopy -k "netCDF-4 classic model" foo4.nc foo4c.nc
> > > > > >
> > > > > > to convert a netCDF-4 file to a netCDF-4 classic model file. That
> > > > > > would add the extra attribute (invisible through the netCDF API).
> > > > > > You could also do the same thing through the HDF5 API, which would
> > > > > > permit adding the attribute and overwriting files, which nccopy
> > > > > > doesn't permit.
> > > > > >
> > > > > > --Russ
> > > > > >
> > > > > > > address@hidden> wrote:
> > > > > > > >
> > > > > > > > Meg,
> > > > > > > >
> > > > > > > > > Thanks for your responses to my email.
> > > > > > > > >
> > > > > > > > > When I ran the ncdump -k on one of our netCDF4 files, the
> > > > response
> > > > > > was
> > > > > > > > just
> > > > > > > > > "netCDF-4." Does this mean it's the enhanced model, and it
> > would
> > > > say
> > > > > > > > > "classic" otherwise? Or is it the other way around?
> > > > > > > >
> > > > > > > > It's the other way around.  The outputs from ncdump -k are one
> > of
> > > > these
> > > > > > > > four strings:
> > > > > > > >
> > > > > > > >   classic
> > > > > > > >   64-bit offset
> > > > > > > >   netCDF-4
> > > > > > > >   netCDF-4 classic model
> > > > > > > >
> > > > > > > > The "-x" option works to specify NcML output for all but the
> > third
> > > > of
> > > > > > > > those format variants, "netCDF-4".
> > > > > > > >
> > > > > > > > > I will pass on the information about the java netCDF library
> > to
> > > > the
> > > > > > NOAA
> > > > > > > > > people who run our CLASS archive. They were the ones who were
> > > > > > originally
> > > > > > > > > asking about this, and they need to develop code to extract
> > the
> > > > NcML
> > > > > > from
> > > > > > > > > netCDF4 files. So this may be very useful for them.
> > > > > > > >
> > > > > > > > --Russ
> > > > > > > >
> > > > > > > > > address@hidden> wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Margaret Tilton - NOAA Affiliate,
> > > > > > > > > >
> > > > > > > > > > Your Ticket has been received, and a Unidata staff member
> > will
> > > > > > review
> > > > > > > > it
> > > > > > > > > > and reply accordingly. Listed below are details of this new
> > > > Ticket.
> > > > > > > > Please
> > > > > > > > > > make sure the Ticket ID remains in the Subject: line on all
> > > > > > > > correspondence
> > > > > > > > > > related to this Ticket.
> > > > > > > > > >
> > > > > > > > > >     Ticket ID: NRE-269426
> > > > > > > > > >     Subject: ncdump -x with netCDF4 files
> > > > > > > > > >     Department: Support netCDF
> > > > > > > > > >     Priority: Normal
> > > > > > > > > >     Status: Open
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > The NetCDF libraries are developed at the Unidata Program
> > > > Center,
> > > > > > > > > > in Boulder, Colorado, funded primarily by the National
> > Science
> > > > > > > > Foundation.
> > > > > > > > > >
> > > > > > > > > > All support requests are handled by the development team.
> > No
> > > > > > dedicated
> > > > > > > > > > support staff are funded at this time. For this reason we
> > > > cannot
> > > > > > > > guarantee
> > > > > > > > > > response times, nor that we can resolve every support
> > issue,
> > > > > > although
> > > > > > > > we
> > > > > > > > > > do our best to respond within 72 hours.
> > > > > > > > > >
> > > > > > > > > > It is in the nature of support requests that the same
> > question
> > > > is
> > > > > > asked
> > > > > > > > > > many
> > > > > > > > > > times. We urge you to search the support archives for
> > material
> > > > > > > > relating to
> > > > > > > > > > your support request:
> > > > > > > > > >
> > > > > > > > > > http://www.unidata.ucar.edu/search.jsp?support&netcdf
> > > > > > > > > >
> > > > > > > > > > If you are having trouble building netCDF, please take a
> > look
> > > > at
> > > > > > the
> > > > > > > > > > "Building NetCDF" page:
> > > > > > > > > >
> > > > > > > > > >
> > http://www.unidata.ucar.edu/software/netcdf/docs/building.html
> > > > > > > > > >
> > > > > > > > > > or the (unfortunately somewhat out-of-date) NetCDF Build
> > > > > > Troubleshooter
> > > > > > > > > > page:
> > > > > > > > > >
> > > > > > > > > >
> > > > http://www.unidata.ucar.edu/software/netcdf/docs/troubleshoot.html
> > > > > > > > > >
> > > > > > > > > > Windows users should see the FAQ list:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > http://www.unidata.ucar.edu/software/netcdf/docs/faq.html#windows_netcdf4_2
> > > > > > > > > >
> > > > > > > > > > Complete documentation (including a tutorial, and sample
> > > > programs
> > > > > > in C,
> > > > > > > > > > Fortran,
> > > > > > > > > > Java, and other programming languages) can be found on the
> > > > netCDF
> > > > > > > > > > Documentation page:
> > > > > > > > > >
> > > > > > > > > > http://www.unidata.ucar.edu/software/netcdf/docs/
> > > > > > > > > >
> > http://www.unidata.ucar.edu/software/netcdf/examples/programs/
> > > > > > > > > >
> > > > > > > > > > If you resolve your issue through one of these methods,
> > please
> > > > > > send a
> > > > > > > > > > reply to
> > > > > > > > > > this email, letting us know that you no longer need
> > support.
> > > > This
> > > > > > will
> > > > > > > > help
> > > > > > > > > > us spend more time on netCDF development.
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > >
> > > > > > > > > > Unidata User Support
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Margaret Tilton
> > > > > > > > > Cooperative Institute for Research in Environmental Sciences
> > > > (CIRES)
> > > > > > at
> > > > > > > > > University of Colorado at Boulder and
> > > > > > > > > NOAA National Geophysical Data Center, Solar and Terrestrial
> > > > Physics
> > > > > > > > > Division
> > > > > > > > > 325 Broadway, E/GC2
> > > > > > > > > Boulder, Colorado 80305
> > > > > > > > > 303-497-6223
> > > > > > > > >
> > > > > > > > >
> > > > > > > > Russ Rew                                         UCAR Unidata
> > > > Program
> > > > > > > > address@hidden
> > > > http://www.unidata.ucar.edu
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Ticket Details
> > > > > > > > ===================
> > > > > > > > Ticket ID: NRE-269426
> > > > > > > > Department: Support netCDF
> > > > > > > > Priority: Normal
> > > > > > > > Status: Closed
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Margaret Tilton
> > > > > > > Cooperative Institute for Research in Environmental Sciences
> > (CIRES)
> > > > at
> > > > > > > University of Colorado at Boulder and
> > > > > > > NOAA National Geophysical Data Center, Solar and Terrestrial
> > Physics
> > > > > > > Division
> > > > > > > 325 Broadway, E/GC2
> > > > > > > Boulder, Colorado 80305
> > > > > > > 303-497-6223
> > > > > > >
> > > > > > >
> > > > > > Russ Rew                                         UCAR Unidata
> > Program
> > > > > > address@hidden
> > http://www.unidata.ucar.edu
> > > > > >
> > > > > >
> > > > > >
> > > > > > Ticket Details
> > > > > > ===================
> > > > > > Ticket ID: NRE-269426
> > > > > > Department: Support netCDF
> > > > > > Priority: Normal
> > > > > > Status: Closed
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Margaret Tilton
> > > > > Cooperative Institute for Research in Environmental Sciences (CIRES)
> > > > > at the University
> > > > > of Colorado and NOAA National Centers for Environmental Information
> > > > (NCEI)
> > > > > 325 Broadway, E/GC2
> > > > > Boulder, Colorado 80305
> > > > > 303-497-6223
> > > > >
> > > > >
> > > >
> > > >
> > > > Ticket Details
> > > > ===================
> > > > Ticket ID: MAN-367636
> > > > Department: Support netCDF
> > > > Priority: Normal
> > > > Status: Closed
> > > >
> > > >
> > >
> > >
> > > --
> > > Margaret Tilton
> > > Cooperative Institute for Research in Environmental Sciences (CIRES)
> > > at the University
> > > of Colorado and NOAA National Centers for Environmental Information
> > (NCEI)
> > > 325 Broadway, E/GC2
> > > Boulder, Colorado 80305
> > > 303-497-6223
> > >
> > >
> > Russ Rew                                         UCAR Unidata Program
> > address@hidden                      http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: MAN-367636
> > Department: Support netCDF
> > Priority: Normal
> > Status: Closed
> >
> >
> 
> 
> --
> Margaret Tilton
> Cooperative Institute for Research in Environmental Sciences (CIRES)
> at the University
> of Colorado and NOAA National Centers for Environmental Information (NCEI)
> 325 Broadway, E/GC2
> Boulder, Colorado 80305
> 303-497-6223
> 
> 

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: MAN-367636
Department: Support netCDF
Priority: Normal
Status: Closed

Attachment: or.sh
Description: Binary data