[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #RPZ-106941]: Bug in netcdf (including 4.1.2-beta2)



Hi Joerg,

Sorry it has taken so long to fix the problem you reported.  Two other users
also reported his bug, and I finally have a solution.  Here's what I reported
to ythe other users earlier today, but I forgot to CC: you:

> Just to keep you appised of progress, I've checked in a fix to our svn trunk, 
> consisting of a 20-line addition to the libsrc/posixio.c code. The conditions 
> for the bug appear to be pretty rare, but are more likely with larger disk 
> block sizes. Examples of the bug with small disk block sizes require 
> relatively small files and involve:
>
> - writing data to a file in nofill mode
> - writing more than one disk-block beyond the end of the file, as might
> happen in writing the last slice of a multidimensional variable before
> writing other slices
> - crossing disk-block boundaries with the region to be written
> - having the in-memory buffer in a state in which the region to be written
> corresponds to the upper half of the buffer and recently written data in
> the lower half of the buffer hasn't been flushed to disk yet.
>
> The last condition makes it difficult to give users an easy way to determine
> whether they have been a victim of this problem. I'm still struggling with
> a better description of the conditions under which it might occur, and I still
> need to understand why we can duplicate the problem for 4K disk blocks if we
> use the double-underbar function nc__create(), but not if we use the more
> common nc_create().
>
> When I have that mystery solved, I should be able to send out a netcdfgroup
> posting, and maybe create an FAQ or blog entry about the bug with more
> information than people are likely to want to read in an email posting.

The fix will be in release 4.1.3-beta, 4.1.3, and all subsequent releases.

I could make the 20-line patch available as well, but there is another fairly
important backward-compatibility fix Ed has put in version 4.1.3, so we want
to encourage upgrades to 4.1.3 where practical.

We hope to have a beta release of 4.1.3 by tomorrow, with the 4.1.3 release
next week, depending on whether problems are discovered in the beta release.

In case you're interested in an earlier fix to test, I've attached the 20-line 
patch for libsrc/posixio.c for version 4.1.2.

--Russ

> > I wrote:
> > > In the meantime, understanding and finding a fix for this problem will
> > > be a priority, even though it has apparently been a bug in the library
> > > for years, at least since the release of netCDF-3.6.2. I also
> > > intend to determine when the bug first appeared in a netCDF release,
> > > as that may help in fixing the problem.
> >
> > Version 3.5.1 from Feb. 17, 2004 appears to be free of the bug, but it's in 
> > the
> > next release, version 3.6.0, from Dec. 18, 2004.  We were using cvs back 
> > then
> > rather than svn for version control, but I may be able to access the 
> > history of
> > checkins and narrow down which checkin first introduced the bug ...
> 
> Actually, as a result of more careful testing, that's wrong.  Version 3.5.1 
> had the
> bug, and as far as I can tell now, no version of netCDF has ever handled large
> block sizes, such as Lustre uses, correctly.  It looks like there's no 
> shortcut
> to fixing this, it will require understanding the optimized buffering code ...
> 
> --Russ
> 
> Russ Rew                                         UCAR Unidata Program
> address@hidden                      http://www.unidata.ucar.edu
> 
> 

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: RPZ-106941
Department: Support netCDF
Priority: Critical
Status: Closed

Attachment: diff
Description: Binary data