Re: [netcdfgroup] random read failures with large CF-2 files (on Lustre?)

  • To: Rob Latham <robl@xxxxxxxxxxx>
  • Subject: Re: [netcdfgroup] random read failures with large CF-2 files (on Lustre?)
  • From: Gerry Creager - NOAA Affiliate <gerry.creager@xxxxxxxx>
  • Date: Wed, 19 Aug 2015 20:58:31 +0000
Won't hurt to ask 'em. They can close it if it's fixed with very little
effort.

gerry

On Wed, Aug 19, 2015 at 8:57 PM, Rob Latham <robl@xxxxxxxxxxx> wrote:

>
>
> On 08/19/2015 03:55 PM, Gerry Creager - NOAA Affiliate wrote:
>
>> I'll open a case to determine if Cray's MPI-IO library has this problem.
>>
>>
> OK.  Might not be any need to do so: David Knaak told me (via off-list
> correspondence) that it was fixed in Cray MPI-IO much the same way I fixed
> it in ROMIO.
>
> ==rob
>
> gerry
>>
>> On Wed, Aug 19, 2015 at 7:47 PM, Rob Latham <robl@xxxxxxxxxxx
>> <mailto:robl@xxxxxxxxxxx>> wrote:
>>
>>
>>
>>     On 08/18/2015 02:31 PM, Ward Fisher wrote:
>>
>>         Hello all,
>>
>>         I just wanted to jump in and comment that this issue, recently
>>         reported
>>         to us by David Knaak at Cray, is now handled in the netCDF-C
>>         development
>>         branch on GitHub. This fix will be in the upcoming release
>>         candidate and
>>         eventual final release of netCDF-C 4.4.0.
>>
>>         Regarding the question of short reads providing more warning;
>> netcdf
>>         specifically was already checking for short reads when ‘paging
>>         in’ data
>>         from a file, but was assuming an error when one would occur (due
>>         to a
>>         non-zero |errno| value). The fix shouldn’t incur any performance
>>         penalty. The new thing I learned about “short reads” is that it is
>>         possible for this to occur /without/ being the result of an
>>         error, but
>>         rather the result of an interrupt.
>>
>>
>>     I found these short reads would happen in ROMIO when trying to read
>>     2 GiB of data in one shot.  Linux would give me back (2GiB-4k) worth
>>     of data.
>>
>>     Today, most MPI-IO libraries should detect and retry this case.
>>     Cray's MPI-IO library is closed source, so i don't know what they do .
>>
>>         In general, since they are technically allowed I think
>>         developers are
>>         going to have to accommodate the possibility of short reads in
>> their
>>         software, one way or another. Developers should already be
>>         checking the
>>         return value of |read()|, and when short, the fix is essentially:
>>
>>           1. Check to see if errno is |EINTR|
>>           2. If so, perform some calculations and resume the read.
>>
>>
>>     While that's strictly correct, I worry about short reads that for
>>     whatever reason don't set EINTR.  So I would check how much data was
>>     read.  If it is less than requested, continue the read to fetch the
>>     missing data.  If that continued read returns 0, then you are EOF
>>     and you are done.
>>
>>     ==rob
>>
>>     --
>>     Rob Latham
>>     Mathematics and Computer Science Division
>>     Argonne National Lab, IL USA
>>
>>
>>     _______________________________________________
>>     netcdfgroup mailing list
>>     netcdfgroup@xxxxxxxxxxxxxxxx <mailto:netcdfgroup@xxxxxxxxxxxxxxxx>
>>     For list information or to unsubscribe,  visit:
>>     http://www.unidata.ucar.edu/mailing_lists/
>>
>>
>>
>>
>> --
>> Gerry Creager
>> NSSL/CIMMS
>> 405.325.6371
>> ++++++++++++++++++++++
>> “Big whorls have little whorls,
>> That feed on their velocity;
>> And little whorls have lesser whorls,
>> And so on to viscosity.”
>> Lewis Fry Richardson (1881-1953)
>>
>
> --
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
>



-- 
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
  • 2015 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: