Re: [netcdfgroup] random read failures with large CF-2 files (on Lustre?)

  • To: Rob Latham <robl@xxxxxxxxxxx>
  • Subject: Re: [netcdfgroup] random read failures with large CF-2 files (on Lustre?)
  • From: Gerry Creager - NOAA Affiliate <gerry.creager@xxxxxxxx>
  • Date: Wed, 19 Aug 2015 20:55:02 +0000
I'll open a case to determine if Cray's MPI-IO library has this problem.

gerry

On Wed, Aug 19, 2015 at 7:47 PM, Rob Latham <robl@xxxxxxxxxxx> wrote:

>
>
> On 08/18/2015 02:31 PM, Ward Fisher wrote:
>
>> Hello all,
>>
>> I just wanted to jump in and comment that this issue, recently reported
>> to us by David Knaak at Cray, is now handled in the netCDF-C development
>> branch on GitHub. This fix will be in the upcoming release candidate and
>> eventual final release of netCDF-C 4.4.0.
>>
>> Regarding the question of short reads providing more warning; netcdf
>> specifically was already checking for short reads when ‘paging in’ data
>> from a file, but was assuming an error when one would occur (due to a
>> non-zero |errno| value). The fix shouldn’t incur any performance
>> penalty. The new thing I learned about “short reads” is that it is
>> possible for this to occur /without/ being the result of an error, but
>> rather the result of an interrupt.
>>
>
> I found these short reads would happen in ROMIO when trying to read 2 GiB
> of data in one shot.  Linux would give me back (2GiB-4k) worth of data.
>
> Today, most MPI-IO libraries should detect and retry this case.  Cray's
> MPI-IO library is closed source, so i don't know what they do.
>
> In general, since they are technically allowed I think developers are
>> going to have to accommodate the possibility of short reads in their
>> software, one way or another. Developers should already be checking the
>> return value of |read()|, and when short, the fix is essentially:
>>
>>  1. Check to see if errno is |EINTR|
>>  2. If so, perform some calculations and resume the read.
>>
>
> While that's strictly correct, I worry about short reads that for whatever
> reason don't set EINTR.  So I would check how much data was read.  If it is
> less than requested, continue the read to fetch the missing data.  If that
> continued read returns 0, then you are EOF and you are done.
>
> ==rob
>
> --
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
>
>
> _______________________________________________
> netcdfgroup mailing list
> netcdfgroup@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe,  visit:
> http://www.unidata.ucar.edu/mailing_lists/
>



-- 
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
  • 2015 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: