Re: [netcdfgroup] random read failures with large CF-2 files (on Lustre?)

To: Gerry Creager - NOAA Affiliate <gerry.creager@xxxxxxxx>
Subject: Re: [netcdfgroup] random read failures with large CF-2 files (on Lustre?)
From: Rob Latham <robl@xxxxxxxxxxx>
Date: Wed, 19 Aug 2015 15:57:43 -0500



On 08/19/2015 03:55 PM, Gerry Creager - NOAA Affiliate wrote:

I'll open a case to determine if Cray's MPI-IO library has this problem.

OK. Might not be any need to do so: David Knaak told me (via off-listcorrespondence) that it was fixed in Cray MPI-IO much the same way Ifixed it in ROMIO.


==rob

gerry

On Wed, Aug 19, 2015 at 7:47 PM, Rob Latham <robl@xxxxxxxxxxx
<mailto:robl@xxxxxxxxxxx>> wrote:



    On 08/18/2015 02:31 PM, Ward Fisher wrote:

        Hello all,

        I just wanted to jump in and comment that this issue, recently
        reported
        to us by David Knaak at Cray, is now handled in the netCDF-C
        development
        branch on GitHub. This fix will be in the upcoming release
        candidate and
        eventual final release of netCDF-C 4.4.0.

        Regarding the question of short reads providing more warning; netcdf
        specifically was already checking for short reads when ‘paging
        in’ data
        from a file, but was assuming an error when one would occur (due
        to a
        non-zero |errno| value). The fix shouldn’t incur any performance
        penalty. The new thing I learned about “short reads” is that it is
        possible for this to occur /without/ being the result of an
        error, but
        rather the result of an interrupt.


    I found these short reads would happen in ROMIO when trying to read
    2 GiB of data in one shot.  Linux would give me back (2GiB-4k) worth
    of data.

    Today, most MPI-IO libraries should detect and retry this case.
    Cray's MPI-IO library is closed source, so i don't know what they do.

        In general, since they are technically allowed I think
        developers are
        going to have to accommodate the possibility of short reads in their
        software, one way or another. Developers should already be
        checking the
        return value of |read()|, and when short, the fix is essentially:

          1. Check to see if errno is |EINTR|
          2. If so, perform some calculations and resume the read.


    While that's strictly correct, I worry about short reads that for
    whatever reason don't set EINTR.  So I would check how much data was
    read.  If it is less than requested, continue the read to fetch the
    missing data.  If that continued read returns 0, then you are EOF
    and you are done.

    ==rob

    --
    Rob Latham
    Mathematics and Computer Science Division
    Argonne National Lab, IL USA


    _______________________________________________
    netcdfgroup mailing list
    netcdfgroup@xxxxxxxxxxxxxxxx <mailto:netcdfgroup@xxxxxxxxxxxxxxxx>
    For list information or to unsubscribe,  visit:
    http://www.unidata.ucar.edu/mailing_lists/




--
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)


--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Follow-Ups:
- Re: [netcdfgroup] random read failures with large CF-2 files (on Lustre?)
  - From: Gerry Creager - NOAA Affiliate

References:
- Re: [netcdfgroup] random read failures with large CF-2 files (on Lustre?)
  - From: Ted Mansell
- Re: [netcdfgroup] random read failures with large CF-2 files (on Lustre?)
  - From: Ted Mansell
- Re: [netcdfgroup] random read failures with large CF-2 files (on Lustre?)
  - From: Ward Fisher
- Re: [netcdfgroup] random read failures with large CF-2 files (on Lustre?)
  - From: Rob Latham
- Re: [netcdfgroup] random read failures with large CF-2 files (on Lustre?)
  - From: Gerry Creager - NOAA Affiliate

2015 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdfgroup archives: