Re: [netcdfgroup] random read failures with large CF-2 files (on Lustre?)

Hi, Ted

cp did "rearrange" the data layout of the new file on Lustre. The striping 
configuration of the new file will inherit the settings of destination 
directory. You can use command "lfs getstripe directory/filename" to check the 
current setting of a directory or a file.

Command "mv" just changes the file's path metadata in its inode, no new data is 
created or moved physically, so the file striping configuration will not be 
changed.

One way to see if the problem you are seeing is related to Lustre is to use the 
"nccmp" utility to compare the two files. After a "cp" command, if the contents 
of two files are different, then it most likely is the Lustre problem. 
http://nccmp.sourceforge.net/

PnetCDF has also a similar tool, called ncmpidiff.

Wei-keng

On Aug 2, 2015, at 10:46 AM, Ted Mansell wrote:

> Update: I am wondering if it is an issue with file striping on Lustre. The 
> files are writing through direct pnetcdf calls with striping set to 16:
> 
> lfs setstripe -c 16 .
> 
> When I cp a cf-2 file to a higher directory (not striped), perhaps this 
> rearranges the storage, because there are no reading issues with the copy, 
> even after mv'ing the copy back to the original directory. (This is for 
> reading either with a serial application like ncview or parallel read via the 
> netcdf4 interface to pnetcdf.)
> 
> -- Ted
> 
>> Howdy,
>> 
>> I seeing some strange behavior in reading large files, and I'm wondering if 
>> anybody has similar experience. This is on a Lustre file system that was 
>> upgraded in June (I assume to the latest version, but I will check that). I 
>> had no previous problems. Anyway, the symptom is that I am reading 
>> reasonably large float arrays (400 x 400 x 130), and randomly some parts of 
>> the array will not be read in, leaving the initialized (zero) values. The 
>> arrays are ordered x,y,z (fortran convention), and the failures usually show 
>> up as partial or complete x-y planes. A file might have 15 to 500 arrays of 
>> this size and one time level.
>> 
>> I am writing the files with a pnetcdf interface, and as far as I can tell, 
>> the writes are going through fine, although I can't say 100% for sure. The 
>> files are then read either through the netcdf4 interface for pnetcdf or as 
>> plain CF-2 files (e.g., with ncview). Ncview shows these failed reads as 
>> bands of zero values, for example, here is an X-Z cross-section:
>> 
>> <PastedGraphic-1.tiff>
>> 
>> where the blue section is just zeros that should not be zero. When I step to 
>> the next XZ plane and back, the values are usually filled in. I assume that 
>> ncview is re-reading the file. Since Ncview is linked to a normal serial 
>> library of netcdf4/hdf5 (4.3.2/1.8.9), so I believe that the pnetcdf libs 
>> probably do not play a role in the read problems. When the files are read 
>> back in by my cloud model, these read failures also happen. Those arrays are 
>> initialized to zero before reading, so the read failures show up as zero. 
>> The last time I ran these codes in April, everything worked fine. I'm just 
>> picking up this work again this month, and these problems appeared. The only 
>> thing I can think of that is different is the Lustre upgrade. For example, I 
>> copied a given file to my home directory, and ncview has no problems with it 
>> there.
>> 
>> So, has anybody seen IO problems on recent (latest, I think) version of 
>> Lustre, particularly with large arrays in 64-bit offset files? It might not 
>> be just netcdf files, but that is all I'm using.
>> 
>> Cheers,
>> - Ted Mansell
>> 
>> P.S. The files are written using Netcdf 4.3.1.1, hdf5 1.8.9, pnetcdf 1.5.0, 
>> netcdf-fortran 4.4.0.
>> 
>> __________________________________________________________
>> | Edward Mansell <ted.mansell@xxxxxxxx>
>> | National Severe Storms Laboratory
>> |--------------------------------------------------------------
>> | "The contents of this message are mine personally and
>> | do not reflect any position of the U.S. Government or NOAA."
>> |--------------------------------------------------------------
>> 
> _______________________________________________
> netcdfgroup mailing list
> netcdfgroup@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe,  visit: 
> http://www.unidata.ucar.edu/mailing_lists/



  • 2015 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: