[netcdfgroup] Advice on parallel netcdf

I was looking for general advice on using parallel netcdf/hdf5. I'm working on development an atmospheric model at NASA. The model of course is distributed with mpi (each process is essentially working one section of the grid describing the world) but much of the file IO is serial. The arrays to be written are gathered on the root process and the root process then does the reading/writing to the netcdf file. In an attempt to improve the overall IO performance I've been experimenting with parallel netcdf/hdf5, where the file for IO is opened for parallel access on all processes and each process read/writes the data for the piece of the world it is working in the netcdf file. Here is an outline of what I am doing in the code with a few actual code snippets:

set some mpi info ...

call MPI_info_create(info,STATUS)
call MPI_Info_set(info,"romio_cb_read", "enable" ,STATUS)
call MPI_Info_set(info,"romio_cb_write", "enable" ,STATUS)
call ESMF_ConfigGetAttribute(CF, cb_buffer_size, Label='cb_buffer_size:', __RC__)
call MPI_Info_set(info,"cb_buffer_size", "16777216" ,STATUS)

status = nf90_create("file_parallel.nc4",IOR(IOR(NF90_CLOBBER,NF90_HDF5),NF90_MPIIO),ncid,comm=comm,info=info)

define dimensions ...
define vars ...
set access to collective for each variable ...

status = nf90_var_par_access(ncid,varid,NF90_COLLECTIVE)

determine start and cnt for process ...
read or write ...


Here are a few general observations.
- In general the IO does not scale with the number of processors and I'm seeing about the same write time for 1 or hundreds of mpi tasks.

- Gathering to root and having root write (and the converse for reading) was generally almost as fast or only marginally slower (2x) than using parallel IO regardless of mpi tasks.

- Setting the access of each variable to collective was crucial to write performance. If the access was set to independent the writing was horribly slow, 10 to 20 times longer than the gather to root/root write method.

- In general playing with the buffer size had no appreciable affect on the performance.

Does anyone have any tricks I haven't thought of or has seen the same thing with parallel IO performance? There really aren't that many things one can play with other than setting the MPI hints or changing the access type for variables (collective or independent). So far I have been using intel 11, intel mpi 3 on the gpfs file system but I plan to play with this on newer intel versions, different MPI stacks, and on lustre instead of gpfs.

--
Ben Auer, PhD   SSAI, Scientific Programmer/Analyst
NASA GSFC,  Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD  20771
Phone: 301-286-9176               Fax: 301-614-6246



  • 2013 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: