I was looking for general advice on using parallel netcdf/hdf5. I'm
working on development an atmospheric model at NASA.
The model of course is distributed with mpi (each process is essentially
working one section of the grid describing the world) but much of the
file IO is serial. The arrays to be written are gathered on the root
process and the root process then does the reading/writing to the netcdf
file. In an attempt to improve the overall IO performance I've been
experimenting with parallel netcdf/hdf5, where the file for IO is opened
for parallel access on all processes and each process read/writes the
data for the piece of the world it is working in the netcdf file.
Here is an outline of what I am doing in the code with a few actual code
snippets:
set some mpi info ...
call MPI_info_create(info,STATUS)
call MPI_Info_set(info,"romio_cb_read", "enable" ,STATUS)
call MPI_Info_set(info,"romio_cb_write", "enable" ,STATUS)
call ESMF_ConfigGetAttribute(CF, cb_buffer_size,
Label='cb_buffer_size:', __RC__)
call MPI_Info_set(info,"cb_buffer_size", "16777216" ,STATUS)
status =
nf90_create("file_parallel.nc4",IOR(IOR(NF90_CLOBBER,NF90_HDF5),NF90_MPIIO),ncid,comm=comm,info=info)
define dimensions ...
define vars ...
set access to collective for each variable ...
status = nf90_var_par_access(ncid,varid,NF90_COLLECTIVE)
determine start and cnt for process ...
read or write ...
Here are a few general observations.
- In general the IO does not scale with the number of processors and I'm
seeing about the same write time for 1 or hundreds of mpi tasks.
- Gathering to root and having root write (and the converse for reading)
was generally almost as fast or only marginally slower (2x) than using
parallel IO regardless of mpi tasks.
- Setting the access of each variable to collective was crucial to write
performance.
If the access was set to independent the writing was horribly slow, 10
to 20 times longer than the gather to root/root write method.
- In general playing with the buffer size had no appreciable affect on
the performance.
Does anyone have any tricks I haven't thought of or has seen the same
thing with parallel IO performance? There really aren't that many things
one can play with other than setting the MPI hints or changing the
access type for variables (collective or independent). So far I have
been using intel 11, intel mpi 3 on the gpfs file system but I plan to
play with this on newer intel versions, different MPI stacks, and on
lustre instead of gpfs.
--
Ben Auer, PhD SSAI, Scientific Programmer/Analyst
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-286-9176 Fax: 301-614-6246