Re: [netcdfgroup] Advice needed: Optimal parallel I/O using NetCDF

Howdy Constantine!

What an interesting question! ;-)

Quick answer:
1 - Use netCDF-4 with nc_create_par to enable parallel I/O.
2 - Experiment with chunksizes as they have a huge impact on performance.
3 - Consider writing checkpoint data in a netcdf-4 compound data type for
dramatic performance improvement.
4 - Consider using the Parallel I/O library from NCAR, a parallel I/O
helper for netCDF.

In more detail...

Using parallel IO is a bit challenging, partly because parallelism is much
more available for computation. You can have 500 processors, but the
hardware really is only going to allow ~16 to 32 to write to the disk
drives in parallel. That is, you cannot get a linear speedup of performance
with more processors, past some pretty low number. (In my experience - I
happily accept correction from someone who's familiar with other cases.)

So you can use netCDF-4, with parallel I/O, which can be achieved by using
nc_create_par instead of nc_create. But your 500 processors, attempting to
write all at the same time, will not achieve a speedup of 500. :-( However,
it should achieve some good speedup, and that's worth having. The code is
almost identical to the code used for netCDF sequential access, so there's
not a lot of programming effort involved.)

For additional performance you may consider using the parallelIO library (
https://github.com/NCAR/ParallelIO), a free library from NCAR that sits on
top of netCDF and provides a very similar API. It allows you to designate a
subset of processors to handle IO. The API is very similar to netCDF, but
under the covers the computational processors will send their data to the
I/O processors, which will write it to disk. This is better that writing
your own MPI code to send data to a subset of processors for writing.
Asynchronous writes are available in v1 of PIO (a fortran based library),
and I am very happy to be working with the PIO team to add asynch writes to
the V2 C-based PIO library (which also has a fortran layer).

The end result is that some small number of processors do I/O in the
background, and the rest of the processors continue doing computation for
the next timestep.

Chunksizes should match your access patterns. That is, what one processor
is writing. You don't have to worry about the continuous-ness of the data
on the disk. Chunking hides all of that for you. Just have each processor
write its data, and HDF5 will sort out where it is on disk.

If you are writing checkpoint files, you should consider using the netCDF-4
enhanced model with compound data types. Generally with checkpoint files
you are writing 50-100 variables, and putting them all in a struct provides
significant performance advantage.

Hope this was helpful and not too much blathering. ;-)

Ed


On Thu, Apr 7, 2016 at 8:23 PM, Constantine Khroulev <c.khroulev@xxxxxxxxx>
wrote:

> Dear netCDF developers and users,
>
> I am writing to ask for advice regarding setting up efficient
> NetCDF-based parallel I/O in the model I'm working on (PISM, see [1]).
>
> This is not a question of *tuning* I/O in a program: I can replace *all*
> of PISM's I/O code if necessary [2].
>
> So, the question is this: how do I use NetCDF to write 2D and 3D fields
> described below efficiently and in parallel?
>
> Here is an example of a setup I need to be able to handle: a 2640 (X
> dimension) by 4560 (Y dimension) uniform Cartesian grid [3] with 401
> vertical (Z) levels. 2D fields take ~90Mb each and 3D fields -- ~36Gb
> each.
>
> A grid like this is typically distributed (uniformly) over 512 MPI
> processes, each process getting a ~70Mb portion of a 3D field and ~200Kb
> per 2D field.
>
> During a typical model run PISM writes the full model state (one 3D
> field and a handful of 2D fields, ~38Gb total) several times
> (checkpointing plus the final output). In addition to this, the user may
> choose to write a number [4] of fields at regular intervals throughout
> the run. It is not unusual to write about 1000 records of each of these
> fields, appending to the output file.
>
> Note that PISM's horizontal (2D) grid is split into rectangular patches,
> most likely 16 patches in one direction and 32 patches in the other in a
> 512-core run. (This means that what is contiguous in memory usually is
> not contiguous in a file even when storage orders match.)
>
> Notes:
>
> - The records we are writing are too big for the NetCDF-3 file format,
>   so we have to use NetCDF-4 or PNetCDF's CDF-5 format. I would prefer
>   to use NetCDF-4 to simplify post-processing. (Before NetCDF 4.4.0 I
>   would have said "PNetCDF is not an option because most post-processing
>   tools don't support it." I'm happy to see CDF-5 support added to
>   mainstream NetCDF-4. Most post-processing tools still don't support
>   it, though.)
>
> - If possible, output files should have one unlimited dimension (time).
>   NetCDF variables should have "time" as the first dimension so PISM's
>   output would fit into the NetCDF's "classic" data model. (We try to
>   follow Climate and Forecasting (CF) conventions.)
>
> - During post-processing and analysis variables are accessed
>   one-record-at-a-time, but each record can be stored contiguously. (I
>   have no idea how to pick chunking parameters, though.)
>
> - All systems I have access to use Lustre.
>
> Thanks in advance for any input you might have!
>
> [1]: PISM stands for "Parallel Ice Sheet Model". See www.pism-docs.org
> for details.
>
> [2]: I'm hoping that you don't need the details of our existing
> implementation to see what's going on. I'm happy to provide such details
> if necessary, though.
>
> [3]: This grid covers all of Greenland with the spatial resolution of
> 600m.
>
> [4]: About 50 in a typical run; these are usually 2D fields.
>
> --
> Constantine
>
> _______________________________________________
> netcdfgroup mailing list
> netcdfgroup@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe,  visit:
> http://www.unidata.ucar.edu/mailing_lists/
>