Dear netCDF developers and users,
I am writing to ask for advice regarding setting up efficient
NetCDF-based parallel I/O in the model I'm working on (PISM, see [1]).
This is not a question of *tuning* I/O in a program: I can replace *all*
of PISM's I/O code if necessary [2].
So, the question is this: how do I use NetCDF to write 2D and 3D fields
described below efficiently and in parallel?
Here is an example of a setup I need to be able to handle: a 2640 (X
dimension) by 4560 (Y dimension) uniform Cartesian grid [3] with 401
vertical (Z) levels. 2D fields take ~90Mb each and 3D fields -- ~36Gb
each.
A grid like this is typically distributed (uniformly) over 512 MPI
processes, each process getting a ~70Mb portion of a 3D field and ~200Kb
per 2D field.
During a typical model run PISM writes the full model state (one 3D
field and a handful of 2D fields, ~38Gb total) several times
(checkpointing plus the final output). In addition to this, the user may
choose to write a number [4] of fields at regular intervals throughout
the run. It is not unusual to write about 1000 records of each of these
fields, appending to the output file.
Note that PISM's horizontal (2D) grid is split into rectangular patches,
most likely 16 patches in one direction and 32 patches in the other in a
512-core run. (This means that what is contiguous in memory usually is
not contiguous in a file even when storage orders match.)
Notes:
- The records we are writing are too big for the NetCDF-3 file format,
so we have to use NetCDF-4 or PNetCDF's CDF-5 format. I would prefer
to use NetCDF-4 to simplify post-processing. (Before NetCDF 4.4.0 I
would have said "PNetCDF is not an option because most post-processing
tools don't support it." I'm happy to see CDF-5 support added to
mainstream NetCDF-4. Most post-processing tools still don't support
it, though.)
- If possible, output files should have one unlimited dimension (time).
NetCDF variables should have "time" as the first dimension so PISM's
output would fit into the NetCDF's "classic" data model. (We try to
follow Climate and Forecasting (CF) conventions.)
- During post-processing and analysis variables are accessed
one-record-at-a-time, but each record can be stored contiguously. (I
have no idea how to pick chunking parameters, though.)
- All systems I have access to use Lustre.
Thanks in advance for any input you might have!
[1]: PISM stands for "Parallel Ice Sheet Model". See www.pism-docs.org
for details.
[2]: I'm hoping that you don't need the details of our existing
implementation to see what's going on. I'm happy to provide such details
if necessary, though.
[3]: This grid covers all of Greenland with the spatial resolution of
600m.
[4]: About 50 in a typical run; these are usually 2D fields.
--
Constantine