On 04/17/2014 03:22 AM, Alexis Praga wrote:
Hi,
I have some questions about parallel netCDF4 (using HDF5, not PnetCDF).
I think it's best to just ask them, so please excuse the long list :
1) What is its strategy for parallel I/O ?
i'm not entirely sure what you're asking here. Most parallel I/O
libraries carry out I/O to different regions of the file simultaneously
(in parallel), and thereby extract more aggregate performance out of the
storage system.
for any application using any I/O library, the trickiest part is how to
decompose your domain over N parallel processes and how to describe that
decomposition.
2) How is it related to HDF5 ? Is it just a wrapper around it ?
in one way of looking, yes. in order to adopt HDF5 as one possible
backend, though, the unidata netCDF folks designed a dispatch system so
one might write via the classic netCDF interface, via the
Argonne-Northwestern Parallel-NetCDF interface, via HDF5, or via DAP.
3) When writing a netCDF4 file, is it really netCDF or is it HDF5 ?
ncdump -k returns "netCDF4" but I am not sure.
the new file format is an HDF5 file that can be examined with the broad
ecosystem of HDF5 utilities. this hdf5 file, though, has a particular
schema or layout that indicates it's a netcdf4 kind of HDF5 file.
4) Is there some documentation online ? I only found that :
http://www.unidata.ucar.edu/software/netcdf/docs/netcdf-tutorial/Parallel.html
which is very light.
5) Any references (paper or benchmarks) are welcomed. At the moment, I only
found the paper by Li et al. (2003) about PnetCDF.
in strict performance terms -- which in the end is not really the be-all
end all -- Argonne-Northwestern Parallel-NetCDF will be hard to beat,
unless you are working with record variables. The classic netcdf
(CDF-1, CDF-2 and CDF-5) file formats are incredibly friendly to
parallel I/O, but this friendly layout comes at a cost -- record
variables can have only one UNLIMITED dimension, the layout of record
variables is sub-optimal for I/O.
HDF5's file format allows for greater flexibility but that flexibility
comes at a metadata cost. Once you start operating on large enough
datasets and large enough levels of parallelism, the underlying file
system becomes the limit on performance.
==rob
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA