Wei Huang <huangwei@xxxxxxxx> writes:
> Hi, netcdfgroup,
>
> Currently, we are trying to use parallel-enabled NetCDF4. We started with
> read/write a 5G file and some computation, we got the following timing (in
> wall-clock) on a IBM power machine:
> Number of Processors Total(seconds) read(seconds) Write(seconds)
> Computation(seconds)
> seq 89.137 28.206 48.327
> 11.717
> 1 178.953 44.837 121.17
> 11.644
> 2 167.25 46.571 113.343
> 5.648
> 4 168.138 44.043 118.968
> 2.729
> 8 137.74 25.161 108.986
> 1.064
> 16 113.354 16.359 93.253
> 0.494
> 32 439.481 122.201 311.215
> 0.274
> 64 831.896 277.363 588.653
> 0.203
>
> First thing we can see is that when run parallel-enabled code at one
> processor, the total
> wall-clok time doubled.
> Then we did not see the scaling when more processors added.
>
> Anyone wants to share their experience?
>
> Thanks,
>
> Wei Huang
> huangwei@xxxxxxxx
> VETS/CISL
> National Center for Atmospheric Research
> P.O. Box 3000 (1850 Table Mesa Dr.)
> Boulder, CO 80307-3000 USA
> (303) 497-8924
>
>
Howdy Wei and all!
Are you using the 4.1.2 release? Did you configure with
--enable-parallel-tests, and did those tests pass?
I would suggest building netCDF with --enable-parallel-tests and then
running nc_test4/tst_nc4perf. This simple program, based on
user-contributed test code, performs parallel I/O with a wide variety of
options, and prints a table of results.
This will tell you whether parallel I/O is working on your platform, and
at least give some idea of reasonable settings.
Parallel I/O is a very complex topic. However, if everything is working
well, you should see I/O improvement which scales reasonably linearly,
for less then about 8 processors (perhaps more, depending on your
system, but not much more.) At this point, your parallel application is
saturating your I/O subsystem, and further I/O performance is
marginal.
In general, HDF5 I/O will not be faster than netCDF-4 I/O. The netCDF-4
layer is very light in this area, and simply calls the HDF5 that the
user would call anyway.
Key settings are:
* MPI_IO vs. POSIX_IO (varies from platform to platform which is
faster. See nc4perf results for your machine/compiler.)
* Chunking and caching play a big role, as always. Caching is
turned off by default, otherwise netCDF caches on all the processors
will consume too much memory. But you should set this to at least the
size of one chunk. Note that this cache will happen on all processors
involved.
* Collective vs. independent access. Seems (to my naive view) like
independent should usually be faster, but the opposite seems to be
the case. This is because the I/O subsystems are good at grouping I/O
requests into larger, more efficient units. Collective access gives
the I/O layer the maximum chance to exercise its magic.
Best thing to do is get tst_nc4perf working on your platform, and then
modify it to write data files that match yours (i.e. same size
variables). The program will then tell you the best set of settings to
use in your case.
If the program shows that parallel I/O is not working, take a look at
the netCDF test program h5_test/tst_h_par.c. This is a HDF5-only program
(no netcdf code at all) that does parallel I/O. If this program does not
show that parallel I/O is working, then your problem is not with the
netCDF layer, but somewhere in HDF5 or even lower in the stack.
Thanks!
Ed
--
Ed Hartnett -- ed@xxxxxxxxxxxxxxxx