Ed, See my answer/comments to your email below. Thanks, Wei Huang huangwei@xxxxxxxx VETS/CISL National Center for Atmospheric Research P.O. Box 3000 (1850 Table Mesa Dr.) Boulder, CO 80307-3000 USA (303) 497-8924 On Sep 19, 2011, at 4:43 PM, Ed Hartnett wrote: > Wei Huang <huangwei@xxxxxxxx> writes: > >> Hi, netcdfgroup, >> >> Currently, we are trying to use parallel-enabled NetCDF4. We started with >> read/write a 5G file and some computation, we got the following timing (in >> wall-clock) on a IBM power machine: >> Number of Processors Total(seconds) read(seconds) Write(seconds) >> Computation(seconds) >> seq 89.137 28.206 48.327 >> 11.717 >> 1 178.953 44.837 121.17 >> 11.644 >> 2 167.25 46.571 113.343 >> 5.648 >> 4 168.138 44.043 118.968 >> 2.729 >> 8 137.74 25.161 108.986 >> 1.064 >> 16 113.354 16.359 93.253 >> 0.494 >> 32 439.481 122.201 311.215 >> 0.274 >> 64 831.896 277.363 588.653 >> 0.203 >> >> First thing we can see is that when run parallel-enabled code at one >> processor, the total >> wall-clok time doubled. >> Then we did not see the scaling when more processors added. >> >> Anyone wants to share their experience? >> >> Thanks, >> >> Wei Huang >> huangwei@xxxxxxxx >> VETS/CISL >> National Center for Atmospheric Research >> P.O. Box 3000 (1850 Table Mesa Dr.) >> Boulder, CO 80307-3000 USA >> (303) 497-8924 >> >> > > Howdy Wei and all! > > Are you using the 4.1.2 release? Did you configure with > --enable-parallel-tests, and did those tests pass? I am using 4.1.3, and configured with "-enanle-parallel-tests", and those tests passed. > > I would suggest building netCDF with --enable-parallel-tests and then > running nc_test4/tst_nc4perf. This simple program, based on > user-contributed test code, performs parallel I/O with a wide variety of > options, and prints a table of results. I have run nc_test4/tst_nc4perf for 1, 2, 4, and 8 processors, results attached. To me, the performance decreases when processors increase. Someone may have a better interpret. I also run tst_parallel4, with result: num_proc time(s) write_rate(B/s) 1 9.2015 1.16692e+08 2 12.4557 8.62048e+07 4 6.30644 1.70261e+08 8 5.53761 1.939e+08 16 2.25639 4.75866e+08 32 2.28383 4.7015e+08 64 2.19041 4.90202e+08 > > This will tell you whether parallel I/O is working on your platform, and > at least give some idea of reasonable settings. > > Parallel I/O is a very complex topic. However, if everything is working > well, you should see I/O improvement which scales reasonably linearly, > for less then about 8 processors (perhaps more, depending on your > system, but not much more.) At this point, your parallel application is > saturating your I/O subsystem, and further I/O performance is > marginal. > > In general, HDF5 I/O will not be faster than netCDF-4 I/O. The netCDF-4 > layer is very light in this area, and simply calls the HDF5 that the > user would call anyway. > > Key settings are: > > * MPI_IO vs. POSIX_IO (varies from platform to platform which is > faster. See nc4perf results for your machine/compiler.) Tested both, POSIX is better. > > * Chunking and caching play a big role, as always. Caching is > turned off by default, otherwise netCDF caches on all the processors > will consume too much memory. But you should set this to at least the > size of one chunk. Note that this cache will happen on all processors > involved. > We use chunking, can probably try caching. > * Collective vs. independent access. Seems (to my naive view) like > independent should usually be faster, but the opposite seems to be > the case. This is because the I/O subsystems are good at grouping I/O > requests into larger, more efficient units. Collective access gives > the I/O layer the maximum chance to exercise its magic. tried both, no significant difference. > > Best thing to do is get tst_nc4perf working on your platform, and then > modify it to write data files that match yours (i.e. same size > variables). The program will then tell you the best set of settings to > use in your case. We can modify this program to mimic our data size, but do not know if this will help us. > > If the program shows that parallel I/O is not working, take a look at > the netCDF test program h5_test/tst_h_par.c. This is a HDF5-only program > (no netcdf code at all) that does parallel I/O. If this program does not > show that parallel I/O is working, then your problem is not with the > netCDF layer, but somewhere in HDF5 or even lower in the stack. > > Thanks! > > Ed > > -- > Ed Hartnett -- ed@xxxxxxxxxxxxxxxx
Attachment:
mpi_io_bluefire.perf
Description: Binary data
netcdfgroup
archives: