Jim,
I am using the gpfs filesystem, but did not set any MPI-IO hints.
I did not do processor binding, but I guess binding could help if
less processors used on a node.
I am actually using NC_MPIPOSIX, rather than NC_MPIIO as the later will give
even worse timing.
The 5G file has 170 variables, with some of them have size:
[ 1 <time | unlimited>, 27 <ilev>, 768 <lat>, 1152 <lon> ]
and used chunk size (1, 1, 192, 288).
The last part more like a netcdf developers work.
Thanks,
Wei
huangwei@xxxxxxxx
VETS/CISL
National Center for Atmospheric Research
P.O. Box 3000 (1850 Table Mesa Dr.)
Boulder, CO 80307-3000 USA
(303) 497-8924
On Sep 19, 2011, at 10:48 AM, Jim Edwards wrote:
> Hi Wei,
>
>
> Are you using the gpfs filesystem and are you setting any MPI-IO hints for
> that filesystem?
>
> Are you using any processor binding technique? Have you experimented with
> other settings?
>
> You stated that the file is 5G but what is the size of a single field and how
> is it distributed? In other words is it already aggregated into a nice
> blocksize or are you expecting netcdf/MPI-IO to handle that?
>
> I think that in order to really get a good idea of where the performance
> problem might be, you need to start by writing and timing a binary file of
> roughly equivalent size, then write an hdf5 file, then write a netcdf4 file.
> My guess is that you will find that the performance problem is lower on the
> tree...
>
> - Jim
>
> On Mon, Sep 19, 2011 at 10:28 AM, Wei Huang <huangwei@xxxxxxxx> wrote:
> Hi, netcdfgroup,
>
> Currently, we are trying to use parallel-enabled NetCDF4. We started with
> read/write a 5G file and some computation, we got the following timing (in
> wall-clock) on a IBM power machine:
> Number of Processors Total(seconds) read(seconds) Write(seconds)
> Computation(seconds)
> seq 89.137 28.206
> 48.327 11.717
> 1 178.953 44.837
> 121.17 11.644
> 2 167.25 46.571
> 113.343 5.648
> 4 168.138 44.043
> 118.968 2.729
> 8 137.74 25.161
> 108.986 1.064
> 16 113.354 16.359
> 93.253 0.494
> 32 439.481 122.201
> 311.215 0.274
> 64 831.896 277.363
> 588.653 0.203
>
> First thing we can see is that when run parallel-enabled code at one
> processor, the total
> wall-clok time doubled.
> Then we did not see the scaling when more processors added.
>
> Anyone wants to share their experience?
>
> Thanks,
>
> Wei Huang
> huangwei@xxxxxxxx
> VETS/CISL
> National Center for Atmospheric Research
> P.O. Box 3000 (1850 Table Mesa Dr.)
> Boulder, CO 80307-3000 USA
> (303) 497-8924
>
>
>
> _______________________________________________
> netcdfgroup mailing list
> netcdfgroup@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe, visit:
> http://www.unidata.ucar.edu/mailing_lists/
>