On Mon, Sep 19, 2011 at 11:09:23AM -0600, Wei Huang wrote:
> Jim,
>
> I am using the gpfs filesystem, but did not set any MPI-IO hints.
> I did not do processor binding, but I guess binding could help if
> less processors used on a node.
> I am actually using NC_MPIPOSIX, rather than NC_MPIIO as the later will give
> even worse timing.
>
> The 5G file has 170 variables, with some of them have size:
> [ 1 <time | unlimited>, 27 <ilev>, 768 <lat>, 1152 <lon> ]
> and used chunk size (1, 1, 192, 288).
>
> The last part more like a netcdf developers work.
Perhaps you can make the netcdf developers' job a bit easier by
providing a test case. If the dataset contains 170 variables, then it
must be part of some larger program and so might be hard to extract.
I'll be honest: I'm mostly curious how pnetcdf handles this workload
(my guess as a pnetcdf developer is "poorly" because of the record
variable i/o). Still, the test case will help the netcdf, hdf5, and
MPI-IO developers...
==rob
> On Sep 19, 2011, at 10:48 AM, Jim Edwards wrote:
>
> > Hi Wei,
> >
> >
> > Are you using the gpfs filesystem and are you setting any MPI-IO hints for
> > that filesystem?
> >
> > Are you using any processor binding technique? Have you experimented with
> > other settings?
> >
> > You stated that the file is 5G but what is the size of a single field and
> > how is it distributed? In other words is it already aggregated into a nice
> > blocksize or are you expecting netcdf/MPI-IO to handle that?
> >
> > I think that in order to really get a good idea of where the performance
> > problem might be, you need to start by writing and timing a binary file of
> > roughly equivalent size, then write an hdf5 file, then write a netcdf4
> > file. My guess is that you will find that the performance problem is
> > lower on the tree...
> >
> > - Jim
> >
> > On Mon, Sep 19, 2011 at 10:28 AM, Wei Huang <huangwei@xxxxxxxx> wrote:
> > Hi, netcdfgroup,
> >
> > Currently, we are trying to use parallel-enabled NetCDF4. We started with
> > read/write a 5G file and some computation, we got the following timing (in
> > wall-clock) on a IBM power machine:
> > Number of Processors Total(seconds) read(seconds) Write(seconds)
> > Computation(seconds)
> > seq 89.137 28.206
> > 48.327 11.717
> > 1 178.953 44.837
> > 121.17 11.644
> > 2 167.25 46.571
> > 113.343 5.648
> > 4 168.138 44.043
> > 118.968 2.729
> > 8 137.74 25.161
> > 108.986 1.064
> > 16 113.354 16.359
> > 93.253 0.494
> > 32 439.481 122.201
> > 311.215 0.274
> > 64 831.896 277.363
> > 588.653 0.203
> >
> > First thing we can see is that when run parallel-enabled code at one
> > processor, the total
> > wall-clok time doubled.
> > Then we did not see the scaling when more processors added.
> >
> > Anyone wants to share their experience?
> >
> > Thanks,
> >
> > Wei Huang
> > huangwei@xxxxxxxx
> > VETS/CISL
> > National Center for Atmospheric Research
> > P.O. Box 3000 (1850 Table Mesa Dr.)
> > Boulder, CO 80307-3000 USA
> > (303) 497-8924
> >
> >
> >
> > _______________________________________________
> > netcdfgroup mailing list
> > netcdfgroup@xxxxxxxxxxxxxxxx
> > For list information or to unsubscribe, visit:
> > http://www.unidata.ucar.edu/mailing_lists/
> >
>
> _______________________________________________
> netcdfgroup mailing list
> netcdfgroup@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe, visit:
> http://www.unidata.ucar.edu/mailing_lists/
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA