[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #PDL-125161]: Writing parallel files with zero-size chunks



Thomas,

> On 08/29/2012 12:18 AM, Unidata netCDF Support wrote:
> > I just succeeded in running a test case that used count[0] = 0 on an MPI 
> > parallel
> > file system using the netCDF-4 parallel I/O inherited from HDF5, and it ran 
> > fine.
> >
> > The test I ran just inserted the following code in a loop after line 136 in
> > nc_test4/tst_parallel.c:
> >
> >        /* See if count dimension == 0 returns error */
> >        count_save = count[0];
> >        count[0] = 0;
> >        if (nc_put_vara_int(ncid, v1id, start, count, slab_data)) ERR;
> >        count[0] = count_save ;
> >
> > Discussing this with CISL consultants indicates the problem may be 
> > platform-specific.
> 
> thanks for developing further test programs. Unfortunately I don't see how all
> processes writing with count = 0 answers my question about all processes but 
> one
> writing with count=0. What I'd like to know is

Sorry, I see I was confusing your support ticket with another similar question 
that
asked whether having count[i]==0 for any i in nc_put_var calls was permitted on
parallel platforms.

Now I've compiled and run the bug demonstration code you provided and have 
reproduced 
the problem, resulting in hanging at the same place you observed:

  $ mpirun -n 5 ./nc4partest
  mpi_name: spock.unidata.ucar.edu size: 5 rank: 0, isDataWriter=0
  mpi_name: spock.unidata.ucar.edu size: 5 rank: 1, isDataWriter=0
  mpi_name: spock.unidata.ucar.edu size: 5 rank: 2, isDataWriter=1
  mpi_name: spock.unidata.ucar.edu size: 5 rank: 3, isDataWriter=0
  mpi_name: spock.unidata.ucar.edu size: 5 rank: 4, isDataWriter=0
  mpi_rank=0 start[0]=0 start[1]=0 count[0]=0 count[1]=0
  mpi_rank=2 start[0]=0 start[1]=0 count[0]=24 count[1]=24
  mpi_rank=3 start[0]=0 start[1]=0 count[0]=0 count[1]=0
  mpi_rank=4 start[0]=0 start[1]=0 count[0]=0 count[1]=0
  mpi_rank=1 start[0]=0 start[1]=0 count[0]=0 count[1]=0
  mpi_rank=1 start[0]=0 start[1]=0 count[0]=0 count[1]=0  C-c C-cCtrl-C 
caught... cleaning up processes

> * Given the available API documentation is my program incorrect or triggering
> undocumented behaviour?

It seems to be correct according to the meager API documentation.  The developer
who implemented the netCDF-4 parallel I/O is no longer at Unidata, and we don't
have anyone here currently with the expertise to diagnose and fix this problem.
I have been trying to contract some help for this area, but have not yet 
succeeded.

> * Since you mention the problem might be platform-specific and my program is
> using a fairly widely available platform (Debian GNU/Linux with only two
> self-compiled libraries used, in this case HDF5 1.8.9 and netcdf 4.2.1.1 both
> passing all tests invoked by make check) is there a bug on this platform I
> should be aware of? Is there another platform I should use instead? I'm all 
> for
> stable testing platforms but I'm not aware of a binary download at
> http://www.unidata.ucar.edu/downloads/netcdf/netcdf-4_2_1_1/index.jsp

Please ignore my comment that the bug might be platform-specific, as that was
merely a repetition of what I heard from NCAR CISL consultants about the other
related bug that they have been looking at.  For now, I will enter this bug
into our Jira issue tracking system, but don't know if we will be able to 
resolve
it in the near future.  For now, I can only recommend that you contact the NCAR
CISL consulting office:

  https://www2.cisl.ucar.edu/uss/csg

Sorry we can't be of more help ...

--Russ


Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: PDL-125161
Department: Support netCDF
Priority: Normal
Status: Closed