[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #RQB-854711]: MPI/IO with unlimited dimensions



Hi Sebastian,

> I tried to compile with netcCDF 4.3.1-rc2 but now, my program now
> craches because of an MPI error:
> 
> *** An error occurred in MPI_Allreduce: the reduction operation MPI_MAX
> is not defined on the MPI_BYTE datatype
> *** on communicator MPI COMMUNICATOR 4 DUP FROM 0
> *** MPI_ERR_OP: invalid reduce operation
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> 
> I'm using OpenMPI 1.4.3.

I'm assuming the program that crashes is the test.cpp you attached in
your original support question.  I tried to duplicate the problem using
OpenMPI 1.7.2_1 on an OSX platform, and got a different error:

  $ mpicxx test.cpp -o test -I{NCDIR}/include -I${H5DIR}/include -L${NCDIR}/lib 
-L${H5DIR}/lib -lnetcdf -lhdf5_hl -lhdf5 -ldl -lm -lz -lcurl
  ./test
  Start on rank 0: 0 0
  Count on rank 0: 1 0
  Assertion failed: (size), function H5MM_calloc, file ../../src/H5MM.c, line 
95.
  [mort:71677] *** Process received signal ***
  [mort:71677] Signal: Abort trap: 6 (6)
  [mort:71677] Signal code:  (0)
  [mort:71677] [ 0] 2   libsystem_c.dylib                   0x00007fff939b994a 
_sigtramp + 26
  [mort:71677] [ 1] 3   ???                                 0x0000000000000000 
0x0 + 0
  [mort:71677] [ 2] 4   libsystem_c.dylib                   0x00007fff93a11e2a 
__assert_rtn + 146
  [mort:71677] [ 3] 5   test                                0x0000000108eeea10 
H5MM_calloc + 256
  [mort:71677] [ 4] 6   test                                0x0000000108d4ca3e 
H5D__chunk_io_init + 1534
  [mort:71677] [ 5] 7   test                                0x0000000108d8a45c 
H5D__write + 4028
  [mort:71677] [ 6] 8   test                                0x0000000108d87460 
H5D__pre_write + 3552
  [mort:71677] [ 7] 9   test                                0x0000000108d8658c 
H5Dwrite + 732
  [mort:71677] [ 8] 10  test                                0x0000000108c8ac27 
nc4_put_vara + 3991
  [mort:71677] [ 9] 11  test                                0x0000000108ca0564 
nc4_put_vara_tc + 164
  [mort:71677] [10] 12  test                                0x0000000108ca04ab 
NC4_put_vara + 75
  [mort:71677] [11] 13  test                                0x0000000108c08240 
NC_put_vara + 288
  [mort:71677] [12] 14  test                                0x0000000108c092d4 
nc_put_vara_int + 100
  [mort:71677] [13] 15  test                                0x0000000108bf2e56 
main + 630
  [mort:71677] [14] 16  libdyld.dylib                       0x00007fff886fd7e1 
start + 0
  [mort:71677] [15] 17  ???                                 0x0000000000000001 
0x0 + 1
  [mort:71677] *** End of error message ***
  Abort

> I think, the bug was introduced in this commit:
> https://github.com/Unidata/netcdf-c/pull/4

We're looking at the problem, thanks for reporting it.

--Russ

> Best regards,
> Sebastian
> 
> On 22.08.2013 18:28, Unidata netCDF Support wrote:
> > Hi Sebastian,
> >
> >> my problem sounds similar but to the bug but it is different. My program
> >> also hangs when using collective MPI I/O.
> >>
> >> According to the bug report, only an issue with independent I/O was fixed.
> >
> > You're right, but we think we have a fix for the collective I/O hang now,
> > available in the netCDF-C 4.3.1-rc2 version (a release candidate):
> >
> >    https://github.com/Unidata/netcdf-c/releases/tag/v4.3.1-rc2
> >
> > At your convenience, please let us know if it fixes the problem.
> >
> > --Russ
> >
> >> On 06.08.2013 00:09, Unidata netCDF Support wrote:
> >>> Hi Sebastian,
> >>>
> >>> Could you tell us if this recently fixed bug sounds like what you
> >>> found?
> >>>
> >>>     https://bugtracking.unidata.ucar.edu/browse/NCF-250
> >>>
> >>> If so, the fix will be in netCDF release 4.3.1, a release candidate
> >>> for which will soon be announced.
> >>>
> >>> --Russ
> >>>
> >>>> Hi everybody,
> >>>>
> >>>> I just figured out that using collective MPI/IO in variables with
> >>>> unlimited dimensions can lead to deadlocks or wrong files.
> >>>>
> >>>> I have attached a small example program which can reproduce deadlock
> >>>> (and wrong output files depending on the variable "count").
> >>>>
> >>>> Did I do anything wrong or is this a known bug?
> >>>>
> >>>> My configuration:
> >>>> hdf5 1.8.11
> >>>> netcdf 4.3
> >>>> openmpi (default ubuntu installation)
> >>>>
> >>>> Compile command:
> >>>> mpicxx test.cpp -I/usr/local/include -L/usr/local/lib -lnetcdf -lhdf5_hl
> >>>> -lhdf5 -lz
> >>>> (netcdf and hdf5 are installed in /usr/local)
> >>>>
> >>>> Best regards,
> >>>> Sebastian
> >>>>
> >>>> --
> >>>> Sebastian Rettenberger, M.Sc.
> >>>> Technische Universität München
> >>>> Department of Informatics
> >>>> Chair of Scientific Computing
> >>>> Boltzmannstrasse 3, 85748 Garching, Germany
> >>>> http://www5.in.tum.de/
> >>>>
> >>>>
> >>> Russ Rew                                         UCAR Unidata Program
> >>> address@hidden                      http://www.unidata.ucar.edu
> >>>
> >>>
> >>>
> >>> Ticket Details
> >>> ===================
> >>> Ticket ID: RQB-854711
> >>> Department: Support netCDF
> >>> Priority: Normal
> >>> Status: Closed
> >>>
> >>
> >> --
> >> Sebastian Rettenberger, M.Sc.
> >> Technische Universität München
> >> Department of Informatics
> >> Chair of Scientific Computing
> >> Boltzmannstrasse 3, 85748 Garching, Germany
> >> http://www5.in.tum.de/
> >>
> >>
> >>
> > Russ Rew                                         UCAR Unidata Program
> > address@hidden                      http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: RQB-854711
> > Department: Support netCDF
> > Priority: Normal
> > Status: Closed
> >
> 
> --
> Sebastian Rettenberger, M.Sc.
> Technische Universität München
> Department of Informatics
> Chair of Scientific Computing
> Boltzmannstrasse 3, 85748 Garching, Germany
> http://www5.in.tum.de/
> 
> 
> 
> Hello,
> 
> I tried to compile with netcCDF 4.3.1-rc2 but now, my program now
> craches because of an MPI error:
> 
> *** An error occurred in MPI_Allreduce: the reduction operation MPI_MAX
> is not defined on the MPI_BYTE datatype
> *** on communicator MPI COMMUNICATOR 4 DUP FROM 0
> *** MPI_ERR_OP: invalid reduce operation
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> 
> I'm using OpenMPI 1.4.3.
> 
> I think, the bug was introduced in this commit:
> https://github.com/Unidata/netcdf-c/pull/4
> 
> Best regards,
> Sebastian
> 
> On 22.08.2013 18:28, Unidata netCDF Support wrote:
> > Hi Sebastian,
> >
> >> my problem sounds similar but to the bug but it is different. My program
> >> also hangs when using collective MPI I/O.
> >>
> >> According to the bug report, only an issue with independent I/O was fixed.
> >
> > You're right, but we think we have a fix for the collective I/O hang now,
> > available in the netCDF-C 4.3.1-rc2 version (a release candidate):
> >
> >    https://github.com/Unidata/netcdf-c/releases/tag/v4.3.1-rc2
> >
> > At your convenience, please let us know if it fixes the problem.
> >
> > --Russ
> >
> >> On 06.08.2013 00:09, Unidata netCDF Support wrote:
> >>> Hi Sebastian,
> >>>
> >>> Could you tell us if this recently fixed bug sounds like what you
> >>> found?
> >>>
> >>>     https://bugtracking.unidata.ucar.edu/browse/NCF-250
> >>>
> >>> If so, the fix will be in netCDF release 4.3.1, a release candidate
> >>> for which will soon be announced.
> >>>
> >>> --Russ
> >>>
> >>>> Hi everybody,
> >>>>
> >>>> I just figured out that using collective MPI/IO in variables with
> >>>> unlimited dimensions can lead to deadlocks or wrong files.
> >>>>
> >>>> I have attached a small example program which can reproduce deadlock
> >>>> (and wrong output files depending on the variable "count").
> >>>>
> >>>> Did I do anything wrong or is this a known bug?
> >>>>
> >>>> My configuration:
> >>>> hdf5 1.8.11
> >>>> netcdf 4.3
> >>>> openmpi (default ubuntu installation)
> >>>>
> >>>> Compile command:
> >>>> mpicxx test.cpp -I/usr/local/include -L/usr/local/lib -lnetcdf -lhdf5_hl
> >>>> -lhdf5 -lz
> >>>> (netcdf and hdf5 are installed in /usr/local)
> >>>>
> >>>> Best regards,
> >>>> Sebastian
> >>>>
> >>>> --
> >>>> Sebastian Rettenberger, M.Sc.
> >>>> Technische Universität München
> >>>> Department of Informatics
> >>>> Chair of Scientific Computing
> >>>> Boltzmannstrasse 3, 85748 Garching, Germany
> >>>> http://www5.in.tum.de/
> >>>>
> >>>>
> >>> Russ Rew                                         UCAR Unidata Program
> >>> address@hidden                      http://www.unidata.ucar.edu
> >>>
> >>>
> >>>
> >>> Ticket Details
> >>> ===================
> >>> Ticket ID: RQB-854711
> >>> Department: Support netCDF
> >>> Priority: Normal
> >>> Status: Closed
> >>>
> >>
> >> --
> >> Sebastian Rettenberger, M.Sc.
> >> Technische Universität München
> >> Department of Informatics
> >> Chair of Scientific Computing
> >> Boltzmannstrasse 3, 85748 Garching, Germany
> >> http://www5.in.tum.de/
> >>
> >>
> >>
> > Russ Rew                                         UCAR Unidata Program
> > address@hidden                      http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: RQB-854711
> > Department: Support netCDF
> > Priority: Normal
> > Status: Closed
> >
> 
> --
> Sebastian Rettenberger, M.Sc.
> Technische Universität München
> Department of Informatics
> Chair of Scientific Computing
> Boltzmannstrasse 3, 85748 Garching, Germany
> http://www5.in.tum.de/
> 
> 
> 

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: RQB-854711
Department: Support netCDF
Priority: Normal
Status: Closed