Re: [netcdfgroup] Problem with parallel netcdf

There used to be a problem with netcdf4 and openmpi (1.4.x) where netcdf4 was 
assuming the behavior of mpich in setting MPI_ERR_COMM (or something else where 
mpich assigned a fixed value (improperly) but openmpi did not). That got fixed, 
but perhaps the problem has come back?  Did you try openmpi 1.4.x?

-- Ted

On Feb 8, 2012, at 6:03 PM, Orion Poplawski wrote:

> I'm trying to build parallel enabled netcdf 4.1.3 on Fedora 16 with hdf5 
> 1.8.7 and with both mpich2 1.4.1p1 and openmpi 1.5.4.  In running make check 
> with the openmpi build I get:
> 
> $ mpiexec -n 4 ./f90tst_parallel
> [orca.cora.nwra.com:32630] *** An error occurred in MPI_Comm_d
> [orca.cora.nwra.com:32630] *** on communicator MPI_COMM_WOR
> [orca.cora.nwra.com:32630] *** MPI_ERR_COMM: invalid communicator
> [orca.cora.nwra.com:32630] *** MPI_ERRORS_ARE_FATAL: your MPI job will now
> HDF5: infinite loop closing library
> D,T,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FDFD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,D,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,F,FD,FD,FD,FD,FD,FD,FD,FD,FD
> 
> *** Testing netCDF-4 parallel I/O from Fortran 90.
> HDF5: infinite loop closing library
> D,T,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FDFD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,D,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,F,FD,FD,FD,FD,FD,FD,FD,FD,FD
> HDF5: infinite loop closing library
> D,T,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FDFD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,D,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,F,FD,FD,FD,FD,FD,FD,FD,FD,FD
> ------------------------------------------------------------------------
> mpiexec has exited due to process rank 2 with PID 32631 on
> node orca.cora.nwra.com exiting improperly. There are two reasons this could 
> occu
> 
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it wa
> for all processes to call "init". By rule, if one process calls "init"
> then ALL processes must call "init" prior to termination
> 
> 2. this process called "init", but exited without calling "finaliz
> By rule, all processes that call "init" MUST call "finalize" prior
> exiting or it will be considered an "abnormal termination"
> 
> This may have caused other processes in the application to
> terminated by signals sent by mpiexec (as reported here)
> ------------------------------------------------------------------------
> [orca.cora.nwra.com:32628] 3 more processes have sent help message 
> help-mpi-errors.trs_are_fatal
> [orca.cora.nwra.com:32628] Set MCA parameter "orte_base_help_aggregate" to 0 
> to see ror messages
> 
> 
> It appears to work fine with mpich2.  Has anyone else come across this?
> 
> Thanks,
> 
>  Orion
> 
> -- 
> Orion Poplawski
> Technical Manager                     303-415-9701 x222
> NWRA, Boulder Office                  FAX: 303-415-9702
> 3380 Mitchell Lane                  orion@xxxxxxxxxxxxx
> Boulder, CO 80301              http://www.cora.nwra.com
> 
> _______________________________________________
> netcdfgroup mailing list
> netcdfgroup@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe,  visit: 
> http://www.unidata.ucar.edu/mailing_lists/ 



  • 2012 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: