[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #PDL-125161]: Writing parallel files with zero-size chunks



Hi Thomas,

> I have an MPI-parallel application with a decomposition such that each array 
> is
> completely handled by one process for I/O purposes and the arrays are
> distributed in a round-robin fashion, i.e. task 0 holds all of array A, task 1
> holds all of array B and so forth.
> 
> My expectation was that I could write this with netcdf4 parallel I/O, so I
> compiled netcdf 4.2.1.1 for OpenMPI 1.4.2 and hdf5 1.8.9 on Debian GNU/Linux
> x86_64 and started testing.
> 
> Unfortunately, when I issue nc_create_par with NC_MPIPOSIX and 
> nc_var_par_access
> with flag NC_INDEPENDENT I only get invalid output, when I change the
> nc_create_par option to NC_MPIIO the program hangs on nc_close.
> 
> I've reduced my use-case to a small test mostly resembling one of the
> demonstration programs. I think the most relevant part is that the processes 
> not
> having any elements from the array each use start and count values of 0 for
> every dimension.
> 
> Please see the attached files for more information.
> 
> When running the attached program with
> 
> $ mpirun -n 5 ./nc4partest
> mpi_name: taifun size: 5 rank: 0, isDataWriter=0
> mpi_name: taifun size: 5 rank: 1, isDataWriter=0
> mpi_name: taifun size: 5 rank: 2, isDataWriter=1
> mpi_name: taifun size: 5 rank: 4, isDataWriter=0
> mpi_name: taifun size: 5 rank: 3, isDataWriter=0
> mpi_rank=1 start[0]=0 start[1]=0 count[0]=0 count[1]=0
> mpi_rank=2 start[0]=0 start[1]=0 count[0]=24 count[1]=24
> mpi_rank=0 start[0]=0 start[1]=0 count[0]=0 count[1]=0
> mpi_rank=3 start[0]=0 start[1]=0 count[0]=0 count[1]=0
> mpi_rank=4 start[0]=0 start[1]=0 count[0]=0 count[1]=0
> 
> and from this point on the program hangs.
> 
> I've tried to locate a hint how to use the nc_put_vara_int call for this case
> but found nothing.
> 
> Do I have to redistribute the data before writing? Are there other values for
> start/count I could use?

I just succeeded in running a test case that used count[0] = 0 on an MPI 
parallel 
file system using the netCDF-4 parallel I/O inherited from HDF5, and it ran 
fine.

The test I ran just inserted the following code in a loop after line 136 in
nc_test4/tst_parallel.c:

       /* See if count dimension == 0 returns error */
       count_save = count[0];
       count[0] = 0;
       if (nc_put_vara_int(ncid, v1id, start, count, slab_data)) ERR;
       count[0] = count_save ;

Discussing this with CISL consultants indicates the problem may be 
platform-specific.

--Russ

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: PDL-125161
Department: Support netCDF
Priority: Normal
Status: Closed