Re: [netcdfgroup] bug report (nc_put_varm_double hangs during collective parallel I/O): a follow-up

  • To: Constantine Khroulev <ckhroulev@xxxxxxxxxx>
  • Subject: Re: [netcdfgroup] bug report (nc_put_varm_double hangs during collective parallel I/O): a follow-up
  • From: Rob Latham <robl@xxxxxxxxxxx>
  • Date: Fri, 27 Jan 2012 14:22:15 -0600
On Wed, Jan 25, 2012 at 10:06:59PM -0900, Constantine Khroulev wrote:
> Hello NetCDF developers,
> 
> My apologies to list subscribers not interested in these (very)
> technical details.

I'm interested!   I hope you send more of these kinds of reports.

> When the collective parallel access mode is selected all processors in
> a communicator have to call H5Dread() (or H5Dwrite()) the same number
> of times.
> 
> In nc_put_varm_*, NetCDF breaks data into contiguous segments that can
> be written one at a time (see NCDEFAULT_get_varm(...) in
> libdispatch/var.c, lines 479 and on). In some cases the number of
> these segments varies from one processor to the next.
> 
> As a result as soon as one of the processors in a communicator is done
> writing its data the program locks up, because now only a subset of
> processors in this communicator are calling  H5Dwrite(). (Unless all
> processors have the same number of "data segments" to write, that is.)

Oh, that's definitely a bug.  netcdf4 should call something like
MPI_Allreduce with MPI_MAX to figure out how many "rounds" of I/O will
be done (this is what we do inside ROMIO, for a slightly different
reason)

> But here's the thing: I'm not sure this is worth fixing. The only
> reason to use collective I/O I can think of is for better performance,
> and then avoiding sub-sampled and mapped reading and writing is a good
> idea anyway.

well, if varm and vars are the natural way to access the data, then
the library should do what it can to do that efficiently.   The fix
appears to be straightforward.  Collective I/O has a lot of advantages
on some platforms: it will automatically select a subset of processors
or automatically construct a file access most closely suited to the
underlying file system.  

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA



  • 2012 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: