Re: [netcdfgroup] bug report (nc_put_varm_double hangs during collective parallel I/O): a follow-up

  • To: Dennis Heimbigner <dmh@xxxxxxxxxxxxxxxx>
  • Subject: Re: [netcdfgroup] bug report (nc_put_varm_double hangs during collective parallel I/O): a follow-up
  • From: Constantine Khroulev <ckhroulev@xxxxxxxxxx>
  • Date: Tue, 26 Feb 2013 08:08:46 -0900
Dennis and others:

Let me know if I can help.

It was a long time ago, but I'm pretty sure I still understand what
was going on there.

-- 
Constantine

On Tue, Feb 26, 2013 at 7:08 AM, Rob Latham <robl@xxxxxxxxxxx> wrote:
> On Fri, Feb 22, 2013 at 01:45:44PM -0700, Dennis Heimbigner wrote:
>> I recently rewrote nc_get/put_vars to no longer
>> use varm, so it may be time to revisit this issue.
>> What confuses me is that in fact, the varm code
>> writes one instance of the variable on each pass
>> (e.g. v[0], v[1], v[2],...). So I am not sure
>> how it is ever not writing the same on all processors.
>> Can the original person (Rob?) give me more details?
>
> I'm not the original person.  Constantine Khroulev provided a nice
> testcase last January (netcdf_parallel_2d.c).
>
> I just pulled netcdf4 from SVN (r2999) and built it with hdf5-1.8.10.
>
> Constantine Khroulev's test case hangs (though in a different place
> than a year ago...):
>
> Today, that test case hangs with one process in a testcase-level
> barrier (netcdf_parallel_ 2d.c:134) and one process stuck in
> nc4_enddef_netcdf4_file trying to flush data.
>
> This test case demonstrates the problem nicely.  Take a peek at it and
> double-check the testcase is correct, but you've got a nice driver to
> find and fix this bug.
>
> ==rob
>
>>
>> =Dennis Heimbigner
>>  Unidata
>>
>> Orion Poplawski wrote:
>> >On 01/27/2012 01:22 PM, Rob Latham wrote:
>> >>On Wed, Jan 25, 2012 at 10:06:59PM -0900, Constantine Khroulev wrote:
>> >>>Hello NetCDF developers,
>> >>>
>> >>>My apologies to list subscribers not interested in these (very)
>> >>>technical details.
>> >>
>> >>I'm interested!   I hope you send more of these kinds of reports.
>> >>
>> >>>When the collective parallel access mode is selected all processors in
>> >>>a communicator have to call H5Dread() (or H5Dwrite()) the same number
>> >>>of times.
>> >>>
>> >>>In nc_put_varm_*, NetCDF breaks data into contiguous segments that can
>> >>>be written one at a time (see NCDEFAULT_get_varm(...) in
>> >>>libdispatch/var.c, lines 479 and on). In some cases the number of
>> >>>these segments varies from one processor to the next.
>> >>>
>> >>>As a result as soon as one of the processors in a communicator is done
>> >>>writing its data the program locks up, because now only a subset of
>> >>>processors in this communicator are calling  H5Dwrite(). (Unless all
>> >>>processors have the same number of "data segments" to write, that is.)
>> >>
>> >>Oh, that's definitely a bug.  netcdf4 should call something like
>> >>MPI_Allreduce with MPI_MAX to figure out how many "rounds" of I/O will
>> >>be done (this is what we do inside ROMIO, for a slightly different
>> >>reason)
>> >>
>> >>>But here's the thing: I'm not sure this is worth fixing. The only
>> >>>reason to use collective I/O I can think of is for better performance,
>> >>>and then avoiding sub-sampled and mapped reading and writing is a good
>> >>>idea anyway.
>> >>
>> >>well, if varm and vars are the natural way to access the data, then
>> >>the library should do what it can to do that efficiently.   The fix
>> >>appears to be straightforward.  Collective I/O has a lot of advantages
>> >>on some platforms: it will automatically select a subset of processors
>> >>or automatically construct a file access most closely suited to the
>> >>underlying file system.
>> >>
>> >>==rob
>> >>
>> >
>> >Was this ever fixed?
>> >
>
> --
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
>
> _______________________________________________
> netcdfgroup mailing list
> netcdfgroup@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe,  visit: 
> http://www.unidata.ucar.edu/mailing_lists/



  • 2013 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: