[netcdfgroup] bug report (nc_put_varm_double hangs during collective parallel I/O): a follow-up

Hello NetCDF developers,

My apologies to list subscribers not interested in these (very)
technical details.

So, here goes.

When the collective parallel access mode is selected all processors in
a communicator have to call H5Dread() (or H5Dwrite()) the same number
of times.

In nc_put_varm_*, NetCDF breaks data into contiguous segments that can
be written one at a time (see NCDEFAULT_get_varm(...) in
libdispatch/var.c, lines 479 and on). In some cases the number of
these segments varies from one processor to the next.

As a result as soon as one of the processors in a communicator is done
writing its data the program locks up, because now only a subset of
processors in this communicator are calling  H5Dwrite(). (Unless all
processors have the same number of "data segments" to write, that is.)

As suggested above, this issue affects all of nc_{get,put}_varm_*
functions, not only nc_put_varm_*. This also affects
nc_{get,put}_vars_* because they use ...varm calls internally.

I've found an example showing how to do collective parallel I/O in
HDF5 in a case when one of processors has no data to read or write;
please see http://www.hdfgroup.org/hdf5-quest.html#par-nodata . Looks
like it might be useful.

But here's the thing: I'm not sure this is worth fixing. The only
reason to use collective I/O I can think of is for better performance,
and then avoiding sub-sampled and mapped reading and writing is a good
idea anyway.

Maybe it's best to make ...varm... and ...vars... functions always use
independent access and document this behavior?

I hope this helps. I'm going to stop playing with gdb now --  it's
time to move to other issues!

-- 
Constantine

PS: I still think that the code snippet attached to my previous e-mail
shows the problem, even though that e-mail is a description of a
symptom and not the underlying problem.

cc help@xxxxxxxxxxxxx



  • 2012 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: