NOTE: The netcdf-hdf
mailing list is no longer active. The list archives are made available for historical reasons.
First, I would like to thank the HDF5 developers for providing the excellent library and documentation. We use parallel hdf5 to store the results of our simulation. Our simulation code is an MPI application and is routinely run with a few hundreds of processors. The simulation is time consuming and can take weeks to finish. Currently we save the results in one file (i.e., every processor write to the same file), one datagroup, one dataset, with unlimited time dimension. Before saving each record, the time dimension is extended. Recently, we had a hardware problem on one of the computation node, and the simulation crashed. As a result, our hdf5 file was corrupted, and we lost all the results of that simulation. This lead me to wondering what the best practice is of using parallel hdf5. I hope the list can provide some guidance. In the event of system crash, how can I prevent the file corruption and how can I minimize the loss of data? Should I flush the buffer after each output, or close the dataset after each output, or save each record in a new datagroup, or save each record in a new file? How much of data loss would I expect in the worst scenario (e.g., the system crashes during disk I/O)? Thanks, -- Eh Tan Staff Scientist Computational Infrastructure for Geodynamics 2750 E. Washington Blvd. Suite 210 Pasadena, CA 91107 (626) 395-1693 http://www.geodynamics.org ============================================================================== To unsubscribe netcdf-hdf, visit: http://www.unidata.ucar.edu/mailing-list-delete-form.html ==============================================================================
netcdf-hdf
archives: