[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #VUY-537245]: Segmentation fault in getvar for large files



Vladimir,

> It seems that after 1 month of recompiling different hdf anf netcdf versions 
> and searching errors in my code, i noticed that SOME random supercomputer 
> nodes have small stack size (ulimit -s).
> And program crush on them.
> So, netCDF is still work great and this is not your bug))
> But of course, it will be great to print error messages like "Insufficient 
> memory" when, for example, stack size is too small.
> It will help users in such cases to debug their code.

Thanks for reporting the cause for the problem!

We'll have to consider (and maybe research) how to detect when the shell 
provides
insufficient stack space.  I'm currently not sure how the netCDF library could
determine when that will be a problem before a segmentation violation indicates
the stack space has been exceeded.

--Russ

> ???????,  5 ???????? 2013, 10:52 -06:00 ?? "Unidata netCDF Support" 
> <address@hidden>:
> >Hi Vladimir,
> >
> >> I use netCDF library (parallel, fortran version) about 4 years and it
> >> works perfectly!
> >>
> >> But, on the new supercomputer i start to get very strange error.
> >> I searched for it in my program and investigation resulted in small test
> >> program, which creates parallel communicator and sequentially read rather
> >> big file (16 bil integers field in 540 mb file).
> >>
> >> It crushes with segmentation fault on NF90_GET_VAR randomly.  For example,
> >> first run is normal, and second and third results in seg fault.
> >>
> >> Interestengly, that for small files (for example 1 bil integers in 200
> >> mb file) it works perfectly.
> >>
> >> Admin cant give me any helpful answers.
> >>
> >> Could you please to suggest where problem is?
> >
> >We would need to be able to duplicate the problem here, and I think that
> >would require access to the file you are reading or to a program that
> >creates it.  Even the output from "netcdf -h" might be sufficient for us
> >to generate an example that would demonstrate the problem.
> >
> >Since the Unidata developer who knew the most about parallel I/O took a
> >different job a couple of years ago, we've been lacking in parallel I/O
> >knowledge and experiencee, so sometimes we have to forward questions to
> >NCAR's consultants instead.
> >
> >There have been a few bug fixes since netCDF-4.1.3 that relate to parallel
> >I/O, including both the Fortran library (in netcdf-fortran-4.2) and in the
> >netCDF-C library it calls (netCDF-4.3.0 or netCDF-4.3.1-rc2 using
> >HDF5-1.8.11).
> >
> >> Write me which commands i should run to provide you information about
> >> supercomputer and installed netCDF.
> >
> >It would be useful to know what version of HDF5 you are using, and whether
> >you tried netCDF-fortran-4.2 with the netCDF C library version 4.3.0 or
> >later.
> >
> >It might be interesting to know what supercomputer and operating system
> >you are using (the output from "uname -a"), but we currently only have
> >easy access to Linux, Mac, Solaris, and Windows systems on which to test.
> >
> >--Russ
> >
> >Russ Rew                                         UCAR Unidata Program
> >address@hidden http://www.unidata.ucar.edu
> >
> >
> >
> >Ticket Details
> >===================
> >Ticket ID: VUY-537245
> >Department: Support netCDF
> >Priority: Normal
> >Status: Closed
> >
> 
> 
> --
> vova kalmykov
> 
Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: VUY-537245
Department: Support netCDF
Priority: Normal
Status: Closed