[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #GQG-203630]: Problem in saving netcdf files



Hi Wuhu,

> Recently I have some problem to run the NCAR CESM (cesm1_0_3) model. It
> seems to me that the problem only happens when the model is saving the
> restart files.
> 
> The modules I am using at the UK hector supercomputer
> (http://www.hector.ac.uk/) are:
> 
> 1) modules/3.2.6.6
> 2) nodestat/2.2-1.0400.29866.4.3.gem
> 3) sdb/1.0-1.0400.30000.6.18.gem
> 4) MySQL/5.0.64-1.0000.4667.20.1
> 5) lustre-cray_gem_s/1.8.4_2.6.32.45_0.3.2_1.0400.6221.1.1-1.0400.30252.1.29
> 6) udreg/2.3.1-1.0400.3911.5.6.gem
> 7) ugni/2.3-1.0400.3912.4.29.gem
> 8) gni-headers/2.1-1.0400.3906.5.1.gem
> 9) dmapp/3.2.1-1.0400.3965.10.12.gem
> 10) xpmem/0.1-2.0400.29883.4.6.gem
> 11) hss-llm/6.0.0
> 12) Base-opts/1.0.2-1.0400.29823.8.1.gem
> 13) xtpe-network-gemini
> 14) pbs/10.4.0.101257
> 15) packages-phase2b
> 16) usertools/1.0
> 17) budgets/1.0
> 18) pgi/11.6.0
> 19) totalview-support/1.1.2
> 20) xt-totalview/8.9.1
> 21) xt-libsci/11.0.00
> 22) pmi/2.1.2-1.0000.8396.13.5.gem
> 23) xt-asyncpe/5.00
> 24) atp/1.2.1
> 25) PrgEnv-pgi/4.0.30
> 26) xt-mpich2/5.3.1
> 27) xtpe-mc12
> 28) svn/1.6.2
> 29) hdf5/1.8.5.0
> 30) netcdf/4.1.1.0
> 
> 
> The debug information see below:
> 
> "
> Thread 1.1 received a signal (Floating Point Exception)
> d1.<> dwhere
> > 0 ncx_put_float_double PC=0x015e2ec8, FP=0x7fffffff1a90
> [/ptmp/ulib/netcdf/4.1.1.0/source> /libsrc/ncx.c#1386]
> 1 ncx_putn_float_double PC=0x015e6488, FP=0x7fffffff1ad0
> [/ptmp/ulib/netcdf/4.1.1.0/source/> libsrc/ncx.c#5731]
> 2 putNCvx_float_double PC=0x015f3d15, FP=0x7fffffff1b60
> [/ptmp/ulib/netcdf/4.1.1.0/source/l> ibsrc/putget.c#2047]
> 3 putNCv_double PC=0x015f500a, FP=0x7fffffff1ba0
> [/ptmp/ulib/netcdf/4.1.1.0/source/libsrc/p> utget.c#2594]
> 4 nc3_put_vara_double PC=0x015fbda5, FP=0x7fffffff1c10
> [/ptmp/ulib/netcdf/4.1.1.0/source/li> bsrc/putget.c#5825]
> 5 nc4_put_vara_tc PC=0x015cabb6, FP=0x7fffffff1c50
> [/ptmp/ulib/netcdf/4.1.1.0/source/libsrc> 4/nc4var.c#1839]
> 6 nc_put_vara_double PC=0x015cb2eb, FP=0x7fffffff1c90
> [/ptmp/ulib/netcdf/4.1.1.0/source/lib> src4/nc4var.c#2106]
> 7 nf_put_vara_double_ PC=0x01615b1f, FP=0x7fffffff5d10
> [/ptmp/ulib/netcdf/4.1.1.0/source/fo> rtran/fort-varaio.c#151]
> 8 netcdf`nf90_put_var_1d_eightbytereal PC=0x01631f48,
> FP=0x7fffffff5de0 [/ptmp/ulib/netcdf/> 
> 4.1.1.0/source/f90/netcdf_expanded.f90#1189]
> 9 pionfwrite_mod`write_nfdarray_double PC=0x00d6d769,
> FP=0x7fffffff61c0 [/esfs1/n02/n02/elf>
> engwh/CESM1.0/ftn/ftn_fsdwfepmc/pio/pionfwrite_mod.F> 90#580]
> 10 piodarray`write_darray_nf_double PC=0x00db0ef9, FP=0x7fffffff65d0
> [/esfs1/n02/n02/elfeng> wh/CESM1.0/ftn/ftn_fsdwfepmc/pio/piodarray.F90#2305]
> 11 piodarray`write_darray_1d_double PC=0x00da30a9, FP=0x7fffffff6650
> [/esfs1/n02/n02/elfeng> wh/CESM1.0/ftn/ftn_fsdwfepmc/pio/piodarray.F90#392]
> 12 piodarray`write_darray_3d_double PC=0x00da5482, FP=0x7fffffff67b0
> [/esfs1/n02/n02/elfeng> wh/CESM1.0/ftn/ftn_fsdwfepmc/pio/piodarray.F90#865]
> 13 cam_history`dump_field PC=0x004a7d15, FP=0x7fffffff6a80
> [/work/n02/n02/elfengwh/CESM1.0/>
> cesm1_0_3/models/atm/cam/src/control/cam_history.F90#4310> ]
> 14 cam_history`wshist PC=0x004aa16c, FP=0x7fffffff73c0
> [/work/n02/n02/elfengwh/CESM1.0/cesm>
> 1_0_3/models/atm/cam/src/control/cam_history.F90#4609> ]
> 15 cam_history`write_restart_history PC=0x00485d34,
> FP=0x7fffffff8170 [/work/n02/n02/elfeng>
> wh/CESM1.0/cesm1_0_3/models/atm/cam/src/control/ca> m_history.F90#866]
> 16 cam_restart`cam_write_restart PC=0x004b2efd, FP=0x7fffffff8640
> [/work/n02/n02/elfengwh/C> 
> ESM1.0/cesm1_0_3/models/atm/cam/src/control/cam_restart.F90#251]
> 17 cam_comp`cam_run4 PC=0x00471b96, FP=0x7fffffff8680
> [/work/n02/n02/elfengwh/CESM1.0/cesm1> 
> _0_3/models/atm/cam/src/control/cam_comp.F90#390]
> 18 atm_comp_mct`atm_run_mct PC=0x0046afb5, FP=0x7fffffff8760
> [/work/n02/n02/elfengwh/CESM1.>
> 0/cesm1_0_3/models/atm/cam/src/cpl_mct/atm_comp_mct.F90#523> ]
> 19 ccsm_comp_mod`ccsm_run PC=0x00412d84, FP=0x7fffffff9a60
> [/work/n02/n02/elfengwh/CESM1.0/> 
> cesm1_0_3/models/drv/driver/ccsm_comp_mod.F90#2165]
> 20 ccsm_driver PC=0x004164e9, FP=0x7fffffff9a70
> [/work/n02/n02/elfengwh/CESM1.0/cesm1_0_3/m> 
> odels/drv/driver/ccsm_driver.F90#47]
> 21 main PC=0x0040050b, FP=0x7fffffff9a90 [ccsm.exe]
> 22 __libc_start_main PC=0x01a256a0, FP=0x7fffffff9b50
> [/usr/src/packages/BUILD/glibc-2.11.1> /csu/libc-start.c#226]
> 23 _start PC=0x004003e0, FP=0x7fffffff9b60 [ccsm.exe]

It looks like a floating-point overflow is occurring when converting a
double-precision value to a single-precision floating-point value just
before converting it to the portable form (XDR) for writing to disk.

That might happen if you are writing out an array of doubles to a
netCDF variable that was declared type NF_FLOAT (a 32-bit float), and
one of the double values was too large to fit in a float, for example
it might be a fill value larger than 9.9692099683868690e+36.

I see you are using a Cray, so another possibility might be something
similar to this problem:

  https://www.myroms.org/projects/src/ticket/217

We don't have a Cray to test on, so we can't duplicate the problem
here.  I'd be curious if "make check" ran successfully for netCDF
4.1.1 for the Cray installation you're using, as it has some tests for
extreme floating-point values to make sure the netCDF library handles
them according to the documentation.  The current netCDF library is
version 4.1.3, but I believe there are no fixes in the current release
for floating-point bugs that would have any effect on the problem you
are seeing.

--Russ

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: GQG-203630
Department: Support netCDF
Priority: Normal
Status: Closed