All,
I was wondering if anyone out there has encountered issues with NetCDF
4.3.2 and Intel 15.0.0.090 (just released) because I seem to have
encountered one in our application where we throw an FPE writing a file.
To wit, I work on the GEOS-5 GCM and our Baselibs build things like
HDF4, HDF5, Netcdf, etc for use with our code. Our normal build of
netcdf in the Baselibs usually just a simple one configured as:
netcdf.config : netcdf/configure
@echo "Configuring netcdf $*"
@(cd netcdf; \
export PATH="$(prefix)/bin:$(PATH)" ;\
export CPPFLAGS="$(CPPFLAGS) $(INC_SUPP)";\
export LIBS="-L$(prefix)/lib -lmfhdf -ldf -lsz -ljpeg $(LINK_GPFS)
$(LIB_CURL) -lm" ;\
./configure --prefix=$(prefix) \
--includedir=$(prefix)/include/netcdf \
--enable-hdf4 \
--enable-dap \
$(NC_PAR_TESTS) \
--disable-shared \
--enable-netcdf-4 \
CC=$(NC_CC) FC=$(NC_FC) CXX=$(NC_CXX) F77=$(NC_F77) )
In this case, since we build for Parallel HDF5, that means our CC=mpicc,
FC=mpif90, etc. I built two versions, both with Intel 15.0.0.090, using
MVAPICH2 2.0 and Intel MPI 5.0.1.035 and both show the issue.
I did a "make check" with my two netcdf builds and they both passed most
of the tests (some dap tests fail, I think, because I'm on a compute
node where no outside internet is seen) so it must not be a simple fail.
So, my first thought was, well, let's add '-g -O0' and rebuild the
library and get to the bottom of this, and, of course, the code runs
just fine now! So, my guess is that it has something to do with the
optimizer.
Then, I built the library explicitly with "-g -O" and I get the same FPE
as before, so it seems as if the optimizer has done...something.
Totalview shows that when we go to write an output NC4 file we get an
FPE and the stack trace leads to var_create_dataset[1]:
var_create_dataset, FP=7fff42867cb0
write_var, FP=7fff42867d70
nc4_rec_write_metadata, FP=7fff42867de0
nc4_enddef_netcdf4_file, FP=7fff42867e00
NC4__enddef, FP=7fff42867e20
nc_enddef, FP=7fff42867e40
ncendef, FP=7fff42867e50
ncendf_, FP=7fff42867e60
cfio_create_, FP=7fff4286a900
esmf_cfiosdffilecreate, FP=7fff4286b470
esmf_cfiofilecreate, FP=7fff4286b4c0
and points to line 1453-4 of libsrc4/nc4hdf.c:
1449 /* Unlimited dim always gets chunksize of 1. */
1450 if (dim->unlimited)
1451 chunksize[d] = 1;
1452 else
1453 chunksize[d] =
pow((double)DEFAULT_CHUNK_SIZE/type_size,
1454 1/(double)(var->ndims -
unlimdim));
1455
In Totalview, I see that "type_size" is said to be "0" which, of course,
will do bad things and might be causing the FPE. Since type_size is
determined from things within var, who knows if a struct is clobbered or
what.
Has anyone else seen this? I suppose for now I can just point to the
debug-netcdf build so I can continue developing/testing with Intel 15
though I don't know what the cost of running netCDF at -O0 is.
Thanks,
Matt
[1] Yes, that does indeed say ncendf because this code has been around a
while in our model and no one has wanted to translate all the ancient
netcdf calls to actual modern ones for fear of breaking something
crucial. But, in the end, it's still calling the right call it needs to.
--
Matt Thompson SSAI, Sr Software Test Engr
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712 Fax: 301-614-6246