[netcdfgroup] Intel 15 and netCDF 4.3.2: Issues with Optimization?

All,

I was wondering if anyone out there has encountered issues with NetCDF 4.3.2 and Intel 15.0.0.090 (just released) because I seem to have encountered one in our application where we throw an FPE writing a file.

To wit, I work on the GEOS-5 GCM and our Baselibs build things like HDF4, HDF5, Netcdf, etc for use with our code. Our normal build of netcdf in the Baselibs usually just a simple one configured as:

netcdf.config : netcdf/configure
   @echo "Configuring netcdf $*"
   @(cd netcdf; \
          export PATH="$(prefix)/bin:$(PATH)" ;\
          export CPPFLAGS="$(CPPFLAGS) $(INC_SUPP)";\
          export LIBS="-L$(prefix)/lib -lmfhdf -ldf -lsz -ljpeg $(LINK_GPFS) 
$(LIB_CURL) -lm" ;\
          ./configure --prefix=$(prefix) \
                      --includedir=$(prefix)/include/netcdf \
                      --enable-hdf4 \
                      --enable-dap \
                      $(NC_PAR_TESTS) \
                      --disable-shared \
                      --enable-netcdf-4 \
                      CC=$(NC_CC) FC=$(NC_FC) CXX=$(NC_CXX) F77=$(NC_F77) )

In this case, since we build for Parallel HDF5, that means our CC=mpicc, FC=mpif90, etc. I built two versions, both with Intel 15.0.0.090, using MVAPICH2 2.0 and Intel MPI 5.0.1.035 and both show the issue.

I did a "make check" with my two netcdf builds and they both passed most of the tests (some dap tests fail, I think, because I'm on a compute node where no outside internet is seen) so it must not be a simple fail.

So, my first thought was, well, let's add '-g -O0' and rebuild the library and get to the bottom of this, and, of course, the code runs just fine now! So, my guess is that it has something to do with the optimizer.

Then, I built the library explicitly with "-g -O" and I get the same FPE as before, so it seems as if the optimizer has done...something. Totalview shows that when we go to write an output NC4 file we get an FPE and the stack trace leads to var_create_dataset[1]:

var_create_dataset, FP=7fff42867cb0
write_var,       FP=7fff42867d70
nc4_rec_write_metadata, FP=7fff42867de0
nc4_enddef_netcdf4_file, FP=7fff42867e00
NC4__enddef,     FP=7fff42867e20
nc_enddef,       FP=7fff42867e40
ncendef,         FP=7fff42867e50
ncendf_,         FP=7fff42867e60
cfio_create_,    FP=7fff4286a900
esmf_cfiosdffilecreate, FP=7fff4286b470
esmf_cfiofilecreate, FP=7fff4286b4c0

and points to line 1453-4 of libsrc4/nc4hdf.c:

1449                         /* Unlimited dim always gets chunksize of 1. */
1450                         if (dim->unlimited)
1451                            chunksize[d] = 1;
1452                         else
1453                            chunksize[d] = 
pow((double)DEFAULT_CHUNK_SIZE/type_size,
1454                                               1/(double)(var->ndims - 
unlimdim));
1455

In Totalview, I see that "type_size" is said to be "0" which, of course, will do bad things and might be causing the FPE. Since type_size is determined from things within var, who knows if a struct is clobbered or what.

Has anyone else seen this? I suppose for now I can just point to the debug-netcdf build so I can continue developing/testing with Intel 15 though I don't know what the cost of running netCDF at -O0 is.

Thanks,
Matt

[1] Yes, that does indeed say ncendf because this code has been around a while in our model and no one has wanted to translate all the ancient netcdf calls to actual modern ones for fear of breaking something crucial. But, in the end, it's still calling the right call it needs to.


--
Matt Thompson          SSAI, Sr Software Test Engr
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712              Fax: 301-614-6246



  • 2014 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: