Re: [netcdfgroup] nf90_char size

To: Davide Sangalli <davide.sangalli@xxxxxx>
Subject: Re: [netcdfgroup] nf90_char size
From: Dave Allured - NOAA Affiliate <dave.allured@xxxxxxxx>
Date: Fri, 1 May 2020 18:30:26 -0600

Everything looks good in ncdump -hs.  The ncvalidator error is expected
because the format is not in the netcdf-3 family.

I am puzzled.  This looks like the hdf5 layer lost a whole lot of file
space, but I don't see how.  One straightforward thing to try is upgrading
to more recent versions of the netcdf and HDF5 libraries.

If that doesn't help, then to get more information, try replicating the
file with nccopy, h5copy, or h5repack.

https://portal.hdfgroup.org/display/HDF5/HDF5+Command-line+Tools

Use contiguous or chunked, but for testing purposes, do not enable any
compression.  The idea is that the writers in those tools should be
correctly optimized to rewrite those large char arrays without wasted
space, in case your own writer did something strange.

I suppose there could be a storage bug in the hdf5 or netcdf support
libraries.  Your char arrays are uncommonly large, so they might have
triggered some sort of edge case.

I am refraining from suggesting low level debugging because I do not want
to inflict pain.  Otherwise, see if other readers have some ideas, or post
the question to the HDF5 users forum.


On Fri, May 1, 2020 at 5:40 PM Davide Sangalli <davide.sangalli@xxxxxx>
wrote:

> I also add
>
> ncvalidator ndb.BS_COMPRESS0.005000_Q1
> Error: Unknow file signature
>     Expecting "CDF1", "CDF2", or "CDF5", but got "�HDF"
> File "ndb.BS_COMPRESS0.005000_Q1" fails to conform with CDF file format
> specifications
>
> Best,
> D.
>
> On 02/05/20 01:26, Davide Sangalli wrote:
>
> Output of ncdump -hs
>
> D.
>
> ncdump -hs BSK_2-5B_X59RL-50B_SP_bse-io/ndb.BS_COMPRESS0.005000_Q1
>
> netcdf ndb.BS_COMPRESS0 {
> dimensions:
>         BS_K_linearized1 = 2025000000 ;
>         BS_K_linearized2 = 781887360 ;
>         complex = 2 ;
>         BS_K_compressed1 = 24776792 ;
> variables:
>         char BSE_RESONANT_COMPRESSED1_DONE(BS_K_linearized1) ;
>                 BSE_RESONANT_COMPRESSED1_DONE:_Storage = "contiguous" ;
>         char BSE_RESONANT_COMPRESSED2_DONE(BS_K_linearized1) ;
>                 BSE_RESONANT_COMPRESSED2_DONE:_Storage = "contiguous" ;
>         char BSE_RESONANT_COMPRESSED3_DONE(BS_K_linearized2) ;
>                 BSE_RESONANT_COMPRESSED3_DONE:_Storage = "contiguous" ;
>         float BSE_RESONANT_COMPRESSED1(BS_K_compressed1, complex) ;
>                 BSE_RESONANT_COMPRESSED1:_Storage = "contiguous" ;
>                 BSE_RESONANT_COMPRESSED1:_Endianness = "little" ;
> // global attributes:
>                 :_NCProperties =
> "version=1|netcdflibversion=4.4.1.1|hdf5libversion=1.8.18" ;
>                 :_SuperblockVersion = 0 ;
>                 :_IsNetcdf4 = 1 ;
>                 :_Format = "netCDF-4" ;
>
>
>
> On Sat, May 2, 2020 at 12:24 AM +0200, "Dave Allured - NOAA Affiliate" <
> dave.allured@xxxxxxxx> wrote:
>
> I agree that you should expect the file size to be about 1 byte per stored
>> character.  IMO the most likely explanation is that you have a netcdf-4
>> file with inappropriately small chunk size.  Another possibility is a
>> 64-bit offset file with crazy huge padding between file sections.  This is
>> very unlikely, but I do not know what is inside your writer code.
>>
>> Diagnose, please.  Ncdump -hs.  If it is 64-bit offset, I think
>> ncvalidator can display the hidden pad sizes.
>>
>>
>> On Fri, May 1, 2020 at 3:37 PM Davide Sangalli <davide.sangalli@xxxxxx>
>> wrote:
>>
>>> Dear all,
>>> I'm a developer of a fortran code which uses netcdf for I/O
>>>
>>> In one of my runs I created a file with some huge array of characters.
>>> The header of the file is the following:
>>> netcdf ndb.BS_COMPRESS0 {
>>> dimensions:
>>>     BS_K_linearized1 = 2025000000 ;
>>>     BS_K_linearized2 = 781887360 ;
>>> variables:
>>>     char BSE_RESONANT_COMPRESSED1_DONE(BS_K_linearized1) ;
>>>     char BSE_RESONANT_COMPRESSED2_DONE(BS_K_linearized1) ;
>>>     char BSE_RESONANT_COMPRESSED3_DONE(BS_K_linearized2) ;
>>> }
>>>
>>> The variable is declared as nf90_char which, according to the
>>> documentation should be 1 byte per element.
>>> Thus I would expect the total size of the file to be 1
>>> byte*(2*2025000000+781887360) ~ 4.5 GB
>>> Instead the file size is 16059445323 bytes ~ 14.96 GB, i.e. 10.46 GB
>>> more and a factor 3.33 bigger
>>>
>>> This happens consistently if I consider the file
>>> netcdf ndb {
>>> dimensions:
>>>     complex = 2 ;
>>>     BS_K_linearized1 = 2025000000 ;
>>>     BS_K_linearized2 = 781887360 ;
>>> variables:
>>>     float BSE_RESONANT_LINEARIZED1(BS_K_linearized1, complex) ;
>>>     char BSE_RESONANT_LINEARIZED1_DONE(BS_K_linearized1) ;
>>>     float BSE_RESONANT_LINEARIZED2(BS_K_linearized1, complex) ;
>>>     char BSE_RESONANT_LINEARIZED2_DONE(BS_K_linearized1) ;
>>>     float BSE_RESONANT_LINEARIZED3(BS_K_linearized2, complex) ;
>>>     char BSE_RESONANT_LINEARIZED3_DONE(BS_K_linearized2) ;
>>> }
>>> The float component should weight ~36 GB while the char component should
>>> be identical to before, i.e. 4.5 GB for a total of 40.5 GB
>>> The file is instead ~ 50.96 GB, i.e. again a factor 10.46 GB bigger than
>>> expected.
>>>
>>> *Why ?*
>>>
>>> My character variables are something like
>>> "tnnnntnnnntnnnnnnnntnnnnnttnnnnnnnnnnnnnnnnt..."
>>> but the file size is already like that just after the file creation,
>>> i.e. before filling it.
>>>
>>> Few info about the library, compiled linking to HDF5 (hdf5-1.8.18), with
>>> parallel IO support:
>>> Name: netcdf
>>> Description: NetCDF Client Library for C
>>> URL: http://www.unidata.ucar.edu/netcdf
>>> Version: 4.4.1.1
>>> Libs: -L${libdir}  -lnetcdf -ldl -lm
>>> /nfs/data/bin/Yambo/gcc-8.1.0/openmpi-3.1.0/yambo_ext_libs/gfortran/mpifort/v4/parallel/lib/libhdf5hl_fortran.a
>>> /nfs/data/bin/Yambo/gcc-8.1.0/openmpi-3.1.0/yambo_ext_libs/gfortran/mpifort/v4/parallel/lib/libhdf5_fortran.a
>>> /nfs/data/bin/Yambo/gcc-8.1.0/openmpi-3.1.0/yambo_ext_libs/gfortran/mpifort/v4/parallel/lib/libhdf5_hl.a
>>> /nfs/data/bin/Yambo/gcc-8.1.0/openmpi-3.1.0/yambo_ext_libs/gfortran/mpifort/v4/parallel/lib/libhdf5.a
>>> -lz -lm -ldl -lcurl
>>> Cflags: -I${includedir}
>>>
>>> Name: netcdf-fortran
>>> Description: NetCDF Client Library for Fortran
>>> URL: http://www.unidata.ucar.edu/netcdf
>>> Version: 4.4.4
>>> Requires.private: netcdf > 4.1.1
>>> Libs: -L${libdir} -lnetcdff
>>> Libs.private: -L${libdir} -lnetcdff -lnetcdf
>>> Cflags: -I${includedir}
>>>
>>> Best,
>>> D.
>>> --
>>> Davide Sangalli, PhD
>>> CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX
>>> Centre
>>> Area della Ricerca di Roma 1, 00016 Monterotondo Scalo, Italy
>>> http://www.ism.cnr.it/en/davide-sangalli-cv/
>>> http://www.max-centre.eu/
>>>
>>

Follow-Ups:
- Re: [netcdfgroup] nf90_char size
  - From: Dave Allured - NOAA Affiliate

References:
- [netcdfgroup] nf90_char size
  - From: Davide Sangalli
- Re: [netcdfgroup] nf90_char size
  - From: Dave Allured - NOAA Affiliate
- Re: [netcdfgroup] nf90_char size
  - From: Davide Sangalli
- Re: [netcdfgroup] nf90_char size
  - From: Davide Sangalli

2020 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdfgroup archives: