First of all thanks for all the advices .
On 02/05/20 19:01, Dave Allured - NOAA Affiliate wrote:
There it is.
> DATASET "BSE_RESONANT_COMPRESSED1_DONE" {
> DATATYPE H5T_STRING {
> STRSIZE 1;
> STRPAD H5T_STR_NULLTERM;
> CSET H5T_CSET_UTF8;
> CTYPE H5T_C_S1;
Your char arrays are being stored as strings, not 1-byte characters.
This incurs overhead for each character.
Ok, I see from the other email that this is not the issue.
Dimension scales are using the extra space
I'll try to generate the equivalent file on a much smaller simulation
and let you now if this happens also on a smaller dataset.
I got from Wei-Keng answer that this could be a bug of the version of
netcdf which I'm using and that I should upgrade to 4.7.4.
Is that right ?
I am not familiar with the exact details of physical storage of HDF5
strings, but it doesn't matter. This scheme is inefficient, and you
should find something better.
I vaguely recall some changes in netcdf-4 character storage in recent
years. Since you are using an older version of the netcdf library,
first try the latest version.
Let me take the opportunity for a couple of questions.
The reason why we have such huge character variable is that we use it as
a control variable.
Basically we have simulations where a huge complex matrix is created and
filled.
Sometimes the code may crash during the simulation. The idea is that we
write a character "t" for each complex number stored to disk.
(We do not store numbers one by one, but we cannot use blocks since the
way the matrix is blocked is depends on the parallelization scheme).
1) Now at first we thought: it would be great if, after the interrupted
run, the call nf90_get_var could check which values are filled and which
are not.
Let's say my netcdf variable is
1.23423, 4.3452 , 5.3453, 7.34534, _, _, _, ...
i.e. only the first 4 values where computed and we need to restart from
the 5th,
But we did not figure out a way. So first question is: is there a way to
check that ?
There are nf90_def_var_fill and nf90_inq_var_fill, but I'm not sure I
can use it.
Maybe we could use a control number for that.
Say first filling the matrix with zeros, but it means that we have to
write twice the whole matrix to disk.
Maybe we should go back to this idea ...
2) The second though was: Let's use logical.
We need to store just 1 bit of information nearby each complex number
(64 bits in single precision), it is not too much.
But we got netcdf does not have 1 bit variable, just 1 byte, so we ended
up using nf90_char
Otherwise, as I said elsewhere, go to 64-bit or CDF5 formats.
Here another question. Our code started with netcdf.
Then we evolved to parallel I/O and the only way we found was via HDF5.
1) is there any alternative ?
2) silly question for a netcdf mailing list: netcdf is essentially a
layer on top of HDF5 for the way we use it. We are discussing in our
developers team to simply drop netcdf and go straight to HDF5. Any
reason why we shouldn't do that in your opinion ?
If netcdf-4 is important to you for some reason, you might also
consider encoding your char data into signed or unsigned bytes.
Ok, I'll try using a nf90_byte instead of nf90_char (is this what you
suggest?)
On Sat, May 2, 2020 at 10:38 AM Davide Sangalli
<davide.sangalli@xxxxxx <mailto:davide.sangalli@xxxxxx>> wrote:
h5stat -Ss ndb.BS_COMPRESS0.005000_Q1
Filename: ndb.BS_COMPRESS0.005000_Q1
Free-space section threshold: 1 bytes
Small size free-space sections (< 10 bytes):
Total # of small size sections: 0
Free-space section bins:
Total # of sections: 0
File space management strategy: H5F_FILE_SPACE_ALL
Summary of file space information:
File metadata: 4355 bytes
Raw data: 16356758312 bytes
Amount/Percent of tracked free space: 0 bytes/0.0%
Unaccounted space: 6216 bytes
Total space: 16356768883 bytes
On Sat, May 2, 2020 at 6:28 PM +0200, "Davide Sangalli"
<davide.sangalli@xxxxxx <mailto:davide.sangalli@xxxxxx>> wrote:
h5dump -Hp ndb.BS_COMPRESS0.005000_Q1
HDF5 "ndb.BS_COMPRESS0.005000_Q1" {
GROUP "/" {
ATTRIBUTE "_NCProperties" {
DATATYPE H5T_STRING {
STRSIZE 57;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
DATASET "BSE_RESONANT_COMPRESSED1" {
DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE { ( 24776792, 2 ) / ( 24776792, 2 ) }
STORAGE_LAYOUT {
CONTIGUOUS
SIZE 198214336
OFFSET 16158554547
}
FILTERS {
NONE
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_IFSET
VALUE 9.96921e+36
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_EARLY
}
ATTRIBUTE "DIMENSION_LIST" {
DATATYPE H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
}
}
DATASET "BSE_RESONANT_COMPRESSED1_DONE" {
DATATYPE H5T_STRING {
STRSIZE 1;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 2025000000 ) / ( 2025000000 ) }
STORAGE_LAYOUT {
CONTIGUOUS
SIZE 2025000000
OFFSET 8100002379
}
FILTERS {
NONE
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_IFSET
VALUE ""
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_EARLY
}
ATTRIBUTE "DIMENSION_LIST" {
DATATYPE H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
}
}
DATASET "BSE_RESONANT_COMPRESSED2_DONE" {
DATATYPE H5T_STRING {
STRSIZE 1;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 2025000000 ) / ( 2025000000 ) }
STORAGE_LAYOUT {
CONTIGUOUS
SIZE 2025000000
OFFSET 10125006475
}
FILTERS {
NONE
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_IFSET
VALUE ""
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_EARLY
}
ATTRIBUTE "DIMENSION_LIST" {
DATATYPE H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
}
}
DATASET "BSE_RESONANT_COMPRESSED3_DONE" {
DATATYPE H5T_STRING {
STRSIZE 1;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 781887360 ) / ( 781887360 ) }
STORAGE_LAYOUT {
CONTIGUOUS
SIZE 781887360
OFFSET 15277557963
}
FILTERS {
NONE
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_IFSET
VALUE ""
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_EARLY
}
ATTRIBUTE "DIMENSION_LIST" {
DATATYPE H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
}
}
DATASET "BS_K_compressed1" {
DATATYPE H5T_IEEE_F32BE
DATASPACE SIMPLE { ( 24776792 ) / ( 24776792 ) }
STORAGE_LAYOUT {
CONTIGUOUS
SIZE 99107168
OFFSET 16059447379
}
FILTERS {
NONE
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_IFSET
VALUE H5D_FILL_VALUE_DEFAULT
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_EARLY
}
ATTRIBUTE "CLASS" {
DATATYPE H5T_STRING {
STRSIZE 16;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
ATTRIBUTE "NAME" {
DATATYPE H5T_STRING {
STRSIZE 64;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
ATTRIBUTE "REFERENCE_LIST" {
DATATYPE H5T_COMPOUND {
H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
H5T_STD_I32LE "dimension";
}
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
}
}
DATASET "BS_K_linearized1" {
DATATYPE H5T_IEEE_F32BE
DATASPACE SIMPLE { ( 2025000000 ) / ( 2025000000 ) }
STORAGE_LAYOUT {
CONTIGUOUS
SIZE 8100000000
OFFSET 2379
}
FILTERS {
NONE
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_IFSET
VALUE H5D_FILL_VALUE_DEFAULT
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_EARLY
}
ATTRIBUTE "CLASS" {
DATATYPE H5T_STRING {
STRSIZE 16;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
ATTRIBUTE "NAME" {
DATATYPE H5T_STRING {
STRSIZE 64;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
ATTRIBUTE "REFERENCE_LIST" {
DATATYPE H5T_COMPOUND {
H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
H5T_STD_I32LE "dimension";
}
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
}
}
DATASET "BS_K_linearized2" {
DATATYPE H5T_IEEE_F32BE
DATASPACE SIMPLE { ( 781887360 ) / ( 781887360 ) }
STORAGE_LAYOUT {
CONTIGUOUS
SIZE 3127549440
OFFSET 12150006475
}
FILTERS {
NONE
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_IFSET
VALUE H5D_FILL_VALUE_DEFAULT
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_EARLY
}
ATTRIBUTE "CLASS" {
DATATYPE H5T_STRING {
STRSIZE 16;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
ATTRIBUTE "NAME" {
DATATYPE H5T_STRING {
STRSIZE 64;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
ATTRIBUTE "REFERENCE_LIST" {
DATATYPE H5T_COMPOUND {
H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
H5T_STD_I32LE "dimension";
}
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
}
}
DATASET "complex" {
DATATYPE H5T_IEEE_F32BE
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
STORAGE_LAYOUT {
CONTIGUOUS
SIZE 8
OFFSET 16059447371
}
FILTERS {
NONE
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_IFSET
VALUE H5D_FILL_VALUE_DEFAULT
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_EARLY
}
ATTRIBUTE "CLASS" {
DATATYPE H5T_STRING {
STRSIZE 16;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
ATTRIBUTE "NAME" {
DATATYPE H5T_STRING {
STRSIZE 64;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
ATTRIBUTE "REFERENCE_LIST" {
DATATYPE H5T_COMPOUND {
H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
H5T_STD_I32LE "dimension";
}
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
}
}
}
}
On Sat, May 2, 2020 at 5:55 PM +0200, "Wei-Keng Liao"
<wkliao@xxxxxxxxxxxxxxxx <mailto:wkliao@xxxxxxxxxxxxxxxx>> wrote:
For HDF5 files, command “h5dump -Hp ndb.BS_COMPRESS0.005000_Q1”
shows
the data chunk settings used by all datasets in the file.
Command “h5stat -Ss ndb.BS_COMPRESS0.005000_Q1” shows information
about
free space, metadata, raw data, etc.
They may reveal why your file is abnormal big.
Most likely it is the chunk setting you used.
Wei-keng
> On May 1, 2020, at 6:40 PM, Davide Sangalli wrote:
>
> I also add
>
> ncvalidator ndb.BS_COMPRESS0.005000_Q1
> Error: Unknow file signature
> Expecting "CDF1", "CDF2", or "CDF5", but got "�HDF"
> File "ndb.BS_COMPRESS0.005000_Q1" fails to conform with CDF file
format specifications
>
> Best,
> D.
>
> On 02/05/20 01:26, Davide Sangalli wrote:
>> Output of ncdump -hs
>>
>> D.
>>
>> ncdump -hs BSK_2-5B_X59RL-50B_SP_bse-io/ndb.BS_COMPRESS0.005000_Q1
>>
>> netcdf ndb.BS_COMPRESS0 {
>> dimensions:
>> BS_K_linearized1 = 2025000000 ;
>> BS_K_linearized2 = 781887360 ;
>> complex = 2 ;
>> BS_K_compressed1 = 24776792 ;
>> variables:
>> char BSE_RESONANT_COMPRESSED1_DONE(BS_K_linearized1) ;
>> BSE_RESONANT_COMPRESSED1_DONE:_Storage =
"contiguous" ;
>> char BSE_RESONANT_COMPRESSED2_DONE(BS_K_linearized1) ;
>> BSE_RESONANT_COMPRESSED2_DONE:_Storage =
"contiguous" ;
>> char BSE_RESONANT_COMPRESSED3_DONE(BS_K_linearized2) ;
>> BSE_RESONANT_COMPRESSED3_DONE:_Storage =
"contiguous" ;
>> float BSE_RESONANT_COMPRESSED1(BS_K_compressed1,
complex) ;
>> BSE_RESONANT_COMPRESSED1:_Storage = "contiguous"
;
>> BSE_RESONANT_COMPRESSED1:_Endianness = "little" ;
>> // global attributes:
>> :_NCProperties =
"version=1|netcdflibversion=4.4.1.1|hdf5libversion=1.8.18" ;
>> :_SuperblockVersion = 0 ;
>> :_IsNetcdf4 = 1 ;
>> :_Format = "netCDF-4" ;
>>
>>
>>
>> On Sat, May 2, 2020 at 12:24 AM +0200, "Dave Allured - NOAA Affiliate" wrote:
>>
>> I agree that you should expect the file size to be about 1 byte per stored character. IMO the most likely explanation is that you have a netcdf-4 file with inappropriately small chunk size. Another possibility is a 64-bit offset file with crazy huge padding between file sections. This is very unlikely, but I do not know what is inside your writer code.
>>
>> Diagnose, please. Ncdump -hs. If it is 64-bit offset, I think ncvalidator can display the hidden pad sizes.
>>
>>
>> On Fri, May 1, 2020 at 3:37 PM Davide Sangalli wrote:
>> Dear all,
>> I'm a developer of a fortran code which uses netcdf for I/O
>>
>> In one of my runs I created a file with some huge array of characters.
>> The header of the file is the following:
>> netcdf ndb.BS_COMPRESS0 {
>> dimensions:
>> BS_K_linearized1 = 2025000000 ;
>> BS_K_linearized2 = 781887360 ;
>> variables:
>> char BSE_RESONANT_COMPRESSED1_DONE(BS_K_linearized1) ;
>> char BSE_RESONANT_COMPRESSED2_DONE(BS_K_linearized1) ;
>> char BSE_RESONANT_COMPRESSED3_DONE(BS_K_linearized2) ;
>> }
>>
>> The variable is declared as nf90_char which, according to the documentation should be 1 byte per element.
>> Thus I would expect the total size of the file to be 1
byte*(2*2025000000+781887360) ~ 4.5 GB
>> Instead the file size is 16059445323 bytes ~ 14.96 GB, i.e.
10.46 GB more and a factor 3.33 bigger
>>
>> This happens consistently if I consider the file
>> netcdf ndb {
>> dimensions:
>> complex = 2 ;
>> BS_K_linearized1 = 2025000000 ;
>> BS_K_linearized2 = 781887360 ;
>> variables:
>> float BSE_RESONANT_LINEARIZED1(BS_K_linearized1, complex) ;
>> char BSE_RESONANT_LINEARIZED1_DONE(BS_K_linearized1) ;
>> float BSE_RESONANT_LINEARIZED2(BS_K_linearized1, complex) ;
>> char BSE_RESONANT_LINEARIZED2_DONE(BS_K_linearized1) ;
>> float BSE_RESONANT_LINEARIZED3(BS_K_linearized2, complex) ;
>> char BSE_RESONANT_LINEARIZED3_DONE(BS_K_linearized2) ;
>> }
>> The float component should weight ~36 GB while the char
component should be identical to before, i.e. 4.5 GB for a total of 40.5 GB
>> The file is instead ~ 50.96 GB, i.e. again a factor 10.46 GB
bigger than expected.
>>
>> Why ?
>>
>> My character variables are something like
>> "tnnnntnnnntnnnnnnnntnnnnnttnnnnnnnnnnnnnnnnt..."
>> but the file size is already like that just after the file
creation, i.e. before filling it.
>>
>> Few info about the library, compiled linking to HDF5 (hdf5-1.8.18), with parallel IO support:
>> Name: netcdf
>> Description: NetCDF Client Library for C
>> URL: http://www.unidata.ucar.edu/netcdf
>> Version: 4.4.1.1
>> Libs: -L${libdir} -lnetcdf -ldl -lm
/nfs/data/bin/Yambo/gcc-8.1.0/openmpi-3.1.0/yambo_ext_libs/gfortran/mpifort/v4/parallel/lib/libhdf5hl_fortran.a
/nfs/data/bin/Yambo/gcc-8.1.0/openmpi-3.1.0/yambo_ext_libs/gfortran/mpifort/v4/parallel/lib/libhdf5_fortran.a
/nfs/data/bin/Yambo/gcc-8.1.0/openmpi-3.1.0/yambo_ext_libs/gfortran/mpifort/v4/parallel/lib/libhdf5_hl.a
/nfs/data/bin/Yambo/gcc-8.1.0/openmpi-3.1.0/yambo_ext_libs/gfortran/mpifort/v4/parallel/lib/libhdf5.a
-lz -lm -ldl -lcurl
>> Cflags: -I${includedir}
>>
>> Name: netcdf-fortran
>> Description: NetCDF Client Library for Fortran
>> URL: http://www.unidata.ucar.edu/netcdf
>> Version: 4.4.4
>> Requires.private: netcdf > 4.1.1
>> Libs: -L${libdir} -lnetcdff
>> Libs.private: -L${libdir} -lnetcdff -lnetcdf
>> Cflags: -I${includedir}
>>
>> Best,
>> D.
>> --
>> Davide Sangalli, PhD
>> CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit)
and MaX Centre
>> Area della Ricerca di Roma 1, 00016 Monterotondo Scalo, Italy
>> http://www.ism.cnr.it/en/davide-sangalli-cv/
>> http://www.max-centre.eu/