Re: [netcdfgroup] nf90_char size

Last important remark.
The expected file size (see first message) should be ~ 37 GB

"ls" file size:
ls -lhtr BSK_2-5B_X59RL-50B_SP_bse-io/ndb.BS_PAR_Q1
-rw-r--r-- 1 sangalli sangalli 47G May 15 12:43 BSK_2-5B_X59RL-50B_SP_bse-io/ndb.BS_PAR_Q1

"du" file size:
du -sh BSK_2-5B_X59RL-50B_SP_bse-io/ndb.BS_PAR_Q1
37G BSK_2-5B_X59RL-50B_SP_bse-io/ndb.BS_PAR_Q1

See also here:
https://superuser.com/questions/1284942/why-is-there-a-huge-discrepancy-between-file-sizes-reported-by-du-versus-ls/1284955

Indeed
ls -lshtr BSK_2-5B_X59RL-50B_SP_bse-io/ndb.BS_PAR_Q1
37G -rw-r--r-- 1 sangalli sangalli 47G May 15 12:46 BSK_2-5B_X59RL-50B_SP_bse-io/ndb.BS_PAR_Q1

D.

On 15/05/20 11:47, Davide Sangalli wrote:
I also add.
The problem remains there even if I do the I/O for just the real variable.
i.e. my file now is

From ncdump
netcdf ndb {
dimensions:
    complex = 2 ;
    BS_K_linearized1 = 2025000000 ;
    BS_K_linearized2 = 781887360 ;
variables:
    float BSE_RESONANT_LINEARIZED1(BS_K_linearized1, complex) ;
    float BSE_RESONANT_LINEARIZED2(BS_K_linearized1, complex) ;
    float BSE_RESONANT_LINEARIZED3(BS_K_linearized2, complex) ;
}

From h5dump
Variables (this is ok)
   DATASET "BSE_RESONANT_LINEARIZED1"   SIZE 16200000000DATASPACE  SIMPLE { ( 2025000000, 2 ) / ( 2025000000, 2 ) }    DATASET "BSE_RESONANT_LINEARIZED2"   SIZE 16200000000 DATASPACE  SIMPLE { ( 2025000000, 2 ) / ( 2025000000, 2 ) }    DATASET "BSE_RESONANT_LINEARIZED3"   SIZE 6255098880 DATASPACE  SIMPLE { ( 781887360, 2 ) / ( 781887360, 2 ) }

Dimensions (this is bad), each dimension is a vector of integers of size equal to its own value    DATASET "BS_K_linearized1"  SIZE 8100000000 DATASPACE SIMPLE { ( 2025000000 ) / ( 2025000000 ) }    DATASET "BS_K_linearized2"  SIZE 3127549440 DATASPACE SIMPLE { ( 781887360 ) / ( 781887360 ) }
   DATASET "complex"           SIZE 8 DATASPACE SIMPLE { ( 2 ) / ( 2 ) }

It seems to me that, in the HDF5 conversion the "netcdf dimension" value is used for the variable size (?)
The dimension is generated with the fortran command
ierr = nf90_def_dim(io_unit,dim_name,dim_value,dim_id)


D.


On 15/05/20 11:12, Davide Sangalli wrote:
Dear all,
also moving to the last version of the libraries the problem remains.

pkgname_netcdf=netcdf-c-4.7.4
pkgname_netcdff=netcdf-fortran-4.5.2
pkgname_pnetcdf=pnetcdf-1.12.1
pkgname_hdf5=hdf5-1.12.0

Moreover I noticed differences between running in serial and running in parallel.
(I interrupted the two runs, so it maybe that the I/O was not over)
Below BS_K_linearized should just be a number (a dimension with netcdf)

SERIAL:
   DATASET "BS_K_linearized1" {
      DATATYPE  H5T_IEEE_F32BE
      DATASPACE  SIMPLE { ( 2025000000 ) / ( 2025000000 ) }
      STORAGE_LAYOUT {
         CONTIGUOUS
         SIZE 0
         OFFSET 18446744073709551615
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  H5D_FILL_VALUE_DEFAULT
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_LATE
      }

PARALLEL:
   DATASET "BS_K_linearized1" {
      DATATYPE  H5T_IEEE_F32BE
      DATASPACE  SIMPLE { ( 2025000000 ) / ( 2025000000 ) }
      STORAGE_LAYOUT {
         CONTIGUOUS
         SIZE 8100000000
         OFFSET 2387
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  H5D_FILL_VALUE_DEFAULT
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_EARLY
      }


I also tried to move to pnetcdf but I have some issues for now.

Best,
D.




On 03/05/20 19:06, Dennis Heimbigner wrote:
One reason to use netcdf over HDF5 is the fact that the
netcdf API is much simpler than HDF5. The HDF5 API
is some 6 times larger than the netcdf API.

On 5/3/2020 1:42 AM, Davide Sangalli wrote:
Thanks again.

I'll have a look to pnetcdf.
Another reason why we moved towards HDF5 was that, according to what I know, they could be able to exploit different levels of memory hierarchy in HPC simulations. Could pnetcdf do that as well ?

Besides that I'd really like some hints. Why could netcdf better than HDF5, or viceversa. Please do your worst.

For the NF90_unlimited, we are already using it in time dependent simulations in a way similar to the one you suggest. For the present case instead I'm just filling a huge complex matrix. So the interruption usually happens because there are limits on the simulation time. I'd really need to check which elements were filled and which were not without having any clue on the status.

Since you mentioned it. I'm very interested in the storage of sparse matrices. My huge matrix is indeed quite sparse. How does that work ?


Best,
D



On Sun, May 3, 2020 at 12:48 AM +0200, "Wei-Keng Liao" <wkliao@xxxxxxxxxxxxxxxx <mailto:wkliao@xxxxxxxxxxxxxxxx>> wrote:

    Hi, Dave

    Thanks for following up with the correct information about the dimension objects.     I admit that I am not familiar with the NetCDF4 dimension representation in HDF5.

    Wei-keng

    > On May 2, 2020, at 5:28 PM, Dave Allured - NOAA Affiliate wrote: > > Wei-king, thanks for the info on the latest release.
    Minor detail, I found that hidden dimension scales are still
    stored as arrays, but the arrays are left unpopulated. HDF5 stores
    these as sparse, which means no wasted space in arrays that are
    never written. > > For Davide, I concur with Wei-king that
    netcdf-C 4.7.4 is okay for your purpose, and should not store
    wasted space. Version 4.7.3 behaves the same as 4.7.4. > > I
    wonder when they changed that, some time between your 4.4.1.1 and
    4.7.3. Also you used HDF5 1.8.18, I used 1.10.5. That should not
    make any difference here, but perhaps it does. > > > On Sat, May
    2, 2020 at 1:01 PM Wei-Keng Liao wrote: > > If you used the latest
    NetCDF 4.7.4, the dimensions will be stored as scalars. > >
    Wei-keng > > > On May 2, 2020, at 1:42 PM, Davide Sangalli wrote:
    > > > > Yeah, but BS_K_linearized1 is just a dimension, how can it
    be 8 GB big ? > > Same for BS_K_linearized2, how can it be 3 GB
    big ? > > These two are just two numbers > > BS_K_linearized1 =
    2,025,000,000 > > (it was chosen has a maximum variable size in my
    code to avoid overflowing the maximum allowed integer in standard
    precision) > > BS_K_linearized2 = 781,887,360 > > > > D. > > > >
    On 02/05/20 19:06, Wei-Keng Liao wrote: > >> The dump information
    shows there are actually 8 datasets in the file. > >> Below is the
    start offsets, sizes, and end offsets of individual datasets. > >>
    There is not much padding space in between the datasets. > >>
    According to this, your file is expected to be of size 16 GB. > >>
    > >> dataset name start offset size end offset > >>
    BS_K_linearized1 2,379 8,100,000,000 8,100,002,379 > >>
    BSE_RESONANT_COMPRESSED1_DONE 8,100,002,379 2,025,000,000
    10,125,002,379 > >> BSE_RESONANT_COMPRESSED2_DONE 10,125,006,475
    2,025,000,000 12,150,006,475 > >> BS_K_linearized2 12,150,006,475
    3,127,549,440 15,277,555,915 > >> BSE_RESONANT_COMPRESSED3_DONE
    15,277,557,963 781,887,360 16,059,445,323 > >> complex
    16,059,447,371 8 16,059,447,379 > >> BS_K_compressed1
    16,059,447,379 99,107,168 16,158,554,547 > >>
    BSE_RESONANT_COMPRESSED1 16,158,554,547 198,214,336 16,356,768,883
    > >> > >> Wei-keng > >> > >>> On May 2, 2020, at 11:28 AM, Davide
    Sangalli wrote: > >>> > >>> h5dump -Hp ndb.BS_COMPRESS0.005000_Q1 > >


_______________________________________________
NOTE: All exchanges posted to Unidata maintained email lists are
recorded in the Unidata inquiry tracking system and made publicly
available through the web.  Users who post to any of the lists we
maintain are reminded to remove any personal information that they
do not want to be made public.


netcdfgroup mailing list
netcdfgroup@xxxxxxxxxxxxxxxx
For list information or to unsubscribe,  visit: https://www.unidata.ucar.edu/mailing_lists/




  • 2020 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: