Re: [netcdfgroup] Content-Based Checksums of a netCDF dataset

On 08/24/2017 09:04 PM, Willi Rath wrote:
Hi Ed,

On 08/24/2017 08:16 PM, Ed Hartnett wrote:
You can turn on HDF5 checksums with nc_def_var_fletcher32() (See: https://www.unidata.ucar.edu/software/netcdf/netcdf-4/newdocs/netcdf-c/nc_005fdef_005fvar_005ffletcher32.html).

Is this what you want?

If I understand the purpose of fletcher32() correctly, it is meant as an internal integrity check where the library checks data it reads from disk against a checksum that has been created at the time of writing?

What I am aiming at is a way of telling if, under the assumption that the files are not corrupted, the actual data contained in two data sets are identical, without re-hashing everytime I wanto to know this.

I was a bit short on the background of my question:

Let's consider the problem of ensuring, that a file is intact after is was moved around in the file system or via the network, solved. (Or at least, this is not a problem specific to netCDF data sets and hence should be tackled somewhere else.)

I am, however, very often confronted with data files that are equivalent (containing exactly the same information), but, due to their netCDF properties (chunking, format, etc.) differing on disk. This started to happen a lot, as people became more widely aware of the advantages of netCDF4 and netCDF4 classic. Suddenly, theres five different files, all with the same name and the same header but on different machines, and no way to tell them apart.

Cheers
Willi

Thanks,
Ed Hartnett

On Thu, Aug 24, 2017 at 12:04 PM, dmh@xxxxxxxx <mailto:dmh@xxxxxxxx> <dmh@xxxxxxxx <mailto:dmh@xxxxxxxx>> wrote:

A small note. Since the goal is equality testing rather than security,
    you should be able to get by with CRC32 or CRC64 checksums.
    SHA256 is overkill.
    =Dennis Heimbigner
      Unidata


    On 8/24/2017 12:00 PM, Willi Rath wrote:

        Hi all,

        I'd like to find a way to verify the contents of a given netCDF
        dataset across different representations on disk.  (Think of the
        data set being defined by its CDL code and different
        representations on disk being realised by different choices of
        format, deflation, chunking, etc. but with identical CDL.)

        There are tools that compare the contents of two netCDF files:
        cdo's diff or nccmp. These tools do, however, rely on both files
        being present on the same file system and at the same time.  A
        hash-based approach calculating checksums from the contents
        rather than the binary representation of the data set would be a
        nice solution to the problem.

        I've tried and collected all attempts made at verification of
        netCDF files in: https://github.com/willirath/netcdf-hash
        <https://github.com/willirath/netcdf-hash> (The most successful
        of which circled around the possibility of including the
        functionality in `ncks` and lead to a pair of tools for
        calculation and verification of MD5 checksums of netCDF files
        that are stored within the files.)

        There also is a demo outlining an approach digesting different
        representations of the same netCDF data set into a sha256 hash
        and storing the hex-value of this hash in global arguments in
        the respective files.

        I'd be very happy about any pointers to additional ideas (or
        perhaps existing tools) solving the problem of netCDF-content
        verification, about suggestions, remarks, etc.

        Cheers
        Willi


    _______________________________________________
    NOTE: All exchanges posted to Unidata maintained email lists are
    recorded in the Unidata inquiry tracking system and made publicly
    available through the web.  Users who post to any of the lists we
    maintain are reminded to remove any personal information that they
    do not want to be made public.


    netcdfgroup mailing list
    netcdfgroup@xxxxxxxxxxxxxxxx <mailto:netcdfgroup@xxxxxxxxxxxxxxxx>
    For list information or to unsubscribe,  visit:
    http://www.unidata.ucar.edu/mailing_lists/
    <http://www.unidata.ucar.edu/mailing_lists/>



_______________________________________________
NOTE: All exchanges posted to Unidata maintained email lists are
recorded in the Unidata inquiry tracking system and made publicly
available through the web.  Users who post to any of the lists we
maintain are reminded to remove any personal information that they
do not want to be made public.


netcdfgroup mailing list
netcdfgroup@xxxxxxxxxxxxxxxx
For list information or to unsubscribe, visit: http://www.unidata.ucar.edu/mailing_lists/



--

Willi Rath
Theorie und Modellierung
GEOMAR
Helmholtz-Zentrum für Ozeanforschung Kiel
Duesternbrooker Weg 20, Raum 422
24105 Kiel, Germany
------------------------------------------------------------
Tel. +49-431-600-4010
wrath@xxxxxxxxx
www.geomar.de
-----------------------



  • 2017 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: