On 08/24/2017 09:04 PM, Willi Rath wrote:
Hi Ed,
On 08/24/2017 08:16 PM, Ed Hartnett wrote:
You can turn on HDF5 checksums with nc_def_var_fletcher32() (See:
https://www.unidata.ucar.edu/software/netcdf/netcdf-4/newdocs/netcdf-c/nc_005fdef_005fvar_005ffletcher32.html).
Is this what you want?
If I understand the purpose of fletcher32() correctly, it is meant as an
internal integrity check where the library checks data it reads from
disk against a checksum that has been created at the time of writing?
What I am aiming at is a way of telling if, under the assumption that
the files are not corrupted, the actual data contained in two data sets
are identical, without re-hashing everytime I wanto to know this.
I was a bit short on the background of my question:
Let's consider the problem of ensuring, that a file is intact after is
was moved around in the file system or via the network, solved. (Or at
least, this is not a problem specific to netCDF data sets and hence
should be tackled somewhere else.)
I am, however, very often confronted with data files that are equivalent
(containing exactly the same information), but, due to their netCDF
properties (chunking, format, etc.) differing on disk. This started to
happen a lot, as people became more widely aware of the advantages of
netCDF4 and netCDF4 classic. Suddenly, theres five different files, all
with the same name and the same header but on different machines, and no
way to tell them apart.
Cheers
Willi
Thanks,
Ed Hartnett
On Thu, Aug 24, 2017 at 12:04 PM, dmh@xxxxxxxx <mailto:dmh@xxxxxxxx>
<dmh@xxxxxxxx <mailto:dmh@xxxxxxxx>> wrote:
A small note. Since the goal is equality testing rather than
security,
you should be able to get by with CRC32 or CRC64 checksums.
SHA256 is overkill.
=Dennis Heimbigner
Unidata
On 8/24/2017 12:00 PM, Willi Rath wrote:
Hi all,
I'd like to find a way to verify the contents of a given netCDF
dataset across different representations on disk. (Think of the
data set being defined by its CDL code and different
representations on disk being realised by different choices of
format, deflation, chunking, etc. but with identical CDL.)
There are tools that compare the contents of two netCDF files:
cdo's diff or nccmp. These tools do, however, rely on both files
being present on the same file system and at the same time. A
hash-based approach calculating checksums from the contents
rather than the binary representation of the data set would be a
nice solution to the problem.
I've tried and collected all attempts made at verification of
netCDF files in: https://github.com/willirath/netcdf-hash
<https://github.com/willirath/netcdf-hash> (The most successful
of which circled around the possibility of including the
functionality in `ncks` and lead to a pair of tools for
calculation and verification of MD5 checksums of netCDF files
that are stored within the files.)
There also is a demo outlining an approach digesting different
representations of the same netCDF data set into a sha256 hash
and storing the hex-value of this hash in global arguments in
the respective files.
I'd be very happy about any pointers to additional ideas (or
perhaps existing tools) solving the problem of netCDF-content
verification, about suggestions, remarks, etc.
Cheers
Willi
_______________________________________________
NOTE: All exchanges posted to Unidata maintained email lists are
recorded in the Unidata inquiry tracking system and made publicly
available through the web. Users who post to any of the lists we
maintain are reminded to remove any personal information that they
do not want to be made public.
netcdfgroup mailing list
netcdfgroup@xxxxxxxxxxxxxxxx <mailto:netcdfgroup@xxxxxxxxxxxxxxxx>
For list information or to unsubscribe, visit:
http://www.unidata.ucar.edu/mailing_lists/
<http://www.unidata.ucar.edu/mailing_lists/>
_______________________________________________
NOTE: All exchanges posted to Unidata maintained email lists are
recorded in the Unidata inquiry tracking system and made publicly
available through the web. Users who post to any of the lists we
maintain are reminded to remove any personal information that they
do not want to be made public.
netcdfgroup mailing list
netcdfgroup@xxxxxxxxxxxxxxxx
For list information or to unsubscribe, visit:
http://www.unidata.ucar.edu/mailing_lists/
--
Willi Rath
Theorie und Modellierung
GEOMAR
Helmholtz-Zentrum für Ozeanforschung Kiel
Duesternbrooker Weg 20, Raum 422
24105 Kiel, Germany
------------------------------------------------------------
Tel. +49-431-600-4010
wrath@xxxxxxxxx
www.geomar.de
-----------------------