Re: [netcdfgroup] [Hdf-forum] Detecting netCDF versus HDF5 -- PROPOSED SOLUTIONS --REQUEST FOR COMMENTS

I'll admit to not having time to read this whole email in detail. But, I've 
read enough and wanted to make just a few remarks.


  1.  Silo *does* know whether a given HDF5 file was produced by Silo. It does 
so by storing some key datasets in the HDF5 file that are, in all likelihood, 
unique to Silo. That isn't to say that some other workflow somewhere in the 
world might generate similarly named, shaped and typed datasets with similar 
contents. But, its unlikely enough situation that I claim Silo knows with 
certainty when it is given an HDF5 file whether the file was indeed produced 
with Silo.
  2.  IMHO, this issue is totally analgous to global symbol name space in C 
applications. Every now and then, you encounter situations in linking together 
upty-umpt C libraries that two libraries export the same public symbol and the 
link fails. The best practice is to avoid using *common* symbol names like 
'temp', 'lib', 'status', etc in the public symbol space. For example, we all 
prepend some 3 or 4 letter moniker to library function names (e.g. MPI_). It 
works, obviously when everyone in the community observes the best practice. Why 
can't the same approach be taken for HDF5 files? The HDF Group could advocate 
for and we, the community, coud adopt the best practice of associating say a 
string valued attribute with the root group in the file. The attributes name 
could be shared or it could be unique. Unique maybe a bit better but not 
required. What is required is the same best practice that the contents of that 
attribute be designed to be unique to the upper level API that is using it.
  3.  I am not sure I appreciate nor agree with attempting to distinguish the 
difference others are trying to make between "Upper Level API" and a "pure HDF5 
compatable format". Any thing written with HDF5 can be read with HDF5 (without 
the upper level API). Of course, there may be conventions that the upper level 
API utilizes that the HDF5 API itself may be ignorant of. So what? We do this 
quite frequently with Silo and Python. Silo writes HDF5 files and some users 
write Python scripts to read it. Those users understand the conventional ways 
in which Silo is using HDF5 and those conventions become codified in the Python 
they write to read HDF5 directly (e.g. without Silo). Its sometimes a pain 
because Silo does actually try very hard to obscure the details of how its 
using HDF5. But, it is nonetheless possible and so I see this distinction as 
rather moot.
  4.  I think (not really sure) HDF5 may have some low level features to insert 
a magick byte sequence into the boot block or other parts of the file header 
*and* such data can be queried back into the application. If so, that solution 
might even be better as it avoids stuffing anything into the file's "HDF5 
Object Global Namespace"
  5.  We still have a bazillion legacy files out there. Can't fix those and so 
still need some hueristics to facilitate workflows using them.

Mark


From: Pedro Vicente 
<pedro.vicente@xxxxxxxxxxxxxxxxxx<mailto:pedro.vicente@xxxxxxxxxxxxxxxxxx>>
Date: Thursday, April 21, 2016 8:33 AM
To: HDF Users Discussion List 
<hdf-forum@xxxxxxxxxxxxxxxxxx<mailto:hdf-forum@xxxxxxxxxxxxxxxxxx>>, 
"cf-metadata@xxxxxxxxxxxx<mailto:cf-metadata@xxxxxxxxxxxx>" 
<cf-metadata@xxxxxxxxxxxx<mailto:cf-metadata@xxxxxxxxxxxx>>, Discussion forum 
for the NeXus data format 
<nexus@xxxxxxxxxxxxxxx<mailto:nexus@xxxxxxxxxxxxxxx>>, 
"netcdfgroup@xxxxxxxxxxxxxxxx<mailto:netcdfgroup@xxxxxxxxxxxxxxxx>" 
<netcdfgroup@xxxxxxxxxxxxxxxx<mailto:netcdfgroup@xxxxxxxxxxxxxxxx>>
Cc: "Miller, Mark C." <miller86@xxxxxxxx<mailto:miller86@xxxxxxxx>>, 
"Marinelli, Daniel J. (GSFC-5810)" 
<daniel.j.marinelli@xxxxxxxx<mailto:daniel.j.marinelli@xxxxxxxx>>, 
"Richard.E.Ullman@xxxxxxxx<mailto:Richard.E.Ullman@xxxxxxxx>" 
<Richard.E.Ullman@xxxxxxxx<mailto:Richard.E.Ullman@xxxxxxxx>>, 
"Christopher.S.Lynnes@xxxxxxxx<mailto:Christopher.S.Lynnes@xxxxxxxx>" 
<Christopher.S.Lynnes@xxxxxxxx<mailto:Christopher.S.Lynnes@xxxxxxxx>>, Kent 
Yang <myang6@xxxxxxxxxxxx<mailto:myang6@xxxxxxxxxxxx>>, John Shalf 
<jshalf@xxxxxxx<mailto:jshalf@xxxxxxx>>, 
"haley@xxxxxxxx<mailto:haley@xxxxxxxx>" 
<haley@xxxxxxxx<mailto:haley@xxxxxxxx>>, Ward Fisher 
<wfisher@xxxxxxxx<mailto:wfisher@xxxxxxxx>>, Ed Hartnett 
<edwardjameshartnett@xxxxxxxxx<mailto:edwardjameshartnett@xxxxxxxxx>>
Subject: Re: [Hdf-forum] Detecting netCDF versus HDF5 -- PROPOSED SOLUTIONS 
--REQUEST FOR COMMENTS


DETECTING HDF5 VERSUS NETCDF GENERATED FILES
REQUEST FOR COMMENTS

AUTHOR: Pedro Vicente


AUDIENCE:
1) HDF, netcdf developers,

Ed Hartnett
Kent Yang

2) HDF, netcdf users, that replied to this thread

Miller, Mark C.
John Shalf

3 ) netcdf tools developers

Mary Haley  , NCL

4) HDF, netcdf managers and sponsors

David Pearah  , CEO HDF Group
Ward Fisher, UCAR
Marinelli, Daniel J. , Richard Ullmman, Christopher Lynnes, NASA


5)
[CF-metadata] list

After this thread started 2 months ago, there was an annoucement on the 
[CF-metadata] mail list
about
"a meeting to discuss current and future netCDF-CF efforts and directions.
The meeting will be held on 24-26 May 2016 in Boulder, CO, USA at the UCAR 
Center Green facility."

This would be a good topic to put on the agenda, maybe?


THE PROBLEM:

Currently it is impossible to detect if an HDF5 file was generated by the HDF5 
API or by the netCDF API.
See previous email about the reasons why.

WHY THIS MATTERS:

Software applications that need to handle both netCDF and HDF5 files cannot 
decide which API to use.
This includes popular visualization tools like IDL, Matlab, NCL, HDF Explorer.

SOLUTIONS PROPOSED: 2

SOLUTION 1: Add a flag to HDF5 source

The hdf5 format specification, listed here

https://www.hdfgroup.org/HDF5/doc/H5.format.html<https://secure-web.cisco.com/10WllRNMv7Cf_PosL959Mbsygr6CpMM45KnvTONP6jx97fWvsvC10H9DS106gC-6O9BkN9sohf_QOYGlODmXFI1KQLkliXk0cZNwrY0tFjrXcIk2Zohje7KImMGcFyupYxpzbMgrELQuc9DVsvYOcCS6QIYztbPNCl4IFf0eGSsX7-OIgT2ptpe3Inuxm13VC3Ydw86ELUiQDJMVxvPE4wk6m6b2mwEWyA7F9PWEnmqzB6ZihhKbqzBn9GwC7VRnFDLFCFlcMQrqAaFHVa9uylHoWHm-KHv58o05PcuuXc2xWhZPLdz6K4UMwRVU9Q2-vzrO7Wb6uMzogIuXY3CPnQFkcmmBtAIKyRGpm-yLCN3M/https%3A%2F%2Fwww.hdfgroup.org%2FHDF5%2Fdoc%2FH5.format.html>

describes a sequence of bytes in the file layout that have special meaning for 
the HDF5 API. It is common practice, when designing a data format,
so leave some fields "reserved for future use".

This solution makes use of one of these empty  "reserved for future use" spaces 
to save a byte (for example) that describes an enumerator
of "HDF5 compatible formats".

An "HDF5 compatible format" is a data format that uses the HDF5 API at a lower 
level (usually hidden from the user of the upper API),
and providing its own API.

This category can still be divide in 2 formats:
1) A "pure HDF5 compatible format". Example, NeXus

http://www.nexusformat.org/

NeXus just writes some metadata (attributes) on top of the HDF5 API, that has 
some special meaning for the NeXus community

2) A "non pure HDF5 compatible format". Example, netCDF

Here, the format adds some extra feature besides HDF5. In the case of netCDF, 
these are shared dimensions between variables.

This sub-division between 1) and 2) is irrelevant for the problem and solution 
in question

The solution consists of writing a different enumerator value on the "reserved 
for future use" space. For example

Value decimal 0 (current value): This file was generated by the HDF5 API 
(meaning the HDF5 only API)
Value decimal 1: This file was generated by the netCDF API (using HDF5)
Value decimal 2: This file was generated by <put here another HDF5 based format>
and so on

The advantage of this solution is that this process involves 2 parties: the HDF 
Group and the other format's organization.

This allows the HDF Group to "keep track" of new HDF5 based formats. It allows 
to make the other format "HDF5 certified" .


SOLUTION 2: Add some metadata to the other API on top of HDF5

This is what Nexus uses.
A Nexus file on creation writes several attributes on the root group, like 
"NeXus_version" and other numeric data.
This is done using the public HDF5 API calls.

The solution for netCDF consists of the same approach, just write some specific 
attributes, and a special netCDF API to  write/read them.

This solutions just requires the work of one party (the netCDF group)

END OF RFC



In reply to people that commented in the thread

@John Shalf

>>Perhaps NetCDF (and other higher-level APIs that are built on top of HDF5) 
>>should include an attribute attached
>>to the root group that identifies the name and version of the API that 
>>created the file?  (adopt this as a convention)

yes, that's one way to do it, Solution 2 above

@Mark Miller

>>>Hmmm. Is there any big reason NOT to try to read a netCDF produced HDF5 file 
>>>with the native HDF5 library if someone so chooses?

It's possible to read a netCDF file using HDF5, yes.
There are 2 things that you will miss doing this:
1) the ability to inquire about shared netCDF dimensions.
2) the ability to read remotely with openDAP.
Reading with HDF5 also exposes metadata that is supposed to be private to 
netCDF. See below

>>>> And, attempting  to read an HDF5 file produced by Silo using just the HDF5 
>>>> library (e.g. w/o Silo) is a major pain.

This I don't understand. Why not read the Silo file with the Silo API?

That's the all purpose of this issue, each higher level API on top of HDF5 
should be able to detect "itself".
I am not familiar with Silo, but if Silo cannot do this, then you have the same 
design flaw that netCDF has.


>>> In a cursory look over the libsrc4 sources in netCDF distro, I see a few 
>>> things that might give a hint a file was created with netCDF. . .
>>>> First, in NC_CLASSIC_MODEL, an attribute gets attached to the root group 
>>>> named "_nc3_strict". So, the existence of an attribute on the root group 
>>>> by that name would suggest the HDF5 file was generated by netCDF.

I think this is done only by the "old" netCDF3 format.

>>>>> Also, I tested a simple case of nc_open, nc_def_dim, etc. nc_close to see 
>>>>> what it produced.
>>>> It appears to produce datasets for each 'dimension' defined with two 
>>>> attributes named "CLASS" and "NAME".

This is because netCDF uses the HDF5 Dimension Scales API internally to keep 
track of shared dimensions. These are internal attributes
of Dimension Scales. This approach would not work because an HDF5 only file 
with Dimension Scales would have the same attributes.


>>>> I like John's suggestion here.
>>>>>But, any code you add to any applications now will work *only* for files 
>>>>>that were produced post-adoption of this convention.

yes. there are 2 actions to take here.
1) fix the issue for the future
2) try to retroactively have some workaround that makes possible now to 
differentiate a HDF5/netCDF files made before the adopted convention
see below


>>>> In VisIt, we support >140 format readers. Over 20 of those are different 
>>>> variants of HDF5 files (H5part, Xdmf, Pixie, Silo, Samrai, netCDF, Flash, 
>>>> Enzo, Chombo, etc., etc.)
>>>>When opening a file, how does VisIt figure out which plugin to use? In 
>>>>particular, how do we avoid one poorly written reader plugin (which may be 
>>>>the wrong one for a given file) from preventing the correct one from being 
>>>>found. Its kinda a hard problem.


Yes, that's the problem we are trying to solve. I have to say, that is quick a 
list of HDF5 based formats there.

>>>> Some of our discussion is captured here. . .
http://www.visitusers.org/index.php?title=Database_Format_Detection

I"ll check it out, thank you for the suggestions

@Ed Hartnett


>>>I must admit that when putting netCDF-4 together I never considered that 
>>>someone might want to tell the difference between a "native" HDF5 file and a 
>>>netCDF-4/HDF5 file.
>>>>>Well, you can't think of everything.

This is a major design flaw.
If you are in the business of designing data file formats, one of the things 
you have to do is how to make it possible to identify it from the other formats.


>>> I agree that it is not possible to canonically tell the difference. The 
>>> netCDF-4 API does use some special attributes to track named dimensions,
>>>>and to tell whether classic mode should be enforced. But it can easily 
>>>>produce files without any named dimensions, etc.
>>>So I don't think there is any easy way to tell.

I remember you wrote that code together with Kent Yang from the HDF Group.
At the time I was with the HDF Group but unfortunately I did follow closely 
what you were doing.
I don't remember any design document being circulated that explains the 
internals of the "how to" make the netCDF (classic) model of shared dimensions
use the hierarchical group model of HDF5.
I know this was done using the HDF5 Dimension Scales (that I wrote), but is 
there any design document that explains it?

Maybe just some internal email exchange between you and Kent Yang?
Kent, how are you?
Do you remember having any design document that explains this?
Maybe something like a unique private attribute that is written somewhere in 
the netCDF file?


@Mary Haley, NCL

NCL is a widely used tool that handles both netCDF and HDF5

Mary, how are you?
How does NCL deal with the case of reading both pure HDF5 files and netCDF 
files that use HDF5?
Would you be interested in joining a community based effort to deal with this, 
in case this is an issue for you?


@David Pearah  , CEO HDF Group

I volunteer to participate in the effort of this RFC together with the HDF 
Group (and netCDF Group).
Maybe we could make a "task force" between HDF Group, netCDF Group and any 
volunteer (such as tools developers that happen to be in these mail lists)?

The "task force" would have 2 tasks:
1) make a HDF5 based convention for the future and
2) try to retroactively salvage the current design issue of netCDF
My phone is 217-898-9356, you are welcome to call in anytime.

----------------------
Pedro Vicente
pedro.vicente@xxxxxxxxxxxxxxxxxx<mailto:pedro.vicente@xxxxxxxxxxxxxxxxxx>
https://twitter.com/_pedro__vicente
http://www.space-research.org/




----- Original Message -----
From:Miller, Mark C.<mailto:miller86@xxxxxxxx>
To: HDF Users Discussion List<mailto:hdf-forum@xxxxxxxxxxxxxxxxxx>
Cc: netcdfgroup@xxxxxxxxxxxxxxxx<mailto:netcdfgroup@xxxxxxxxxxxxxxxx> ; Ward 
Fisher<mailto:wfisher@xxxxxxxx>
Sent: Wednesday, March 02, 2016 7:07 PM
Subject: Re: [Hdf-forum] Detecting netCDF versus HDF5

I like John's suggestion here.

But, any code you add to any applications now will work *only* for files that 
were produced post-adoption of this convention.

There are probably a bazillion files out there at this point that don't follow 
that convention and you probably still want your applications to be able to 
read them.

In VisIt, we support >140 format readers. Over 20 of those are different 
variants of HDF5 files (H5part, Xdmf, Pixie, Silo, Samrai, netCDF, Flash, Enzo, 
Chombo, etc., etc.) When opening a file, how does VisIt figure out which plugin 
to use? In particular, how do we avoid one poorly written reader plugin (which 
may be the wrong one for a given file) from preventing the correct one from 
being found. Its kinda a hard problem.

Some of our discussion is captured here. . .

http://www.visitusers.org/index.php?title=Database_Format_Detection

Mark


From: Hdf-forum 
<hdf-forum-bounces@xxxxxxxxxxxxxxxxxx<mailto:hdf-forum-bounces@xxxxxxxxxxxxxxxxxx>>
 on behalf of John Shalf <jshalf@xxxxxxx<mailto:jshalf@xxxxxxx>>
Reply-To: HDF Users Discussion List 
<hdf-forum@xxxxxxxxxxxxxxxxxx<mailto:hdf-forum@xxxxxxxxxxxxxxxxxx>>
Date: Wednesday, March 2, 2016 1:02 PM
To: HDF Users Discussion List 
<hdf-forum@xxxxxxxxxxxxxxxxxx<mailto:hdf-forum@xxxxxxxxxxxxxxxxxx>>
Cc: "netcdfgroup@xxxxxxxxxxxxxxxx<mailto:netcdfgroup@xxxxxxxxxxxxxxxx>" 
<netcdfgroup@xxxxxxxxxxxxxxxx<mailto:netcdfgroup@xxxxxxxxxxxxxxxx>>, Ward 
Fisher <wfisher@xxxxxxxx<mailto:wfisher@xxxxxxxx>>
Subject: Re: [Hdf-forum] Detecting netCDF versus HDF5

Perhaps NetCDF (and other higher-level APIs that are built on top of HDF5) 
should include an attribute attached to the root group that identifies the name 
and version of the API that created the file?  (adopt this as a convention)

-john

On Mar 2, 2016, at 12:55 PM, Pedro Vicente 
<pedro.vicente@xxxxxxxxxxxxxxxxxx<mailto:pedro.vicente@xxxxxxxxxxxxxxxxxx>> 
wrote:
Hi Ward
As you know, Data Explorer is going to be a general purpose data reader for 
many formats, including HDF5 and netCDF.
Here
http://www.space-research.org/
Regarding the handling of both HDF5 and netCDF, it seems there is a potential 
issue, which is, how to tell if any HDF5 file was saved by the HDF5 API or by 
the netCDF API?
It seems to me that this is not possible. Is this correct?
netCDF uses an internal function NC_check_file_type to examine the first few 
bytes of a file, and for example for any HDF5 file the test is
/* Look at the magic number */
   /* Ignore the first byte for HDF */
   if(magic[1] == 'H' && magic[2] == 'D' && magic[3] == 'F') {
     *filetype = FT_HDF;
     *version = 5;
The problem is that this test works for any HDF5 file and for any netCDF file, 
which makes it impossible to tell which is which.
Which makes it impossible for any general purpose data reader to decide to use 
the netCDF API or the HDF5 API.
I have a possible solution for this , but before going any further, I would 
just like to confirm that
1)      Is indeed not possible
2)      See if you have a solid workaround for this, excluding the dumb ones, 
for example deciding on a extension .nc or .h5, or traversing the HDF5 file to 
see if it's non netCDF conforming one. Yes, to further complicate things, it is 
possible that the above test says OK for a HDF5 file, but then the read by the 
netCDF API fails because the file is a HDF5 non netCDF conformant
Thanks
----------------------
Pedro Vicente
pedro.vicente@xxxxxxxxxxxxxxxxxx<mailto:pedro.vicente@xxxxxxxxxxxxxxxxxx>
http://www.space-research.org/
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@xxxxxxxxxxxxxxxxxx<mailto:Hdf-forum@xxxxxxxxxxxxxxxxxx>
http://secure-web.cisco.com/1r-EJFFfg6rWlpQsvXstBNTjaHQaKT_NkYRN0Jj_f-Z3EK0-hs6IbYc8XUBRyPsH3mU3CS0iiY7_qnchCA0QxNzQt270d_2HikCwpAWFmuHdacin62eaODutktDSOULIJmVbVYqFVSKWPzoX7kdP0yN9wIzSFxZfTwfhU8ebsN409xRg1PsW_8cvNiWzxDNm9wv9yBf9yK6nkEm-bOx2S0kBLbg9WfIChWzZrkpE3AHU9I-c2ZRH_IN-UF4g_g0_Dh4qE1VETs7tZTfKd1ox1MtBmeyKf7EKUCd3ezR9EbI5tK4hCU5qW4v5WWOxOrD17e8yCVmob27xz84Lr3bCK5wIQdH5VzFRTtyaAhudpt9E/http%3A%2F%2Flists.hdfgroup.org%2Fmailman%2Flistinfo%2Fhdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@xxxxxxxxxxxxxxxxxx<mailto:Hdf-forum@xxxxxxxxxxxxxxxxxx>
http://secure-web.cisco.com/1r-EJFFfg6rWlpQsvXstBNTjaHQaKT_NkYRN0Jj_f-Z3EK0-hs6IbYc8XUBRyPsH3mU3CS0iiY7_qnchCA0QxNzQt270d_2HikCwpAWFmuHdacin62eaODutktDSOULIJmVbVYqFVSKWPzoX7kdP0yN9wIzSFxZfTwfhU8ebsN409xRg1PsW_8cvNiWzxDNm9wv9yBf9yK6nkEm-bOx2S0kBLbg9WfIChWzZrkpE3AHU9I-c2ZRH_IN-UF4g_g0_Dh4qE1VETs7tZTfKd1ox1MtBmeyKf7EKUCd3ezR9EbI5tK4hCU5qW4v5WWOxOrD17e8yCVmob27xz84Lr3bCK5wIQdH5VzFRTtyaAhudpt9E/http%3A%2F%2Flists.hdfgroup.org%2Fmailman%2Flistinfo%2Fhdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5



________________________________

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@xxxxxxxxxxxxxxxxxx<mailto:Hdf-forum@xxxxxxxxxxxxxxxxxx>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5