Re: [netcdfgroup] [Hdf-forum] Detecting netCDF versus HDF5 -- PROPOSED SOLUTIONS --REQUEST FOR COMMENTS

  • To: Pedro Vicente <pedro.vicente@xxxxxxxxxxxxxxxxxx>
  • Subject: Re: [netcdfgroup] [Hdf-forum] Detecting netCDF versus HDF5 -- PROPOSED SOLUTIONS --REQUEST FOR COMMENTS
  • From: Mary Haley <haley@xxxxxxxx>
  • Date: Thu, 21 Apr 2016 10:16:14 -0600
Hi Pedro,

Nice to hear from you again.  I've CC'ed Dave Brown to answer the NCL
question, since he's the person familiar with the code for opening these
files. I also CC'ed Rick who's also been working with some file I/O issues
in NCL.

--Mary


On Thu, Apr 21, 2016 at 9:33 AM, Pedro Vicente <
pedro.vicente@xxxxxxxxxxxxxxxxxx> wrote:

>
> DETECTING HDF5 VERSUS NETCDF GENERATED FILES
> REQUEST FOR COMMENTS
>
> AUTHOR: Pedro Vicente
>
>
> AUDIENCE:
> 1) HDF, netcdf developers,
>
> Ed Hartnett
> Kent Yang
>
> 2) HDF, netcdf users, that replied to this thread
>
> Miller, Mark C.
> John Shalf
>
> 3 ) netcdf tools developers
>
> Mary Haley  , NCL
>
> 4) HDF, netcdf managers and sponsors
>
> David Pearah  , CEO HDF Group
> Ward Fisher, UCAR
> Marinelli, Daniel J. , Richard Ullmman, Christopher Lynnes, NASA
>
>
> 5)
> [CF-metadata] list
>
> After this thread started 2 months ago, there was an annoucement on the
> [CF-metadata] mail list
> about
> "a meeting to discuss current and future netCDF-CF efforts and directions .
> The meeting will be held on 24-26 May 2016 in Boulder, CO, USA at the UCAR
> Center Green facility."
>
> This would be a good topic to put on the agenda, maybe?
>
>
> THE PROBLEM:
>
> Currently it is impossible to detect if an HDF5 file was generated by the
> HDF5 API or by the netCDF API.
> See previous email about the reasons why.
>
> WHY THIS MATTERS:
>
> Software applications that need to handle both netCDF and HDF5 files
> cannot decide which API to use.
> This includes popular visualization tools like IDL, Matlab, NCL, HDF
> Explorer.
>
> SOLUTIONS PROPOSED: 2
>
> SOLUTION 1: Add a flag to HDF5 source
>
> The hdf5 format specification, listed here
>
> https://www.hdfgroup.org/HDF5/doc/H5.format.html
>
> describes a sequence of bytes in the file layout that have special meaning
> for the HDF5 API. It is common practice, when designing a data format,
> so leave some fields "reserved for future use".
>
> This solution makes use of one of these empty  "reserved for future use"
> spaces to save a byte (for example) that describes an enumerator
> of "HDF5 compatible formats".
>
> An "HDF5 compatible format" is a data format that uses the HDF5 API at a
> lower level (usually hidden from the user of the upper API),
> and providing its own API.
>
> This category can still be divide in 2 formats:
> 1) A "pure HDF5 compatible format". Example, NeXus
>
> http://www.nexusformat.org/
>
> NeXus just writes some metadata (attributes) on top of the HDF5 API, that
> has some special meaning for the NeXus community
>
> 2) A "non pure HDF5 compatible format". Example, netCDF
>
> Here, the format adds some extra feature besides HDF5. In the case of
> netCDF, these are shared dimensions between variables.
>
> This sub-division between 1) and 2) is irrelevant for the problem and
> solution in question
>
> The solution consists of writing a different enumerator value on the
> "reserved for future use" space. For example
>
> Value decimal 0 (current value): This file was generated by the HDF5 API
> (meaning the HDF5 only API)
> Value decimal 1: This file was generated by the netCDF API (using HDF5)
> Value decimal 2: This file was generated by <put here another HDF5 based
> format>
> and so on
>
> The advantage of this solution is that this process involves 2 parties:
> the HDF Group and the other format's organization.
>
> This allows the HDF Group to "keep track" of new HDF5 based formats. It
> allows to make the other format "HDF5 certified" .
>
>
> SOLUTION 2: Add some metadata to the other API on top of HDF5
>
> This is what Nexus uses.
> A Nexus file on creation writes several attributes on the root group, like
> "NeXus_version" and other numeric data.
> This is done using the public HDF5 API calls.
>
> The solution for netCDF consists of the same approach, just write some
> specific attributes, and a special netCDF API to  write/read them.
>
> This solutions just requires the work of one party (the netCDF group)
>
> END OF RFC
>
>
>
> In reply to people that commented in the thread
>
> @John Shalf
>
> >>Perhaps NetCDF (and other higher-level APIs that are built on top of
> HDF5) should include an attribute attached
> >>to the root group that identifies the name and version of the API that
> created the file?  (adopt this as a convention)
>
> yes, that's one way to do it, Solution 2 above
>
> @Mark Miller
>
> >>>Hmmm. Is there any big reason NOT to try to read a netCDF produced HDF5
> file with the native HDF5 library if someone so chooses?
>
> It's possible to read a netCDF file using HDF5, yes.
> There are 2 things that you will miss doing this:
> 1) the ability to inquire about shared netCDF dimensions.
> 2) the ability to read remotely with openDAP.
> Reading with HDF5 also exposes metadata that is supposed to be private to
> netCDF. See below
>
> >>>> And, attempting  to read an HDF5 file produced by Silo using just the
> HDF5 library (e.g. w/o Silo) is a major pain.
>
> This I don't understand. Why not read the Silo file with the Silo API?
>
> That's the all purpose of this issue, each higher level API on top of HDF5
> should be able to detect "itself".
> I am not familiar with Silo, but if Silo cannot do this, then you have the
> same design flaw that netCDF has.
>
>
> >>> In a cursory look over the libsrc4 sources in netCDF distro, I see a
> few things that might give a hint a file was created with netCDF. . .
> >>>> First, in NC_CLASSIC_MODEL, an attribute gets attached to the root
> group named "_nc3_strict". So, the existence of an attribute on the root
> group by that name would suggest the HDF5 file was generated by netCDF.
>
> I think this is done only by the "old" netCDF3 format.
>
> >>>>> Also, I tested a simple case of nc_open, nc_def_dim, etc. nc_close
> to see what it produced.
> >>>> It appears to produce datasets for each 'dimension' defined with two
> attributes named "CLASS" and "NAME".
>
> This is because netCDF uses the HDF5 Dimension Scales API internally to
> keep track of shared dimensions. These are internal attributes
> of Dimension Scales. This approach would not work because an HDF5 only
> file with Dimension Scales would have the same attributes.
>
>
> >>>> I like John's suggestion here.
> >>>>>But, any code you add to any applications now will work *only* for
> files that were produced post-adoption of this convention.
>
> yes. there are 2 actions to take here.
> 1) fix the issue for the future
> 2) try to retroactively have some workaround that makes possible now to
> differentiate a HDF5/netCDF files made before the adopted convention
> see below
>
>
> >>>> In VisIt, we support >140 format readers. Over 20 of those are
> different variants of HDF5 files (H5part, Xdmf, Pixie, Silo, Samrai,
> netCDF, Flash, Enzo, Chombo, etc., etc.)
> >>>>When opening a file, how does VisIt figure out which plugin to use? In
> particular, how do we avoid one poorly written reader plugin (which may be
> the wrong one for a given file) from preventing the correct one from being
> found. Its kinda a hard problem.
>
>
> Yes, that's the problem we are trying to solve. I have to say, that is
> quick a list of HDF5 based formats there.
>
> >>>> Some of our discussion is captured here. . .
> http://www.visitusers.org/index.php?title=Database_Format_Detection
>
> I"ll check it out, thank you for the suggestions
>
> @Ed Hartnett
>
>
> >>>I must admit that when putting netCDF-4 together I never considered
> that someone might want to tell the difference between a "native" HDF5 file
> and a netCDF-4/HDF5 file.
> >>>>>Well, you can't think of everything.
>
> This is a major design flaw.
> If you are in the business of designing data file formats, one of the
> things you have to do is how to make it possible to identify it from the
> other formats.
>
>
> >>> I agree that it is not possible to canonically tell the difference.
> The netCDF-4 API does use some special attributes to track named
> dimensions,
> >>>>and to tell whether classic mode should be enforced. But it can easily
> produce files without any named dimensions, etc.
> >>>So I don't think there is any easy way to tell.
>
> I remember you wrote that code together with Kent Yang from the HDF Group .
> At the time I was with the HDF Group but unfortunately I did follow
> closely what you were doing.
> I don't remember any design document being circulated that explains the
> internals of the "how to" make the netCDF (classic) model of shared
> dimensions
> use the hierarchical group model of HDF5.
> I know this was done using the HDF5 Dimension Scales (that I wrote), but
> is there any design document that explains it?
>
> Maybe just some internal email exchange between you and Kent Yang?
> Kent, how are you?
> Do you remember having any design document that explains this?
> Maybe something like a unique private attribute that is written somewhere
> in the netCDF file?
>
>
> @Mary Haley, NCL
>
> NCL is a widely used tool that handles both netCDF and HDF5
>
> Mary, how are you?
> How does NCL deal with the case of reading both pure HDF5 files and netCDF
> files that use HDF5?
> Would you be interested in joining a community based effort to deal with
> this, in case this is an issue for you?
>
>
> @David Pearah  , CEO HDF Group
>
> I volunteer to participate in the effort of this RFC together with the HDF
> Group (and netCDF Group).
> Maybe we could make a "task force" between HDF Group, netCDF Group and any
> volunteer (such as tools developers that happen to be in these mail lists)?
>
> The "task force" would have 2 tasks:
> 1) make a HDF5 based convention for the future and
> 2) try to retroactively salvage the current design issue of netCDF
> My phone is 217-898-9356, you are welcome to call in anytime.
>
> ----------------------
> Pedro Vicente
> pedro.vicente@xxxxxxxxxxxxxxxxxx
> https://twitter.com/_pedro__vicente
> http://www.space-research.org/
>
>
>
>
>
> ----- Original Message -----
> *From:* Miller, Mark C. <miller86@xxxxxxxx>
> *To:* HDF Users Discussion List <hdf-forum@xxxxxxxxxxxxxxxxxx>
> *Cc:* netcdfgroup@xxxxxxxxxxxxxxxx ; Ward Fisher <wfisher@xxxxxxxx>
> *Sent:* Wednesday, March 02, 2016 7:07 PM
> *Subject:* Re: [Hdf-forum] Detecting netCDF versus HDF5
>
> I like John's suggestion here.
>
> But, any code you add to any applications now will work *only* for files
> that were produced post-adoption of this convention.
>
> There are probably a bazillion files out there at this point that don't
> follow that convention and you probably still want your applications to be
> able to read them.
>
> In VisIt, we support >140 format readers. Over 20 of those are different
> variants of HDF5 files (H5part, Xdmf, Pixie, Silo, Samrai, netCDF, Flash,
> Enzo, Chombo, etc., etc.) When opening a file, how does VisIt figure out
> which plugin to use? In particular, how do we avoid one poorly written
> reader plugin (which may be the wrong one for a given file) from preventing
> the correct one from being found. Its kinda a hard problem.
>
> Some of our discussion is captured here. . .
>
> http://www.visitusers.org/index.php?title=Database_Format_Detection
>
> Mark
>
>
> From: Hdf-forum <hdf-forum-bounces@xxxxxxxxxxxxxxxxxx> on behalf of John
> Shalf <jshalf@xxxxxxx>
> Reply-To: HDF Users Discussion List <hdf-forum@xxxxxxxxxxxxxxxxxx>
> Date: Wednesday, March 2, 2016 1:02 PM
> To: HDF Users Discussion List <hdf-forum@xxxxxxxxxxxxxxxxxx>
> Cc: "netcdfgroup@xxxxxxxxxxxxxxxx" <netcdfgroup@xxxxxxxxxxxxxxxx>, Ward
> Fisher <wfisher@xxxxxxxx>
> Subject: Re: [Hdf-forum] Detecting netCDF versus HDF5
>
> Perhaps NetCDF (and other higher-level APIs that are built on top of HDF5)
> should include an attribute attached to the root group that identifies the
> name and version of the API that created the file?  (adopt this as a
> convention)
>
> -john
>
> On Mar 2, 2016, at 12:55 PM, Pedro Vicente <
> pedro.vicente@xxxxxxxxxxxxxxxxxx> wrote:
> Hi Ward
> As you know, Data Explorer is going to be a general purpose data reader
> for many formats, including HDF5 and netCDF.
> Here
> http://www.space-research.org/
> Regarding the handling of both HDF5 and netCDF, it seems there is a
> potential issue, which is, how to tell if any HDF5 file was saved by the
> HDF5 API or by the netCDF API?
> It seems to me that this is not possible. Is this correct?
> netCDF uses an internal function NC_check_file_type to examine the first
> few bytes of a file, and for example for any HDF5 file the test is
> /* Look at the magic number */
>    /* Ignore the first byte for HDF */
>    if(magic[1] == 'H' && magic[2] == 'D' && magic[3] == 'F') {
>      *filetype = FT_HDF;
>      *version = 5;
> The problem is that this test works for any HDF5 file and for any netCDF
> file, which makes it impossible to tell which is which.
> Which makes it impossible for any general purpose data reader to decide to
> use the netCDF API or the HDF5 API.
> I have a possible solution for this , but before going any further, I
> would just like to confirm that
> 1)      Is indeed not possible
> 2)      See if you have a solid workaround for this, excluding the dumb
> ones, for example deciding on a extension .nc or .h5, or traversing the
> HDF5 file to see if it's non netCDF conforming one. Yes, to further
> complicate things, it is possible that the above test says OK for a HDF5
> file, but then the read by the netCDF API fails because the file is a HDF5
> non netCDF conformant
> Thanks
> ----------------------
> Pedro Vicente
> pedro.vicente@xxxxxxxxxxxxxxxxxx
> http://www.space-research.org/
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@xxxxxxxxxxxxxxxxxx
>
> http://secure-web.cisco.com/1r-EJFFfg6rWlpQsvXstBNTjaHQaKT_NkYRN0Jj_f-Z3EK0-hs6IbYc8XUBRyPsH3mU3CS0iiY7_qnchCA0QxNzQt270d_2HikCwpAWFmuHdacin62eaODutktDSOULIJmVbVYqFVSKWPzoX7kdP0yN9wIzSFxZfTwfhU8ebsN409xRg1PsW_8cvNiWzxDNm9wv9yBf9yK6nkEm-bOx2S0kBLbg9WfIChWzZrkpE3AHU9I-c2ZRH_IN-UF4g_g0_Dh4qE1VETs7tZTfKd1ox1MtBmeyKf7EKUCd3ezR9EbI5tK4hCU5qW4v5WWOxOrD17e8yCVmob27xz84Lr3bCK5wIQdH5VzFRTtyaAhudpt9E/http%3A%2F%2Flists.hdfgroup.org%2Fmailman%2Flistinfo%2Fhdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@xxxxxxxxxxxxxxxxxx
>
> http://secure-web.cisco.com/1r-EJFFfg6rWlpQsvXstBNTjaHQaKT_NkYRN0Jj_f-Z3EK0-hs6IbYc8XUBRyPsH3mU3CS0iiY7_qnchCA0QxNzQt270d_2HikCwpAWFmuHdacin62eaODutktDSOULIJmVbVYqFVSKWPzoX7kdP0yN9wIzSFxZfTwfhU8ebsN409xRg1PsW_8cvNiWzxDNm9wv9yBf9yK6nkEm-bOx2S0kBLbg9WfIChWzZrkpE3AHU9I-c2ZRH_IN-UF4g_g0_Dh4qE1VETs7tZTfKd1ox1MtBmeyKf7EKUCd3ezR9EbI5tK4hCU5qW4v5WWOxOrD17e8yCVmob27xz84Lr3bCK5wIQdH5VzFRTtyaAhudpt9E/http%3A%2F%2Flists.hdfgroup.org%2Fmailman%2Flistinfo%2Fhdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>
>
> ------------------------------
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@xxxxxxxxxxxxxxxxxx
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>
>