Re: [netcdfgroup] Problem with nc_inq_grps and ncdump reading HDF5file

To: John Caron <caron@xxxxxxxx>, Pedro Vicente <pedro.vicente@xxxxxxxxxxxxxxxxxx>
Subject: Re: [netcdfgroup] Problem with nc_inq_grps and ncdump reading HDF5file
From: Andrew Dowsey <andrew.dowsey@xxxxxxxxxxxxxxxx>
Date: Thu, 27 Nov 2014 22:12:17 +0000
Hi John, Pedro and Russ,

Thank you very much for your very useful replies.

My strategy was to start off with the most simple method of translating our XML 
schema of interest (http://www.peptideatlas.org/tmp/mzML1.1.0.html) to HDF5. 
And then to iteratively optimise it to HDF5/NetCDF4’s strengths, measuring the 
improvement. It seems I cannot start at this simplistic mapping due to the 
number of groups limitation (for example, there could be 100,000s of groups 
under ‘SpectrumList’ because 100,000s of individual spectra can potentially be 
recorded in one run of the instrument.

I’m thinking now that it is probably best to give up on a ‘pure’ HDF5 
implementation, and instead save encapsulate the XML metadata as a HDF5 string, 
with the binary data (which is currently Base64 encoded in the XML) stored as 
HDF5 datasets. However, there could potentially be 100,000s of these 1D 
datasets so it might be best to store them in a single ragged array, or 
concatenate them and store a separate index.

Best regards,
Andy



From: John Caron <caron@xxxxxxxx<mailto:caron@xxxxxxxx>>
Date: Tuesday, 18 November 2014 15:28
To: Pedro Vicente 
<pedro.vicente@xxxxxxxxxxxxxxxxxx<mailto:pedro.vicente@xxxxxxxxxxxxxxxxxx>>
Cc: Andrew Dowsey 
<andrew.dowsey@xxxxxxxxxxxxxxxx<mailto:andrew.dowsey@xxxxxxxxxxxxxxxx>>, Russ 
Rew <russ@xxxxxxxxxxxxxxxx<mailto:russ@xxxxxxxxxxxxxxxx>>, 
"netcdfgroup@xxxxxxxxxxxxxxxx<mailto:netcdfgroup@xxxxxxxxxxxxxxxx> List" 
<netcdfgroup@xxxxxxxxxxxxxxxx<mailto:netcdfgroup@xxxxxxxxxxxxxxxx>>, 
"hdf-forum@xxxxxxxxxxxxxxxxxx<mailto:hdf-forum@xxxxxxxxxxxxxxxxxx>" 
<hdf-forum@xxxxxxxxxxxxxxxxxx<mailto:hdf-forum@xxxxxxxxxxxxxxxxxx>>
Subject: Re: [netcdfgroup] Problem with nc_inq_grps and ncdump reading HDF5file

Hi Andrew:

Its unlikely that having that many groups is a good design, although there 
might be exceptions. What are you using them for?

John

On Mon, Nov 17, 2014 at 5:18 PM, Pedro Vicente 
<pedro.vicente@xxxxxxxxxxxxxxxxxx<mailto:pedro.vicente@xxxxxxxxxxxxxxxxxx>> 
wrote:
Andrew

If you are writing a new format, my guess is that you are writing an API for 
it, and since I have been writing these High-Level APIs for years (starting 
with the very own HDF5 High-Level API), and I am writing one now, here's my 
advice for it :-)

1) Do not use groups as a means to access your data (more on this later).
2) Encapsulate all HDF5 IDs.


For my case, I am using a model where datasets are accessed for read/write 
using their full path name, e.g "/path/to/dataset".

So, "/path/to/" is the group name  and "dataset" is the dataset relative name.

API usage for read/write is

write( "/path/to/dataset", void * data buffer)

so, the model to read/write is just the full dataset name as a string and a 
buffer with data.

The group part is implicit in the string, as you can see no HDF5 group IDs or 
group paths to read/write (that is what I mean with
"not use groups as a means to access your data ")


How can you accomplish this?

Here's an example of functions part of a C++ class , regarding this part

void create_group(const std::string& group_name);

void

create_dataset(const std::string& group_path, const std::string & dataset_name, 
other parameters here like dataset size or chunk size

void

write(const std::string& path, constvoid* buf);

Let's start with the create_group function


Here just create the group using the HDF5 C API (I use the C API even in a C++ 
program, the HDF5 C++ API adds nothing of use here)


gid = H5Gcreate2(this->m_fid, group_name.c_str(), other parameters
H5Gclose(gid)
As can be seen the group is created and closed immediately.

This means, you will have no groups open at all during your read/write calls 
(typically the core part of a program execution).

Since groups are closed immediately, you can even exceed the number of maximum 
open groups Russ mentioned.

For the dataset creation


create_dataset(

const std::string& group_path, const std::string& dataset_name,
concatenate the group path with the dataset name and store it somehow in your 
C++ class



std::string absolute_dataset_name =

group_path + "/" + dataset_name;
In my case , I am using a map with dataset ID/ full path of dataset, because I 
want to keep the datasets open (in memory), but any other way to store the path 
will do


//map with dataset name (full path) / HDF5 dataset ID

std::

map<std::string, hid_t> m_map_datasets;


here is the HDF5 create call


did = H5Dcreate2(

this->m_fid, absolute_dataset_name.c_str()

The trick here is to use the file ID (that must be stored in the class) and the 
full path. This takes group IDs out of the equation.

For the write/read call use simply


write(

const std::string& path, constvoid* buf)



get the dataset ID from full path

//get dataset ID from map <path/ID>

hid_t did = this->m_map_datasets[path];

use it in the write call

H5Dwrite(did, other parameters



Hope this helps and ask any further questions if needed.



btw, I downloaded your file with only groups, it has a size of 37MB, lots of 
group metadata here.

-Pedro

----------------------
Pedro Vicente
pedro.vicente@xxxxxxxxxxxxxxxxxx<mailto:pedro.vicente@xxxxxxxxxxxxxxxxxx>
http://www.space-research.org/

----- Original Message -----
From: Russ Rew<mailto:russ@xxxxxxxxxxxxxxxx>
To: Andrew Dowsey<mailto:andrew.dowsey@xxxxxxxxxxxxxxxx>
Cc: netcdfgroup@xxxxxxxxxxxxxxxx<mailto:netcdfgroup@xxxxxxxxxxxxxxxx>
Sent: Monday, November 17, 2014 2:11 PM
Subject: Re: [netcdfgroup] Problem with nc_inq_grps and ncdump reading HDF5file

Hi Andrew,

You've run across a limitation in the number of simultaneously open groups 
permitted in netCDF-4.  The groups 
documentation<http://www.unidata.ucar.edu/netcdf/docs/group__groups.html> says

... Encoding both the open file id and group id in a single integer currently 
limits the number of groups per netCDF-4 file to no more than 32767. Similarly, 
the number of simultaneously open netCDF-4 files in one program context is 
limited to 32767.

I think both those limits should actually be 65535 (== 2**16 - 1), but in any 
case, your HDF5 file has 119254 groups, which is too many for netCDF-4 to 
handle.

The only workaround I can think of would be to close some groups if you don't 
need to have them all open simultaneously.

--Russ


On Mon, Nov 17, 2014 at 8:14 AM, Andrew Dowsey 
<andrew.dowsey@xxxxxxxxxxxxxxxx<mailto:andrew.dowsey@xxxxxxxxxxxxxxxx>> wrote:
Hi,

I’m trying to create HDF5 files that can be read by NetCDF4 and I’ve run into a 
problem in that nc_inq_grps seems to report some bad ids. ncdump bails with 
this error too. h5dump works fine. The problem is deterministic but I haven’t 
been able to figure out what causes it because slightly different HDF5 files 
work fine. I have created a test file that has this problem, which contains 
nothing but groups. It can be downloaded from 
http://personalpages.manchester.ac.uk/staff/andrew.dowsey/test.h5

I am creating a new format for a type of instrument data we use, and for 
flexibility I would like it to be writeable/readable both by HDF5 and netCDF4 
libraries.

Any insight would be greatly appreciated!

Kind regards,
Andy


Andrew Dowsey PhD
Lecturer and CADET Bioinformatics Research Lead
Institute of Human Development, The University of Manchester

t: +44 161 701 0244<tel:%2B44%20161%20701%200244>
f: +44 161 701 0242<tel:%2B44%20161%20701%200242>
http://www.manchester.ac.uk/research/andrew.dowsey<http://www.manchester.ac.uk/research/andrew.dowsey/>

Centre for Advanced Discovery and Experimental Therapeutics (CADET)
Central Manchester University Hospitals NHS Foundation Trust
Oxford Road
Manchester M13 9WL
UK


_______________________________________________
netcdfgroup mailing list
netcdfgroup@xxxxxxxxxxxxxxxx<mailto:netcdfgroup@xxxxxxxxxxxxxxxx>
For list information or to unsubscribe,  visit: 
http://www.unidata.ucar.edu/mailing_lists/


________________________________

_______________________________________________
netcdfgroup mailing list
netcdfgroup@xxxxxxxxxxxxxxxx<mailto:netcdfgroup@xxxxxxxxxxxxxxxx>
For list information or to unsubscribe,  visit: 
http://www.unidata.ucar.edu/mailing_lists/

_______________________________________________
netcdfgroup mailing list
netcdfgroup@xxxxxxxxxxxxxxxx<mailto:netcdfgroup@xxxxxxxxxxxxxxxx>
For list information or to unsubscribe,  visit: 
http://www.unidata.ucar.edu/mailing_lists/
References:
- [netcdfgroup] Problem with nc_inq_grps and ncdump reading HDF5 file
  - From: Andrew Dowsey
- Re: [netcdfgroup] Problem with nc_inq_grps and ncdump reading HDF5 file
  - From: Russ Rew
- Re: [netcdfgroup] Problem with nc_inq_grps and ncdump reading HDF5file
  - From: Pedro Vicente
- Re: [netcdfgroup] Problem with nc_inq_grps and ncdump reading HDF5file
  - From: John Caron
2014 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdfgroup archives: