Re: [netcdfgroup] NetCDF4 for Fusion Data

To: netcdfgroup@xxxxxxxxxxxxxxxx
Subject: Re: [netcdfgroup] NetCDF4 for Fusion Data
From: John Storrs <john.storrs@xxxxxxxxxxxx>
Date: Thu, 22 Jan 2009 11:28:55 +0000
Hi John and Ed

Thanks for your responses.

John Storrs wrote:
> > Further to my previous postings about 'unlimited' dimensions, I now
> > understand the semantics better, and it's apparent that there is a
> > mismatch with the needs of our application.
> >
> > As previously described, we need to archive say 96 digitizer channels
> > which have the same sample times but potentially different sample counts.
> > From a logical point of view, the channel measurements share a single
> > time dimension - some move further along it than others, that's all. They
> > should clearly all reference a single time coordinate variable.
To clarify, the case I'm describing is archiving digitizer data from a 5
second plasma shot. The data is acquired from the digitizers after each shot
(~3000 channels in total), and a number of archive files are written and
permanently stored. The data for each channel has a fixed size, known at the
time the archive file is written. The proposal to use an unlimited time
dimension here is a stratagem to allow a set of variables storing arrays of
different (fixed) lengths to reference a single logical dimension and a
single coordinate array. That's the requirement - if there's a better way of
doing it I need to know.

> > Also, we  may want to stick with our present compression strategy for
>>  time, storing it as a (sequence of) triple: start time, time increment,
>> and count. We might put these values in an attribute of the time
>>  coordinate variable, leaving the variable itself empty.
> storing data values in attributes is a bad idea (IMHO). what would be the
> motivation?
I was seeing them as similar to scale_factor and add_offset attributes, which
we will use for digitizer data. They would be components of a new convention.
In the usual simple case of a start time of 0.0 seconds, a fixed sample clock
period P seconds (eg 1.0e-6), and maximum size of the unlimited time
dimension T, time coordinates can be stored as "0.0, P, T". This is a
compression strategy. If we want to store the actual data we'll have to use a
double array to avoid loss of precision (floats are too small) - compressing
with shuffle and deflate filters reduces the data size by ~95% in some tests,
but still leaves ~2MB. What's really needed for this type of data is a
differential filter which reduces it to the triples described. Does HDF5
offer that - I haven't seen it.

> > The 'unlimited' semantics go only half way to matching this requirement.
> > At the HDF5 storage level,all is well. H5dump shows that the stored size
> > of each variable is the initialized size, not the maximum initialized
> > size of all the variables to which the dimension is evidently set. So far
> > so good, but ncdump shows all the data padded to that size, reducing its
> > usefulness. This is presumably because the dimension provides the only
> > size exposed by the API, unless I overlook something. HDF5 knows about
> > the initialized sizes, but NetCDF doesn't expose them. So we cannot
> > easily read the data and nothing but the data. Do you have an initialized
> > size inquiry function tucked away somewhere, or do we have to store the
> > value as an attribute with each variable?
>
> I would store the "actual size" of each variable as another variable. if
> you are constantly changing it, dont store as an attribute. if it only
> changes occasionally, an attribute would probably be ok.
>
> Since you want these to share the same time coordinate among these
> variables, you have to pay the price that all of them logically have the
> same length. Generic application will work fine, seeing missing values.
> Your specialized programs can take advantage of the extra info and operate
> more efficiently.
>
> > I don't think I want to explore VLEN to crack this, because it's new and
> > would complicate things. It seems to me that this is a use case others
> > will encounter, which needs a tidy solution.Any thoughts? I have to
> > present a strong case for NetCDF here next week, to counter an HDF5
> > proposal which doesn't have this problem, though it has many others.
> >
> > Another point: nc_inq_ncid returns NC_NOERR if the named group doesn't
> > exist. Do you mean this?
> >
> > Regards
> > John
> >
> > On Tuesday 20 January 2009, John Storrs wrote:
> >> I've uncovered a couple of problems:
> >>
> >> (1) Variables in an explicitly defined group, using the same 'unlimited'
> >> dimension but of different initialized sizes, result in an HDF error
> >> when ncdump is run (without flags) on the generated NetCDF4 file. No
> >> problems are reported when the file is generated (all netcdf call return
> >> values are checked in the usual way). The dimension is defined in the
> >> root group. Try writing data of size S to one variable, and size < S to
> >> the next. This error isn't seen if the variables are all in the root
> >> group. In that case, ncdump fills all variables to the maximum size
> >> which I suppose is a feature and not a bug. An ncdump flag to disable
> >> this feature would be useful.
> >
> > --
> > John Storrs, Experiments Dept      e-mail: john.storrs@xxxxxxxxxxxx
> > Building D3, UKAEA Fusion                               tel: 01235 466338
> > Culham Science Centre                                    fax: 01235
> > 466379 Abingdon, Oxfordshire OX14 3DB
> > http://www.fusion.org.uk
> >



--
John Storrs, Experiments Dept      e-mail: john.storrs@xxxxxxxxxxxx
Building D3, UKAEA Fusion                               tel: 01235 466338
Culham Science Centre                                    fax: 01235 466379
Abingdon, Oxfordshire OX14 3DB              http://www.fusion.org.uk
References:
- [netcdfgroup] NetCDF4 for Fusion Data
  - From: John Storrs
2009 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdfgroup archives: