Ed, Jeff
Thanks for your prompt and useful comments.
I attach minimal C code showing the 'unlimited dimension' problem I reported.
It takes forever to run when the chunk definition call is disabled, but runs
fine when the call is enabled. I also attach a synthetic cdl file, generated
from a data acquisition configuration file, showing typical data
organisation. Our current archive file format (an old in-house system)
allows '/' as a character in flat variable names. Initially with NetCDF/HDF5
we'll stick to the old names using groups.
Further comments below:
On Friday 16 January 2009, Ed Hartnett wrote:
> John Storrs <john.storrs@xxxxxxxxxxxx> writes:
> > We are evaluating HDF5 and NetCDF4 as archive file formats for fusion
> > research data. We would like to use the same format for experimental
> > (shot-based) data and modelling code data, to get the benefits of
> > standardisation (one API to learn, one interface module to write for
> > visualization tool access, etc). A number of fusion modelling codes use
> > NetCDF. NetCDF for experimental data will be new though, so far as I
> > know. I've found some problems in shot data archiving tests which need
> > to be resolved for it to be considered further.
> >
> > MAST (Mega-Amp Spherical Tokamak) shot data (from magnetic sensors etc)
> > is mostly digitized in the range 1kHz to 2MHz. MAST shots are currently
> > less than 1 second in duration, but 5 second shots are forseen (some
> > other experiments have much longer shot times). We use up to 96-channel
> > digitizers. Acquisition start time and sample period is common to a
> > digitizer, but the number of samples per channel sometimes varies - that
> > is, some channels may be sampled for a longer time than others. Channel
> > naming is
> > hierarchical.
>
> I wonder if you could send the CDL of the test files you've come up
> with (i.e. run ncdump -h on the files).
> This would make your proposed data structures more clear.
> Also is this code in C, fortran, C++, Java? Or something else?
C is the only thing I've tried so far. If we go for NetCDF4 we will need IDL.
Has anyone provided a full IDL interface yet? We could do that if not.
> > There are two NetCDF-related issues here. The first is how to store the
> > channel data, the second how to store time, both efficiently of course.
> > We want per-variable compression. We don't want uninitialised value
> > padding in variable data, even if it would be efficiently compressed. In
> > the normal case where acquisition start time and sample period is common
> > to all channels in a dataset, we would prefer to define just one
> > dimension, not many if channel data array sizes vary.
>
> Do you then store each channel in a different variable? Would it make
> sense to use two dimensions, time and channel?
Digitizer channels have names identifying detectors. Different channels
potentially have different attributes (units, scale, offset etc). So each
channel has to be stored in a separate variable in NetCDF/HDF5. No problem
with that.
>
> Can I ask why you don't want the uninitialized values stored in the
> file, even if they are compressed away? (Which will not happen unless
> you set fill mode, BTW, since unfilled values will contain random bits
> that will not compress well).
We could do that if writes are very efficient, and if a read can return the
initialized data efficiently. Digitized channel data will be stored as shorts
with float or double offset and scale, and we will have to provide wrapper
read functions returning the initialized data in float or double arrays. Is
the initialized data size is known from a system attribute without looking
for trailing fill bytes?
>
> An alternative would be to use the newly introduced variable length
> arrays in netCDF-4. In this case you don't store any padding
> values. But using VLENs means that existing netcdf code will not work
> on the resulting data files, as VLEN was just introduced, and existing
> code will not know how to deal with it.
I haven't looked at this yet. Using 'unlimited' with a chunk size defined
seems to do the required job.
>
> For your code this might not matter much, because you are writing it
> from scratch. But it also means that existing visualization packages
> will not cope with the data.
>
> >> NetCDF4 tests with a single fixed dimension, writing varying amounts of
> >> data
> >
> > to uncompressed channel variables, shows that the variables are written
> > to the archive file with padding, even in no_fill mode. The file size is
> > independent of the amount of data written.
>
> No_fill mode doesn't mean that the file size will change, just that
> the program will not take the time to initialize all the data to the
> fill value.
>
> Are you saying that it *is* initializing the data to a fill value?
> (That would be a bug.) Or just that the file size indicates that the
> data values are there (but filled with junk)?
I just looked at the file size, not the contents.
>
> For example, a 10x10 array will have 100 values whether fill mode is
> off or on. If it is on, then they will all be initialized to the fill
> value. If it is off they will just contain garbage.
>
> > NetCDF4 tests with a single unlimited dimension work for very small
> > dimension sizes, but take forever to write even a single 4 MSample
> > channel variable (we are using HDF5 1.8.2 if that's relevant to this
> > problem). That looks the right way to go if the processing time and
> > memory overhead is small, but we can't test it.
>
> Probably this is a chunksize problem. NetCDF-4 selects a default set
> of chunk sizes when you create a variable. Change them with the
> nc_def_var_chunking function (after defining the variable, but before
> the next enddef).
Correct. Works fine with any reasonable defined chunk size. The inquiry
function reports zero chunk size before one is defined. Is that the default?
>
> For very large variables, the default chunksizes don't work well in
> the 4.0 release. (This is fixed for the upcoming 4.0.1 release, so you
> can get the daily snapshot and see if this problem is better. Get the
> daily snapshot here:
> ftp://ftp.unidata.ucar.edu/pub/netcdf/snapshot/netcdf-4-daily.tar.gz)
I haven't tried this yet.
>
> Try setting the chunksizes to something reasonable by inserting the
> nc_def_var_chunking function call. For example, try setting it to the
> size of the array of data that you are writing in one call to
> nc_put_vara_*. (Or some integer multiple of that).
This works.
>
> Chunking is complex, but only important if you are I/O bound. (As you
> apparently will be, with your 4 MSample case, using the 4.0 default
> chunksizes.)
>
> > Coming to storage of the time coordinate variable. If we actually store
> > the data, it will need to be in a double array to avoid loss of
> > precision. Aleternatively we could define the variable as an integer with
> > a double scale and offset. Both of these sound inefficient.
> > Traditionally we store this type of data as a (sequence of) triple: start
> > time, time increment, count. Clearly we can do that within a convention,
> > expanding it in reader code. How should we handle this?
>
> Would the CF conventions time coordinate work for you? This is a start
> time stored as an attribute, and then the time of each observation
> stored in the coordinate variable. For example:
>
> double time(time) ;
> time:long_name = "time" ;
> time:units = "days since 1990-1-1 0:0:0" ;
>
> (Of course, you would want seconds, not days).
>
> For more, see:
> http://cf-pcmdi.llnl.gov/documents/cf-conventions/1.4/cf-conventions.html#t
>ime-coordinate
We'll have to think more about this. Now that I've seen the high compression
(eg 95%) you get with shuffle and deflate on a double array containing
sequential time values I'm wondering if the old compression techniques we use
can be consigned to the trash can.
>
> Thanks,
>
> Ed
--
John Storrs, Experiments Dept e-mail: john.storrs@xxxxxxxxxxxx
Building D3, UKAEA Fusion tel: 01235 466338
Culham Science Centre fax: 01235 466379
Abingdon, Oxfordshire OX14 3DB http://www.fusion.org.uk
___________________________________________________________________________
netcdf4test.c:
#include <stdlib.h>
#include <stdio.h>
#include <netcdf.h>
#include <values.h>
#define FILE_NAME "netcdf4test.nc"
#define SIZE 500000
/* Handle errors by printing an error message and exiting with a
* non-zero status. */
#define ERRCODE 2
#define ERR(e) {printf("Error: %s\n", nc_strerror(e)); exit(ERRCODE);}
int
main()
{
int retval, i, ncid, t_dimid, tid, chunked;
double *time;
size_t start, count, chunksize;
time = malloc(SIZE*sizeof(double));
for (i = 0; i < SIZE; i++) {
time[i] = i*2.0e-6;
}
if ((retval = nc_create(FILE_NAME, NC_NETCDF4|NC_CLOBBER, &ncid)))
ERR(retval);
if ((retval = nc_def_dim(ncid, "t", NC_UNLIMITED, &t_dimid)))
ERR(retval);
if ((retval = nc_def_var(ncid, "t", NC_FLOAT, 1, &t_dimid, &tid)))
ERR(retval);
if ((retval = nc_inq_var_chunking(ncid, tid, &chunked, &chunksize)))
ERR(retval);
printf("before setting chunksize, chunked = %d, chunksize = %d\n",
chunked, (int)chunksize);
#if 1
chunksize = SIZE;
if ((retval = nc_def_var_chunking(ncid, tid, NC_CHUNKED, &chunksize)))
ERR(retval);
#endif
#if 0
if ((retval = nc_def_var_deflate(ncid, tid, NC_SHUFFLE, 1, 1)))
ERR(retval);
if ((retval = nc_def_var_fletcher32(ncid, tid, 1)))
ERR(retval);
#endif
start = 0;
count = SIZE;
if ((retval = nc_put_vara_double(ncid, tid, &start, &count, &time[0])))
ERR(retval);
if ((retval = nc_close(ncid)))
ERR(retval);
printf("*** SUCCESS writing example file netcdf4test.nc!\n");
return 0;
}
_________________________________________________________________________
xmb.cdl
netcdf xmb {
dimensions:
time = unlimited;
variables:
double time(time);
group: xmb_halo {
group: elm {
group: l {
variables:
short 1(time);
short 2(time);
short 3(time);
short 4(time);
}
group: u {
variables:
short 1(time);
short 2(time);
short 3(time);
short 4(time);
}
}
group: mbd {
variables:
short 1(time);
short 2(time);
short 3(time);
short 4(time);
short 5(time);
short 6(time);
}
group: p2l {
variables:
short 1(time);
short 2(time);
short 3(time);
short 4(time);
short 5(time);
short 6(time);
}
group: p2u {
variables:
short 1(time);
short 2(time);
short 3(time);
short 4(time);
short 5(time);
short 6(time);
}
group: p3l {
variables:
short 1(time);
short 2(time);
short 3(time);
short 4(time);
short 5(time);
short 6(time);
}
group: p3u {
variables:
short 1(time);
short 2(time);
short 3(time);
short 4(time);
short 5(time);
short 6(time);
}
}
group: xmb_phalo {
group: epl {
variables:
short i1(time);
short i2(time);
short i3(time);
short i4(time);
short o1(time);
short o2(time);
short o3(time);
short o4(time);
short o5(time);
short o6(time);
}
group: epu {
variables:
short i1(time);
short i2(time);
short i3(time);
short i4(time);
short o1(time);
short o2(time);
short o3(time);
short o4(time);
short o5(time);
short o6(time);
}
group: p2l {
variables:
short i1(time);
short i2(time);
short i3(time);
short i4(time);
short i5(time);
short i6(time);
short o1(time);
short o2(time);
short o3(time);
short o4(time);
}
group: p2u {
variables:
short i1(time);
short i2(time);
short i3(time);
short i4(time);
short i5(time);
short i6(time);
short o1(time);
short o2(time);
short o3(time);
short o4(time);
}
}
group: xmb_sad {
group: out {
variables:
short l01(time);
short l02(time);
short m01(time);
short m010(time);
short m011(time);
short m012(time);
short m02(time);
short m03(time);
short m04(time);
short m05(time);
short m06(time);
short m07(time);
short m08(time);
short m09(time);
short u01(time);
short u07(time);
}
}
}