Lloyd Treinish has done an excellent job on thinking how to use netCDF "as is"
to represent complex datatypes that are not inherently supported in netCDF
now. We in the analytical instrument community have requirements that
go far beyond where netCDF is currently. We need to support more complex
data models sooner rather than later. Many kudos to Lloyd for taking the
next major step -- again!!
>From my quick reading of Lloyd's comments, the conventions used in Data
Explorer (included below) detail a way to implement some parts of a more
extensive data model using conventions in CDL. I think the description
is very useful to describe how one might use netCDF and CDL to store more
complex datatypes in netCDF files.
I'd like to see the technical requirements and scope of those requirements
that the Data Explorer data model addresses now. Lloyd, is the Data Explorer
data model specification a public-domain document?
Data Explorer sounds like a great package. It appears to address many
requirements for several different domains of science.
--------------------------------------------------------------------------
We need to advance to other issues not addressed by Lloyd's input. I feel
that we still haven't fully addressed the question of standards.
There are very important business, organizational, and people constraints
on any solution that will be WIDELY accepted, i.e., become a standard.
The analytical instrument vendors, universities, government agencies, and
end user companies that I've been working with on analytical data standards
over the past 4 years have said "if we have to buy it from "company x", then
it's not an open standard, and we don't want it."
A major problem with Data Explorer (and other commercial systems "more
advanced" than netCDF) is that it is proprietary, and requires paid royalties
to a for-profit company. I've been hit down hard for proposing proprietary
technologies to standards groups and other researchers that, for whatever
reason, feel they must base their work on public-domain standards.
Until a public-domain version is made available that is free of charge,
available over Internet, and is supported by a vendor-independent software
engineering support group like Unidata or NASA, Data Explorer (or any other
commercial package) doesn't serve the major needs for universities, standards
communities, and even many sections of industry for scientific data
interchange and storage. I've hit up against this hard "reality" many
times.
If such a public-domain version of a generic package (Data Explorer or any
other package) for scientific data interchange and storage were made
available to the scientific community, it must not be made available as
a "scaled-down" version, that requires someone to buy the commercial version
to get the full functionality. Unidata doesn't use such "hooks", because
they don't serve Unidata's clientele.
We must not lose sight of the fact that technical solutions by themselves
are not complete solutions or business solutions, whether your "business"
is university research, industrial R&D, or government R&D. Too many technical
solutions fail to make it "to market" because they are technical solutions
only, and fail to satisfy the all other requirements, particularly business,
organizational, and people.
This is not a soapbox conversation. I've had to take long hard looks at
what is making the analytical data standards successful. The technical part
of it (netCDF software) is an important, yet small part of the solution.
This is not always easy for technical people (including myself) to accept.
The vendor-independent software support center (Unidata) is an organizational
factor that is crucial to the success of netCDF. However, to be successful
the full range of requirements must be included in the solution. Unidata has
done a good job of addressing the fuller range of requirements than most other
organizations I've seen.
Unidata does an enormous amount of work to make sure their codes are fully
avaiable on all the major platforms, with no particular bias toward any group
of users or vendors. They should be commended for all their great work.
I hope that this discussion leads to a broader discussion of requirements
for systems and solutions in the future. This may be controversial, but it
is meant to forward the scientific community large.
NetCDF has a broad applicability, and it needs to be extended to meet some
of the requirements beyond those that Lloyd and others have begun to address
in varous memos. This is a good time to start discussing the broader
requirements for the future versions.
Your feedback on this note will be much appreciated.
Rich Lysakowski
Director, Analytical Data Interchange and Storage Standards Project
=======================================================================
From: DECWRL::"lloydt@xxxxxxxxxxxxxx" "Lloyd A. Treinish" 25-Nov-92 12:17
To: netcdfgroup@xxxxxxxxxxxxxxxx
CC: goucher@xxxxxxxxxxxxxxxxxxxx, ravi@xxxxxxxxxxx, hdf-netcdf@xxxxxxxxxxxxx
Subj: netCDF and "complex" data
There has been much discussion (mostly last month) about storing complex
structures in netCDF. I have been meaning to respond, but I have been very
busy. A lot of the e-mail traffic duplicates a few things that we talked
about at NSSDC a few years ago for CDF, but had not implemented at that time.
However, here at the Visualization Systems Group at IBM T. J. Watson Research
Center, the discussion also duplicates much of what was considered
and then implemented almost two years ago as part of the development of the
IBM Visualization Data Explorer software.
Therefore, I have attached a brief document that describes some of what we
have done in the context of importing data stored in netCDF for your
consideration. I believe it addresses many of the issues that have been
discussed in this forum. Keep in mind this is NOT a proposal, but an
outline of an actual implementation that is available in a commercial
software product today. Although the ideas can be cast in a form
independently of Data Explorer, that software does fully support them. If
you have any questions, and especially if you have any comments (positive
or negative), please let me know.
Lloyd Treinish
-------------------------------------------------------------------------------
Importing netCDF data into IBM Visualization Data Explorer
Lloyd A. Treinish
Visualization Systems Group
IBM T. J. Watson Research Center
P. O. Box 704
Yorktown Heights, NY 10598
lloydt@xxxxxxxxxxxxxx
The IBM Visualization Data Explorer (DX) is a general-purpose software package
for scientific data visualization. It employs a data-flow-driven client-
server execution model and is currently available on five platforms: IBM POWER
Visualization Systems (a medium-grain, shared memory parallel supercomputer)
and workstations -- IBM RISC System/6000, Silicon Graphics Indigo and Crimson,
Hewlett-Packard 700 and Sun Sparcstation 2. DX is built on a foundation of an
internal data model, which describes and provides uniform access services for
any data brought into, generated by, or exported from the software. Hence, it
has a notion of supporting a number of different classes of interesting
scientific data, which can be described by its shape (size and number of
dimensions), rank (e.g., scalar, vector, tensor), type (float, integer, byte,
etc. or real, complex, quaternion), where the data are located in space
(positions), how the locations are related to each other (connections),
aggregates or groups (e.g., hierarchies, series, composites, etc.). It also
supports those entities required for graphics and imaging operations within
the context of Data Explorer. Generically, these are called "objects". The
DX data model is supported with an applications programming interface (API)
for users or developers to create functions or operations (i.e., modules) for
DX. At the user-level (i.e., via a graphical user interface, visual
programming or scripting-language programming) the details of the data model
and this interface are hidden. An important consequence of this approach is
that modules are polymorphic. In addition, there is an external
representation, the native dx format. It is a multiple sequential file
representation of DX objects. The DX data model is quite rich. Most of what
it can support is not directly expressable by netCDF. Therefore, a
methodology to extend netCDF for use with Data Explorer was developed. For
more information about the DX architecture and data model see, for example, R.
Haber et al, "A Data Model for Scientific Visualization with Provisions for
Regular and Irregular Grids", Proceedings IEEE Visualization '91 Conference,
pp. 298-305, October 1991, B. Lucas et al, "An Architecture for a Scientific
Visualization System". Proceedings IEEE Visualization '92, pp. 107-113,
October 1992, and "IBM Visualization Data Explorer User's Guide, Second
Edition", IBM Document Number SC38-0496-1, August 1992.
NetCDF is a data abstraction for (self-describing) multi-dimensional blocks.
The descriptions are in terms of attributes, which may be assigned globally or
to one or more variables (i.e., a multi-dimensional block). NetCDF in the DX
context provides a portable and commonly-used API (C and a veneer layer for
FORTRAN 77), and a fixed, portable physical file structure (a single XDR file)
in the public domain. NetCDF only knows about arrays of scalars and is a
carrier for them and their descriptions. There is NO knowledge or semantics
imbedded with regard to any other structure. Since such arrays are inherently
flat and rectilinear there is insufficient information typically to define
suitable objects for import to DX, especially for irregular or hierarchical
data. A netCDF user is free to define custom conventions for the array
storage and attribute nomenclature. In this sense it is possible to create a
mechanism to support a limited set of other structures on top of the array
"protocol". However, this also means that a generic netCDF reader would only
be able to report contents and be unable to operate on any underlying context.
Such a context for the creation of DX objects for their importation has been
defined. Any system that attempts to support structures more complex than
what raw netCDF handles would have to deal with this situation. The notion of
being able to import any random netCDF and create the correct DX object is NOT
possible given the limited netCDF vocabulary. The exception would be for a
very limited class of regular/rectilinear arrays, in which any more complex
structure is ignored (e.g., a simple image). The DX convention for simple
regular data is essentially based on that idea. Hence, a visualization pack-
age that is only capable of dealing with "native" netCDF data would have to
have limited functionality. DX is capable of dealing with a far greater vari-
ety of complex data, only a subset of which can be expressed effectively in
netCDF even when one does so via external constraints.
What are these aforementioned conventions? The netCDF vocabulary is not
sufficiently rich nor at a high enough level to adequately describe the kinds
of objects that must be supported for general visualization and analysis.
This is a result of the heritage of the original CDF implementation at
NASA/GSFC in the mid-1980s. Although the current CDF implementation at NASA
does address a few of these limitations, both netCDF and CDF still are focused
primarily on a relatively low-level abstraction -- multidimensional blocks.
DX objects can be decomposed to a lower level, that of multidimensional
arrays. However, the DX array objects are more flexible than those of
CDF/netCDF model because they support rank and shape/dimensionality
independently. Nevertheless, netCDF can be used as a carrier of self-de-
scribing multidimensional arrays, whose descriptions when following a certain
convention, can be used by DX to create proper objects. Of course, this may
not always be practical since there are significant limitations on the kinds
of arrays that a single netCDF may contain based upon constraints such as
size, number of named dimensions, etc. due to what the netCDF software
supports and its physical file structure. This is an additional justification
among other reasons for requiring a native structure. The best way to
illustrate these ideas is with a few examples.
Scalar data that is on a regular grid can be imported into Data Explorer from
a "standard" netCDF file. To import vector data, data on irregular grids, or
time series data, additional attributes must be added to the netCDF file.
These attributes allow you to specify the data, positions, and connections
components of your data set.
REGULAR GRIDS
To import scalar data on a regular grid, specify the netCDF file name as the
"name" parameter in the Import module. By default, all netCDF variables will
be imported and collected into a group. To import one or more particular
variables, specify their names as the "variable" parameter. The "format"
parameter must be "netCDF."
Data Explorer automatically constructs positions and connections for each
variable, with an origin of 0.0 and spacings of 1.0 along each dimension.
For data that is logically a vector field, but whose values are stored in
three separate netCDF variables, each component of the vector can be imported
separately; the Compute module can then be used to create a single vector
field. For data that is logically a vector field, but whose values are stored
as an n+1 dimensional regular grid, use the Slice and Compute modules to
separate the components of the vector, and then recombine them into a single
vector field.
Example of a Simple Regular Grid
The following netCDL describes a 3 x 3 x 3 regular grid at origin (0, 0, 0)
with deltas of 1.0 along each axis.
netcdf volume {
dimensions:
nx = 3;
ny = 3;
nz = 3;
variables:
float field_data(nx, ny, nz);
data:
field_data
0, 0, 0 0, 0, 0 0, 5, 0
0, 0, 5 0, 0, 0 0, 0, 0
5, 0, 0 0, 0, 0 0, 0, 0;
}
NetCDF on completely regular grids can be imported directly by Data Explorer
without modifying the netCDF file as indicated earlier.
COMPLEX FIELDS
For data with more complex structure, conventions have been established for
netCDF variable attributes, as described in the format below. There are two
key variable attributes that you will need to define for each netCDF variable,
"field", which as far as you are concerned is used to specify the rank of the
parameter, and "positions", which is used to specify where the information
containing the locations of the data in space is stored. The defaults for
connections (i.e., topological primitive) is quads, cubes, etc. depending on
the shape of the field. If you do not specify positions, regularity is
assumed with origin at 0.0 and a spacing of 1.0. Data Explorer does support
dimensional or array products. This is a generalization of the notion of
product specification for rectilinear grids that is employed in CDF and
netCDF. Hence, this idea is exploited in the netCDF conventions.
It should be noted that netCDF does not make a distinction about the
relationship between data dependency and mesh structure -- it is just arrays.
Such an distinction is at an applications level above netCDF. Data Explorer
allows you to specify whether the values associated with a grid or mesh are to
be assigned at the node points of the mesh or the center of the grid cells.
For data in netCDF to be imported into DX, it is assumed that the data are
associated with node points (i.e., data are dependent on positions). If this
is not appropriate for the data of interest, the Post module can be used to
convert to a cell-centered form (i.e., data are dependent on connections)
after importing. Alternately, the additional field components described below
can be used.
IRREGULAR ARRAYS
Data
To indicate that a netCDF variable contains values corresponding to the data
component, it must have the following attribute:
variable1:field = "fieldname";
Variable1 is the name of the netCDF variable containing data values to be
imported. fieldname is the name of the Data Explorer field by which the user
refers to the data (for example, "temperature," "pressure," "wind"). If more
than one variable is tagged with the same field name, each variable is read
into a field, and the fields are collected into a group.
The data are read in as an array of values, one number per grid point. If the
data are actually a vector or a matrix at each grid point, use one of the
following modifiers:
variable1:field = "fieldname, vector";
variable1:field = "fieldname, matrix";
The nonscalar data are stored in additional dimensions for the variable. For
a static three-dimensional 3-vector, the three components are stored in a
fourth dimension of size 3.
If the data have both regular connections and regular positions, no other
attributes are required. A regular grid is assumed, with the origin at 0.0,
and a spacing of 1.0 along each axis. The number of axes will be determined
from the number of dimensions in the data array.
Positions
If the locations of the data values in variable1 do not form a regular lattice
(with origins at 0.0 and spacings of 1.0), the name of a netCDF variable that
contains the position information must be specified as an attribute for
variable1.
There are five different types of position specifications: none, completely
regular, completely irregular, and two types of partially regular.
Completely irregular is assumed if the following attribute is specified:
variable1:positions = "variable2";
where variable2 is an array of vectors, one for each grid point, defining its
location. The dimensionality of the data space is determined by the number of
items in a vector.
Regular positions can be specified with just the origin and spacing between
grid points along each axis in compact form. The following attribute is used:
variable1:positions = "variable2, compact";
where variable2 is the name of a n times 2 array containing origin, delta
pairs for the spacing and location of positions along each axis. The number
of positions along each axis is determined from the shape of variable1.
Positions that can be specified as the product of arrays containing the
location of points along each axis can be input in product form. Use the
following attribute:
variable1:positions = "variable2a, product;
variable2b, product;
.
.
.
variable2x, product";
where the variable2's are each the name of an array containing a list of
positions along that axis. The number of items in each array must match the
length of the corresponding axis in the original variable1 data array.
If any of the axes in an partially regular product array are actually regular,
they can be specified in "compact" form:
variable1:positions = "variable2a, product, compact;
variable2b, product;
.
.
.
variable2x, product";
where variable2a is the name of an origin, delta array, and the rest are
position lists as before.
Connections
If the connections between positions is a regular lattice, no additional
attributes are necessary. For 1D data, connections of "lines" is assumed. 2D
data implies "quads," 3D data implies "cubes" and for higher dimensions,
"hypercubes" is assumed.
If the connections are irregular, use one of the following attributes:
variable1:connections = "variable3, tetrahedra";
variable1:connections = "variable3, triangles";
variable1:connections = "variable3, cubes";
variable1:connections = "variable3, quads";
where variable3 is the name of an array containing a vector of point numbers,
defining each connection element item. The length of this vector depends on
the choice of connections. If the shape is not explicitly specified,
tetrahedra are assumed.
Additional Components
If additional component information is present in the file, the following
attributes are valid:
variable1:component = "variable4, componentname, scalar;
variable5, componentname, vector;
variable6, componentname, matrix";
and
variable4:attributes = "ref, componentname;
dep, componentname";
SERIES DATA
The DX data model does support aggregates of data, which can be treated as a
single entity. Such aggregates may be hierarchical or a simple flat
collection of low-level objects like a (time) series. There are three ways to
specify the import of datasets that should be treated as series: single
variable, separate variables or separate files.
Single Variable
When all data values are defined as a single netCDF variable, and the
unlimited dimension of the variable is to be interpreted as the series
dimension, then use one of the following forms of the "field" attribute:
variable1:field = "fieldname, scalar, series";
variable1:field = "fieldname, vector, series";
variable1:field = "fieldname, matrix, series";
All other specifications are the same as for simple fields.
The position and connection information is assumed to be constant for all
members of the series and hence, is not stored redundantly. If the positions
or connections change for each step of the series, then the variables used for
those arrays must also have an unlimited dimension that corresponds one-for-
one with the data array. An example using this method is shown below.
Separate Variables
When there are separate netCDF variables defined for each step in the series,
but all variables are in the same file, use the following global attribute
tags:
:seriesxxx = "fieldname;
variable1a;
variable1b;
.
.
.
variable1x";
or
:seriesxxx = "fieldname;
variable1a, float_value;
variable1b, float_value;
.
.
.
variable1x, float_value";
where the global tag must have the first 6 characters "series". Global tags
must be unique, so additional characters can be added to distinguish them.
Each variable1x is the name array containing the data for that step. In the
first format, the spacing of the steps is assumed to be 1.0. In the second
format, the float_value is the value of each step. All other specifications
are the same as for simple fields. For example,
:series_temp = "temp; temp001; temp002; temp003; . . . ; temp999";
or
:series_temp = "temp; temp001, 0.0; temp002, 0.3; temp003, 0.7";
Each name, tempnnn, is the name of a variable (array) containing the data for
each member of the series.
Separate Files
When there are netCDF variables in separate files which make up the steps of a
series, use the following global attribute tags:
:seriesxxx = "fieldname, files;
filename1;
filename2;
.
.
.
filenameN";
or
:seriesxxx = "fieldname, files;
filename1, float_value;
filename2, float_value;
.
.
.
filenameN, float_value";
where the global tag must have the first 6 characters "series". Global tags
must be unique, so additional characters can be added to distinguish them.
Each filenameN is the name of the netCDF file which contains the data
variables for that step. In the first format, the spacing of the steps is
1.0. In the second format, the float_value is the value of each step. All
other specifications are the same as for simple fields.
This format can be used to create short term series within a file, and then
have a series of these smaller series. The syntax is an extension of what is
done for multiple steps being multiple variables within a file. For example,
:series_temp = "temp, files; temp_file1; temp_file2; temp_file3; . . .
temp_fileN";
or
:series_temp = "temp, files; temp_file1, 1001.0; temp_file2, 1001.5
temp_file3, 1002.0; . . . temp_fileN, 1231.5";
Compact Specifications of Regular Dimensions
This example describes a single two-dimensional scalar field on a latitude-
longitude, regular, rectangular grid. The example data are temperature on a
one-degree grid with global coverage. For regular dimensions, storing all the
grid locations is redundant and wasteful of storage, even if you use a product
notation that netCDF can handle. Because Data Explorer array objects can be
specified compactly, you can use this method to specify a netCDF with regular
dimensions efficiently. For each dimension, you need to specify its value at
the origin and its spacing along the dimension.
In this example, two variable attributes are defined for the netCDF variables.
"field" specifies the rank of the field parameter, and "positions" specifies
where the information containing the locations of the data is space is
located.
dimensions:
lon = 360;
lat = 180;
naxes = 2;
ndeltas = 2;
variables:
float locations(naxes, ndeltas);
float temperature(lat, lon);
temperature:field = "temperature, scalar";
temperature:positions = "locations, regular";
data:
locations = 89.5, -1., // compact specification, origin and
-179, 1.; // spacing for lat and lon
temperature = ... ; // Data for temperature
Partially Regular Grids and Time Series
This example describes an ocean circulation model, which consists of a time
series of four three-dimensional scalars (temp, sali, wata and conv) and one
three-dimensional 3-vector (vel). NetCDF would typically require that there
are seven variables (all scalars with the vector be stored as three scalars).
The coordinate system for the velocity vectors corresponds to that of the grid
(that is, +u implies north, +v implies east, and +w implies down).
These grids are partially regular in that the "time," "tlat," and "tlon"
portions (three out of the four dimensions) are all regularly spaced. "time"
is to be mapped to members of a series group. The fourth dimension, "tlvl,"
is irregularly spaced. The compact notation can be used for the regular
notation, while all the values along the irregular dimension must be
specified; a product is formed from the dimensions. The specification in
netCDL notation is:
dimensions:
time = UNLIMITED;
tlat = 30;
tlon = 50;
tlvl = 30;
vsize = 3; // At each grid cell for variable vel, there are
// three floats for the u, v, and w components of the
// vector field.
naxes = 3;
ndeltas = 2;
variables:
float lat_axis(ndeltas, naxes);
float lon_axis(ndeltas, naxes);
float level_axis(tlvl, naxes);
float temp(time, tlat, tlon, tlvl);
temp:field = "temperature, scalar, series";
temp:positions = "lat_axis, product, compact; lon_axis, product,
compact; level_axis, product";
float sali(time, tlat, tlon, tlvl);
sali:field = "salinity, scalar, series";
sali:positions = "lat_axis, product, compact; lon_axis, product,
compact; level_axis, product";
float wata(time, tlat, tlon, tlvl);
wata:field = "water parage, scalar, series";
wata:positions = "lat_axis, product, compact; lon_axis, product,
compact; level_axis, product";
float conv(time, tlat, tlon, tlvl);
conv:field = "covective index, scalar, series";
conv:positions = "lat_axis, product, compact; lon_axis, product,
compact; level_axis, product";
float vel(time, tlat, tlon, tlvl, vsize);
vel:field = "velocity, vector, series";
vel:positions = "lat_axis, product, compact; lon_axis, product,
compact; level_axis, product";
data:
lat_axis = -14.667, 0., 0.,
0.333, 0., 0.;
lon_axis = 0.0, -99.8, 0.0,
0.0, 0.5, 0.0;
level_axis = 0.0, 0.0, 17.5,
0.0, 0.0, 53.425,
.
.
.
0.0, 0.0, 5374.98;
temp = ... ;
sali = ... ;
wata = ... ;
conv = ... ;
vel = ... ;
Irregular Surface
This example is the netCDL description of a netCDF for an irregular surface,
that of the classic teapot. It has precomputed normals, which are imported as
the "normals" component, in addition to positions and connections.
netcdf teapot { // name of datafile is "teapot.ncdf"
// name of field is "surface"
dimensions:
pointnums = 2268;
trinums = 3584;
axes = 3;
sides = 3;
variables:
float locations(pointnums, axes);
float normalvect(pointnums, axes);
long tris(trinums, sides);
float surfacedata(pointnums);
// global attributes:
:source = "Classic Teapot, data from Turner Whitted";
// specific attributes:
surfacedata:field = "surface";
surfacedata:connections = "tris, triangles";
surfacedata:positions = "locations";
surfacedata:component = "normalvect, normals, vector";
normalvect:attributes = "dep, positions";
// This is the start of a large data section
data:
.
.
.
}
% ====== Internet headers and postmarks (see DECWRL::GATEWAY.DOC) =====
% Received: by enet-gw.pa.dec.com; id AA09931; Wed, 25 Nov 1992 09:19:39 -0800
% Received: by unidata.ucar.edu id AA07935 (5.65c/IDA-1.4.4 for
netcdfgroup-send); Wed, 25 Nov 1992 09:23:14 -070
% Received: from watson.ibm.com by unidata.ucar.edu with SMTP id AA07931
(5.65c/IDA-1.4.4 for <netcdfgroup@xxxxxxxxxxxxxxxx>); Wed, 25 Nov 1992 09:23:09
-070
% Message-Id: <199211251623.AA07931@xxxxxxxxxxxxxxxx>
% Organization: .
% Keywords: 199211251623.AA07931
% Received: from YKTVMH by watson.ibm.com (IBM VM SMTP V2R2) with BSMTP id
8595; Wed, 25 Nov 1992 11:23:04 ES
% Date: Wed, 25 Nov 1992 11:05:36 EST
% From: "Lloyd A. Treinish" <lloydt@xxxxxxxxxxxxxx>
% To: netcdfgroup@xxxxxxxxxxxxxxxx
% Cc: goucher@xxxxxxxxxxxxxxxxxxxx, ravi@xxxxxxxxxxx, hdf-netcdf@xxxxxxxxxxxxx
% Subject: netCDF and "complex" data