We've been talking conventions for netCDF data that will facilitate
the development of generic applications for multidimensional scientific
data. It occurred to me that maybe we should take a close look at
applications that already deal with this type of data in this manner.
One such application that I am familiar with is AVS (the Application
Visualization System). AVS is essentially a collection of tools that
read, write, process, and display data from a broad class that it calls
"field" data (it also handles "unstructured cell data" for finite
element applications, but since I don't use this I won't discuss it).
Using AVS we have manipulated time series and depth profiles, rendered
gridded bathymetric data draped with sidescan imagery, displayed
scattered earthquake and CTD data, plotted velocity vectors in 3D from
shipboard acoustic Doppler transects, and explored data from a 3D
orthogonal curvilinear sigma coordinate model. All of this data were
encompassed by the AVS "field" data class.
I am proposing that we consider using the field data model as the basis
for our oceanographic netCDF files, which will require adopting a few
conventions. First a description of the "field" data class, which
encompasses a broad range of scientific data.
To define the nature of the field data, AVS needs to know 5 pieces of
information:
1. The dimensions of the data space.
The data array can have any number of dimensions, and the
dimensions can be of any size.
2. The number of data components at each coordinate node.
Each data element in the array can consist of one value or
a vector of values.
3. The data representation. (i.e. byte, integer, float, etc.)
4. The dimensions of the coordinate space.
This is not necessarily the same number as the number of dimensions
of the data space. A drifter trajectory, for example, has only 1
data dimension (time, or record number), but it's location in the
water column needs to be described by 3 spatial coordinates.
5. The nature of the mapping between the data and coordinate space.
AVS allows for uniform, rectilinear or irregular mapping of data
space to coordinate space.
Uniform means that the data is equally spaced along each dimension,
so that the coordinates can be determined from min and max extents.
Rectilinear means that each dimension of data space is mapped to
a corresponding dimension of coordinate space through a coordinate
variable vector. This corresponds to the type of data that can be
processed by PMELs EPIC format and the UNIDATA netCDF operators.
In rectilinear mappings, the number of dimensions in data space
and coordinate space is the same.
Irregular means that there is no simple mapping between data
and coordinate space, and the coordinate location of each data
point is explicitly defined. The number of data and
coordinate dimensions need not be the same. It is this class that
allows AVS to handle curvilinear model output, scattered data in
x,y,z space (like Doppler data, drifter data, and CTD data), and
air-temperature on the surface of the Earth.
Some AVS tools work on any data in field format, some only work with field
data that have certain attributes. For example, "print field" works no matter
what type of field data it is, while "compute divergence" requires that you
have a 2 or 3-vector field on a uniform grid. The point is, with these
attributes, you can develop tools that work in a generic manner on a WIDE
class of common oceanographic data types.
The bad news is, a bare bones netCDF file only supplies us with 1 of these
critical pieces of information: the data representation (e.g float, int).
The good news is, that with just two conventions, we could supply all the
rest of the information.
*****************
Convention 1: define a dimension named "components"
*****************
In netCDF, we know how many dimensions the variable has, but we don't
know which of these dimensions correspond to data space and which
dimensions correspons to components of a vector. For example, a
three-component velocity vector defined on a 2D coordinate grid might
be defined
dimensions:
lat=20,lon=20,components=3;
variables:
float velocity(lat,lon,components);
A priori, we don't know that "components" doesn't refer to depth, or
some other coordinate dimension. Luckily, netCDF uses named dimension,
so all we need to adopt a convention that defines a special dimension
name that tells the application "this dimension denotes the number of
data components at each coordinate node". The design plan for the
SIEVE system that is being developed by the USGS in Reston incorporates
this convention, suggesting "components" as the special dimension
name. Seems logical to me.
By adopting this convension, we pick up two more critical pieces of
information, numbers 1 and 2 on the list above: the dimensions of the
data space and the number of components at each coordinate node.
************
Convention 2: define a variable attribute called "independent_variables"
************
To define the mapping from data space to coordinate space, we need to
specify the coordinate variables on which each data variable depends.
In other words, for each dependent variable, we need to supply the
independent variables. This could be accomplished by a string
attribute which simply lists the independent variables. For example, a
temperature record from an ocean surface drifter might be defined:
Dimensions: position=1000;
Variables:
float temp(position);
temp:long_name = "Temperature";
temp:units = "Celcius";
temp:independent_variables = "lat lon"
float lon(position);
lon:long_name = "Longitude";
lon:units = "degrees";
float lat(position);
lat:long_name = "Latitude";
lat:units = "degrees";
which would be taken to mean that each temperature point corresponds to
a 2-space coordinate given by lat and lon. Actually, due to the power
of netCDF, we would only *need* to supply the attribute
"independent_variables" for irregular mappings where the number of
coordinate dimensions exceeds the number of data dimensions.
Rectilinear mappings and irregular mappings where the data and
coordinate dimensions are the same can be determined from the data and
coordinate variables themselves. Uniform mappings are ugly, since
they require origin and coordinate interval info to be supplied. I
would propose that data coordinates must be supplied (even if evenly
spaced), or else the application would assume data indices as
coordinates. Getting complicated with attribute schemes for uniform
data just doesn't seem worth it.
Rectilinear mappings would be determined by checking to see if 1D
variables exist with the same name as the named dimensions, just as
defined in conventions.info. In these cases, the independent_variables
attribute would be unneccessary. An example of the rectilinear mapping
is the familiar:
Dimensions: lat=10,lon=10;
Variables:
float temp(lat,lon);
temp:long_name = "Temperature";
temp:units = "Celcius";
float lon(lon);
lon:long_name = "Longitude";
lon:units = "degrees";
float lat(lat);
lat:long_name = "Latitude";
lat:units = "degrees";
If a variable name is also a dimension name, but it is not 1D, then the
mapping is irregular. For example, salinity data from a time-dependent
curvilinear, sigma coordinate numerical model might look like:
Dimensions: x=40,y=40,z=10,time=1000;
Variables:
float sal(time,z,y,x)
sal:long_name="Salinity";
sal:units="psu";
float time(time);
float z(z,y,x);
float x(y,x);
float y(y,x);
The application would find the coordinate variables, and deduce that
since the coordinates are greater than 1D, that the field must be
irregular. It would then assume that since the variable time is 1D,
that the entire salinity field at a given time index corresponds to the
value of the variable time at this index. Similiarly, the application
would deduce that the z locations hold for all time, and that the x and
y locations hold for all depths and all times.
Adopting these two conventions would allow generic applications to
be developed which understand a much wider range of data types, many
common in the oceanographic community.
Comments?
--
Rich Signell | rsignell@xxxxxxxxxxxxxxxxxx
U.S. Geological Survey | (508) 457-2229 | FAX (508) 457-2310
Quissett Campus | "George promised to be good...
Woods Hole, MA 02543 | ... but it is easy for little monkeys to forget."