Hello again,
There's been so much to talk about in the past few days I'm not
sure I know how to respond. First, I think we all agree that conventions
are needed and these conventions should support the "possibility of writing
applications that access generic netCDF files and operate on them." So,
what do we need to do to support this. I contend that the conventions.info,
in its current form, is somewhat inadaquate, for a few reasons which I
intend to cover.
I'd like to start by looking at some examples based on what Tim
Holt stated. Before I start I should state that I am a "computer tweak?".
Although I've never heard this term before, I'm sure it could be applied to me.
Tim writes:
What it comes down to for the average tech/PI with raw data is this --
"I want to make a graph of time vs temperature", "I want to plot the tracklines
from the cruise", or "Where were we when we took water sample 38, and what was
the flow-through temp and conductivity?"
I think discussing what a "generic application" will need to know, about
the data, inorder to accomplish these tasks will highlight some areas where
conventions will do a lot of good and some areas where conventions, of the
wrong type, may inhibit the production of general tools. So lets look at
each of these examples and what's involved with implementing them from the
applications perspective.
"I want to make a graph of time vs temperature"
Seems simple enough. First a generic application may know about the
concept of time but it realy doesn't need to know that the dependent variable
is temperature. In fact all the application really needs is an array that
represents coordinates in the X direction, an array that represents
coordinates in the Y direction and which one is the independent variable.
With these arrays it can then determine what the ranges of the values in this
data are and can then set up a window->viewport mapping for transforming the
data onto a location on the screen. Not really much of a problem except for
how does the application no which of possibly many variables in the file are
the appropriate variables to use to make this plot?
Now what happens when the data is not stored in two simple arrays?
Whose responsibility is it to state how variables in the netCDF file should
be selected and ordered to produce the two arrays needed for this task?
For example, a data set could be collected that contains temperature,
pressure and humidity. The following are a couple of the many possible ways
to put this data in to a netCDF file.
netcdf file1 {
dimensions:
values = 5;
time = UNLIMITED;
variables:
float dataset1(time,values);
dataset1:index0 = "temperature"
dataset1:index1 = "pressure"
dataset1:index2 = "humidity"
dataset1:index3 = "lattitude"
dataset1:index4 = "longitude"
long time(time);
}
netcdf file2 {
dimensions:
values = 3;
latlon = 2;
time = UNLIMITED;
variables:
float dataset2(time,values);
dataset2:index0 = "temperature"
dataset2:index1 = "pressure"
dataset2:index2 = "humidity"
long time(time);
float location(time,latlon);
dataset2:index0 = "lattitude"
dataset2:index1 = "longitude"
}
The reasons why someone would want to organize their data in this fashion
is inconsequential. The reasons may be related to how the instrument measuring
the data works. In these two examples one file uses 3 netcdf-dimensions and
three variables and the other uses 2 netcdf-dimensions and two variables to
represent the same data. So now I ask the question again, how is the
application supposed to know what it means to plot time vs temperature? These
are VERY VERY simple examples. The complexity of "understanding" the
organization of the data, from simply looking at the organization of the
variables and dimensions in a file, grows as higher dimensional datasets are
looked at. The number of permutations in the organization of a dataset grow as
the dimensionality of the data grows.
"I want to plot the tracklines from the cruise"
What information is needed by the application in this case. The app
needs to know which variables in the netCDF file are "latitude" and "longitude"
and that the data is infact geographic data. It then needs to determine what the
extent of the lattitude and longitude variables are so it can select the
appropriate map projection. Again as in the previous example this data could
exist in the netCDF file in various organizations.
"Where were we when we took water sample 38, and what was the flow-through temp
and conductivity?"
This type of request, if made directly to a generic application, would
require the application to "know" what "sample 38", "flow-through", "temp" and
"conductivity" are, where they're stored and how to access and display them.
This certainly seems like it would be out of a resonable scope of capability of
a generic application.
As can be seen there are several things that a self-describing netCDF
file cannot possibly describe to an application. IMHO the primary problem is a
lack of standard organizations of data or a lack of a mechanism for
communicating the organization of the data. By organization I mean what are
the geometries of the data( 1D, 2D, 3D ...), what set of variables and
dimensions make up a single data set, is the set of a certain class of data
( Rectilinear grid, scattered, line, irregular grid, mesh . . .), does a
given variable represent an independent or dependent variable. I maintain that
these are the types of information for which conventions are needed in order
to realize "applications that access generic netCDF files and operate on them."
The current conventions.info document only standardizes names, although
important for allowing humans to understand the data, it is inadaquate for
communicating to the application how the data is organized. Understanding the
organization is needed to allow the application to determine which methodes
could be used to visualize the data. If the intention of standardizing names
is to allow applications to "understand" the data based on the names of
variables in a netCDF file, it won't work as well as standardizing data
representations (organizations). Why, because many types of data from
different disciplines can be classified and visuzilized based on the geometry
information(coordinate system) of the data which does not depend on the names
or type of data, but on the structure. Using names like "sfc_t" for surface
temperature does nothing to comunicate the organization of the data or allow
an application to infer a visualization method unless the application has been
configured to "understand" all of the names in the conventions.info document.
This is completely unnecessary do to the fact that most data fit in to simple
classes (organizations, structures) of data.
A boat moving around on the surface of the ocean and collects data
The struture or class of data for these data sets can be classified as a
"2D Random data set." Why 2D, because in each case there are 2 coordinates
(lat,lon) that define the location of the sample point. Why Random, because
there are no functional relationships between the coordinate pairs. Similar
abstractions can be made for gridded data and other classes of data. I feel
very strongly that these are the areas that need to be standardized not names
but structures. Until there is a method of grouping variables in a netCDF file
such that the geometric properties of the data can be inferred a generic
visualization application is really impossible.
-ethan
--
Ethan Alpert internet: ethan@xxxxxxxxxxxxx | Standard Disclaimer:
Scientific Visualization Group, | I represent myself only.
Scientific Computing Division |-------------------------------
National Center for Atmospheric Research, PO BOX 3000, Boulder Co, 80307-3000