There has been much discussion (mostly last month) about storing complex structures in netCDF. I have been meaning to respond, but I have been very busy. A lot of the e-mail traffic duplicates a few things that we talked about at NSSDC a few years ago for CDF, but had not implemented at that time. However, here at the Visualization Systems Group at IBM T. J. Watson Research Center, the discussion also duplicates much of what was considered and then implemented almost two years ago as part of the development of the IBM Visualization Data Explorer software. Therefore, I have attached a brief document that describes some of what we have done in the context of importing data stored in netCDF for your consideration. I believe it addresses many of the issues that have been discussed in this forum. Keep in mind this is NOT a proposal, but an outline of an actual implementation that is available in a commercial software product today. Although the ideas can be cast in a form independently of Data Explorer, that software does fully support them. If you have any questions, and especially if you have any comments (positive or negative), please let me know. Lloyd Treinish ------------------------------------------------------------------------------- Importing netCDF data into IBM Visualization Data Explorer Lloyd A. Treinish Visualization Systems Group IBM T. J. Watson Research Center P. O. Box 704 Yorktown Heights, NY 10598 lloydt@watson.ibm.com The IBM Visualization Data Explorer (DX) is a general-purpose software package for scientific data visualization. It employs a data-flow-driven client- server execution model and is currently available on five platforms: IBM POWER Visualization Systems (a medium-grain, shared memory parallel supercomputer) and workstations -- IBM RISC System/6000, Silicon Graphics Indigo and Crimson, Hewlett-Packard 700 and Sun Sparcstation 2. DX is built on a foundation of an internal data model, which describes and provides uniform access services for any data brought into, generated by, or exported from the software. Hence, it has a notion of supporting a number of different classes of interesting scientific data, which can be described by its shape (size and number of dimensions), rank (e.g., scalar, vector, tensor), type (float, integer, byte, etc. or real, complex, quaternion), where the data are located in space (positions), how the locations are related to each other (connections), aggregates or groups (e.g., hierarchies, series, composites, etc.). It also supports those entities required for graphics and imaging operations within the context of Data Explorer. Generically, these are called "objects". The DX data model is supported with an applications programming interface (API) for users or developers to create functions or operations (i.e., modules) for DX. At the user-level (i.e., via a graphical user interface, visual programming or scripting-language programming) the details of the data model and this interface are hidden. An important consequence of this approach is that modules are polymorphic. In addition, there is an external representation, the native dx format. It is a multiple sequential file representation of DX objects. The DX data model is quite rich. Most of what it can support is not directly expressable by netCDF. Therefore, a methodology to extend netCDF for use with Data Explorer was developed. For more information about the DX architecture and data model see, for example, R. Haber et al, "A Data Model for Scientific Visualization with Provisions for Regular and Irregular Grids", Proceedings IEEE Visualization '91 Conference, pp. 298-305, October 1991, B. Lucas et al, "An Architecture for a Scientific Visualization System". Proceedings IEEE Visualization '92, pp. 107-113, October 1992, and "IBM Visualization Data Explorer User's Guide, Second Edition", IBM Document Number SC38-0496-1, August 1992. NetCDF is a data abstraction for (self-describing) multi-dimensional blocks. The descriptions are in terms of attributes, which may be assigned globally or to one or more variables (i.e., a multi-dimensional block). NetCDF in the DX context provides a portable and commonly-used API (C and a veneer layer for FORTRAN 77), and a fixed, portable physical file structure (a single XDR file) in the public domain. NetCDF only knows about arrays of scalars and is a carrier for them and their descriptions. There is NO knowledge or semantics imbedded with regard to any other structure. Since such arrays are inherently flat and rectilinear there is insufficient information typically to define suitable objects for import to DX, especially for irregular or hierarchical data. A netCDF user is free to define custom conventions for the array storage and attribute nomenclature. In this sense it is possible to create a mechanism to support a limited set of other structures on top of the array "protocol". However, this also means that a generic netCDF reader would only be able to report contents and be unable to operate on any underlying context. Such a context for the creation of DX objects for their importation has been defined. Any system that attempts to support structures more complex than what raw netCDF handles would have to deal with this situation. The notion of being able to import any random netCDF and create the correct DX object is NOT possible given the limited netCDF vocabulary. The exception would be for a very limited class of regular/rectilinear arrays, in which any more complex structure is ignored (e.g., a simple image). The DX convention for simple regular data is essentially based on that idea. Hence, a visualization pack- age that is only capable of dealing with "native" netCDF data would have to have limited functionality. DX is capable of dealing with a far greater vari- ety of complex data, only a subset of which can be expressed effectively in netCDF even when one does so via external constraints. What are these aforementioned conventions? The netCDF vocabulary is not sufficiently rich nor at a high enough level to adequately describe the kinds of objects that must be supported for general visualization and analysis. This is a result of the heritage of the original CDF implementation at NASA/GSFC in the mid-1980s. Although the current CDF implementation at NASA does address a few of these limitations, both netCDF and CDF still are focused primarily on a relatively low-level abstraction -- multidimensional blocks. DX objects can be decomposed to a lower level, that of multidimensional arrays. However, the DX array objects are more flexible than those of CDF/netCDF model because they support rank and shape/dimensionality independently. Nevertheless, netCDF can be used as a carrier of self-de- scribing multidimensional arrays, whose descriptions when following a certain convention, can be used by DX to create proper objects. Of course, this may not always be practical since there are significant limitations on the kinds of arrays that a single netCDF may contain based upon constraints such as size, number of named dimensions, etc. due to what the netCDF software supports and its physical file structure. This is an additional justification among other reasons for requiring a native structure. The best way to illustrate these ideas is with a few examples. Scalar data that is on a regular grid can be imported into Data Explorer from a "standard" netCDF file. To import vector data, data on irregular grids, or time series data, additional attributes must be added to the netCDF file. These attributes allow you to specify the data, positions, and connections components of your data set. REGULAR GRIDS To import scalar data on a regular grid, specify the netCDF file name as the "name" parameter in the Import module. By default, all netCDF variables will be imported and collected into a group. To import one or more particular variables, specify their names as the "variable" parameter. The "format" parameter must be "netCDF." Data Explorer automatically constructs positions and connections for each variable, with an origin of 0.0 and spacings of 1.0 along each dimension. For data that is logically a vector field, but whose values are stored in three separate netCDF variables, each component of the vector can be imported separately; the Compute module can then be used to create a single vector field. For data that is logically a vector field, but whose values are stored as an n+1 dimensional regular grid, use the Slice and Compute modules to separate the components of the vector, and then recombine them into a single vector field. Example of a Simple Regular Grid The following netCDL describes a 3 x 3 x 3 regular grid at origin (0, 0, 0) with deltas of 1.0 along each axis. netcdf volume { dimensions: nx = 3; ny = 3; nz = 3; variables: float field_data(nx, ny, nz); data: field_data = 0, 0, 0 0, 0, 0 0, 5, 0 0, 0, 5 0, 0, 0 0, 0, 0 5, 0, 0 0, 0, 0 0, 0, 0; } NetCDF on completely regular grids can be imported directly by Data Explorer without modifying the netCDF file as indicated earlier. COMPLEX FIELDS For data with more complex structure, conventions have been established for netCDF variable attributes, as described in the format below. There are two key variable attributes that you will need to define for each netCDF variable, "field", which as far as you are concerned is used to specify the rank of the parameter, and "positions", which is used to specify where the information containing the locations of the data in space is stored. The defaults for connections (i.e., topological primitive) is quads, cubes, etc. depending on the shape of the field. If you do not specify positions, regularity is assumed with origin at 0.0 and a spacing of 1.0. Data Explorer does support dimensional or array products. This is a generalization of the notion of product specification for rectilinear grids that is employed in CDF and netCDF. Hence, this idea is exploited in the netCDF conventions. It should be noted that netCDF does not make a distinction about the relationship between data dependency and mesh structure -- it is just arrays. Such an distinction is at an applications level above netCDF. Data Explorer allows you to specify whether the values associated with a grid or mesh are to be assigned at the node points of the mesh or the center of the grid cells. For data in netCDF to be imported into DX, it is assumed that the data are associated with node points (i.e., data are dependent on positions). If this is not appropriate for the data of interest, the Post module can be used to convert to a cell-centered form (i.e., data are dependent on connections) after importing. Alternately, the additional field components described below can be used. IRREGULAR ARRAYS Data To indicate that a netCDF variable contains values corresponding to the data component, it must have the following attribute: variable1:field = "fieldname"; Variable1 is the name of the netCDF variable containing data values to be imported. fieldname is the name of the Data Explorer field by which the user refers to the data (for example, "temperature," "pressure," "wind"). If more than one variable is tagged with the same field name, each variable is read into a field, and the fields are collected into a group. The data are read in as an array of values, one number per grid point. If the data are actually a vector or a matrix at each grid point, use one of the following modifiers: variable1:field = "fieldname, vector"; variable1:field = "fieldname, matrix"; The nonscalar data are stored in additional dimensions for the variable. For a static three-dimensional 3-vector, the three components are stored in a fourth dimension of size 3. If the data have both regular connections and regular positions, no other attributes are required. A regular grid is assumed, with the origin at 0.0, and a spacing of 1.0 along each axis. The number of axes will be determined from the number of dimensions in the data array. Positions If the locations of the data values in variable1 do not form a regular lattice (with origins at 0.0 and spacings of 1.0), the name of a netCDF variable that contains the position information must be specified as an attribute for variable1. There are five different types of position specifications: none, completely regular, completely irregular, and two types of partially regular. Completely irregular is assumed if the following attribute is specified: variable1:positions = "variable2"; where variable2 is an array of vectors, one for each grid point, defining its location. The dimensionality of the data space is determined by the number of items in a vector. Regular positions can be specified with just the origin and spacing between grid points along each axis in compact form. The following attribute is used: variable1:positions = "variable2, compact"; where variable2 is the name of a n times 2 array containing origin, delta pairs for the spacing and location of positions along each axis. The number of positions along each axis is determined from the shape of variable1. Positions that can be specified as the product of arrays containing the location of points along each axis can be input in product form. Use the following attribute: variable1:positions = "variable2a, product; variable2b, product; . . . variable2x, product"; where the variable2's are each the name of an array containing a list of positions along that axis. The number of items in each array must match the length of the corresponding axis in the original variable1 data array. If any of the axes in an partially regular product array are actually regular, they can be specified in "compact" form: variable1:positions = "variable2a, product, compact; variable2b, product; . . . variable2x, product"; where variable2a is the name of an origin, delta array, and the rest are position lists as before. Connections If the connections between positions is a regular lattice, no additional attributes are necessary. For 1D data, connections of "lines" is assumed. 2D data implies "quads," 3D data implies "cubes" and for higher dimensions, "hypercubes" is assumed. If the connections are irregular, use one of the following attributes: variable1:connections = "variable3, tetrahedra"; variable1:connections = "variable3, triangles"; variable1:connections = "variable3, cubes"; variable1:connections = "variable3, quads"; where variable3 is the name of an array containing a vector of point numbers, defining each connection element item. The length of this vector depends on the choice of connections. If the shape is not explicitly specified, tetrahedra are assumed. Additional Components If additional component information is present in the file, the following attributes are valid: variable1:component = "variable4, componentname, scalar; variable5, componentname, vector; variable6, componentname, matrix"; and variable4:attributes = "ref, componentname; dep, componentname"; SERIES DATA The DX data model does support aggregates of data, which can be treated as a single entity. Such aggregates may be hierarchical or a simple flat collection of low-level objects like a (time) series. There are three ways to specify the import of datasets that should be treated as series: single variable, separate variables or separate files. Single Variable When all data values are defined as a single netCDF variable, and the unlimited dimension of the variable is to be interpreted as the series dimension, then use one of the following forms of the "field" attribute: variable1:field = "fieldname, scalar, series"; variable1:field = "fieldname, vector, series"; variable1:field = "fieldname, matrix, series"; All other specifications are the same as for simple fields. The position and connection information is assumed to be constant for all members of the series and hence, is not stored redundantly. If the positions or connections change for each step of the series, then the variables used for those arrays must also have an unlimited dimension that corresponds one-for- one with the data array. An example using this method is shown below. Separate Variables When there are separate netCDF variables defined for each step in the series, but all variables are in the same file, use the following global attribute tags: :seriesxxx = "fieldname; variable1a; variable1b; . . . variable1x"; or :seriesxxx = "fieldname; variable1a, float_value; variable1b, float_value; . . . variable1x, float_value"; where the global tag must have the first 6 characters "series". Global tags must be unique, so additional characters can be added to distinguish them. Each variable1x is the name array containing the data for that step. In the first format, the spacing of the steps is assumed to be 1.0. In the second format, the float_value is the value of each step. All other specifications are the same as for simple fields. For example, :series_temp = "temp; temp001; temp002; temp003; . . . ; temp999"; or :series_temp = "temp; temp001, 0.0; temp002, 0.3; temp003, 0.7"; Each name, tempnnn, is the name of a variable (array) containing the data for each member of the series. Separate Files When there are netCDF variables in separate files which make up the steps of a series, use the following global attribute tags: :seriesxxx = "fieldname, files; filename1; filename2; . . . filenameN"; or :seriesxxx = "fieldname, files; filename1, float_value; filename2, float_value; . . . filenameN, float_value"; where the global tag must have the first 6 characters "series". Global tags must be unique, so additional characters can be added to distinguish them. Each filenameN is the name of the netCDF file which contains the data variables for that step. In the first format, the spacing of the steps is 1.0. In the second format, the float_value is the value of each step. All other specifications are the same as for simple fields. This format can be used to create short term series within a file, and then have a series of these smaller series. The syntax is an extension of what is done for multiple steps being multiple variables within a file. For example, :series_temp = "temp, files; temp_file1; temp_file2; temp_file3; . . . temp_fileN"; or :series_temp = "temp, files; temp_file1, 1001.0; temp_file2, 1001.5 temp_file3, 1002.0; . . . temp_fileN, 1231.5"; Compact Specifications of Regular Dimensions This example describes a single two-dimensional scalar field on a latitude- longitude, regular, rectangular grid. The example data are temperature on a one-degree grid with global coverage. For regular dimensions, storing all the grid locations is redundant and wasteful of storage, even if you use a product notation that netCDF can handle. Because Data Explorer array objects can be specified compactly, you can use this method to specify a netCDF with regular dimensions efficiently. For each dimension, you need to specify its value at the origin and its spacing along the dimension. In this example, two variable attributes are defined for the netCDF variables. "field" specifies the rank of the field parameter, and "positions" specifies where the information containing the locations of the data is space is located. dimensions: lon = 360; lat = 180; naxes = 2; ndeltas = 2; variables: float locations(naxes, ndeltas); float temperature(lat, lon); temperature:field = "temperature, scalar"; temperature:positions = "locations, regular"; data: locations = 89.5, -1., // compact specification, origin and -179, 1.; // spacing for lat and lon temperature = ... ; // Data for temperature Partially Regular Grids and Time Series This example describes an ocean circulation model, which consists of a time series of four three-dimensional scalars (temp, sali, wata and conv) and one three-dimensional 3-vector (vel). NetCDF would typically require that there are seven variables (all scalars with the vector be stored as three scalars). The coordinate system for the velocity vectors corresponds to that of the grid (that is, +u implies north, +v implies east, and +w implies down). These grids are partially regular in that the "time," "tlat," and "tlon" portions (three out of the four dimensions) are all regularly spaced. "time" is to be mapped to members of a series group. The fourth dimension, "tlvl," is irregularly spaced. The compact notation can be used for the regular notation, while all the values along the irregular dimension must be specified; a product is formed from the dimensions. The specification in netCDL notation is: dimensions: time = UNLIMITED; tlat = 30; tlon = 50; tlvl = 30; vsize = 3; // At each grid cell for variable vel, there are // three floats for the u, v, and w components of the // vector field. naxes = 3; ndeltas = 2; variables: float lat_axis(ndeltas, naxes); float lon_axis(ndeltas, naxes); float level_axis(tlvl, naxes); float temp(time, tlat, tlon, tlvl); temp:field = "temperature, scalar, series"; temp:positions = "lat_axis, product, compact; lon_axis, product, compact; level_axis, product"; float sali(time, tlat, tlon, tlvl); sali:field = "salinity, scalar, series"; sali:positions = "lat_axis, product, compact; lon_axis, product, compact; level_axis, product"; float wata(time, tlat, tlon, tlvl); wata:field = "water parage, scalar, series"; wata:positions = "lat_axis, product, compact; lon_axis, product, compact; level_axis, product"; float conv(time, tlat, tlon, tlvl); conv:field = "covective index, scalar, series"; conv:positions = "lat_axis, product, compact; lon_axis, product, compact; level_axis, product"; float vel(time, tlat, tlon, tlvl, vsize); vel:field = "velocity, vector, series"; vel:positions = "lat_axis, product, compact; lon_axis, product, compact; level_axis, product"; data: lat_axis = -14.667, 0., 0., 0.333, 0., 0.; lon_axis = 0.0, -99.8, 0.0, 0.0, 0.5, 0.0; level_axis = 0.0, 0.0, 17.5, 0.0, 0.0, 53.425, . . . 0.0, 0.0, 5374.98; temp = ... ; sali = ... ; wata = ... ; conv = ... ; vel = ... ; Irregular Surface This example is the netCDL description of a netCDF for an irregular surface, that of the classic teapot. It has precomputed normals, which are imported as the "normals" component, in addition to positions and connections. netcdf teapot { // name of datafile is "teapot.ncdf" // name of field is "surface" dimensions: pointnums = 2268; trinums = 3584; axes = 3; sides = 3; variables: float locations(pointnums, axes); float normalvect(pointnums, axes); long tris(trinums, sides); float surfacedata(pointnums); // global attributes: :source = "Classic Teapot, data from Turner Whitted"; // specific attributes: surfacedata:field = "surface"; surfacedata:connections = "tris, triangles"; surfacedata:positions = "locations"; surfacedata:component = "normalvect, normals, vector"; normalvect:attributes = "dep, positions"; // This is the start of a large data section data: . . . }