© Copyright 1997 American Meteorological Society. To appear in the Proceedings of the Thirteenth International Conference on Interactive Information and Processing Systems for Meteorology, Oceanography, and Hydrology, Anaheim, California, American Meteorology Society, February 1997.
Permission to place a copy of this work on this server has been provided by the AMS. The AMS does not guarantee that the copy provided here is an accurate copy of the published work.

UNIDATA'S NETCDF INTERFACE FOR DATA ACCESS:
STATUS AND PLANS

Russell K. Rew [1] and Glenn P. Davis

Unidata Program Center
University Corporation for Atmospheric Research
Boulder, Colorado

1. INTRODUCTION

Unidata's netCDF (network Common Data Form) is a data model for array-oriented scientific data access, a package of freely available software that implements the data model, and a machine-independent data format. NetCDF supports the creation, manipulation, and sharing of scientific data sets that are self-describing, portable, directly accessible, and appendable.

A data model specifies data components, relationships, and operations, independent of any particular programming language. The components of a netCDF data set are its variables, dimensions, and attributes. Each variable has a name, a shape determined by its dimensions, a type, some attributes, and values. Variable attributes represent ancillary information, such as units and special values used for missing data. Operations on netCDF components include creation, renaming, inquiring, writing, and reading.

The netCDF software includes interfaces for C, Fortran, C++, perl, and Java. Utilities are available for displaying the structure and contents of a netCDF data set, as well as for generating a netCDF data set from a simple text representation.

The netCDF format provides a platform-independent binary representation for self-describing data in a form that permits efficient access to a small subset of a large data set, without first reading through all the preceding data. The format also allows appending data along one dimension without copying the data set or redefining its structure.

Since Unidata developed netCDF, other groups and projects in the geosciences have adopted the netCDF interfaces and format, and its use has also spread to other disciplines. Below, we summarize the growth in the use of netCDF, describe the current status of the software including the recent addition of new interfaces, present the benefits of using netCDF for platform-independent data representation, list some current limitations of the netCDF model and format, and discuss how some of these limitations are addressed by new features that are under development for the next version.

2. USAGE

Since netCDF was made available in 1989 and described in (Rew 1990), the popularity of the interface and format has continued to grow. Now widely used in the atmospheric sciences, it is one of only a handful of data-access interfaces and formats that are used across diverse scientific disciplines (Brown 1993). For example, as part of the Distributed Ocean Data System (DODS), developers have implemented a client-server-based distributed system for access to oceanographic data over the Internet that supports use of the netCDF interface for clients (DODS Web Site, August 1996). Descriptions of some of other projects and groups that are now using netCDF are available from Unidata (NetCDF Web Site, September 1996).

As a measure of recent usage, during April and May 1996, over 900 distinct hosts downloaded version 2.4 of netCDF software, and over 1600 distinct hosts in more than 50 countries accessed information on netCDF from the Web site. NetCDF data may now be accessed from over 20 packages of freely available software, including DDI, DODS, EPIC, FAN, FERRET, GMT, GrADS, HDF interface, LinkWinds, SciAn, and Zebra. Access to netCDF data is also available from commercial or licensed packages for data analysis and visualization, including IBM Data Explorer, IDL, GEMPAK, MATLAB, PPLUS, PV-Wave, PolyPaint+, and NCAR Graphics. For more information on these and other packages for manipulating and displaying netCDF data, see (NetCDF Software Web Site, September 1996).

Use of netCDF library interfaces to access data makes knowledge of the format unnecessary, but lack of a published format specification had proved an obstacle to the adoption of netCDF in some cases. This obstacle was recently removed, with the publication of detailed documentation for the netCDF format (Rew 1996).

The unexpectedly widespread use of netCDF means that any future changes to the data model, interfaces, or format must be planned and implemented with great care. Backward compatibility with existing software and data archives is very important to netCDF users and must be part of future development plans.

3. CURRENT STATUS

During the last year, support was added to netCDF version 2 (hereafter referred to as netCDF-2) for new platforms, and significant optimizations were implemented for supercomputer architectures. This improved the performance of netCDF sufficiently that it is now used for storing and comparing model results for several atmospheric and climate models (Kuehn 1996).

The recently released netCDF-3 includes a complete rewrite of the netCDF library. The netCDF-3 file format is unchanged, so files written with the new version can be read with previous versions and vice versa.

Starting with netCDF-3, the library is no longer dependent on a vendor-supplied XDR library for external data representation, making it easier to build applications that use netCDF. Replacement of the XDR layer also made the library about twice as fast as the previous version.

The netCDF-3 library is now written in ANSI C. The conversion to ANSI C offered an opportunity to implement a completely new C interface that provided significant benefits to C programs that use netCDF: type-safety, automatic type conversions, improved readability, and more standard error behavior. The new interface also removes some obstacles to adding future enhancements, such as packed data and enhanced concurrency.

NetCDF-3 also includes a new Fortran interface that provides analogous benefits: enhanced type safety, automatic type conversion, clean separation of external and language-native types, and a new function-naming scheme for improved readability in applications.

The netCDF-3 library includes support for all netCDF-2 function interfaces, globals variables, and behavior. The benefits of the new C and Fortran interfaces will be an incentive to use them in future applications, but current applications that use the netCDF-2 interfaces will continue to work. Programs may be converted to the new interfaces incrementally, since a mixture of netCDF-2 and netCDF-3 calls is permitted.

The facilities for automatic type conversion in the new C and Fortran interfaces permit accessing numeric data using any convenient numeric type, independent of the external type of the data. For example, a user may access a variable as an array of double-precision floating-point numbers, even if the data is stored externally as 8-, 16-, or 32-bit integers or 32-bit floating-point numbers. Application programs can be simpler, since they don't have to deal with multiple external types, and can be more robust, since they continue to work even after a change to the external type of the data. This capability will be required in netCDF-4, when data may be represented externally in a packed form (for example, a packed array of 10-bit data) for which there is no natural corresponding native type.

Other new features of netCDF-3 include the ability to easily suppress buffering to facilitate sharing data among concurrent programs, the ability to specify whether 8-bit data is treated as signed or unsigned, improved support for 64-bit platforms, and new simple inquiry functions.

FAN (File Array Notation), a new package of utilities for netCDF, was recently made available (Davies 1996). The capabilities of the FAN utilities include extracting and manipulating array data from netCDF files, printing selected data from netCDF arrays, copying ASCII data into netCDF arrays, and performing various operations (sum, mean, max, min, product,...) on netCDF arrays.

4. BENEFITS

Benefits of using netCDF or other similar higher-level data-access interfaces for portable and self-describing data include:

5. LIMITATIONS

While the netCDF data model is widely applicable to data that can be organized into a collection of named scalar or array variables with named attributes, there are some important limitations to the model and its implementation in software. Some of these limitations are inherent in the trade-offs among conflicting requirements that netCDF embodies, but we plan to address other limitations in the next version of the software.
5.1 Compression and File Size
Currently, netCDF offers a limited number of external numeric data types: 8-, 16-, 32-bit integers, or 32- or 64-bit floating-point numbers. This limited set of sizes may use file space inefficiently. For example, arrays of 9-bit values must be stored in 16-bit short integers. Storing arrays of 1- or 2-bit values in 8-bit values is even less optimal.

With the current netCDF file format, no more than 2 gigabytes of data can be stored in a single netCDF file. This limitation is a result of 32-bit offsets currently used for storing positions within a file.

5.2 Indirect Access
Currently, if data in one netCDF file is also needed with another file, the data must either be copied, or an application must know about the location of the data in multiple files. There are no interfaces for defining variables in one file that point to other variables or variable data cross-sections in other files. This limits data sharing, and may even require maintaining multiple copies of data that is used in several files.

If it were possible to use a link variable to point to a specified cross-section of data in one or more other files, data could be shared by reference, without copying it. For example, an image loop could be represented by a small file containing a link variable pointing to image data in other files. To an application reading the link variable, it would appear as if the image data were in the file.

5.3 Necessity for Conventions
The extent to which data can be completely self-describing is limited: there is always some assumed context without which sharing and archiving data would be impractical. NetCDF permits storing meaningful names for variables, dimensions, and attributes; units of measure in a form that can be used in computations; text strings for attribute values that apply to an entire data set; and simple kinds of coordinate system information. But for more complex kinds of metadata (for example, the information necessary to provide accurate georeferencing of data on unusual grids or from satellite images), it is often necessary to develop conventions (Fulker 1991); (NetCDF Conventions Web Site, May 1996).

Specific additions to the netCDF data model might make some of these conventions unnecessary or allow some forms of metadata to be represented in a uniform and compact way. For example, adding explicit georeferencing to the netCDF data model would simplify elaborate georeferencing conventions at the cost of complicating the model. The problem is finding an appropriate trade-off between the richness of the model and its generality (i.e., its ability to encompass many kinds of data). A data model tailored to capture the shared context among researchers within one discipline may not be appropriate for sharing or combining data from multiple disciplines.

5.4 Limitations of the Data Model
The netCDF data model does not support nested data structures such as trees, nested arrays, or other recursive structures, primarily because the current Fortran interface must be able to read and write any netCDF data set. Through use of indirection and conventions it is possible to represent some kinds of nested structures, but the result falls short of the netCDF goal of self-describing data.

Another limitation of the current model is that only one unlimited (changeable) dimension is permitted for each netCDF data set. Multiple variables can share an unlimited dimension, but then they must all grow together. Hence the netCDF model does not permit variables with several unlimited dimensions or the use of multiple unlimited dimensions in different variables within the same file. Hence variables that have non-rectangular shapes (for example, ragged arrays) cannot be represented conveniently.

6. NETCDF-4 AND FUTURE DIRECTIONS

The ability to store packed arrays of n-bit values without wasting space will be the first of the above limitations addressed in netCDF-4. Interfaces have already been designed to add new external data types for packed data, and to permit transparent packing and scaling of limited-precision floating-point values as n-bit arrays.

Both predefined and adaptive scaling will be supported. Parameters (scales and offsets) for packing will be permitted to vary along one or more variable dimensions. One or more exact and extreme values may be specified that will be preserved in packing and unpacking. Whether data is packed or not will be transparent to data readers, since the unpacking will be handled by the library. It will be possible to suppress unpacking and read the raw packed data, if desired.

The current 2 Gbyte file size limitation will also be eliminated with the use of 64-bit offsets in netCDF-4.

To support packed data and larger file sizes, netCDF-4 will require a new format, the first format change for netCDF. For backward compatibility with programs and data archives that use the current netCDF format, the netCDF-4 software must support access to data in both the old and new formats. Fortunately, netCDF already includes a format version number in the file format, so users and programs need not know whether they are accessing data in the old or new format. It will not be possible to add packed data to old format files, but otherwise the change should be relatively transparent.

It may be possible to add link variables for indirect data access to netCDF-4 as well. Our plans for this addition tentatively include the use of a combination of URL and FAN notation for specifying references to cross-sections of data in other files or on other hosts. This has the potential to make the usefulness of data independent of its location, permitting all the members of a virtual community to view and make use of their data holdings as a common resource.

Finally, we were surprised at how easy it was to provide a Java interface for netCDF data, based on an initial read-only Java interface from Joe Sirott (Java Climate Atlas Web Site, July 1996). The use of a Java-based approach to the design and implementation of distributed data access systems appears very promising. Systems based on Java's Remote Method Invocation package (RMI Web Site, September 1996) with portable data in forms such as netCDF may be able to provide powerful new capabilities that will be important for future applications, including independence from data location; executable content that is part of the metadata (for example, for georeferencing data); platform-independence for applications; the ability to write data clients and servers using simple abstract interfaces for data access; and a hierarchy of rich object models that make it easy to customize data access for particular applications.

7. REFERENCES

Brown, S. A, M. Folk, G. Goucher, and R. Rew, 1993. "Software for portable scientific data management," Computers in Physics, Am. Inst. Phys., Vol. 7, No. 3, 304-308.

Davies, H. L., 1995. "FAN - An array-oriented query language," Second Workshop on Database Issues for Data Visualization (Visualization '95), Atlanta, Georgia, IEEE.

DODS Web Site
http://dods.gso.uri.edu/DODS/

Fulker, D. W., 1991. "Unidata strawman for storing earth-referencing data," Seventh International Conference on Interactive Information and Processing Systems for Meteorology, Oceanography, and Hydrology, New Orleans, Am. Meteor. Soc.

Kuehn, J. A., 1996. "Faster libraries for creating network-portable self-describing datasets," Proceedings of the 37th Cray User Group Meeting, Barcelona, Spain, Cray User Group. NetCDF Web Site
http://www.unidata.ucar.edu/software/netcdf/

NetCDF Software Web Site
http://www.unidata.ucar.edu/software/netcdf/software.html

NetCDF Conventions Web Site
http://www.unidata.ucar.edu/software/netcdf/conventions.html

RMI Web Site
http://chatsubo.javasoft.com/current/

Rew, R. K. and G. P. Davis, 1990. "The Unidata netCDF: software for scientific data access," Sixth International Conference on Interactive Information and Processing Systems for Meteorology, Oceanography, and Hydrology, Anaheim, California, Am. Meteor. Soc., 33-40.

Rew, R. K., G. P. Davis, S. Emmerson, and H. Davies, 1996. NetCDF User's Guide, An Interface for Data Access. (Available as PostScript or on the Web at <URL:http://www.unidata.ucar.edu/software/netcdf/docs.html>.

Sirott Web Site
http://cosmo.atmos.washington.edu/


[1] Corresponding author address: Russ Rew, UCAR Unidata, P.O. Box 3000, Boulder, CO 80307-3000; e-mail <russ@unidata.ucar.edu>. The Unidata Program Center is sponsored by the National Science Foundation and managed by the University Corporation for Atmospheric Research.