NOTE: The netcdf-hdf
mailing list is no longer active. The list archives are made available for historical reasons.
Mike, Several of us (Mitch Baltuch, Peggy Bruehl, Glenn Davis, Steve Emmerson, Dave Fulker, and Russ Rew) have had a chance to read and think about the draft ``netCDF/HDF Design Document'' and we have some questions and comments, which I've collected into the following response. Please pass these on to whoever else should see such comments at NCSA. First, it is not clear from the draft what you intend with regard to data stored in the current netCDF format. More specifically, will it be possible to use tools written to the planned netCDF/HDF interface on archives of current netCDF files? Or will such archives have to be converted to HDF first? We would naturally prefer that the implementation be able to recognize whether it is dealing with HDF or netCDF files. Symmetrically, you should expect that if a program uses the netCDF/HDF interface implementation to create a file, then our library should be able to deal with it, (though we currently don't have the resources to be able to commit to this). In fact this goal could be stated more strongly: Data created by a program that uses the netCDF interface should be accessible to other programs that use the netCDF interface without relinking. This principle covers portability of data across different platforms, but also implies that a consolidated library must handle both netCDF/HDF and netCDF file formats and must maintain backward compatibility for archives stored using previous versions of the formats. This level of interoperability may turn out to be impractical, but we feel it is a desirable goal. It seems to be implied by the sentence on page 4: A hybrid implementation will also give HDF users the power of the netCDF tools while at the same time making the HDF tools available to netCDF users. Note that one possible way to achieve this goal is to recognize the file type when a file is opened, and use the current implementations of the HDF and netCDF libraries as appropriate. A new flag used when creating a file could specify which type of file representation was desired. This use of two different representations for data accessed by the same interface can be justified if each representation has clear benefits; otherwise, we should agree on using the single superior representation and relegating the other to read-only support as long as useful archives in that form exist. If a representation based on VSets is superior to the current netCDF representation in some ways and inferior in other significant ways, then the use of both representations is likely to continue. For example, it may be possible to support variable-length records with the VSet implementation at the cost of slower hyperslab access. In such a case, users would benefit most if the alternative tradeoffs captured in the two different representations were available from a single library at file creation time. Although it may be too early to determine the advantages or disadvantages of one representation over the other, perhaps it needs to be made more clear how the benefits of the VSet-based implementation compare with the implementation costs and the potential space and performance penalties discussed in section 3. We could not determine from the draft whether this project includes resources for rewriting existing HDF tools to use the netCDF/HDF interface. If so, will these tools also use other HDF interfaces or low-level HDF calls? If so, they may not be very useful to the netCDF user community. This is a question of completeness of the interface. If the netCDF/HDF interface is still missing some functionality needed by the tools and requiring the use of other HDF interfaces, perhaps it would be better to augment the netCDF/HDF interface to make it completely adequate for such tools. Here are some more specific comments on the draft design document, in order of appearance in the draft document: On page 1, paragraph 1, you state: [netCDF] has a number of limiting factors. Foremost among them are problems of speed, extensibility and supporting code. If the netCDF model permitted more extensibility by allowing users to define their own basic data types, for example, it might be impractical to write fully general netCDF programs like the netCDF operators we have specified. There is a tradeoff between extensibility and generality of programs that may be written to a particular data model. The ultimate extensibility is to permit users to write any type of data to a file, e.g. fwrite(), but then no useful high-level tools can be written that exploit the data model; it becomes equivalent to a low-level data-access interface. The lack of extensibility may thus be viewed as a carefully chosen tradeoff rather than a correctable disadvantage. On page 2, paragraph 2: The Unidata implementation only allows for a single unlimited dimension per data set. Expectations are that the HDF implementation will not have such a limitation. We are somewhat skeptical about the practicality of supporting both multiple unlimited dimensions and efficient direct-access to hyperslabs. Consider a single two-dimensional array with both dimensions unlimited. Imagine starting with a 2 by 2 array, then adding a third column (making it 2 by 3), then adding a third row, (making it 3 by 3), then adding a fourth column (making it 3 by 4), and so on, until you have an N by N array. Keeping the data contiguous is impractical, because it would require about 2*N copying operations, resulting in an unacceptably slow O(N**3) access algorithm for O(N**2) data elements. The alternative of keeping each incremental row and column in its own VData would mean that accessing either the first row or the first column, for example, would require O(N) reads, and there would be no easy way of reading all the elements in the array by row or by column that did not require multiple reads for many of the data blocks. With the current implementation, each row requires only 1 read and all the elements in the array may be read efficiently from the N row records. Most netCDF programs we have seen use direct access to hyperslabs, and we think maintaining efficient direct access to hyperslabs of multidimensional data should be an important goal. If you can eliminate the current netCDF restriction of only a single unlimited dimension while preserving efficient hyperslab access, we would be very impressed. Page 2, paragraph 5: One of the primary drawbacks of the existing Unidata implementation is that it is based on XDR. This is another case where a particular tradeoff can be viewed as a drawback or a feature, depending on the requirements. Use of a single specific external data format is an advantage when maintaining the code, comparing files written on different platforms, or supporting a large number of platforms. Use of native format and converters, as in HDF, means that the addition of a new platform requires writing conversions to all other existing representations, whereas netCDF requires only conversion to and from XDR. The performance of netCDF in some common applications relates more to the stdio layer below XDR than to XDR: the buffering scheme of stdio is not optimal for styles of access used by netCDF. We have evidence that this can be fixed without abandoning XDR or the advantages of a single external representation. Page 4, paragraph 2: In fact, the people at Unidata are reluctant to divulge how a netCDF structure is actually stored on disk ... This is a slight overstatement. We have only been reluctant to document the netCDF structure in early versions of the netCDF User's Guide, but the structure of netCDF files has always been derivable from the code, which we make freely available. We added a chapter to the User's Guide: ``The NetCDF File Structure and Performance'' which discusses the parts of a netCDF file and their order. Page 4, paragraph 6: For instance, it will then be possible to associate a 24-bit raster image with a [netCDF] variable. We're not sure how it would be possible to access such data using the existing netCDF interface. For example, if you used ncvarget(), would you have to provide the address of a structure for the data to be placed in? If other new types are added, how can generic programs handle the data? What is returned by ncvarinq() for the type of such data? Do you intend that attributes can have new types like "24-bit raster image" also? As for storing 24-bit data efficiently, we have circulated a proposal for packed netCDF data using three new reserved attributes that would support this. Page 5, paragraph 4: Then if the user wants to associate any attributes with that dimension, they are forced to create a variable with the same name (i.e. time(time) in the variable section of Figure 1) and associate any attributes with the variable. ... Since a dimension can have any number of attributes, it is necessary ... Strictly speaking, a netCDF dimension can't have attributes, only a name and a size. If a variable has the same name as a netCDF dimension and the variable's shape is specified by that dimension, it is treated by convention only as a coordinate variable for the dimension. The amount of space saved by merging dimensions with their coordinate variables seems small, since netCDF datasets typically have a small number of dimensions compared to the amount of data. It might even end up taking more space for some datasets, since you presumably would have to generate dimension values for dimensions that had no corresponding coordinate variable. Page 7, paragraph 2: ... it is not readily clear that a distinction needs to be made between dimensions and variables. Dimensions serve to interrelate variables that are defined on a common grid, as well as specifying shapes and sizes of variables. It seems necessary to preserve the distinction between netCDF dimensions and variables for several reasons. First, some variables cannot serve in the role of dimensions, for example multidimensional variables, or single-dimension variables with non-monotonic values. Second, some of the properties of variables make no sense for dimensions, for example missing values, type, and associated attributes. Some characteristics of dimensions also do not make sense for variables, for example it is easy to define what is meant by an "unused dimension" (not used to define the shapes of any variables), but what would an "unused variable" mean. We think you are right when you say Representing these two object the same way may cause more problems than it solves ... Page 7, paragraph 5: However, people have asked that the netCDF be able to handle 300,000 records, each record containing a single 8-bit data element. We currently round the size of each record up to the nearest 32-bit boundary, so you may be trying something too ambitious if you plan to make this much more space-efficient than under the current implementation. However the 50-byte overhead for each record under HDF, if each record is stored as a VData, does seem too extravagant. Page 8, paragraph 1: The current VGroup access routines would require a linear search through the contents of a VGroup when performing lookup functions. ... Because a variable's VGroup may contain other elements (dimensions, attributes, etc. ...) it is not sufficient to go to the Xth child of the VGroup when looking for the Xth record. As stated above, we think it is very important to preserve direct access to netCDF data, and to keep hyperslab access efficient. Page 8, paragraph 6: Furthermore, Unidata is in the process of adding operators to netCDF, which may be lost by adopting SILO as a front-end. The netCDF operators do not currently involve any extensions to the netCDF library; they are written entirely on top of the current library interface. It is possible that we will want to add an additional library function later to provide more efficient support for some of the netCDF operators (e.g. ncvarcpy() which would copy a variable from one netCDF file to another without going through the XDR layer). We agree with your decision to use the Unidata netCDF library rather than SILO as the "front-end". We have set up a mailing list here for Unidata staff who are interested in the netCDF/HDF project: netcdf-hdf@xxxxxxxxxxxxxxxxx Feel free to send additional responses or draft documents to that address or to individual Unidata staff members. ---- Russ Rew russ@xxxxxxxxxxxxxxxx Unidata Program Center University Corporation for Atmospheric Research P.O. Box 3000 Boulder, Colorado 80307-3000
netcdf-hdf
archives: