NOTE: The netcdf-hdf
mailing list is no longer active. The list archives are made available for historical reasons.
netcdf-hdf group: This is a response to the response from the Unidata team to the netCDF/HDF Design Document. The original response was posted to netcdf-hdf@xxxxxxxxxxxxxxxx on April 21. Mike and Chris ============================================================= Russ et al: Thanks for your response to the "netCDF/HDF Design Document." Now that we have to really get the project going, things aren't nearly so simple, and this kind of feedback is extremely useful. We have gone over your response, and I've put together some responses and clarifications, which follow. Mike & Chris >Mike, > ... > >First, it is not clear from the draft what you intend with regard to data >stored in the current netCDF format. More specifically, will it be possible >to use tools written to the planned netCDF/HDF interface on archives of >current netCDF files? Or will such archives have to be converted to HDF >first? We would naturally prefer that the implementation be able to >recognize whether it is dealing with HDF or netCDF files. Symmetrically, >you should expect that if a program uses the netCDF/HDF interface >implementation to create a file, then our library should be able to deal >with it, (though we currently don't have the resources to be able to commit >to this). In fact this goal could be stated more strongly: > > Data created by a program that uses the netCDF interface should be > accessible to other programs that use the netCDF interface without > relinking. > >This principle covers portability of data across different platforms, but >also implies that a consolidated library must handle both netCDF/HDF and >netCDF file formats and must maintain backward compatibility for archives >stored using previous versions of the formats. This level of >interoperability may turn out to be impractical, but we feel it is a >desirable goal. We agree that it is desirable that users not have to think about (or even know) how their data is organized. The difficulties involved in maintaining two or more storage formats are ones we already have to deal with just within HDF. There are instances where we've developed newer better ways of organizing a particular object. It isn't fun, but so far it's been managable. What worries me about this policy over the long term is the cumulative work involved as new platforms get introduced, and new versions of operating systems and programming languages are introduced. As these sorts of things happen, we would like to not be committed to supporting all old "outdated" formats. Initially we definitely will support both old and new netCDF formats. We just don't want to guarantee that we will carry it over to new platforms and machines. There is another issue that has to do with supporting "old" things. Based on feedback we're getting from loyal HDF users, we'll probably want to extend that idea to data models, too. For example, some heavy users would rather stick with the predefined SDS model than the more general netCDF model. In a sense, that's no problem since netCDF provides a superset of SDS. We might define SDS as a standard netCDF data abstraction for a certain range of applications. The same has been suggested of raster images. Still, this kind of thing could be very confusing to users trying to decide whetherto use one or the other interface. In addition we would want all software to know that it could treat something stored as an SDS they same way they tread and equivalent netCDF. I suspect you people have already faced this problem with differently defined netCDFs. My guess would be that the problem is managable if the number of different abstractions is small. I'd be interested in your observations. > It seems to be implied by the sentence on page 4: > > A hybrid implementation will also give HDF users the power of the netCDF > tools while at the same time making the HDF tools available to netCDF users. > >Note that one possible way to achieve this goal is to recognize the file >type when a file is opened, and use the current implementations of the HDF >and netCDF libraries as appropriate. A new flag used when creating a file >could specify which type of file representation was desired. Yes, this would be a way to do it. I would like to encourage one format only, however, because in the long run it would make for greater interoperability among programs. > >This use of two different representations for data accessed by the same >interface can be justified if each representation has clear benefits; >otherwise, we should agree on using the single superior representation and >relegating the other to read-only support as long as useful archives in that >form exist. If a representation based on VSets is superior to the current >netCDF representation in some ways and inferior in other significant ways, >then the use of both representations is likely to continue. For example, it >may be possible to support variable-length records with the VSet >implementation at the cost of slower hyperslab access. In such a case, >users would benefit most if the alternative tradeoffs captured in the two >different representations were available from a single library at file >creation time. Good example. I think there will be times when the current netCDF format is definitely superior. For example, suppose I have three variables with the unlimited dimension they are stored in an interleaved fashion. If I access a hyperslab of "records", taking the same slab from all three variables, I might be able to avoid the three seeks I would have to make using the Vset approach (as currently designed--could change). Another option would be to implement the netCDF physical format as an option within HDF, so that the basic file format would still be HDF but the physical storage would follow the old netCDF scheme. (This is a little tricky for the example I've given, and may be really dumb.) We already have the option of different physical storage schemes for individual objects (contiguous, linked blocks, and external), so the concept is there, sort of. >Although it may be too early to determine the advantages or >disadvantages of one representation over the other, perhaps it needs to be >made more clear how the benefits of the VSet-based implementation compare >with the implementation costs and the potential space and performance >penalties discussed in section 3. Good idea. We will try to expand that section. Meantime, it would help us if you could share with use anything you've written on why you chose the format you did. We have tried to determine the strengths and weaknesses of the current format, but you have certainly thought about it more than we have. > >We could not determine from the draft whether this project includes >resources for rewriting existing HDF tools to use the netCDF/HDF >interface. That isn't covered in the draft, but in the NSF proposal we say we'll do that during the second year of the project. With the EOS decision and possible extra funding, we may do it sooner. It depends a lot on what EOS decides should be given priority. We've already had meetings with our tool developers and others about doing this, and it seems pretty straightforward, especially if we ignore attributes that NCSA tools don't yet know about. By the way, Ben Domenico mentioned some time ago that he might assign somebody the task of adapting X-DataSlice to read netCDF. Did that ever happen? >If so, will these tools also use other HDF interfaces or low-level HDF >calls? If so, they may not be very useful to the netCDF user community. Good point. We now have a situation in which any of a number of different types of data can be usefully read by the same tool. 8-bit raster, 24-bit raster, 32-bit float, 16-bit integer, etc., all can be thought of as "images." How we sort this out, or let the users sort it out, is going to be tricky. >This is a question of completeness of the interface. If the netCDF/HDF >interface is still missing some functionality needed by the tools and >requiring the use of other HDF interfaces, perhaps it would be better to >augment the netCDF/HDF interface to make it completely adequate for such >tools. This is an issue that we now need to really tackle. It highlights the fact that HDF has a number of interfaces (and correspondingly a number of data models, I guess), whereas netCDF presents a single data model (I guess). There are pros and cons to each approach, which we probably should explicate at some point. Pros and cons aside, netCDF seems to cover a goodly portion of what the other HDF interfaces cover. The SDS interface obviously fits well into netCDF. The raster image interface can be described in terms of netCDF (8-bit for sure, 24-bit less well), though it seems to work so well with its current organization that we'll have to think hard about whether to convert it to netCDF. Palettes, same. Annotations, maybe not as well, especially when we support appending to annotations and multiple annotations per object. What's left is Vsets, which we put in to support unstructured grids, as well as providing a general grouping structure. Vsets have become very popular, and seem to fill a number of needs. I think the SILO extensions to netCDF may actually give us a nice "extended" netCDF that will cover many of the high level applications of Vsets. We never did think of Vsets as being a high level interface, but rather as a collection of routines that would facilitate building complex organizations for certain applications, such as graphics and finite element applications. SILO appears to give us that higher level extension. > >Here are some more specific comments on the draft design document, in order >of appearance in the draft document: > >On page 1, paragraph 1, you state: > > [netCDF] has a number of limiting factors. Foremost among them are > problems of speed, extensibility and supporting code. > >If the netCDF model permitted more extensibility by allowing users to define >their own basic data types, for example, it might be impractical to write >fully general netCDF programs like the netCDF operators we have specified. >There is a tradeoff between extensibility and generality of programs that >may be written to a particular data model. The ultimate extensibility is to >permit users to write any type of data to a file, e.g. fwrite(), but then >no useful high-level tools can be written that exploit the data model; it >becomes equivalent to a low-level data-access interface. The lack of >extensibility may thus be viewed as a carefully chosen tradeoff rather than >a correctable disadvantage. Good point. Highlights the fact that HDF concentrated in its early days on providing a format that would support a variety of data models, whereas CDF went for a single, more general model, takinng the position that the file format was not nearly as important. Also highlights the fact that, for the time being at least, we feel there is enough value in the multiple-model/extensibility aspects of HDF that we want to keep them. netCDF would be one of several data models supported in HDF, at least initially. > >On page 2, paragraph 2: > > The Unidata implementation only allows for a single unlimited dimension > per data set. Expectations are that the HDF implementation will not have > such a limitation. > >We are somewhat skeptical about the practicality of supporting both multiple >unlimited dimensions and efficient direct-access to hyperslabs. Consider a >single two-dimensional array with both dimensions unlimited. Imagine >starting with a 2 by 2 array, then adding a third column (making it 2 by 3), >then adding a third row, (making it 3 by 3), then adding a fourth column >(making it 3 by 4), and so on, until you have an N by N array. Keeping the >data contiguous is impractical, because it would require about 2*N copying >operations, resulting in an unacceptably slow O(N**3) access algorithm for >O(N**2) data elements. The alternative of keeping each incremental row and >column in its own VData would mean that accessing either the first row or >the first column, for example, would require O(N) reads, and there would be >no easy way of reading all the elements in the array by row or by column >that did not require multiple reads for many of the data blocks. With the >current implementation, each row requires only 1 read and all the elements >in the array may be read efficiently from the N row records. Yes, this was less clear in the paper than it should have been. For exactly the reasons you have outlined above, the restriction that any variable could only have a single unlimited dimension would have to remain. However, it should be possible to have a variable X dependent on unlimited dimension 'time' and a variable Y dependent on unlimited dimension 'foo' in the same file. > > >Most netCDF programs we have seen use direct access to hyperslabs, and we >think maintaining efficient direct access to hyperslabs of multidimensional >data should be an important goal. If you can eliminate the current netCDF >restriction of only a single unlimited dimension while preserving efficient >hyperslab access, we would be very impressed. So would we :-). > >Page 2, paragraph 5: > > One of the primary drawbacks of the existing Unidata implementation is > that it is based on XDR. > >This is another case where a particular tradeoff can be viewed as a drawback >or a feature, depending on the requirements. Use of a single specific >external data format is an advantage when maintaining the code, comparing >files written on different platforms, or supporting a large number of >platforms. Use of native format and converters, as in HDF, means that the >addition of a new platform requires writing conversions to all other >existing representations, whereas netCDF requires only conversion to and >from XDR. The performance of netCDF in some common applications relates >more to the stdio layer below XDR than to XDR: the buffering scheme of stdio >is not optimal for styles of access used by netCDF. We have evidence that >this can be fixed without abandoning XDR or the advantages of a single >external representation. Just one clarification here: HDF offers native mode only on the condition that there will be no conversion. Some day we might offer conversions from and to all representations, but not now. We've only gotten a little flack about that. > ... >Page 4, paragraph 6: > > For instance, it will then be possible to associate a 24-bit raster image > with a [netCDF] variable. > >We're not sure how it would be possible to access such data using the >existing netCDF interface. For example, if you used ncvarget(), would you >have to provide the address of a structure for the data to be placed in? If >other new types are added, how can generic programs handle the data? What >is returned by ncvarinq() for the type of such data? Do you intend that >attributes can have new types like "24-bit raster image" also? As for >storing 24-bit data efficiently, we have circulated a proposal for packed >netCDF data using three new reserved attributes that would support this. > Yeah. Good questions. We haven't tackled them yet. ... >Page 8, paragraph 1: > > The current VGroup access routines would require a linear search through > the contents of a VGroup when performing lookup functions. ... Because a > variable's VGroup may contain other elements (dimensions, attributes, etc. > ...) it is not sufficient to go to the Xth child of the VGroup when > looking for the Xth record. > >As stated above, we think it is very important to preserve direct access to >netCDF data, and to keep hyperslab access efficient. > For the time being, we have decided to place all of a record variable's data into a single VData. In doing so, we have retained fast hyperslab access (in fact it is even faster because all of a variable's data is contiguous). As a side note, VDatas are able to efficiently store and retrieve 8-bit data. It is not yet clear whether people will require the flexibility of storing data in separate objects. If it does seem that users wish to be able to store data distributedly, we will add that capability later. Rather than using a 'threshold' as outlined in the draft you received, we are now leaning towards providing a reserved attribute that the user can set to indicate whether they require all of the data to be in a single VData or in multiple ones. The problem with representing this information at the level of an attribute is how to differentiate between "user" and "system" attributes. For instance, if someone writes out some data, goes into redef() mode and changes the "contiguousness" / packing / fill-values and tries to write more data things are going to be all messed up. Are there plans to logically separate the two types of attributes (i.e. define_sys_attr() and define_user_attr())? Or is the distinction just based on syntactic convention (i.e. names with leading underscores...)? What happens when the user wants a mutable attribute whose name has a leading underscore? >Page 8, paragraph 6: > > Furthermore, Unidata is in the process of adding operators to netCDF, > which may be lost by adopting SILO as a front-end. > >The netCDF operators do not currently involve any extensions to the netCDF >library; they are written entirely on top of the current library interface. >It is possible that we will want to add an additional library function later >to provide more efficient support for some of the netCDF operators (e.g. >ncvarcpy() which would copy a variable from one netCDF file to another >without going through the XDR layer). We agree with your decision to use >the Unidata netCDF library rather than SILO as the "front-end". Because SILO was developed at Lawrence Livermore, it will be impossible to use the existing SILO code in any public domain software. We are currently investigating whether we will even be able to use the *ideas* developed within the Lab in the public domain. We plan to release a description of the SILO interface over the netCDF / HDF mailing list in the near future to see if anyone has different suggestions about how to model mesh data within the context of netCDF. > >We have set up a mailing list here for Unidata staff who are interested in >the netCDF/HDF project: netcdf-hdf@xxxxxxxxxxxxxxxxx Feel free to send >additional responses or draft documents to that address or to individual >Unidata staff members. > >---- >Russ Rew russ@xxxxxxxxxxxxxxxx >Unidata Program Center >University Corporation for Atmospheric Research >P.O. Box 3000 >Boulder, Colorado 80307-3000
netcdf-hdf
archives: