In a previous blog, I described the internals of an HDF5 file that uses netCDF4 shared dimensions. While that description remains valid, I've discovered some holes in my implementation, as well as some new thoughts on how shared dimensions could be done in a simpler way. Please turn off your cell phones.
My current algorithm for finding shared dimensions goes like this:
- Find all variables (datasets in HDF5 parlance) with attribute CLASS = "DIMENSION_SCALE". These are the dimension scales, and they correspond 1-1 with netCDF4 dimensions. So for each dimension scale, make a dimension using the variable's name as the dimension name, and using the variable's shape(0) as dimension length.
- Find all variables with attribute CLASS = "DIMENSION_LIST". These are the data variables, and the DIMENSION_LIST contains a list of the shared dimension names used by that data variable.
- When a dimension scale is also a coordinate variable, some special processing has to happen, because the "DIMENSION_LIST" is not present. There are two cases:
- The dimension scale has rank 1 (is 1 dimensional). This is the easy and common case, since it means that it has one dimension with the same name as itself.
- The dimension scale has rank 2. These are type char coordinate variables. Its first dimension has the same name as itself, but its second dimension is tricky to find. We know the length of it, so the algorithm I'm using is to look through the dimensions and match on length. If its unique, use that dimension. If not unique, then create an anonymous dimension. I may modify this later to find the real shared dimension.
If you actually read that last part, you may note that the complication is mostly with finding the second dimension of a 2D dimension scale. There are a lot of ironies with this. The first is that this case probably should just be handled by creating a seperate 1D dimension scale that is not also a variable, and a data variable that is not a dimension. The second is that the 2D char coordinate variables are usually really string valued coordinates, and if there were string types in the classic model, then we wouldn't need the second troublesome dimension. The third irony is that this second dimension never needs to be shared and should be anonymous, which netCDF4 does not currently have.
Another wrinkle is that the netCDF4 library uses "dimension ids" to identify the dimensions used, using _Netcdf4Dimid and/or _Netcdf4Coordinates internal attributes. However AFAICT the ids are not well defined, in the sense they are not stored in the file format, but are the index into a list of dimensions that rely on some (apparently undocumented) ordering that the HDF5 C library imposes. So unless I am misunderstanding this, I can't use those ids in a pure Java library that doesn't have access to the HDF5 C library. If I am correct, then this is an example where the file format and the reference library have gotten confused, a mistake we library writers often make.
After working with this issue again, I realized that in principle an easier thing to do is to just put a DIMENSION_LIST on any variable that is supposed to be a data variable. For completeness, here is a proposal to implement netCDF4 shared dimensions with HDF5 dimension scales:
- an object may have one or both "DIMENSION_SCALE" and "DIMENSION_LIST" attributes.
- an object that has the "DIMENSION_SCALE" attribute defines a dimension of length shape(0).
- an object that has the "DIMENSION_LIST" attribute defines a data variable.
- a dimension must exist in the same group or a parent group of a variable that uses it.
I think this would cover the matter, and is significantly simpler than what is implemented now.
However, this proposal doesn't try to capture the creation order of the dimensions, which is one of the purposes of the current _Netcdf4Dimid and _Netcdf4Coordinates internal attributes. The netCDF java library currently ignores creation order, while the netCDF C library preserves creation order. We are debating whether that's an acceptable state of affairs.