NOTE: The netcdf-hdf
mailing list is no longer active. The list archives are made available for historical reasons.
Hi Russ, > I'd like to reconsider the Unicode issue, and specifically ask about > the feasibility of what we hope is a small addition to HDF5 to allow > netCDF to support UTF-8 encoded names for variables, dimensions, and > attributes without HDF5 having to support such encoded names. > > We would like to just declare in netCDF documentation that the names > for netCDF variables, dimensions, and attributes are UTF-8 encoded > when provided to or returned from netCDF interfaces. This is > backwards compatible, because we currently only support ASCII strings > (with some restrictions), and what we're proposing would just remove > the restrictions and allow non-ASCII bytes (with the upper bit set), > to allow for UTF-8 encoding of other Unicode characters. > > What we would need from HDF5 is a way to request that names for > Datasets and Attributes allow an arbitrary byte array, so we can use > UTF-8 encoding for non-ASCII characters. > > Is this feasible? After rooting through the group API as much as I have recently, I think it's probably quite feasible for the names of object & attributes to use UTF-8 encoding for their strings. There are only two hangups I can see: - The names will be sorted in byte-value order, since there's no locale information embedded in the file, which may disconcert international users. - The strings are nul-terminated and I'm not certain if part of a UTF-8 string can be nul. I'll write some tests that check for proper insertion of non-ASCII strings as object & attribute names and let you know what I find out. Note that Unicode strings as elements of a dataset is harder and probably won't work correctly currently. Quincey > Otherwise there are no library changes in netCDF that we would need to > support UTF-8 encoding for Unicode names. Some applications such as > ncdump and ncgen will have to know how to handle encoded names, but we > are willing to deal with that. > > Note that we're not requesting that you drop restrictions on all > names, just that you provide a way for netCDF-4 to be able to use > names with non-ASCII bytes, for example a call to a function that says > checking on new names will subsequently lenient (e.g. you could still > disallow empty names, names with embedded null characters, or names > that are too long). Existing code that didn't invoke this call would > still have to abide by the current name restrictions. > > Also I notice that the documentation for H5Acreate and H5Dcreate at > > http://hdf.ncsa.uiuc.edu/HDF5/doc/RM_H5A.html#Annot-Create > http://hdf.ncsa.uiuc.edu/HDF5/doc/RM_H5D.html#Dataset-Create > > currently list no restrictions on names to use only ASCII characters, > but the Introduction to HDF5 says > > A dataset name is a sequence of alphanumeric ASCII characters. > > --Russ >
netcdf-hdf
archives: