Hi Jed,
> Is there any way - currently or planned for the future - to specify
> netCDF4-specific variable definitions (e.g., deflate) in a CDL file?
There is no current way, but it's in our plans (see below).
> As I often use ncgen on cdl files to create new netCDF files, this
> capability would be highly desirable.
>
> As a workaround, I did try creating an empty netCDF-4 file with ncgen
> -v4 -b ... and then writing a little f90 program to run nf90_redef()
> and then set the "deflate" flags for each variable. But the library
> returned an error, saying this action is not possible after the
> variables are defined. According to the documentation
>
> "Once enddef has been called, it is impossible to set the deflate for
> a variable."
>
> But I might imagine that one could call redef to change these flags?
> Unfortunately, this did not work for me.
No, it's actually not possible to change the compression (or chunking)
of a netCDF variable after it has been created, so doing a redef won't
work.
But this is a timely question, as ncdump is being reimplemented for
netCDF-4.
There are several "performance characteristics" of netCDF data that are
currently not represented in ncdump output (or in NcML):
- format variant (classic, 64-bit-offset, netcdf-4, netcdf-4-classic)
- netCDF-4 variable compression
- netCDF-4 variable chunking parameters
- netCDF-4 endianness
At issue are two different views of the purpose of CDL/NcML:
1. CDL and NcML are abstract textual representations of metadata and
data, without details of optimizations for performance. This has
advantages in being able to easily compare ncdump output of two
files that use different performance-related format, compression,
chunking, or endianness settings.
The current implementation of ncdump follows this philosophy,
creating CDL/NcML with no information about the file format
variant of the input, but allows determining this information
with the "-k" (kind) option.
2. CDL and NcML are a completely faithful textual representation of
data with all the details needed to generate performance-tuned
binary data via a program such as ncgen, permitting ncdump and
ncgen to be true inverses.
One way to implement the second philosophy is to optionally include in
ncdump output extra syntax to specify performance characteristics. We
have plans for this, but are still discussing how to do it.
One approach would simply add new syntax to CDL to represent
performance-related characteristics. For example, performance
characteristic specifications could be included after a variable
definition in parentheses:
float relhum(time, level, lat, lon) (Compression: deflate=5) ;
It would then be the job of ncgen to parse the new syntax and make the
appropriate API calls.
A second approach would require ncdump and similar utilities to generate
synthetic attributes that don't really exist in the data file but that
contain information about format, chunking, compression, and
endianness. These synthetic attributes would represent
performance-related properties of the data that ncgen could use in
generating binary files from CDL/NcML data. File-level, group-level,
or variable-level attributes with names "_Compression", "_Chunking",
and "_Endianness" could be used for these attributes, with
variable-level attributes overriding group-level attributes, which in
turn could override file-level specifications. Although the ncgen
utility would respect these special attributes, they would not
actually be stored in the file, since that information is already
available through the API.
A third approach would implement these attributes under the C and Java
APIs, so that whenever a variable is represented as compressed, the
API would behave as if performance-related attributes have been defined
even though such attributes do not actually exist in the file. Users
could specify performance characteristics by defining such attributes
instead of through existing API calls, but such attributes could only
be defined at certain times. For example, it's not possible to change
the file format of a file through an API call, so adding a new
"_Format" attribute to an existing file that used a different format
would result in an error. Similarly, performance characteristics of
variables fixed at variable definition time could not later be altered
by adding a variable attribute. But ncgen could use such attributes
to allow CDL to specify performance characteristics in creating new
files.
If you have reasons why you think we should reject or favor one of
these approaches, please let us know soon. Thanks!
--Russ