NCZarr Support for Zarr Filters

[Note: See github issue 2006 for additional comments.]

To date, filters in the netcdf-c library referred to HDF5 style filters. This style of filter is represented in the netcdf-c/HDF5 file by the following information:

  1. An unsigned integer, the "id", and
  2. A vector of unsigned integers that encode the "parameters" for controlling the behavior of the filter.

The "id" is a unique number assigned to the filter by the HDF Group filter authority. It identifies a specific filter algorithm. The "parameters" of the filter are not defined explicitly but only by the implementation of the filter.

The inclusion of Zarr support in the netcdf-c library (called NCZarr) creates the need to provide a new representation consistent with the way that Zarr files store filter information. For Zarr, filters are represented using the JSON notation. Each filter is defined by a JSON dictionary, and each such filter dictionary is guaranteed to have a key named "id" whose value is a unique string defining the filter algorithm: "lz4" or "bzip2", for example.

The parameters of the filter are defined by additional -- algorithm specific -- keys in the filter dictionary. One commonly used filter is "blosc", which has a JSON dictionary of this form.

{
"id": "blosc",
"cname": "lz4",
"clevel": 5,
"shuffle": 1
}

So in HDF5 terms, it has three parameters:

  1. "cname" -- the sub-algorithm used by the blosc compressor, LZ4 in this case.
  2. "clevel" -- the compression level, 5 in this case.
  3. "shuffle" -- is the input shuffled before compression, yes (1) in this case.

NCZarr (netcdf Zarr) is required to store its filter information in its metadata in the above JSON dictionary format. Simultaneously, NCZarr expects to use many of the existing HDF5 filter implementations. This means that some mechanism is needed to translate between the HDF5 id+parameter model and the Zarr JSON dictionary model.

The standardization authority for defining Zarr filters is the list supported by the NumCodecs project. Comparing the set of standard filters (aka codecs) defined by NumCodecs to the set of standard filters defined by HDF5, it can be seen that the two sets overlap, but each has filters not defined by the other.

Note also that it is undesirable that a specific set of filters/codecs be built into the NCZarr implementation. Rather, it is preferable for there be some extensible way to associate the JSON with the code implementing the codec. This mirrors the plugin model used by HDF5.

Currently, each HDF5 filter is implemented by a shared library that has certain well-defined entry points that allow the netcdf/HDF5 libraries to determine information about the filter, notably its id. In order to use the codec JSON format, these entry points must be extended in some way to obtain the corresponding defining JSON. But there is another desirable constraint. It should be possible to associate an existing HDF5 filter -- one without codec JSON information -- with the corresponding codec JSON. This association needs to be implemented by some mechanism external to the HDF5 filter.

Pre-Processing Filter Libraries

The process for using filters for NCZarr is defined to operate in several steps. First, as with HDF5, all shared libraries in a specified directory (HDF5_PLUGIN_PATH) are scanned. They are interrogated to see what kind of library they implement, if any. This interrogation operates by seeing if certain well-known (function) names are defined in this library.

There are two library types:

  1. HDF5 -- exports a specific API: "H5Z_plugin_type" and "H5Z_get_plugin_info".
  2. Codec -- exports a specific API: "NCZ_codec_type" and "NCZ_get_codec_info"

Note that a given library can export either or both of these APIs. This means that we can have three types of libraries:

  1. HDF5 only
  2. Codec only
  3. HDF5 + Codec

Suppose that our HDF5_PLUGIN_PATH location has an HDF5-only library. Then by adding a corresponding, separate, Codec-only library to that same location, it is possible to make an HDF5 library usable by NCZarr. It is possible to do this without having to modify the HDF5-only library. Over time, it is possible to merge any given HDF5-only library with a Codec-only library to produce a single, combined library.

Using Plugin Libraries

The approach used by NCZarr is to have the netcdf-c library process all of the libraries by interrogating each one for the well-known APIs and recording the result. Any libraries that do not export one or both of the well-known APIs is ignored.

Internally, the netcdf-c library pairs up each HDF5 library API with a corresponding Codec API by invoking the relevant well-known functions (See Appendix A). This results in this table for associated codec and hdf5 libraries.

HDF5 APICodec APIAction
Not definedNot definedIgnore
DefinedNot definedIgnore
DefinedDefinedNCZarr usable

Using the Codec API

Given a set of filters for which the HDF5 API and the Codec API are defined, it is then possible to use the APIs to invoke the filters and to process the meta-data in Codec JSON format.

Writing an NCZarr Container

When writing, the user program invokes the NetCDF API function nc_def_var_filter. This function is currently defined to operate using HDF5-style id and parameters (unsigned ints). The netcdf-c library examines its list of known filters to find one matching the HDF5 id provided by nc_def_var_filter. The set of parameters provided is stored internally. Then during writing of data, the corresponding HDF5 filter is invoked to encode the data.

When it comes time to write out the meta-data, the stored HDF5-style parameters are passed to a specific Codec function to obtain the corresponding JSON representation. Again see Appendix A. This resulting JSON is then written in the NCZarr metadata.

Reading an NCZarr Container

When reading, the netcdf-c library reads the metadata for a given variable and sees that some set of filters are applied to this variable. The metadata is encoded as Codec-style JSON.

Given a JSON Codec, it is parsed to provide a JSON dictionary containing the string "id" and the set of parameters as various keys. The netcdf-c library examines its list of known filters to find one matching the Codec "id" string. The JSON is passed to a Codec function to obtain the corresponding HDF5-style unsigned int parameter vector. These parameters are stored for later use.

When it comes time to read the data, the stored HDF5-style filter is invoked with the parameters already produced and stored.

Supporting Filter Chains

HDF5 supports filter chains, which is a sequence of filters where the output of one filter is provided as input to the next filter in the sequence. When encoding, the filters are executed in the "forward" direction, while when decoding the filters are executed in the "reverse" direction.

In the Zarr meta-data, a filter chain is divided into two parts: the "compressor" and the "filters". The former is a single JSON codec as described above. The latter is an ordered JSON array of codecs. So if compressor is something like "compressor": {"id": "c"...} and the filters array is like this: "filters": [ {"id": "f1"...}, {"id": "f2"...}...{"id": "fn"...}] then the filter chain is (f1,f2,...fn,c) with f1 being applied first and c being applied last when encoding. On decode, the filter chain is executed in the order (c,fn...f2,f1).

So, an HDF5 filter chain is divided into two parts, where the last filter in the chain is assigned as the "compressor" and the remaining filters are assigned as the "filters". But independent of this, each codec, whether a compressor or a filter, is stored in the JSON dictionary form described earlier.

Extensions

The Codec style, using JSON, has the ability to provide very complex parameters that may be hard to encode as a vector of unsigned integers. It might be desirable to consider exporting a JSON-base API out of the netcdf-c API to support user access to this complexity. This would mean providing some alternate version of "ncdefvar_filter" that takes a string-valued argument instead of a vector of unsigned ints.

One bad side-effect of this is that we then may have two classes of plugins. One class can be used by both HDF5 and NCZarr, and a second class that is usable only with NCZarr.

This extension is unlikely to be implemented until a compelling use-case is encountered.

Summary

This document outlines the proposed process by which NCZarr utilizes existing HDF5 filters. At the same time, it describes the mechanisms to support storing filter metadata in the NCZarr container using the Zarr compliant Codec style representation of filters and their parameters.

Appendix A. Codec API

The Codec API mirrors the HDF5 API closely. It has one well-known function that can be invoked to obtain information about the Codec as well as pointers to special functions to perform conversions.

Note that this Appendix is only an initial proposal and is subject to change.

NCZ_get_codec_info

This function returns a pointer to a C struct that provides detailed information about the codec converter.

Signature

struct NCZ_codec_t NCZ_get_codec_info(void);

NCZ_codec_t

typedef struct NCZ_codec_t {
    int version; /* Version number of the struct */
    int sort; /* Format of remainder of the struct;
                 Currently always NCZ_CODEC_HDF5 */
    const char* id;            /* The name/id of the codec */
    const unsigned int hdf5id; /* corresponding hdf5 id */
    int (*NCZ_codec_to_hdf5)(const char* codec, int* nparamsp, unsigned* paramsp);
    int (*NCZ_hdf5_to_codec)(int nparams, unsigned* params, char** codecp);
} NCZ_codec_t;

The key to this struct is the two function pointers that do the conversion between codec JSON and HDF5 parameters.

NCZ_codec_to_hdf5

Given a JSON Codec representation, it returns a corresponding vector of unsigned integers for use with HDF5.

Signature

int (*NCZ_codec_to_hdf5)(const char* codec, int* nparamsp, unsigned** paramsp);

Arguments

  1. codec -- (in) ptr to JSON string representing the codec.
  2. nparamsp -- (out) store the length of the converted HDF5 unsigned vector
  3. paramsp -- (out) store a pointer to the converted HDF5 unsigned vector; caller must free the returned vector. Note the double indirection.

Return Value: a netcdf-c error code.

NCZ_hdf5_to_codec

Given an HDF5 vector of unsigned integers and it length, return the corresponding JSON codec representation.

Signature

int (*NCZ_hdf5_to_codec)(int nparamsp, unsigned* paramsp, char** codecp);

Arguments

  1. nparams -- (in) the length of the HDF5 unsigned vector
  2. params -- (in) pointer to the HDF5 unsigned vector.
  3. codecp -- (out) store the string representation of the codec; caller must free.

Return Value: a netcdf-c error code.

Comments:

So is the intention to continue to use nc_ def_ var_filter() for the new string info as well? How does that work?

Something to bear in mind is that whatever solution is selected, it must be implemented in the Fortran APIs as well. So it should have no void pointers, unfortunately, which would otherwise be a tempting choice.

I could imagine a new function: nc_ def_ var_ filter_json() which takes a path to a JSON file, which specifies compression parameters.

Perhaps also introduce a new def/inq pair for each new compression method, which takes the exact parameters that function needs, and then, behind the scenes, creates the JSON and calls nc_ def_ var_ filter_json(). This is similar to what CCR does, and an simple extension of what we
already have, with nc_ def_ var_deflate() and nc_ def_ var_szip().

This new compression work is fantastic and I'm sure will bring a lot of science benefit to our
users. NOAA, NASA, ESA - they are drowning in data! And the problem is only getting worse each year...

Ed

Posted by Edward Hartnett on May 23, 2021 at 11:40 PM MDT #

Additional comments can be read/written at
https://github.com/Unidata/netcdf-c/issues/2006

Posted by Dennis Heimbigner on May 24, 2021 at 07:08 AM MDT #

Dennis,

Thank you for drafting a sensible proposal to leverage Zarr filters in NCZarr.

This is my first exposure to Zarr filters, though I have a good understanding of HDF5 filters. I do not understand from this material how Zarr or NCZarr handle the multiple filter dictionaries needed for filter-chaining. The real world example would be, say, a lossy codec followed by a lossless codec. The netCDF filter documentation explicitly states that filters are applied in first-defined order on writing, and reverse order on reading. I expect that NCZarr allows for multiple codec dictionaries in the metadata storage, will apply filters in the order of their appearance in the metadata. Clarifying how Zarr and NCZarr support multiple filters for a variable would be helpful, even if it is just a re-statement of the treatment of multiple HDF5 filters, would be helpful in making this proposal more airtight.

I like the HDF5/NCZarr symmetry in the exported symbols in the shared libraries. If this proposal is adopted then we would certainly try to provide the NCZarr symbols in the CCR filter shared libraries.

Charlie

Posted by Charlie Zender on May 24, 2021 at 11:02 AM MDT #

Post a Comment:
Comments are closed for this entry.
Unidata Developer's Blog
A weblog about software development by Unidata developers*
Unidata Developer's Blog
A weblog about software development by Unidata developers*

Welcome

FAQs

News@Unidata blog

Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
Browse by Topic
« December 2024
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today