NetCDF has historically offered two different storage formats for the netCDF data model: files based on the original netCDF binary format, and files based on the HDF5 format. While this has proven effective in the past for traditional disk storage, it is less efficient for modern cloud-focused technologies such as those provided by Amazon S3, Microsoft Azure, IBM Cloud Object Storage, and other cloud service providers. To that end, the Unidata development team is happy to announce that we are expanding the storage solutions available through the netCDF software libraries.
Selecting an underlying Technology
As with the decision to base the netCDF Extended Data Model and File Format on the HDF5 technology, we do not want to reinvent the wheel when it comes to cloud storage. There are a number of existing technologies that the netCDF team can use to implement native object storage capabilities. We considered the following criteria as we evaluated various storage formats:
- Its data model must be compatible with existing netCDF data models (or very nearly so).
- It must have a well-defined API (programmer interface) that allows for easy subsetting, chunking, compression, etc.
- It must not require a back-end service (such as TDS), and should not incur costs beyond standard fees charged by cloud providers (typically for data egress).
- It must not be tied to any particular cloud provider.
- Existing software libraries supporting the technology are a plus, but are not necessary.
The most important question, however, was the following: What formats already have broad support from our community?
Based on this final question, it became clear to us that the Zarr object model and storage format was the right choice for netCDF's first foray into adding cloud-focused storage options.
Zarr
Zarr enjoys broad popularity within the Unidata community, particularly among our Python users. By integrating support for the latest Zarr specification (while not locking ourselves in to a specific version), we should be able to provide the broadest support for data written by other software packages which use the latest Zarr specification. Of additional benefit is the active and engaged Zarr developer community; through our interactions with them, we are confident that they will be able to help us tackle any issues which we might encounter.
Current Status
Adding additional storage format functionality to the netCDF library is not a speculative task. Work has started on implementing Zarr compatibility within netCDF. Unidata developer Dennis Heimbigner has written a series of internal documents outlining the approach we will take, as well as a mapping of the netCDF data model to the Zarr data model. He has also published a blog post, motivated by the question "How does Zarr handle chunking?" Additional developer blog posts related to netCDF Zarr development will be forthcoming, and we hope to begin solicity community feedback on beta releases within the next several months.
Future Work
Having laid the groundwork for adding Zarr functionality into the netCDF library, we are preparing to start the technical work. Our work and progress will be visible via the netCDF GitHub page, and we will continue to post as regularly as possible regarding our progress. Once we have Zarr functionality implemented, we will have a clearer idea of what it takes to add additional storage formats to support the netCDF data models. We will then be ready, based on community feedback, to begin evaluating additional storage options available to us.
If you have questions or comments on Undiata's work to incorporate Zarr functionality into netCDF, contact Unidata netCDF support.