netCDF vs Zarr, an Incomplete Comparison

Visualization created from netCDF data
Visualization created efficiently from netCDF data using kerchunk.

At NSF Unidata, we have been supporting and developing netCDF standards and packages since the original release of netCDF in 1990. We strongly believe in the usefulness of netCDF Common Data Model for Earth Systems Science data, and for other types of data! NetCDF files can be used efficiently in machine learning modeling applications (see Loading NetCDFs in TensorFlow by Noah Brenowitz) and can be used as a virtual Zarr dataset using the python package kerchunk: check out Using Kerchunk with uncompressed NetCDF 64-bit offset files: Cloud-optimized access to HYCOM Ocean Model output on AWS Open Data, which provides a nice oceanographic demo by Rich Signell, and Fake it until you make it — Using Kerchunk to read NetCDF4 data on AWS S3 as Zarr for rapid data access by Lucas Sterzinger.

Zarr is an emergent data standard first introduced in 2016, and has implemented some nice features around efficient subsetting and chunking, cloud optimization, and flexible metadata handling. Zarr was born out of the need for scientific data formats optimized for object storage, instead of the traditional file-/block-based storage. This was driven by the explosion of cloud-hosted scientific data across the last decade. Zarr naturally has some distinct cloud optimization features not found in the file formats previously supported by netCDF.

netCDF and Zarr

In 2016, NSF Unidata was urged by our community to investigate options to allow netCDF to work more easily with modern cloud-based infrastructure. At that time, Zarr was identified as one of several possible initial avenues of interest. Based on the strong interest and rapid adoption of Zarr by the community, the netCDF team decided to begin working with the Zarr community to leverage the good work and contributions being made. Since this time, NSF Unidata has been an active participant in Zarr community meetings. Since 2022, NSF Unidata has had a voting seat on the Zarr Implementation Committee (ZIC), giving our community a formal voice in the technical development process adopted by the Zarr project.

At NSF Unidata we are interested and invested in the success of Zarr, and see it as a compliment to our netCDF efforts. Since our initial introduction to the Zarr community, netCDF has implemented ncZarr, a data storage format largely compatible with netCDF-4 enhanced data model. This new format has been integrated into netCDF so that users can leverage the advantages of cloud-based object storage without having to overhaul their existing code, or move away from netCDF software. A side effect of this adoption has been the ability to convert compatible files between ncZarr-based storage and more traditional netCDF files stored in block-based storage.

Unexpected Interactions with HPC

By design, Zarr's primary focus is on object storage. As the scientific community has investigated use of Zarr in research activities, situations where Zarr is not an appropriate choice have come to light. There have been surprising observations, particularly in High Performance Computing systems, as the community moves beyond sample datasets and begins exploring real-world data.

As part of its approach to object storage, Zarr generates a large number of ‘files’ which represent the corresponding dataset. An unintended consequence of this can be observed when we then consider the case where Zarr is not operating in an object store environment, but is instead being used within traditional block-storage filesystem (such as ext3/ext4, HFS+, or NTFS). The proliferation of files and directories generated can be a tremendous problem for large HPC systems, which by design serve many different types of users, filetypes, and software systems.

While this issue is not present for object storage, which is common for cloud systems like Amazon S3 or Azure Blob Storage (the use of which is becoming more and more common for Earth Systems Science data), it illustrates that there is seldom a one-size-fits-all solution for scientific data management. While object storage hosted scientific data is becoming more common, the bulk of scientific data used for data analysis, machine learning, and historic data archival still exist in traditional computing ecosystems.

We have put together a short and an extremely (perfectly?) imperfect Jupyter Notebook that illustrates this: netCDF vs zarr, an imperfect comparsion

While the Zarr files were faster to write, the example test case we used did create more than 2000 files, compared to just 2 netCDF files. With some effort, this notebook could probably be optimized for both netCDF and Zarr generation (we are hoping to get pull requests and comments from you about this!), but it serves to illustrate the situation.

As more and more HPC centers move to object storage, this potential downside might fade away in the future.

Thomas Martin is an AI/ML Software Engineer at the NSF Unidata Program Center. Have questions? Contact support-ml@unidata.ucar.edu or book an office hours meeting with Thomas on his Calendar.

Ward Fisher is the lead developer for NSF Unidata's netCDF efforts.

Comments:

Yes a zarr dataset has lot of files, but what problems does that actually cause? I tried running the notebook on an HPC systems and the read/write performance of Zarr was great.

Posted by Rich Signell on October 02, 2024 at 12:07 AM MDT #

Hi Rich, There is no problem at this notebook scale (and agree, the I/O speed is impressive!) but it can cause problems if your on a HPC with 1000's of users with 100,000's of files each. Just wanted to point out for specific use cases, it might not be the best. For many, it is! Best, Thomas

Posted by Thomas Martin on October 02, 2024 at 02:58 AM MDT #

Hi Thomas, just a couple of remarks: 1. I think it should be mentioned that zarr can also create a single file as output. E.g. by using a Zipstore or a similar store: https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.ZipStore I haven't tested the performance implications at large scale. 2. The notebook is using different compression methods for zarr and netcdf. This doesn't affect the number of files but it does affect the runtimes and the file sizes.

Posted by Panagiotis Mavrogiorgos on October 30, 2024 at 08:49 AM MDT #

Post a Comment:
Comments are closed for this entry.
News@Unidata
News and information from the Unidata Program Center
News@Unidata
News and information from the Unidata Program Center

Welcome

FAQs

Developers’ blog

Recent Entries:
Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
Browse by Topic
« December 2024
SunMonTueWedThuFriSat
2
3
4
5
6
7
8
11
12
13
14
15
16
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today