Data Format Summit Meeting

Last week, on Wednesday, the Unidata netCDF team spent the day with Quincey and Larry of the HDF5 team. This was great because we usually don't get to spend this much time with Quincey, and we worked out a lot of issues relating to netCDF/HDF5 interoperability.

I came away with the following action items:

  • switch to WEAK file close
  • enable write access for HDF5 files without creation ordering
  • deferred metadata read
  • show multi-dimensional atts as 1D, like Java
  • ignore reference types
  • try to allow attributes on user defined types
  • forget about stored property lists
  • throw away extra links to groups and objects (like Java does)
  • work with Kent/Elena on docs for NASA/GIP
  • hdf4 netCDF v2 API writes as well as reads HDF4. How should this be handled?
  • John suggests not using EOS libraries but just recoding that functionality.
  • HDF5 team will release tool for those in big-endian wasteland. It will rewrite the file.
  • should store software version in netcdf-4 file somewhere in hidden att.
  • use HDF5 function to find file type, this supports user block
  • read gip article
  • update netCDF wikipedia page with format compatibility info
  • data models document for GIP?

I have been assured that this blog is write-only, so I don't have to explain any of he above, because no one is reading this! ;-)

The tasks above, when complete, with together add up to a lot more interoperability between netCDF-4 and existing HDF5 data files, allowing netCDF tools to be used on HDF5 files.

NetCDF Presentation at HDF5 Workshop

This week I am attending the HDF5 workshop in Champaign, Illinois. I am learning a lot of interesting things about HDF5, and I gave a presentation on netCDF, which is now available on the netCDF web site for those that are interested:

Hartnett, E., 2010-09: NetCDF and HDF5 - HDF5 Workshop 2010.

It's great to see the HDF5 team again!

NPP Data and HDF5 Reference Type

The NPP satellite mission is to produce data in HDF5. There is great interest in seeing these data through the netCDF API, so that netCDF tools can work with NPP data.

At the HDF5 workshop this week I have been given a sample file by Elena which is like the NPP files we will eventually see.

NetCDF can read HDF5 files as long as they follow certain rules, but the NPP files doesn't all those rules. In particular, they use the reference type, currently no handled by netCDF. My plan is to have netCDF not give up when it runs into a reference object.

Developments in NetCDF C Library For 4.1.2 Release

There have been many performance improvements in the upcoming netCDF-4.1.2 release.

One improvement is a complete refactor of all netCDF-4 memory structures. Now the metadata of a netCDF file occupies the smallest possible amount of memory. I have added many more Valgrind tests, and the HDF5 team has worked hard to track down memory issues in HDF5. (Most were not really bugs, but just doing things that Valrgrid doesn't like.)

It's particularly important on high performance platforms that memory used be minimized. If you run a program with 10,000 processors, and each of them uses too much memory for the metadata, that adds up to a lot of wasted memory. And in HPC they have better uses for their memory.

The biggest improvement in performance came from a rewrite of the way that netCDF-4 reads the HDF5 file. The code has been rewritten in terms of the H5LIterate() function, and this has resulted in a huge performance gain. Here's an email from Russ quantifying this gain:

From: Russ Rew <russ-AT-unidata.ucar-DOT-edu>
Subject: timings of nc_open speedup
To: ed-AT-unidata.ucar-DOT-edu
Date: Thu, 23 Sep 2010 15:23:12 -0600
Organization: UCAR Unidata Program
Reply-to: russ-AT-unidata.ucar-DOT-edu                                                                                                                                                    

Ed,

On Jennifer Adam's file, here's the before and after timings on buddy (on the file and a separate copy, to defeat caching):

  real  0m32.60s
  user  0m0.15s
  sys   0m0.46s

  real  0m0.14s
  user  0m0.01s
  sys   0m0.02s

which is a 233x speedup.

Here's before and after for test files I created that have twice as many levels as Jennifer Adam's and much better compression:

  real  0m23.78s
  user  0m0.24s
  sys   0m0.60s

  real  0m0.05s
  user  0m0.01s
  sys   0m0.01s

which is a 475x speedup.  By using even more levels, the speedup becomes arbitrarily large, because now nc_open takes a fixed amount of time that depends on the amount of metadata, not the amount of data.

--Russ

As Russ notes, this is a speedup that can be defined as arbitrarily large, if we tailor the input file correctly. But Jennifer's file is a real one, and at18.4 giga-bytes (name: T159_1978110112.nc4) this file is a real disk-buster. Yet it has a simple metadata structure. At a > 200 times speedup is nice. We had been talking about a new file open mode which would not open the file and read the metadata, all because it was taking so long. I guess I don't have to code that up now, so that's a least a couple of weeks work saved by this fix! (Not to mention that now netCDF-4 will work much better for these really big files, which are becoming more and more common.)

Here's the ncdump -h of this lovely test file:

netcdf T159_1978110112 {
dimensions:
        lon = 320 ;
        lat = 160 ;
        lev = 11 ;
        time = 1581 ;
variables:
        double lon(lon) ;
                lon:units = "degrees_east" ;
                lon:long_name = "Longitude" ;
        double lat(lat) ;
                lat:units = "degrees_north" ;
                lat:long_name = "Latitude" ;
        double lev(lev) ;
                lev:units = "millibar" ;
                lev:long_name = "Level" ;
        double time(time) ;
                time:long_name = "Time" ;
                time:units = "minutes since 1978-11-01 12:00" ;
        float temp(time, lev, lat, lon) ;
                temp:missing_value = -9.99e+08f ;
                temp:longname = "Temperature [K]" ;
                temp:units = "K" ;
        float geop(time, lev, lat, lon) ;
                geop:missing_value = -9.99e+08f ;
                geop:longname = "Geopotential [m^2/s^2]" ;
                geop:units = "m^2/s^2" ;
        float relh(time, lev, lat, lon) ;
                relh:missing_value = -9.99e+08f ;
                relh:longname = "Relative Humidity [%]" ;
                relh:units = "%" ;
        float vor(time, lev, lat, lon) ;
                vor:missing_value = -9.99e+08f ;
                vor:longname = "Vorticity [s^-1]" ;
                vor:units = "s^-1" ;
        float div(time, lev, lat, lon) ;
                div:missing_value = -9.99e+08f ;
                div:longname = "Divergence [s^-1]" ;
                div:units = "s^-1" ;
        float uwnd(time, lev, lat, lon) ;
                uwnd:missing_value = -9.99e+08f ;
                uwnd:longname = "U-wind [m/s]" ;
                uwnd:units = "m/s" ;
        float vwnd(time, lev, lat, lon) ;
                vwnd:missing_value = -9.99e+08f ;
                vwnd:longname = "V-wind [m/s]" ;
                vwnd:units = "m/s" ;
    float sfp(time, lat, lon) ;
                sfp:missing_value = -9.99e+08f ;
                sfp:longname = "Surface Pressure [Pa]" ;
                sfp:units = "Pa" ;

// global attributes:
                :NCO = "4.0.2" ;
}

Special thanks to Jennifer Adams, from the GrADS project. Not only did she provide this great test file, but she also built my branch distribution and tested the fix for me! Thanks Jennifer! Thanks also to Quincey of HDF5 for helping me sort out the best way to read a HDF5 file.

Now I just have to make sure that parallel I/O is working OK, and then 4.1.2 will be ready for release!

NetCDF-4.1 Released!

Now we wait for any bugs to be reported...

It is always a tremendous amount of effort to make a netCDF release. There are so many people depending on it, and I really don't want to mess up up.

The 4.1 release, which I did last Friday, was my sixth netCDF release. This release has a lot of new features - more than usual, since we have had more than the usual number of programmers working on netCDF. It hasn't been just me and (part-time) Russ, as in previous releases. This is the first release that contains work from Dennis, and what a large piece of work it does contain: the new opendap client.

In addition, the 4.1 release contains a new utility, nccopy, which copies netCDF files, changing format on the way if desired. (Russ did nccopy). There is also a new ncgen, finally up to speed on all netCDF-4 extensions, which Dennis did.

As for me, I put in a layer that can read (many but not all) HDF4 files, a layer that can use parallel-netcdf for parallel I/O to classic/64-bit offset files, a modification to allow most existing HDF5 data files to be read by netCDF. Finally, and at the last minute, I changed the default settings of caching and chunking so that the users producing the giant AR-5 datasets would get good performance. Also, I added the nc-config utility to help users come up with correct compiler flags when building netCDF programs, and two new libraries are part of the distribution, UDUNITS (Steve Emmerson) and libcf (yours truly).

Probably we should have released 6 months ago, and held some of those features back for another release. The more new features, the harder it is to test and release.

But it's out now, and we have started working on release 4.2. For that release we have more limited ambitions, and I hope it will be out this year.
Unidata Developer's Blog
A weblog about software development by Unidata developers*
Unidata Developer's Blog
A weblog about software development by Unidata developers*

Welcome

FAQs

News@Unidata blog

Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
  • feed AWIPS (17)
Browse by Topic
« August 2024
SunMonTueWedThuFriSat
    
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
       
Today