Dataset Schemas are lost in GRIB datasets

Model data output, such as from climate models, typically is stored as multidimensional arrays in a scientific file format such as netCDF, HDF, or GRIB. For each parameter or variable, the canonical form of the data looks, for example,  like:

   float temperature(time=100, z=48, y=180, x=360);

which indicates a four dimensional variable stored as 32-bit floats, with dimension sizes that are typical for global models. The dimension sizes, in this example (100, 48, 180, 360), are called the variable shape. For each dimension, there is a coordinate variable which allows the data to be georeferenced, for example:

   int time(time-100);
   int z(z=48);
   float y(y=180);
   float x(x=360);

Some variables, such as surface data, dont have a z coordinate, so they have the form:

   float surfaceFlux(time=100, y=180, x=360);

Ensemble models add another dimension, so this form looks like:

   float temperatureEnsemble(time=100, ens=20, z=48, y=180, x=360);

These represent the dataset schema, analogous to the database schema in relational data bases. The schema defines what is stored in the dataset, and what are the valid subset requests or queries.

When model data is encoded in GRIB (both GRIB-1 and GRIB-2), the dataset schema is not explicitly recorded. Instead, the data is represented as a collection of records containing a 2-D (y, x) slice. Each record is self contained, with no references to any other record, and there is no way to describe the overall shape of a variable. Instead, each GRIB record contains the coordinate values (time, level, ensemble) that are used by that record, as well as a description of the variable. A GRIB dataset is thus an unordered collection of GRIB records, and here we assume that this collection is coherent, e.g. represents the output from a single model. The NCEP GFS half-degree global forecast model generates more than 22,250 GRIB records in 126 variables every 6 hours. Collections of model forecast runs, and long running models like climate simulations may have many hundreds of thousands of GRIB records.

In principle, if one wants to figure out the dataset schema of a GRIB dataset, one reads through the records, identifies the variables, collects the coordinate values that each variable uses, and finds each variable's shape and coordinates. In practice this is surprisingly hard to do, for the following reasons.

First and foremost, variables are not given unique names in a GRIB record. Instead, they are represented as collections of attributes in the GRIB record's Product Definition Section (PDS). Typical PDS have several dozen such attributes. There is no formal way to distinguish which attributes should be used to create unique variables. In database terminology, the record table is denormalized, with no key which would allow one to group the records by variable. The job of choosing which attributes should distinguish one variable from another is left to the reader. There is no mechanism in the GRIB specification to do so.

In Unidata's CDM library, a careful examination of 30 or 40 operational GRIB datasets, mostly from NCEP, have led us to choose the following fields from the PDS to define unique variables:  GDS, PDS template, discipline, category, parameter, and level type, and where they apply, the statistical process type (code table 4.10), and the ensemble derived type  (table 4.7). These seem to be correct for the GRIB datasets we have examined. There is no guarantee these are correct for all GRIB datasets, since there is no way for the writers of GRIB datasets to convey their intentions in the GRIB records themselves. One can (and we do) examine external documentation for the model output, and verify that the variables listed there are the same as the ones that our algorithm creates. This validation underscores that fact that each model really has a dataset schema as described above, but the schema is unfortunately not encoded in the GRIB dataset.

There are other things that complicate deriving dataset schemas from collections of GRIB records. The collection may be spread over more than one file. It may be missing some of the records, or some of the files. Encoding errors are not unusual. Duplicate records may exist in the collection. Local unknown tables, templates, and codes may be used. All of these are not hard to deal with on an individual basis, but writing general purpose GRIB software that can read any GRIB dataset with an arbitrary number of minor defects is challenging.

GRIB has been in use for over 20 years, and you might wonder how GRIB readers and writers have managed this state of affairs. My personal guess is that, for the most part, GRIB readers are working with data that they have written, and make whatever assumptions that are needed in order to correctly read it. Most GRIB reading libraries that I have seen come from specific National Centers, and are specialized to read the GRIB datasets that those centers produce. This is most obvious in the lack of local tables from other centers, but there may be more subtle problems if one tries to use software from center A to read GRIB datasets from center B.

Creating general purpose GRIB reading software as in the CDM library uncovers some of these problems. But Unidata is most familiar with NCEP datasets, which are used in the US research and educational community that we serve, and its possible, even likely, that not all our assumptions will be the right ones for all datasets. As our software becomes more widely used in other communities, and as astute users notice and report problems, we expect to be able to correct these faulty assumptions.

A more systematic approach to dealing with these issues will require the WMO and national centers to create a GRIB reference library, and interoperability testing software, as well as a GRIB users group which can share knowledge in an informal way, and ask questions and get timely answers.

Comments:

I think it might be more practical to try and form a working group of GRIB users at important national centers, and see if some informal arrangements might be made to improve the problem.

Posted by ed on April 27, 2011 at 12:33 AM MDT #

Post a Comment:
Comments are closed for this entry.
Unidata Developer's Blog
A weblog about software development by Unidata developers*
Unidata Developer's Blog
A weblog about software development by Unidata developers*

Welcome

FAQs

News@Unidata blog

Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
Browse by Topic
« December 2024
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today