NetCDF ZARR Data Model Specification

Introduction
Notation
Data Model
Excluded Elements
Appendix A. Supporting Lexical Tokens
1. Fully Qualified Names
Appendix B. Supplementary Material
1. Specifying Context-Sensitive Elements
Appendix C. Complete Version of the Abstract Representation Specification

Introduction

This document describes the to-be-implemented NCZarr data model by reference to the netcdf-4 (aka netcdf enhanced) data model. Elements of the enhanced model included in this model will be listed. Elements of the enhanced model not included are listed in a later section.

Notation

In order to represent the abstract structure of the NCZarr data model, we must choose some suitable notation. This notation must meet the requirement that it is typed, meaning that the nodes of the tree have a type and the structure of the node must conform to that type.

Ideally, we would use Json as our notation since that is the target representation used by the Zarr specification. Unfortunately, Json is effectively typeless so we do not consider it powerful enough to properly represent the data model. If some way exists to do this, then this may be viable.

We choose Antlr4 [1] as our formalism because it is designed for such uses as this one, and it is quite concise. In the following specification, upper-case names (such as NAME or ZARRVERSION) are terminals in the parsing sense and are specified in Appendix A.

Data Model

Dataset

dataset : NAME ZARRVERSION (dimension | variable | attribute | group)*

The unit of data storage in NCZarr, as with netcdf-4, is the Dataset. A Dataset is also a Group (see below), so it can contain variables, attributes, and (sub-)groups. These semantics are consistent with the netcdf-4 Dataset semantics.

Group

group: NAME (dimension | variable | attribute | group)*

A Group contains a collection of dimension declarations, variable declarations, attributes, and (sub-)groups. Note that user-defined type declarations are not (yet) included.

Attribute

attribute : NAME value_type (CONSTANT)+

An Attribute contains a (ordered) set of values, where the values are constants consistent with the specified type of the attribute. An attribute must have at least one value.

Dimension

dimension: NAME SIZE

A Dimension declaration defines a named dimension where the dimension has a specific specified size.

Variable

variable: NAME type (dimref)* (attribute)*

A Variable declaration defines a named variable of a specified type. It also can reference a set of dimensions defining the rank and size of the variable. If no dimensions are referenced, then the variable is a scalar.

Additionally, any number of attributes can be associated with the variable to define properties about the variable.

Dimension Reference

dimref: SIZE | FQN

A Dimension reference specifies one the dimensions of a variable by either defining an anonymous dimension where the size is specified directly, or by providing the fully qualified name refering to some dimension defined in some Group via a <Dimension> declaration.

Types

type: atomic_type ; 
atomic_type: fixed_atomic_type | char_type ; f
ixed_atomic_type: BYTE_T // A signed 8 bit integer | UBYTE_T // An unsigned 8 bit integer | SHORT_T // A signed 16 bit integer | USHORT_T // An unsigned 16 bit integer | INT_T // A signed 32 bit integer | UINT_T // An unsigned 32 bit integer | INT64_T // A signed 64 bit integer | UINT64_T // An unsigned 64 bit integer ; 
char_type: CHAR_T ;

For now, NCZarr only supports the signed and unsigned integer types of sizes 8, 16, 32, and 64 bits. It also supports an approximation to the character type. Addition of more complex types such as strings must await the Zarr version 3 specification.

These atomic types are those can be used when specifying the type of a variable or an attribute, the names are taken from the corresponding netCDF-4 specification.

Character Type

The character type is almost universally (except for Java) associated with an 8-bit unsigned value. But this has always caused problems because historically, multiple encodings have been associated with it: ASCII, ISO-LATIN-8859, UTF-8, for example.

Each encoding may support only a subset of the 256 possible values that can be represented by an 8-bit unsigned value. In the case of UTF-8, which supports multi-byte characters, a single 8-bit value may not even be able to represent a legal UTF-8 character.

To deal with this, we essentially punt by declaring the character type to be the same as UBYTE_T (an 8-bit unsigned integer). Interpretation of the encoding of a character is then outside the scope of this document.

Excluded Elements

The initial data model for NCZarr deliberately excludes a number of netcdf-4 concepts so that a working implementation can be achieved as rapidly as possible. Additionally, implementation of some netcdf-4 features need to be coordinated with the new version 3 Zarr specification.

Strings

The biggest omission is the netcdf-4 String type. The reason is that it is a varying length type and proper representation in Zarr is still incomplete. It is expected that this will be the first new type to be added since it is so useful. For now, the netcdf-3 approach of using arrays of characters will need to be used.

User-Defined Types.

The netcdf-4 user-defined type constructors are enumeration, compound, opaque, and vlen. Of these, the most problematic is vlen because of its varying length. Without it, the others would all be fixed size and could be implemented. In fact the v2 Zarr specification does provide for compound types, but we choose to wait for v3 before implementing it.

Unlimited Dimension Size

The netcdf-4 notion of unlimited allows for the definition of a dimension whose size is known at any given point in time, but whose size can vary over time. It is still the case that all references to it are required to have the same size and this can cause some difficulties at the storage level where it can introduce undefined values into existing variables.

Appendix A. Supporting Lexical Tokens

In order to completely interpret the above data model, a number of supporting lexical definitions are required and are described here.

NAME: IDCHAR+ FQN: ([/])|([/](IDCHAR)+)+ SIZE: DIGITS // Non-negative integer ZARRVERSION: DIGITS '.' DIGITS '.' DIGITS // Type Lexemes BYTE_T: 'byte' UBYTE_T: 'ubyte' SHORT_T: 'short' USHORT_T: 'ushort' INT_T: 'int' UINT_T: 'uint' INT64_T: 'int64' UINT64_T: 'uint64' CHAR_T: 'char // Exact form is as usual, but will leave out for now CONSTANT: INTEGER | UNSIGNED | FLOAT | CHAR; fragment DIGITS: ['0'-'9']+ fragment UTF8: // Assume base character set is UTF8 fragment ASCII: [0-9a-zA-Z !#$%()*+:;<=>?@[]\^_`|{}~] // Printable ASCII fragment IDCHAR: (IDASCII|UTF8) fragment IDASCII: [0-9a-zA-Z!#$%()*+:;<=>?@[]^_`|{}~] | '\\' | '\/' | '\ '

A NAME consists of a sequence of any legal non-control UTF-8 characters. A control character is any UTF-8 character in the inclusive range 0x00 — 0x1F.

Fully Qualified Names

Every dimension and variable in a NCZarr Dataset has a Fully Qualified Name (FQN), which provides a way to unambiguously reference it in a dataset. Currently, the only case where this is used is for referencing named dimensions from within variable declarations.

These FQNs follow the common conventions of names for lexically scoped identifiers. In NCZarr scoping is provided by Groups (and the group subtype dataset). Just as with hierarchical file systems or variables in many programming languages, a simple grammar formally defines how the names are built using the names of the FQN's components (see lexical grammar above).

The FQN for a "top-level" variable or dimension is defined purely by the sequence of enclosing groups plus the variable's simple name.

Notes:

Every dataset has a single outermost dataset node. which semantically, acts like the root group. Whatever name that dataset has is ignored for the purposes of forming the FQN and instead is treated as if it has the empty name ("").
There is no limit to the nesting of groups.

The character "/" has special meaning in the context of a fully qualified name. This means that if a name is added to the FQN and that name contains this character, then that characters must be specially escaped so that they will not be misinterpreted. The escape character itself must also be escaped, as must a blank.

The defined escapes are as follows.

Character	Escaped Form
/	/

blank	lank

Appendix B. Supplementary Material

Specifying Context-Sensitive Elements

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. ^{[[[#Ref-7|7]]]}

Appendix C. Complete Version of the Abstract Representation Specification

This is the complete Antlr specification in a form that can be processed by Antlr.

grammar z ; dataset : NAME ZARRVERSION (dimension | variable | attribute | group)* ; group: NAME (dimension | variable | attribute | group)* ; attribute : NAME value_type (CONSTANT)+ ; dimension: NAME SIZE ; variable: NAME type (dimref)* (attribute)* ; dimref: SIZE | FQN ; type: atomic_type ; atomic_type: fixed_atomic_type | char_type ; fixed_atomic_type: BYTE_T // A signed 8 bit integer | UBYTE_T // An unsigned 8 bit integer | SHORT_T // A signed 16 bit integer | USHORT_T // An unsigned 16 bit integer | INT_T // A signed 32 bit integer | UINT_T // An unsigned 32 bit integer | INT64_T // A signed 64 bit integer | UINT64_T // An unsigned 64 bit integer ; char_type: CHAR_T ; // Lexemes NAME: IDCHAR+ ; FQN: ([/])|([/](IDCHAR)+)+ ; SIZE: DIGITS ; // Non-negative integer ; ZARRVERSION: DIGITS '.' DIGITS '.' DIGITS ; // Type Lexemes BYTE_T: 'byte' ; UBYTE_T: 'ubyte' ; SHORT_T: 'short' ; USHORT_T: 'ushort' ; INT_T: 'int' ; UINT_T: 'uint' ; INT64_T: 'int64' ; UINT64_T: 'uint64' ; CHAR_T: 'char' ; // Exact form is as usual, but will leave out for now CONSTANT: INTEGER | UNSIGNED | FLOAT | CHAR ; fragment INTEGER: [+-]?DIGITS ; fragment UNSIGNED: DIGITS ; fragment FLOAT: [+-]?DIGITS '.' DIGITS ; fragment STRING: ''' ~['] ''' ; fragment DIGITS: [0-9]+ ; fragment UTF8: ASCII ; // Assume base character set is UTF8 ; fragment IDCHAR: (IDASCII|UTF8) ; fragment IDASCII: [0-9a-zA-Z]|[!#$%()*+:;<=>?@]|'['|']'|'\'|[^_`|{}~] |'\\'|'\/'|'\ ' ; fragment ASCII: [0-9a-zA-Z]|[ !#$%()*+:;<=>?@]|'['|']'|'\'|[^_`|{}~] ; // Printable ASCII

References

[1] https://www.antlr.org/

Copyright

Point of Contact

Author: Dennis Heimbigner
Email: dmh at ucar dot edu
Initial Version: 11/28/2018
Last Revised: 07/2/2019

Posted by: dmh

Jul 2, 2019

Add new comment

Article Category

NetCDF

Article type

Developer Blog