DAP4 Commentary: DAP4 On-The-Wire Format
08 April 2012
Background
The current DAP2 clients use two different approaches to
managing the packet of data that is sent by the server.
The C++ libdap library uses what I will call an "eager"
evaluation method. By this I mean that the whole packet is
processed when received, is decomposed into its constituent
parts (e.g. data arrays, sequence records, etc) and those parts
are used to annotate the parsed DDS.
[
Read More]
Posted by $entry.creator.screenName
DAP4 Commentary: DDX Lexical Elements
27 March 2012
This document describes the lexical elements that occur in
the DAP4 grammar.
Within the
Relax-NG (rng) DAP4 grammar,
there are markers for occurrences of
primitive type such as integers, floats, or strings.
The markers typically look like this when defining
an attribute that can occur in the DAP4 DDX.
<attribute name="namespace"><data type="string"/></attribute>
The "<data type="string"/>"
specifies the lexical class for the values that this
attribute can have. In this case, the namespace attribute is
defined to have a String value. Similar notation is used
for values occurring as text within an xml element. The
lexical specification later in this document defines the
legal lexical structure for such lexical items.
Specifically, it defines the format of the following lexical
items.
- Constants, namely: string, float, integer, and character.
- Identifiers
The specification is written using the
ISO/IEC 9945-2:2003 Information technology -- Portable Operating System Interface (POSIX) -- Part 2: System Interfaces.
This is the extended Posix regular expression
specification.
I have augmented it in the following ways.
- Names are assigned to regular expressions using the notation
name = regular-expression
- Named expressions can be used in subsequent regular
expressions by using the notation {name}. Such occurrences
are equivalent to textually substituting the expression
associated with name for the {name} occurrence: More or less
like a macro.
DAP4 Lexical elements
Notes:
- The definition of {UTF8} is deferred to the next section.
- Comments are indicated using the "//" notation.
- Standard xml escape formats (&xDD) are assumed to be allowed anywhere.
Basic character set definitions
CONTROLS = [\x00-\x1F] // ASCII control characters
WHITESPACE = [ \r\t\f]+
HEXCHAR = [0-9a-zA-Z]
// ASCII printable characters
ASCII = [0-9a-zA-Z !"#$%&'()*+,-./:;<=>?@[\\\]\\^_`|{}~]
Ascii characters that may appear unescaped in Identifiers
This is assumed to be basically all ASCII printable characters
except the characters ' ', '.', '/', '"', ''', and '&'.
Occurrences of these characters are assumed to be representable
using the standard xml '&xx;' notation.
IDASCII = [0-9a-zA-Z!#$%'()*+,-:;<=>?@[\\\]\\^_`|{}~]
The numeric classes: integer and float
INTEGER = {INT}|{UINT}|{HEXINT}
INT = [+-][0-9]+{INTTYPE}?
UINT = [0-9]+{INTTYPE}?
HEXINT = {HEXSTRING}{INTTYPE}?
INTTYPE = ([BbSsLl]|"ll"|"LL")
HEXSTRING = (0[xX]{HEXCHAR}+)
FLOAT = ({MANTISSA}{EXPONENT}?)|{NANINF}
EXPONENT = ([eE][+-]?[0-9]+)
MANTISSA = [+-]?[0-9]*\.[0-9]*
NANINF = (-?inf|nan|NaN)
The Character classes
STRING = ([^"\&]|{XMLESCAPE})*
CHARACTER = ([^'\&]|{XMLESCAPE})
Note that the character type only supports ASCII characters because
it can only hold a single 8-bit byte.
The Identifier class
ID = {IDCHAR}+
IDCHAR = ({IDASCII}|{XMLESCAPE}|{UTF8})
XMLESCAPE = &x{HEXCHAR}{HEXCHAR};
Note that the above lexical element classes are not
disjoint. For example, the sequence of characters 1234 can
be either an identifer,a float, or an integer. So the order
of testing is assumed to be this.
- INTEGER
- FLOAT
- ID
- STRING
UTF-8 Character Encodings
We discuss UTF-8 character encoding in the context
of this document.
http://www.w3.org/2005/03/23-lex-U.
The most correct (validating) version of UTF8 character set is as follows.
UTF8 = ([\xC2-\xDF][\x80-\xBF])
| (\xE0[\xA0-\xBF][\x80-\xBF])
| ([\xE1-\xEC][\x80-\xBF][\x80-\xBF])
| (\xED[\x80-\x9F][\x80-\xBF])
| ([\xEE-\xEF][\x80-\xBF][\x80-\xBF])
| (\xF0[\x90-\xBF][\x80-\xBF][\x80-\xBF])
| ([\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF])
| (\xF4[\x80-\x8F][\x80-\xBF][\x80-\xBF])
The lines of the expression cover the UTF8 characters as follows:
- non-overlong 2-byte
- excluding overlongs
- straight 3-byte
- excluding surrogates
- straight 3-byte
- planes 1-3
- planes 4-15
- plane 16
Note that ASCII and control characters are not included.
The above reference also defines some alternative regular expressions.
The most relaxed version of UTF8 is this.
UTF8 = ([\xC0-\xD6].)
|([\xE0-\xEF]..)
|([\xF0-\xF7]...)
The partially relaxed version of UTF8 is this.
UTF8 = ([\xC0-\xD6][\x80-\xBF])
| ([\xE0-\xEF][\x80-\xBF][\x80-\xBF])
| ([\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF])
We deem it acceptable to use this last relaxed expression
for validating UTF-8 character strings.
Posted by $entry.creator.screenName
DAP4 Commentary: DAP4 Grammar
27 March 2012
[Version: 1.0]
At the end of this document are instructions for accessing and
testing a formal grammar for the DAP4 DDX using the Relax-NG
schema language. I constructed it initially without any reference to any
other explicit or implicit grammars so I could record my ideas. I
have since modified it based on examining James'
implied grammar
and from comments from others and from
a comparison with the
xsd grammar.
Differences with DAP4 xsd Grammar
I converted the
xsd grammar
to an equivalent
relax-ng (rng) grammar.
One major difference I see is in dimension handling.
- I just used the name "dimension" rather
than "shareddimension". For me, all dimensions (except anonymous
ones) are shared.
- The xsd separates out scalars from arrays. I always allowed
the dimensions for a variable to be optional to handle the scalar
case.
- I attempted to be as consistent as possible, so I allowed
any type including sequences and structures to be dimensioned.
(but see
previous commentary).
- The dimensions of a variable are currently specified in the
rng grammar as a sequence of elements named "Dimension" contained
in the "variables" element type.
Other differences:
Testing the Relax-NG Grammar
You will need to copy three files:
- dap4.rng - this is the grammar file. it uses the
Relax-NG schema language
This grammar file can be obtained from
http://dl.dropbox.com/u/53929684/dap4.rng.
- test.xml - this is a test file that I am growing to cover the whole grammar.
This can be obtained from
http://dl.dropbox.com/u/53929684/test.xml.
- jing.jar - Jing is a validator that takes the grammar and a
test file and checks that the test file conforms to the
grammar. This can be obtained from
http://dl.dropbox.com/u/53929684/jing.jar.
To use this jar file, do the command:
java -jar jing.jar dap4.rng test.xml
No output is produced if the validation succeeds, otherwise,
error messages are produced.
Posted by $entry.creator.screenName
DAP4 Commentary: Possible Notation for Server Commands
27 March 2012
Looking to the future, it is clear that eventually
our query language, or more generically our previous discussion of
URL Annotations must encompass three classes of computations.
- Queries in the DAP2 sense,
- Commands to control the client-side processing of requests on the server
(i.e. thing like caching),
- Server-side processing.
I want to propose a notation for everything in the URL after
the "?". I think this notation has ability to represent a wide
variety of features without, I hope, being too generic.
The notation is basically nested functions combined with single
assignment variables. A semantically nonsensical, but
grammatical example would look something like this:
"?_x=f(17,g(h(12))),f2(_x,[0:3:10])".
Everything past the "?" is in the form of a comma separated
list of nested function invocations. Anything that begins with an
underscore is considered a local, temporary, variable, anything
that does not look like a function call (i.e. that is not a name followed
immediately by a left parenthesis) is assumed to be a string constant. Each
function has an arbitrary number of argument expressions
separated by commas.
There would be several semantic rules.
- A variable may only be assigned to once (single assignment),
but may be referenced as many times as desired after that.
- All functions have a defined "return type", which looks like
a legal DDX minus certain things like groups, enumeration
declarations, and dimension declarations. In addition, a function
may be defined to have a "void" return type, which means it is
executed for its side-effects on the server.
- Any expression that is not assigned to a variable and does
not have a void return type will have its return value returned
to the caller as part of a DATADDX.
Notes
My hypothesis is that this notation should also be able to handle
most kinds of server side processing by defining and composing
functions.
The standard projection+selection constraints of DAP2 can be
represented using a special query() function whose argument is
the standard DAP2 constraint, or alternatively, one could define
a collection of nested functions to do the same thing, or
alternatively, we could split the query part into two pieces
separated by a semicolon. The first piece would be a constraint
expression and the second piece (after the semicolon) would be in
the nest function call form defined above.
An important aspect has to do with the construction of what may
be referred to as a DATADDX. It defines the structure of a DDX
that is the composition of the return types of the invoked
functions that will return a (possibly structured) value. I need
to work this out. BUT, in any case, the resulting DATADDX may
have only have a loose relation to any DDX representing the raw
dataset. This is because server-side computations will not have
been represented in the original DDX, but only in the DATADDX.
I also hypothesize that Ferret notations
http://.../thredds/dodsC/hfrnet/agg/6km_expr_{}{let deq1ubar=u[d=1,l=1:24@ave]}
could be represented in my proposed
function notation without having to clutter up the URL format.
Posted by $entry.creator.screenName
DAP4 Commentary: Sequences and Vlens
27 March 2012
"Oh what a tangled web we weave
when first we practice to build a type system."
Recently, I made the claim to James that the Sequence construct
could serve to represent CDM/HDF5 vlen constructs.
His reply was as follows:
In my opinion we will want vlens to be in the data model, if not
as a type, then as a feature of arrays - see the schema and my
text on the Data Model page. The reason for that [is that] I have
already tried using Sequence for vlen (in hdf4) and it was a
failure. Not a failure from the technical POV but because neither
server writers nor client writers nor client users ever 'got it.'
I think one part of the problem for those people was that vlens
in HDF4 do not support relational operators via the API while
Sequences are supposed to. So the conceptual mismatch doomed the idea.
This is a very important observation, and has caused me to
re-think how sequences and vlens should be handled in DAP4.
Vlens
Let us start by addressing the addition of vlens to the DAP4 data model.
Some possible ways to insert vlens into the dap4 data model
include the following.
- In CDM, a vlen is marked by a "*" as a dimension name in the
set of dimensions associated with a variable. the list of
dimensions is allowed to have any number of occurrences
of "*" [Aside, I will note that "unlimited" is also an option
for CDM, but should not be needed for DAP4 because at the time of
a request for data, the size of the unlimited dimension is known].
- James has proposed something similar except that the "*" is
restricted to occurring as the last dimension.
- Another possibility is to create a new container object,
call it Vlen, that (like Sequence) is inherently of variable
length and is not dimensionable. [Aside: The term "vlen" is kind
of odd. the term "list" would actually make more sense]
If we were to choose option 2 (terminal * only),
then we must address the question
of translation between CDM and DAP4. Going from DAP4 to CDM is
straightforward because having the last dimension be "*" is legal
in CDM. Going from CDM to DAP4 requires the introduction of a
number of Structure elements.
Consider the following CDM example (using a pseudo-syntax)
Int32 v[d1][*][d2][*][d3].
This would have to be represented in DAP4 something like this.
Structure v_1 {
Structure v_2 {
Int32 v[d3];
}[d2][*];
}[d1][*];
In the lucky case that the last dimension is a already a "*":
Int32 v[d1][*][d2][*];
then we have a simpler representation.
Structure v_1 {
Int32 v[d2][*];
}[d1][*];
Commentary
- As a personal matter, I would prefer to use the CDM
representation. Adding a new container type (Vlen),while
appealing semantically, only complicates the model more. James'
approach requires the use of additional Structure definitions
which, in my opinion, obscures the underlying semantics.
- One thing that I need to check is how this affects the
proposed on the wire format.
Sequences
Nathan Potter noted the following.
At one time we considered using ''nested'' sequences as a
representation of an SQL database, where essentially the keys
that link the tables in the DB define the structure of some
nested sequence thing. I think it's a useful idea, but there is
no implementation of it to play with. [Aside: I could swear that
someone told me that they actually used a relational database as
the back end to a sequence]
And later Nathan Potter noted the following:
I mentioned (in the same email that contained the nested sequence
comment quoted above) that we wrote and released a server that
represents a single database table or view as a single
sequence. This is quite different from the point that I was
making in the section quoted above that there may be a use case
for nested sequences. (which was in response to your
question "Are there any other legitimate examples for using
NESTED sequences.") [Ndp 12:08, 26 February 2012 (PST)]
It is this ability to have selection constraints applied that
separates sequences from vlens. It is clear that in the absence
of selection constraints, there is no essential difference
between sequences and vlens. Further, in the few examples I could
find or were sent to me, it seemed that nested sequences were
being used as, in effect, vlens.
So, we seem to have two very similar concepts (vlen and
sequence), which complicates the DAP4 model. The question for me is:
Do we get rid of the Sequence concept, or at least define it as equivalent to the following?: Structure {...} [*].
My current belief is that we should keep sequences but with the following restrictions:
- Sequences can only occur as top-level, scalar, variables within a group.
- Sequences may not be nested in any other container (i.e. other sequences or structure)
This keeps sequences for the original purpose of acting
as "relations". All other places where we might use a sequence
before will now use a vlen.
As with vlens, the translation between DAP4 and CDM needs to be addressed.
- The conversion from DAP4 to CDM can be addressed using the
rule above, namely that a sequence is, in CDM, represented as
Structure {...} [*]
.
- Translation from CDM to DAP4 allows for the option of never
using sequences, but always using vlens. An alternate translation
might be to say that if you have a top-level CDM structure whose
only dimension is a vlen, then translate that to a sequence (in
effect inverting the DAP4->CDM translation).
Posted by $entry.creator.screenName