DAP4 Commentary: Sequences and Vlens
27 March 2012
"Oh what a tangled web we weave
when first we practice to build a type system."
Recently, I made the claim to James that the Sequence construct
could serve to represent CDM/HDF5 vlen constructs.
His reply was as follows:
In my opinion we will want vlens to be in the data model, if not
as a type, then as a feature of arrays - see the schema and my
text on the Data Model page. The reason for that [is that] I have
already tried using Sequence for vlen (in hdf4) and it was a
failure. Not a failure from the technical POV but because neither
server writers nor client writers nor client users ever 'got it.'
I think one part of the problem for those people was that vlens
in HDF4 do not support relational operators via the API while
Sequences are supposed to. So the conceptual mismatch doomed the idea.
This is a very important observation, and has caused me to
re-think how sequences and vlens should be handled in DAP4.
Vlens
Let us start by addressing the addition of vlens to the DAP4 data model.
Some possible ways to insert vlens into the dap4 data model
include the following.
- In CDM, a vlen is marked by a "*" as a dimension name in the
set of dimensions associated with a variable. the list of
dimensions is allowed to have any number of occurrences
of "*" [Aside, I will note that "unlimited" is also an option
for CDM, but should not be needed for DAP4 because at the time of
a request for data, the size of the unlimited dimension is known].
- James has proposed something similar except that the "*" is
restricted to occurring as the last dimension.
- Another possibility is to create a new container object,
call it Vlen, that (like Sequence) is inherently of variable
length and is not dimensionable. [Aside: The term "vlen" is kind
of odd. the term "list" would actually make more sense]
If we were to choose option 2 (terminal * only),
then we must address the question
of translation between CDM and DAP4. Going from DAP4 to CDM is
straightforward because having the last dimension be "*" is legal
in CDM. Going from CDM to DAP4 requires the introduction of a
number of Structure elements.
Consider the following CDM example (using a pseudo-syntax)
Int32 v[d1][*][d2][*][d3].
This would have to be represented in DAP4 something like this.
Structure v_1 {
Structure v_2 {
Int32 v[d3];
}[d2][*];
}[d1][*];
In the lucky case that the last dimension is a already a "*":
Int32 v[d1][*][d2][*];
then we have a simpler representation.
Structure v_1 {
Int32 v[d2][*];
}[d1][*];
Commentary
- As a personal matter, I would prefer to use the CDM
representation. Adding a new container type (Vlen),while
appealing semantically, only complicates the model more. James'
approach requires the use of additional Structure definitions
which, in my opinion, obscures the underlying semantics.
- One thing that I need to check is how this affects the
proposed on the wire format.
Sequences
Nathan Potter noted the following.
At one time we considered using ''nested'' sequences as a
representation of an SQL database, where essentially the keys
that link the tables in the DB define the structure of some
nested sequence thing. I think it's a useful idea, but there is
no implementation of it to play with. [Aside: I could swear that
someone told me that they actually used a relational database as
the back end to a sequence]
And later Nathan Potter noted the following:
I mentioned (in the same email that contained the nested sequence
comment quoted above) that we wrote and released a server that
represents a single database table or view as a single
sequence. This is quite different from the point that I was
making in the section quoted above that there may be a use case
for nested sequences. (which was in response to your
question "Are there any other legitimate examples for using
NESTED sequences.") [Ndp 12:08, 26 February 2012 (PST)]
It is this ability to have selection constraints applied that
separates sequences from vlens. It is clear that in the absence
of selection constraints, there is no essential difference
between sequences and vlens. Further, in the few examples I could
find or were sent to me, it seemed that nested sequences were
being used as, in effect, vlens.
So, we seem to have two very similar concepts (vlen and
sequence), which complicates the DAP4 model. The question for me is:
Do we get rid of the Sequence concept, or at least define it as equivalent to the following?: Structure {...} [*].
My current belief is that we should keep sequences but with the following restrictions:
- Sequences can only occur as top-level, scalar, variables within a group.
- Sequences may not be nested in any other container (i.e. other sequences or structure)
This keeps sequences for the original purpose of acting
as "relations". All other places where we might use a sequence
before will now use a vlen.
As with vlens, the translation between DAP4 and CDM needs to be addressed.
- The conversion from DAP4 to CDM can be addressed using the
rule above, namely that a sequence is, in CDM, represented as
Structure {...} [*]
.
- Translation from CDM to DAP4 allows for the option of never
using sequences, but always using vlens. An alternate translation
might be to say that if you have a top-level CDM structure whose
only dimension is a vlen, then translate that to a sequence (in
effect inverting the DAP4->CDM translation).
Posted by $entry.creator.screenName
DAP4 Commentary: Characterization of URL Annotations
27 March 2012
Characterization of URL Annotations
Requests for data using the DAP4 protocol will require a
significant number of annotations specifying what is to be
retrieved, commands to the server, and commands to the client.
This document is intended to just describe the information with
which URLs need to annotated based on past experience. It also
enumerates the possible URL components that can be used to encode
the annotations. I will consider a specific encoding in a
separate document.
Looking at the DAP2 URLs, we see three classes of annotations:
protocol, server commands, client commands, and queries (aka
constraints).
Protocol
For DAP2, the fact that the DAP2 protocol being used is inferred
from context. In netcdf-C, for example, the fact that the
dataset name is a URL is sufficient to indicate the use of the
DAP2 protocol (although that will change). For some servers,
such as TDS, the protocol is also inferrable from elements of the
URL path. For example, in the URL
http://.../thredds/dodsC/...
the "dodsC" indicates the use of the DAP2 protocol. TDS also
supports a schema called "dods:" that also indicates the use of DAP2.
Server Commands
Server commands in DAP2 are appended to the dataset URL to
indicate attributes of the request. For example:
http://test.opendap.org/dataset.nc.dds
The defined kinds of server commands for DAP2 are as follows:
- Component requests: ".dds", ".das"
- Data requests with format: ".dods", ".asc"
- Miscellaneous: ".html", ".ver"
Client commands
Client commands are interpreted by the client-side library to
specify actions to be performed by the library. The existence of
client commands is important because we want to communicate from
the user to the library without requiring any knowledge by
intermediate code layers. For example, netcdf C tools such as
ncdump send URLs to the underlying netcdf DAP2 library without
having to be cognizant of their structure.
Currently, the primary use for client commands is caching, to
indicate the degree of caching and prefetch to be used with a
given request to the server.
Currently, client commands are represented as "name=value" pairs
or just "name" enclosed in square braces: "[nocache]", for
example. These commands are prefixed to the URL such as this.
[show=fetch]http://test.opendap.org/dataset.nc
The legal set of client commands is client library specific.
One notable problem with this form of client command is that it
prevents generic URL parsers from parsing the URL because, of
course, the square bracket notation is non-standard.
It should be noted that an alternative to using client commands
in the URL is to use a configuration file (often referred to as
the ".rc" file such as ".dodsrc"). This configuration file is
assumed to be either in the caller's home directory or in the
current working directory. It contains the necessary client
commands to be applied. It is mildly less convenient for the user
to use a .rc file than to embed a client command in the URL.
Queries
The third class of URL annotations specifies some form of query
to control the information to be extracted from a dataset on the
server. This information is then passed back to the client.
In DAP2, queries consisted of projections and selections
specifying a subset of the data in a dataset.
A projection represents a path through the DDS parse tree
annotated with constraints on the dimensions. For example, this query:
"?P1.P2[0:2:10].F[1:3][4:4]".
A selection represents a boolean expression to be applied to the
records of a sequence. Syntactically, a selection could cross
sequences, thus implying a join of the sequences, but in practice
this diss not allowed.
DAP2 queries also allowed the use of functions in the projections
and selections to compute, for example, sums or averages. But the
semantics was never very well defined. The set of allowable
functions is server dependent.
Annotation Mechanisms
DAP4 will need to support at least the three classes of
annotations described above. Whatever annotation mechanisms are
chosen, the following properties seem desirable.
- The resulting URL should be parseable by generic URL parsers
=>Client commands should be embedded at the end of URLs, not the beginning.
- Whatever annotation encoding is used, it is desirable if it is as
uniform as possible.
As mechanisms, we have the following available to us:
- The URL schema -- "http:" for example, or the TDS "dods:"
schema. Using this is somewhat undesirable because it would need
to encode also an underlying encrypted protocol like
https: (versus http:).
- URL path elements such as the current use of
e.g. http://host/../dodsC/... by TDS.
- URL query -- everything after the first '?' in the URL. URL
queries technically have a defined form as name=value pairs, but
in practice are pretty much free form.
- URL fragment -- everything after the last '#'. Again these
are pretty much free form.
- Filename extensions -- everything between the data set name
in the path and the start of the query. The DAP2 ".dds"
and ".dods" are examples of this.
- Alternate extension formats. Ethan Davis has proposed
the use of a "+" notation instead of filename extensions:
"+ddx+ascii", for example. This has the advantage of clearly not
being confused with filename extensions while also making clear
the additive nature of such annotation.
I should note that the Ferret server has taken to seriously
abusing the URL format with URLs like this.
http://.../thredds/dodsC/hfrnet/agg/6km_expr_{}{let deq1ubar=u[d=1,l=1:24@ave]}
so we have much to aspire to :-)
Posted by $entry.creator.screenName
[
Comments [1]
]
DAP4 Commentary: The on-the-wire format
27 March 2012
Background
The current DAP2 clients, use two different approaches to
managing the packet of data that is sent by the server.
The C++ libdap library uses what I will call an "eager"
evaluation method. By this I mean that the whole packet is
processed when received, is decomposed into its constituent
parts (e.g. data arrays, sequence records, etc) and those parts
are used to annotate the parsed DDS.
In contrast, the oc library uses a "lazy" evaluation method. That
is, the incoming packet is sent immediately into a file or into a
chunk of heap memory. Almost no preproccessing occurs. Data
extraction occurs only when requested by the user code through
the API.
Problem addressed
The relative merits and demerits of lazy versus eager are well
known and will not be repeated here.
Lazy evaluation of the DAP2 packet is hampered by the inlining of
variable length data: sequences and strings specifically. If it
were not for those, the lazy evaluator could compute directly the
location of the desired subset of data as requested by the user,
and do so without having to read any intermediate information.
But when, for example, Strings are inlined, then it is necessary
to walk the packet piece by piece to step over the strings.
I plan to use lazy evaluation for my implementations of DAP4, and
propose here the outline of a format for the on-the-wire data
packet that makes lazy operation fast and simple without, I
believe, interferring with eager evaluation.
Proposed solution
Since we have previously agreed on the use of multipart-mime, the incoming
data is presumed to be sequence of variable length packets with a
known length (for each packet) and a unique id for each packet.
Under these assumptions, I propose the following format.
- The initial packet is of known computable length, aka "fixed
length" for short. That is, its size can be computed solely
knowing the DXD for the incoming data. This means that strings
and sequences are not represented inline, but instead are
represented by fixed-size "pointers" into other, following
packets that contain the sequence and/or string data.
- Each element in a string array in the initial packet is represented
by three pieces of fixed size info:
- the unique id of the packet containing the contents of the string.
- the offset in the packet defined in (a).
- the length of the string in bytes (assuming utf-8 encoding).
- As an optimization, the string packet can be directly
appended to the fixed size initial packet, in which case, the
first item is not strictly necessary.
- Given a sequence object either a scalar or as an array of
sequences, the sequence is replaced by the following fixed size item:
- The unique id of the packet containing the sequence records
- Further, each record of the sequence packet is assumed to
be "fixed length" by applying the rules above. This means that
knowing the total size of the packet containing the sequence
records, it is possible to know the exact number of records in
the packet without actually having to walk the sequence packet to
count them.
Rationale for the solution
The above representation makes lazy evaluation very simple and a
given item in a packet can be reached in
o(1) time. Even with the
case of nested sequences/vlens, the proper item can be reached in
o(log n) time where n is the depth of the nesting.
The cost is that a hash map is needed to map unique id's to
offsets in the file or heap memory.
The lazy versus eager cases also apply on the server
side. Currently, for example, the opendap code on the thredds
server takes the underlying data source (.nc file for example),
converts it to DAP2 and annotates the DDS with the data. Then as
a second pass, the annotations are converted as needed and sent out
over the wire.
A lazy version would associate elements of the underlying source
with the DDS. Transfer of the data to the wire would then occur
directly from the original source to the wire format a needed.
As an aside, I have a (untested and unverified)
hypothesis is that the proposed encoding will also simplify
the use of lazy evaluation on the server side.
Updates
- 2012-02-20
- The above encoding has as one consequence that
all embedded counts that currently exist in DAP2 are
superfluous. Ditto for the sequence record markers. It may still
be desirable to include the counts for purposes of error
checking, but they are not strictly necessary.
Posted by $entry.creator.screenName