"Oh what a tangled web we weave when first we practice to build a type system."Recently, I made the claim to James that the Sequence construct could serve to represent CDM/HDF5 vlen constructs.
His reply was as follows:
In my opinion we will want vlens to be in the data model, if not as a type, then as a feature of arrays - see the schema and my text on the Data Model page. The reason for that [is that] I have already tried using Sequence for vlen (in hdf4) and it was a failure. Not a failure from the technical POV but because neither server writers nor client writers nor client users ever 'got it.' I think one part of the problem for those people was that vlens in HDF4 do not support relational operators via the API while Sequences are supposed to. So the conceptual mismatch doomed the idea.
This is a very important observation, and has caused me to re-think how sequences and vlens should be handled in DAP4.
Vlens
Let us start by addressing the addition of vlens to the DAP4 data model.Some possible ways to insert vlens into the dap4 data model include the following.
- In CDM, a vlen is marked by a "*" as a dimension name in the
set of dimensions associated with a variable. the list of
dimensions is allowed to have any number of occurrences
of "*" [Aside, I will note that "unlimited" is also an option
for CDM, but should not be needed for DAP4 because at the time of
a request for data, the size of the unlimited dimension is known].
- James has proposed something similar except that the "*" is
restricted to occurring as the last dimension.
- Another possibility is to create a new container object, call it Vlen, that (like Sequence) is inherently of variable length and is not dimensionable. [Aside: The term "vlen" is kind of odd. the term "list" would actually make more sense]
If we were to choose option 2 (terminal * only), then we must address the question of translation between CDM and DAP4. Going from DAP4 to CDM is straightforward because having the last dimension be "*" is legal in CDM. Going from CDM to DAP4 requires the introduction of a number of Structure elements.
Consider the following CDM example (using a pseudo-syntax)
Int32 v[d1][*][d2][*][d3].This would have to be represented in DAP4 something like this.
Structure v_1 {
Structure v_2 {
Int32 v[d3];
}[d2][*];
}[d1][*];
In the lucky case that the last dimension is a already a "*":
Int32 v[d1][*][d2][*];then we have a simpler representation.
Structure v_1 {
Int32 v[d2][*];
}[d1][*];
Commentary
- As a personal matter, I would prefer to use the CDM
representation. Adding a new container type (Vlen),while
appealing semantically, only complicates the model more. James'
approach requires the use of additional Structure definitions
which, in my opinion, obscures the underlying semantics.
- One thing that I need to check is how this affects the proposed on the wire format.
Sequences
Nathan Potter noted the following.At one time we considered using ''nested'' sequences as a representation of an SQL database, where essentially the keys that link the tables in the DB define the structure of some nested sequence thing. I think it's a useful idea, but there is no implementation of it to play with. [Aside: I could swear that someone told me that they actually used a relational database as the back end to a sequence]And later Nathan Potter noted the following:
I mentioned (in the same email that contained the nested sequence comment quoted above) that we wrote and released a server that represents a single database table or view as a single sequence. This is quite different from the point that I was making in the section quoted above that there may be a use case for nested sequences. (which was in response to your question "Are there any other legitimate examples for using NESTED sequences.") [Ndp 12:08, 26 February 2012 (PST)]
It is this ability to have selection constraints applied that separates sequences from vlens. It is clear that in the absence of selection constraints, there is no essential difference between sequences and vlens. Further, in the few examples I could find or were sent to me, it seemed that nested sequences were being used as, in effect, vlens.
So, we seem to have two very similar concepts (vlen and sequence), which complicates the DAP4 model. The question for me is:
Do we get rid of the Sequence concept, or at least define it as equivalent to the following?: Structure {...} [*].My current belief is that we should keep sequences but with the following restrictions:
- Sequences can only occur as top-level, scalar, variables within a group.
- Sequences may not be nested in any other container (i.e. other sequences or structure)
As with vlens, the translation between DAP4 and CDM needs to be addressed.
- The conversion from DAP4 to CDM can be addressed using the
rule above, namely that a sequence is, in CDM, represented as
Structure {...} [*]
. - Translation from CDM to DAP4 allows for the option of never using sequences, but always using vlens. An alternate translation might be to say that if you have a top-level CDM structure whose only dimension is a vlen, then translate that to a sequence (in effect inverting the DAP4->CDM translation).