Re: orthogonality (was Re: New attempt)

Hi John,

Responses below..

John Caron wrote:

Hi Joe, thanks for your thoughts; some immediate comments follow. I will
also be thinking about these ideas and may have more to say later.


----- Original Message -----
From: "Joe Wielgosz" <joew@xxxxxxxxxxxxx>
To: "Benno Blumenthal" <benno@xxxxxxxxxxxxxxxx>; "John Caron"
<caron@xxxxxxxxxxxxxxxx>; <thredds@xxxxxxxxxxxxxxxx>
Sent: Tuesday, June 04, 2002 3:45 PM
Subject: orthogonality (was Re: New attempt)



Benno, John,

As I am currently immersed in Web-service-think (particularly WSDL,
which seems to basically be a more generalized version of what THREDDS
catalog is attempting for scientific data services) I would propose that
the principle of orthogonality might be a useful tool for deciding on
these issues.

For an XML design, orthogonality means whether or not a given tag or
attribute represents a distinct concept that is in no case expressible
using existing tags and attributes.

While a completely orthogonal tag set results in somewhat less succinct
documents than an approach which defines a number of "special case"
tags, the advantage is that it yields the maximum ratio of
expressiveness to schema complexity.

WSDL is an extreme example of this approach. Unlike them, we may want to
define special-case tags for very common cases to make the
actual documents less unwieldly.

The purpose of this message is not to suggest exactly which cases these
might be; rather, I am just suggesting we take a look the current and
proposed DTD down from this perspective.


I agree that orthogonality is a very important motivation. Related
motivation is simplicity (small number of concepts) and concept independence
(same meaning wherever a concept is used).

The other motivation I have had is:

    "Make common cases simple and human-readable"

For me, attributes are a bit easier to read then contained elements, so this
motivation orients me towards the use of attributes, all else being equal.


As far as I can tell, the trade off is between better readability and extensibility. Attributes are obviously more readable. However they are less extensible because the value of an attribute must always be plain text, whereas the content of a sub-element which starts as plain text can have markup added if this becomes desirable down the road.

"Building Web Services with Java" says:
"In general, whenever humans design XML documents, you will see more frequent use of attributes. This is true eve in data-oriented applications. On the other hand, when XML documents are automatically 'designed' and generated by applications, you might see a more prevalent use of elements."



This has also motivated factoring out common information, which reduces
repeated information. This has led to:
    1) allowing properties to be placed at collection elements which become
the defaults for contained datasets.
    2) factoring out server/service elements.

These are more qualitative and aesthetic judgements, and are rather
subjective. None of these can be judged completeley independently, you have
to look at the overall effect on readers, writers, APIs, tutorials, etc.


-------------------

A summary of the completely orthogonal concepts I believe we have
introduced thus far (not necessarily named the same as the tags that
currently express the concept):

service type - a named mechanism for accessing scientific data

access - a (named?) binding of a URI to a specific service type

metadata type - a named convention for description of scientific data

metadata - a (named?) binding of a text fragment to a specific metadata

type

metadata reference - a (named?) binding of a URI to a specific metadata

type

dataset - a named collection of access objects and metadata

collection - a named collection of datasets

collection reference - the URI of a THREDDS XML document containing a
collection

----------------

In contrast, the following concepts are not clearly orthogonal to me:

service path, server path, collection path, dataset path, suffix - since
they are only used in the context of the access object, they don't
really add meaning - any catalog using these attributes is equivalent to
one without them, which uses absolute uri's for all of its access objects

compound service / service list - this also doesn't strictly speaking
add meaning since services are only used in the context of access
objects - thus, one access with a compound service type is functionally
equivalent to n access objects with simple service types.

service subtype - unless the values for this attribute are given
standard meanings, this is equivalent to a named access object. even if
the values do have standard meanings, there still seems to be some
overlap with metadata type.

catalog, server - if you factor out the path attribute, these are
equivalent to collections

documentation - equivalent to metadata with a human-readable metadata type

document - same as documentation, except with the connotation that it is
not critical to interpreting the dataset


I think its interesting and important what the ontology is, and I notice
some subtle differences in your conceptualization and mine (and also between
Benno's and mine). If we can converge on that a bit more, then some of these
other issues may be clarified. I think your description above is good, but
here are some different perspectives on some of it:

I would say the basic objects are datasets and services. Therefore, an
access element is a binding of a dataset to a service. The base, path,
suffix are first attempts to specify the binding. Metadata are properties of
a dataset.


If these are the basic objects, then where does the actual URL reside?

As far as I can see, the only place where a URL has meaning is when a specific dataset is bound to a specific service. In WSDL this concept (a specific instance of a web service definition) is called a "port". In THREDDS it is currently expressed as an access tag.

I would suggest that an access object is really a very basic concept in THREDDS since without it, neither datasets nor services are very useful.


Its true that a catalog is just a collection. Its also a container for
service elements, but I dont really like that.


Maybe a catalog should be able to import an external list of service definitions, so that the collection/dataset/access list and the service defn's can be physically and conceptually separated when desired.

Even if we keep service
elements factored out, it seems better to allow collections to contain
service elements, so service elements can be near where they are used.

> This
> will probably be useful for large synthetic catalogs. OTOH, a catalog is

I would be wary of putting too much effort into pretty formatting, especially when it adds complexity at the machine level. The primary use of this catalog format is going to be machine-to-machine communication. I very much doubt anyone is going to spend much time reading raw XML output from THREDDS servers except to debug them. And if they are, you might as well just write an XSLT doc that creates genuinely readable output.

very much a user-visible object in a way that perhaps an arbitrary
collection should not be. So while I am inclined to make a catalog even more
like a collection by putting service into collection, I am still inclined to
a seperate top-level catalog element at the moment.



Ive tried making documentation a metadata subtype. It felt slightly wrong,
and I think its because documentation has to do with presentation, while
metadata is really properties of the dataset itself. Adding more of the
XLink "show" semantics to documentation seem to make it diverge more from
metadata. At the cost of adding an extra element, it seems easier to explain
with them seperate.

Much harder question is the distinction between a dataset and a collection,
since a dataset is a collection of data. I have conceptualized it as
follows: a dataset is something that can be selected, and then it is
processed in a protocol-dependent way. A collection is a
protocol-independent mechanism for grouping datasets.


--------------------

One that I am not sure about is the "attribute" tag, since I am not
clear on how this is intended to be used. Is it for the THREDDS parser,
or passed directly to the user? Will there be standardized names and
values for attributes?


Im open to suggestions on how it should be used. Right now, Id say it is
made available to the client, to the protocol-aware dataset processing, and
optionally to the end-user. Therefore it could be the mechanism to make
standard extensions in the future without having to change the XML format.



A reminder, I am not trying to say specifically whether any of these
tags should be kept or dropped. I am merely suggesting that we might
want to focus on tags and attributes that represent orthogonal concepts,
and be a bit more choosy about the rest.

Also, I would suggest that any proposed extensions that *are* genuinely
orthogonal to the original tag set (although I'm not sure we've had any
thus far) be given special consideration, since by definition, there is
no workaround if they are not included.


agree, do yu have anything in mind?


John, hope this is useful input.


very useful, thanks much


- Joe









--
Joe Wielgosz
joew@xxxxxxxxxxxxx / (707)826-2631
---------------------------------------------------
Center for Ocean-Land-Atmosphere Studies (COLA)
Institute for Global Environment and Society (IGES)
http://www.iges.org