February 1994
We intend the Internet Data Distribution (IDD) system to support an arbitrary
number (order 40) of simultaneous "DataFlows." Each DataFlow represents the
timely transport of information from a single source computer, designated the
"FlowSource," to one or many (order 300) data holdings, designated "LocalHoldings,"
at various points on the Internet where there are institutions with permission
to receive the data. Responsibilities for the IDD system are decentralized to
a significant extent: one organization is responsible for the FlowSource, one
organization is responsible for each LocalHolding. In addition, the IDD employs
a number (order 100) of data relays--each of which may or may not co-exist with
a LocalHolding--at institutions which are responsible for running and maintaining
them. For a limited number (order 15) of the DataFlows, the Unidata Program
Center (UPC) holds responsibility for coordinating the efforts of all parties
in order to achieve a degree of end-to-end integrity adequate to meet the education
and research needs of Unidata's core membership.
FlowProducts
Each DataFlow is to be dynamic. The aggregate body of information associated with
a DataFlow, designated the "GlobalHoldings," comprises a changing collection of
atomic datasets, called "FlowProducts." Changes in the GlobalHoldings are of three
types:
- New FlowProducts may be introduced at any time (only) at the FlowSource,
and the organization responsible for the FlowSource determines the content
of each new FlowProduct, which may be an arbitrary bit string.
- Once introduced, a FlowProduct may move to and be held in one or many LocalHoldings,
and those organizations responsible for the LocalHoldings determine which
FlowProducts are acquired and held.
- FlowProducts may be purged from LocalHoldings, as determined by the responsible
organization. Once purged from all LocalHoldings, a FlowProduct no longer
exists in the GlobalHoldings and may not be retrieved via the IDD system.
Note 1: FlowProducts need not be files. However, as discussed in the section on
Local Filing and Processing of FlowProducts, the IDD must support the assembly
of files from raw and/or processed FlowProducts.
Note 2: FlowProducts are atomic only in the IDD context. In fact, most FlowProducts
will be disassembled, decoded, and/or reorganized prior to use in scientific
applications. And the IDD must allow such derived datasets to be treated as
FlowProducts as well; thus the aggregated GlobalHoldings from all DataFlows
will encompass redundant data in the form of decoded and reorganized datasets
as well as identical copies of certain FlowProducts in several LocalHoldings.
FlowSources
For purposes of this design, each FlowSource is assumed to be a sequential supplier
of FlowProducts with very little memory. More specifically, at any instant a FlowSource
will be expected only to supply the most recent FlowProduct and no others. However,
each FlowSource is expected to offer a limited degree of flow control, the amount
of which will be determined, by testing, to be as little as is necessary to keep
the loss of FlowProducts near zero. [If possible, we should quantify this to indicate,
for example, how fast a non-flow-controlled FlowSource can run or, alternatively,
how much buffering must be provided for a FlowSource with a given data rate.]
The data rate at which FlowProducts are introduced is assumed to be less than
200 kilobits per second at any single FlowSource.
The IDD must support event-driven DataFlows, where the introduction of a FlowProduct
by a FlowSource causes the automatic distribution of that atom to all LocalHoldings
which have a predefined interest in the FlowProduct. This matter is discussed
further in connection with FlowSubscriptions and SubsetSelectors, described
below.
It may be desirable for the IDD to include software to facilitate building
a FlowSource which interfaces to a collection (perhaps a directory) of files
and which creates FlowProducts as follows: each time a new file is added to
the collection or a file in the collection is changed, this file is augmented
with SelectionKeys, UniqueKeys, and FlowCounters and introduced to the IDD as
a FlowProduct. (Note that if a file is modified, the corresponding FlowProduct
would have a different UniqueKey, even if its SelectionKey is unchanged.) This
capability could be useful for maintaining mirrored file systems.
In general, all DataFlows are handled by computers running (components of)
version 5 or later of the Local Data Management (LDM) software provided and
maintained by the UPC. In special cases, however, other software systems may
need to interact with the IDD, perhaps to function as a FlowSource, for example,
so the LDM protocols must be designed and documented to facilitate such circumstances.
Subscriptions and Subset Selections
In conjunction with each and every FlowProduct, the FlowSource is responsible
for creating two keys as follows: 1) a "SelectionKey" (typically containing several
fields, such as time, location and type of observation) should characterize the
FlowProduct and allow the selection of subsets of the GlobalHoldings associated
with that DataFlow; 2) a "UniqueKey" must be chosen to distinguish this FlowProduct
from every other one that might exist in the GlobalHoldings. Ideally, the UniqueKey
might also serve to verify the accuracy of data delivery or to find identical
FlowProducts in distinct DataFlows.
For a given DataFlow, the organization responsible for each LocalHolding will
establish a "FlowSubscription" to the entirety or a subset of the GlobalHoldings
for that DataFlow. In the latter case, the subset will be defined by a "SubsetSelector"
which serves to discriminate among FlowProducts by boolean and regular-expression
operations on the fields of their SelectionKeys.
For certain DataFlows, users may be required to obtain and maintain a license
or other formal proof of permission to acquire some or all FlowProducts from
the associated AggregateHolding. The IDD must provide modestly secure means
to prevent the delivery of data to users who do not have appropriate permissions.
The human actions required to create or change FlowSubscriptions and SubsetSelectors
must be straightforward to perform by personnel at the LocalHolding site, and
they should require minimal expertise. The results should be immediate (order
1 minute). There should be reasonable protection against subscription changes
by unauthorized individuals.
Actions to Fulfill Subscriptions
In general, the establishment of a new or changed subscription will result in
several actions by the IDD system:
- The GlobalHoldings will be compared with the LocalHolding to determine
which FlowProducts would have to be transferred to the latter so as to fulfill
the FlowSubscription in respect to extant data which match the SubsetSelector.
- The process of fulfilling the FlowSubscription with extant data would begin,
perhaps after checks are performed to verify that the necessary transfers
are reasonable. The order of FlowProduct transfer generally will be from oldest
to newest, though it may be desirable, as a user option, to allow the order
to be reversed.
- Independent of actions 1 and 2, all new FlowProducts to which the FlowSubscription
applies will be transferred to the LocalHolding soon after they are introduced
into the IDD system. The acceptable delay is of order 10 seconds. This action
must be reliable only to the extent practical within the context that LocalHolding
computer failures and network conjestions are bound to be unpredictable. More
specifically, if a FlowProduct cannot be transferred within a reasonable length
of time (order 10 minutes), it need not be delivered.
- Periodically (at user-specified intervals) or on problem detection (at
the LocalHolding computer), IDD actions (as in items 1 and 2) must occur automatically
to reconcile discrepencies between the GlobalHoldings and the LocalHolding
which have arisen from the failed delivery of FlowProducts.
In general, the responsibility for the completeness of a LocalHolding lies with
the organization which manages it. Specifically, if a FlowSubscription is not
fulfilled for any reason, all corrective steps are to be initiated from the LocalHolding.
This suggests that the establishment or modification of a FlowSubscription should
be an idempotent operation. It also suggests that there must be mechanisms by
which recipients of data can discern, as quickly as possible, when a data delivery
interruption has occurred.
Regarding the last point, perhaps the FlowSource could generate an integer
"FlowCounter" which is incremented with each new product--the receiver could
then use jumps in sequence to identify interruptions. Alternatively, the receiver
might periodically check an inventory which is maintained at the FlowSource.
Connectivity and Reliability
Internet connections to all LocalHoldings are assumed to have adequate capacity
to handle the FlowSubscriptions and FlowProduct selections made by the responsible
organization. In other words, the organization is responsible for obtaining adequate
connections or, alternatively, for limiting its selections. In cases where the
capacity is inadequate, the IDD design does not attempt to predict which FlowProducts
will not be received.
DataFlows in the IDD are of two general types: A) those upon which many Unidata
users depend and, therefore, reliability is of direct concern to the UPC, and
B) those which arise on an ad hoc basis among users of the LDM software. The
IDD design must facilitate the creation of new type B DataFlows so that they:
- do not have detrimental impact on the reliability of type A DataFlows;
- do not require recompilations or other major IDD system adaptations;
- can support some form of access control managed by the data provider without
Unidata intervention;
- can evolve to become type A DataFlows if community interest warrants.
For type A DataFlows, the IDD must provide means for monitoring centrally the
entire state of the DataFlow, including individual states of LocalHoldings and
discrepencies between LocalHoldings and GlobalHoldings (that is, disrepencies
which arise outside the intended effects of SubsetSelectors). Such monitoring
capabilities also should reflect the health of the underlying networks, to the
extent practical. It must be possible--from the UPC--to maintain current, quantitative
measures of how promptly and reliably the IDD system delivers data to all users.
Local Filing and Processing of FlowProducts
The IDD is not required to support direct user access to LocalHoldings because,
in general, LocalHoldings may be accessed only through the IDD system. Therefore,
the IDD system must provide user-definable means to store FlowProducts in local
files or, alternatively, to pass them to processing routines, such as decoders.
The process for selecting FlowProducts to file and/or process should parallel,
as closely as possible, the previously described processes for establishing FlowSubscriptions
and defining SubsetSelectors. It must be possible to employ SelectionKeys and
UniqueKeys in the naming of files and in the processing routines.
This capability is intended to facilitate the joining of the IDD system with
more powerful systems for managing, browsing, and giving access to archived
data. In this way, the IDD system is relieved of all but the simplest forms
of retrospective data access, utilizing SelectionKeys and UniqueKeys. Content-based
selections of FlowProducts and subsetting within FlowProducts will not be supported
by the IDD system and, if needed, will be provided separately. Thus, GlobalHoldings
typically will encompass relatively short time periods (order 10 days), and
user needs for long time-series or other data sets that differ significantly
from the forms provided by the FlowSources, or for sophisticated subsetting,
selection and assembly and will be met through systems that are fed via IDD,
not by the IDD system per se.