Unidata IDD Functional Requirements Specification

Dave Fulker February 1994

We intend the Internet Data Distribution (IDD) system to support an arbitrary number (order 40) of simultaneous "DataFlows." Each DataFlow represents the timely transport of information from a single source computer, designated the "FlowSource," to one or many (order 300) data holdings, designated "LocalHoldings," at various points on the Internet where there are institutions with permission to receive the data. Responsibilities for the IDD system are decentralized to a significant extent: one organization is responsible for the FlowSource, one organization is responsible for each LocalHolding. In addition, the IDD employs a number (order 100) of data relays--each of which may or may not co-exist with a LocalHolding--at institutions which are responsible for running and maintaining them. For a limited number (order 15) of the DataFlows, the Unidata Program Center (UPC) holds responsibility for coordinating the efforts of all parties in order to achieve a degree of end-to-end integrity adequate to meet the education and research needs of Unidata's core membership.

FlowProducts

Each DataFlow is to be dynamic. The aggregate body of information associated with a DataFlow, designated the "GlobalHoldings," comprises a changing collection of atomic datasets, called "FlowProducts." Changes in the GlobalHoldings are of three types:
  1. New FlowProducts may be introduced at any time (only) at the FlowSource, and the organization responsible for the FlowSource determines the content of each new FlowProduct, which may be an arbitrary bit string.
  2. Once introduced, a FlowProduct may move to and be held in one or many LocalHoldings, and those organizations responsible for the LocalHoldings determine which FlowProducts are acquired and held.
  3. FlowProducts may be purged from LocalHoldings, as determined by the responsible organization. Once purged from all LocalHoldings, a FlowProduct no longer exists in the GlobalHoldings and may not be retrieved via the IDD system.
Note 1: FlowProducts need not be files. However, as discussed in the section on Local Filing and Processing of FlowProducts, the IDD must support the assembly of files from raw and/or processed FlowProducts.

Note 2: FlowProducts are atomic only in the IDD context. In fact, most FlowProducts will be disassembled, decoded, and/or reorganized prior to use in scientific applications. And the IDD must allow such derived datasets to be treated as FlowProducts as well; thus the aggregated GlobalHoldings from all DataFlows will encompass redundant data in the form of decoded and reorganized datasets as well as identical copies of certain FlowProducts in several LocalHoldings.

FlowSources

For purposes of this design, each FlowSource is assumed to be a sequential supplier of FlowProducts with very little memory. More specifically, at any instant a FlowSource will be expected only to supply the most recent FlowProduct and no others. However, each FlowSource is expected to offer a limited degree of flow control, the amount of which will be determined, by testing, to be as little as is necessary to keep the loss of FlowProducts near zero. [If possible, we should quantify this to indicate, for example, how fast a non-flow-controlled FlowSource can run or, alternatively, how much buffering must be provided for a FlowSource with a given data rate.]

The data rate at which FlowProducts are introduced is assumed to be less than 200 kilobits per second at any single FlowSource.

The IDD must support event-driven DataFlows, where the introduction of a FlowProduct by a FlowSource causes the automatic distribution of that atom to all LocalHoldings which have a predefined interest in the FlowProduct. This matter is discussed further in connection with FlowSubscriptions and SubsetSelectors, described below.

It may be desirable for the IDD to include software to facilitate building a FlowSource which interfaces to a collection (perhaps a directory) of files and which creates FlowProducts as follows: each time a new file is added to the collection or a file in the collection is changed, this file is augmented with SelectionKeys, UniqueKeys, and FlowCounters and introduced to the IDD as a FlowProduct. (Note that if a file is modified, the corresponding FlowProduct would have a different UniqueKey, even if its SelectionKey is unchanged.) This capability could be useful for maintaining mirrored file systems.

In general, all DataFlows are handled by computers running (components of) version 5 or later of the Local Data Management (LDM) software provided and maintained by the UPC. In special cases, however, other software systems may need to interact with the IDD, perhaps to function as a FlowSource, for example, so the LDM protocols must be designed and documented to facilitate such circumstances.

Subscriptions and Subset Selections

In conjunction with each and every FlowProduct, the FlowSource is responsible for creating two keys as follows: 1) a "SelectionKey" (typically containing several fields, such as time, location and type of observation) should characterize the FlowProduct and allow the selection of subsets of the GlobalHoldings associated with that DataFlow; 2) a "UniqueKey" must be chosen to distinguish this FlowProduct from every other one that might exist in the GlobalHoldings. Ideally, the UniqueKey might also serve to verify the accuracy of data delivery or to find identical FlowProducts in distinct DataFlows.

For a given DataFlow, the organization responsible for each LocalHolding will establish a "FlowSubscription" to the entirety or a subset of the GlobalHoldings for that DataFlow. In the latter case, the subset will be defined by a "SubsetSelector" which serves to discriminate among FlowProducts by boolean and regular-expression operations on the fields of their SelectionKeys.

For certain DataFlows, users may be required to obtain and maintain a license or other formal proof of permission to acquire some or all FlowProducts from the associated AggregateHolding. The IDD must provide modestly secure means to prevent the delivery of data to users who do not have appropriate permissions.

The human actions required to create or change FlowSubscriptions and SubsetSelectors must be straightforward to perform by personnel at the LocalHolding site, and they should require minimal expertise. The results should be immediate (order 1 minute). There should be reasonable protection against subscription changes by unauthorized individuals.

Actions to Fulfill Subscriptions

In general, the establishment of a new or changed subscription will result in several actions by the IDD system:

  1. The GlobalHoldings will be compared with the LocalHolding to determine which FlowProducts would have to be transferred to the latter so as to fulfill the FlowSubscription in respect to extant data which match the SubsetSelector.
  2. The process of fulfilling the FlowSubscription with extant data would begin, perhaps after checks are performed to verify that the necessary transfers are reasonable. The order of FlowProduct transfer generally will be from oldest to newest, though it may be desirable, as a user option, to allow the order to be reversed.
  3. Independent of actions 1 and 2, all new FlowProducts to which the FlowSubscription applies will be transferred to the LocalHolding soon after they are introduced into the IDD system. The acceptable delay is of order 10 seconds. This action must be reliable only to the extent practical within the context that LocalHolding computer failures and network conjestions are bound to be unpredictable. More specifically, if a FlowProduct cannot be transferred within a reasonable length of time (order 10 minutes), it need not be delivered.
  4. Periodically (at user-specified intervals) or on problem detection (at the LocalHolding computer), IDD actions (as in items 1 and 2) must occur automatically to reconcile discrepencies between the GlobalHoldings and the LocalHolding which have arisen from the failed delivery of FlowProducts.
In general, the responsibility for the completeness of a LocalHolding lies with the organization which manages it. Specifically, if a FlowSubscription is not fulfilled for any reason, all corrective steps are to be initiated from the LocalHolding. This suggests that the establishment or modification of a FlowSubscription should be an idempotent operation. It also suggests that there must be mechanisms by which recipients of data can discern, as quickly as possible, when a data delivery interruption has occurred.

Regarding the last point, perhaps the FlowSource could generate an integer "FlowCounter" which is incremented with each new product--the receiver could then use jumps in sequence to identify interruptions. Alternatively, the receiver might periodically check an inventory which is maintained at the FlowSource.

Connectivity and Reliability

Internet connections to all LocalHoldings are assumed to have adequate capacity to handle the FlowSubscriptions and FlowProduct selections made by the responsible organization. In other words, the organization is responsible for obtaining adequate connections or, alternatively, for limiting its selections. In cases where the capacity is inadequate, the IDD design does not attempt to predict which FlowProducts will not be received.

DataFlows in the IDD are of two general types: A) those upon which many Unidata users depend and, therefore, reliability is of direct concern to the UPC, and B) those which arise on an ad hoc basis among users of the LDM software. The IDD design must facilitate the creation of new type B DataFlows so that they:

  1. do not have detrimental impact on the reliability of type A DataFlows;

  2. do not require recompilations or other major IDD system adaptations;

  3. can support some form of access control managed by the data provider without Unidata intervention;

  4. can evolve to become type A DataFlows if community interest warrants.
For type A DataFlows, the IDD must provide means for monitoring centrally the entire state of the DataFlow, including individual states of LocalHoldings and discrepencies between LocalHoldings and GlobalHoldings (that is, disrepencies which arise outside the intended effects of SubsetSelectors). Such monitoring capabilities also should reflect the health of the underlying networks, to the extent practical. It must be possible--from the UPC--to maintain current, quantitative measures of how promptly and reliably the IDD system delivers data to all users.

Local Filing and Processing of FlowProducts

The IDD is not required to support direct user access to LocalHoldings because, in general, LocalHoldings may be accessed only through the IDD system. Therefore, the IDD system must provide user-definable means to store FlowProducts in local files or, alternatively, to pass them to processing routines, such as decoders. The process for selecting FlowProducts to file and/or process should parallel, as closely as possible, the previously described processes for establishing FlowSubscriptions and defining SubsetSelectors. It must be possible to employ SelectionKeys and UniqueKeys in the naming of files and in the processing routines.

This capability is intended to facilitate the joining of the IDD system with more powerful systems for managing, browsing, and giving access to archived data. In this way, the IDD system is relieved of all but the simplest forms of retrospective data access, utilizing SelectionKeys and UniqueKeys. Content-based selections of FlowProducts and subsetting within FlowProducts will not be supported by the IDD system and, if needed, will be provided separately. Thus, GlobalHoldings typically will encompass relatively short time periods (order 10 days), and user needs for long time-series or other data sets that differ significantly from the forms provided by the FlowSources, or for sophisticated subsetting, selection and assembly and will be met through systems that are fed via IDD, not by the IDD system per se.