IDD Contingency Ideas

Background

Since the NSFNet backbone was turned over to "commodity" Internet service providers on April 30, 1995, the overall performance of the network has been somewhat shaky. Nevertheless, for the most part, since the installation of the LDM 5 at most relays (starting December 1995), we could count on the majority of our relay sites getting 100% of the data. However, since late February, 1996, the service in some areas has degraded enough that it is beginning to show in the long term IDD statistics as losses of at least some data products at relay sites during the busy afternoon hours.

Moreover, one Internet administrator estimates that the traffic on the Internet is doubling every 4 months. With that sort of increase in demand, service may degrade further before the providers can increase the capacity to the point where it can keep up with the demand. Consequently, we have been having discussions at the UPC and among some of the sites to come up with some ideas for what might be done if the reliability for the current datastreams drops below an acceptable level. Indeed for some sites, it has already dropped below that level and they are taking action on their own. But this writeup is intended to open a discussion as to whether some coordinated action by the entire community can be undertaken if it is seen to be needed.

Long Term Options

Some options may be useful in the long term for improving IDD reliability in the long term, but don't appear to be mature enough to help us at the moment. These include technology and infrastructure that is still in the research or experimental stage:

Reliable Multicast Protocol (RMP)
Resource Reservation Protocol (RSVP)
The Very High Speed Backbone (vBNS)

Impractical or Unacceptable Options

Several possibilities have been discussed, but have been rejected or indefinitely tabled for various reasons. These include:

Returning to a satellite broadcast system
Going to a centralized data distribution center
Building our own Unidata community network based on Frame Relay or dedicated lines
Attempting to get NSF to reestablish something like the NSFnet

Short Term Contingency Options

Of the ideas which have come under discussion, the two that seem to be the most practical to undertake as temporary, emergency measures have to do with reconfiguring the LDM and IDD.

Increase product queue size to hold data longer

Perhaps the easiest solution is to increase the product queue size on the relay nodes to extend the elasticity of the data. If we know that latencies, on average, max out at some value (say 2 hours) we can increase the queue size to deal with it. This approach is not without its drawbacks:

It is only effective if the latencies are cyclic. If a site is always behind, a larger upstream queue will do nothing to alleviate the problem.
An increase in queue size will require a corresponding increase in memory resource allocation, as we do memory mapping on the file. Given a need for queue sizes that may approach sizes in excess of 100MB, the amount of RAM and swap needed may be beyond the reach of many of the relay sites.
There is another issue here; namely, are delays of more than an hour acceptable to the community? We've discussed this before as a philosophical issue and have not come to any firm conclusion. For example, sites might be willing to delay the HDS if they could get the station data in a timely fashion. As I recall the idea of prioritizing the data products/streams has come up, but that would significantly complicate the LDM.

Cut down on the amount of data handled by the relay sites.

In effect, by cutting down on the amount of data you ask for, you are deciding ahead of time which products you will do without, rather than losing products randomly during the busy times of the day. Some leaf nodes who have special difficulties getting data reliably have already begun to experiment with this approach. We'll attempt to keep track of their experiences.

Of course, leaf nodes can make these decisions on their own. However, if it becomes necessary to take such action at relay nodes, then careful coordination is needed because the decision at a relay affects all the sites downstream of the relay. Cutting back on data at the relays could be done in a number of ways. The simplest of course would to come to a community wide decision as to what products are of lower priority and simply cease injecting those products into the IDD from the source. If there were some general agreement that certain of the NCEP model data were generally less important than others, that would greatly simplify the process.

However, if there is not sufficient general agreement on which products can be cut out, an alternative would be to structure the FOS distribution topology around groups of sites which agree on the set of products they want. For example, one branch of the distribution might distribute only the DDS for sites are not interested in real time model forecasts. Another branch might want certain of the models, but not all of them. If, on the other hand certain of the relay sites do have the requisite bandwidth, they might still be able to relay the entire HDS stream to other sites which can receive it reliably.

There are probably many other options as well, but the purpose of this is not to provide a definitive or complete list of options, but to stimulate discussion among the user community so as to give the UPC guidance in developing a contingency plan for the IDD.

Other Options

In the slightly longer term future, there are other possibilities to consider. One somewhat radically different idea that is under discussion at the UPC would be to develop a hybrid data-flow/demand-driven distributed file system. A downstream system could register "standing orders" for certain products which would be delivered and processed in the current ASAP manner. Products not so registered could be obtained and processed on an ad-hoc basis by accessing a seemingly local file system. The accessible lifetime of a product would be decided by the data provider. Of course this would require considerable development effort.

Another possible approach would involve the addition of dynamic routing to configure the IDD, with each site doing regular ldmpings of a set of upstream nodes and getting its data from source that seems to offer the best connectivity. If properly configured, this would provide automatic load balancing (so when things get bad, everybody shares in bad latencies) and would adapt to changes in network connectivity and outages much more quickly than our current manual system. More development is needed for this approach also and we would have to be careful to prevent data loops and thrashing.

Jerry Guynes of Texas A&M proposed an idea which involves establishing 12 top tier sites which would receive data via satellite feeds from Alden and then relay the data via IDD to other sites in Internet "clean region". We are currently looking into the questions of how the "clean regions" are determined, how the system could be reconfigured if the topology of the underlying Internet changes and different top tier sites are needed, and how expensive it would be to set up and maintain such a system. Even if we can't afford the satellite part of such a system for 12 sites, the information as to how to determine "clean regions" could be very helpful in determining an optimum topology for the current IDD.

BUT PLEASE KEEP IN MIND THAT THIS IS STILL CONTINGENCY PLANNING AT THIS POINT IN TIME

Go to the Unidata Homepage.

This page was Webified by Ben Domenico <ben@unidata.ucar.edu>
Questions or comments can be sent to <support@unidata.ucar.edu>.