IDD Contingency Ideas
Background
Since the NSFNet backbone was turned over to "commodity"
Internet service providers on April 30, 1995, the overall
performance of the network has been somewhat shaky.
Nevertheless, for the most part, since the installation of
the LDM 5 at most relays (starting December 1995), we
could count on the majority of our relay sites getting
100% of the data.
However, since late February, 1996, the service in some
areas has degraded enough that it is beginning to show
in the long term IDD statistics as losses of at least
some data products at relay sites during the busy
afternoon hours.
Moreover, one Internet administrator estimates that the
traffic on the Internet is doubling every 4 months. With
that sort of increase in demand, service may degrade
further before the providers can increase the capacity
to the point where it can keep up with the demand.
Consequently, we have been having discussions at the
UPC and among some of the sites to come up with some
ideas for what might be done if the reliability for
the current datastreams drops below an acceptable level.
Indeed for some sites, it has already dropped below
that level and they are taking action on their own.
But this writeup is intended to open a discussion as
to whether some coordinated action by the entire
community can be undertaken if it is seen to be
needed.
Long Term Options
Some options may be useful in the long term for improving
IDD reliability in the long term, but don't appear to
be mature enough to help us at the moment. These include
technology and infrastructure that is still in the research
or experimental stage:
- Reliable Multicast Protocol (RMP)
- Resource Reservation Protocol (RSVP)
- The Very High Speed Backbone (vBNS)
Impractical or Unacceptable Options
Several possibilities have been discussed, but have been
rejected or indefinitely tabled for various reasons. These
include:
- Returning to a satellite broadcast system
- Going to a centralized data distribution center
- Building our own Unidata community network
based on Frame Relay or dedicated lines
- Attempting to get NSF to reestablish something like
the NSFnet
Short Term Contingency Options
Of the ideas which have come under discussion, the two
that seem to be the most practical to undertake as
temporary, emergency measures have to do with reconfiguring
the LDM and IDD.
Increase product queue size to hold data longer
Perhaps the easiest solution is to increase the product queue size on the
relay nodes to extend the elasticity of the data. If we know that
latencies, on average, max out at some value (say 2 hours) we can increase
the queue size to deal with it.
This approach is not without its drawbacks:
- It is only effective if the latencies are cyclic. If a site
is always behind, a larger upstream queue will do nothing to
alleviate the problem.
- An increase in queue size will require a corresponding
increase in memory resource allocation, as we do memory
mapping on the file. Given a need for queue sizes that may
approach sizes in excess of 100MB, the amount of RAM and swap
needed may be beyond the reach of many of the relay sites.
- There is another issue here; namely, are delays of more than an
hour acceptable to the community? We've discussed this before
as a philosophical issue and have not come to any firm conclusion.
For example, sites might be willing to delay the HDS if they
could get the station data in a timely fashion. As I recall the
idea of prioritizing the data products/streams has come up, but
that would significantly complicate the LDM.
Cut down on the amount of data handled by the relay sites.
In effect, by cutting down on the amount of data you ask for,
you are deciding ahead of time which products you will do
without, rather than losing products randomly during the
busy times of the day.
Some leaf nodes who have special difficulties getting data
reliably have already begun to experiment with this approach.
We'll attempt to keep track of their experiences.
Of course, leaf nodes can make these decisions on their own.
However, if it becomes necessary to take such action at
relay nodes, then careful coordination is needed because the
decision at a relay affects all the sites downstream of the
relay. Cutting back on data at the relays could be done in
a number of ways. The simplest of course would to come to
a community wide decision as to what products are of lower
priority and simply cease injecting those products into the
IDD from the source. If there were some general agreement
that certain of the NCEP model data were generally less important
than others, that would greatly simplify the process.
However, if there is not sufficient general agreement on
which products can be cut out, an alternative would be
to structure the FOS distribution topology around groups
of sites which agree on the set of products they want.
For example, one branch of the distribution might distribute
only the DDS for sites are not interested in real time model
forecasts. Another branch might want certain of the models,
but not all of them. If, on the other hand certain of the
relay sites do have the requisite bandwidth, they might
still be able to relay the entire HDS stream to other
sites which can receive it reliably.
There are probably many other options as well, but the
purpose of this is not to provide a definitive or complete
list of options, but to stimulate discussion among the
user community so as to give the UPC guidance in developing
a contingency plan for the IDD.
Other Options
In the slightly longer term future, there are other possibilities
to consider.
One somewhat radically different idea that is under
discussion at the UPC would be to develop
a hybrid data-flow/demand-driven distributed file system. A
downstream system could register "standing orders" for certain
products which would be delivered and processed in the current ASAP
manner. Products not so registered could be obtained and processed
on an ad-hoc basis by accessing a seemingly local file system.
The accessible lifetime of a product would be decided by the data
provider. Of course this would require considerable
development effort.
Another possible approach would involve the addition of
dynamic routing to configure the IDD, with each site doing
regular ldmpings of a set of upstream nodes and getting
its data from source that seems to offer the best connectivity.
If properly configured, this would provide automatic load
balancing (so when things get bad, everybody shares in
bad latencies) and would adapt to changes in network
connectivity and outages much more quickly than our
current manual system. More development is needed
for this approach also and we would have
to be careful to prevent data loops and thrashing.
Jerry Guynes of Texas A&M proposed an idea which involves
establishing 12 top tier sites which would receive data
via satellite feeds from Alden and then relay the data
via IDD to other sites in Internet "clean region". We
are currently looking into the questions of how the
"clean regions" are determined, how the system could be
reconfigured if the topology of the underlying Internet
changes and different top tier sites are needed, and
how expensive it would be to set up and maintain such a
system. Even if we can't afford the satellite part of
such a system for 12 sites, the information as to how
to determine "clean regions" could be very helpful in
determining an optimum topology for the current IDD.
BUT PLEASE KEEP IN MIND THAT THIS IS STILL CONTINGENCY
PLANNING AT THIS POINT IN TIME
Go to the
Unidata Homepage.
This page was Webified by
Ben Domenico <ben@unidata.ucar.edu>
Questions or comments can be sent to
<support@unidata.ucar.edu>.