[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

IDD Delays: Latency vs. Bandwidth (fwd)




===============================================================================
Robb Kambic                                Unidata Program Center
Software Engineer III                      Univ. Corp for Atmospheric Research
address@hidden             WWW: http://www.unidata.ucar.edu/
===============================================================================

---------- Forwarded message ----------
Date: Thu, 16 Nov 2000 10:08:40 -0700
From: Russ Rew <address@hidden>
To: address@hidden
     Tom McDermott <address@hidden>,
     Tim Doggett <address@hidden>
Subject: IDD Delays: Latency vs. Bandwidth

Hi,

On Wed, 15 Nov 2000, Tom McDermott wrote:

> On Wed, 15 Nov 2000, Jim Koermer wrote:
>
> > During these episodes, my upstream site(s) is usually quite good with
> > FOS latencies < 1 minute.
> 
> This would seem to point to either bad network connection to your upstream
> host (or more likely since I believe you said this occurs with both your
> primary and failover feed), limited bandwith at your site.

Thanks to Jim Koermer, Tom McDermott, and Tim Doggett for raising some
important issues and providing a clear analysis of some of the causes.
A few additional observations about bandwidth versus latency might be
useful for troubleshooting and configuring LDM sites for the IDD.

First, a clarification: what we're calling the "FOS data stream" in
this discussion comes from the NOAAPORT NWSTG channel.  When they are
injected into the IDD, NOAAPORT text products are currently tagged
with the "IDS|DDPLUS" feed type and binary products are tagged with
the "HDS" feedtype.  There are lots more of these products than were
on the old IDS|DDPLUS|HDS feeds in the Family of Services.

The main point I want to make is that the number of products per time
interval may be more important than the volume of the products as a
cause for delays.  Another way of saying this is that the network
latency may be more important than the aggregate bandwidth for a
network connection in determining IDD delays.

Sending each product always require at least one remote procedure call
(RPC, a round trip network transaction) from the upstream to the
downstream site, so the rate at which even small products can be sent
from one host to another is limited by the maximum number of RPCs per
second over the network connection between the hosts.  The time for a
single RPC call is what ldmping measures, and this is close to the
time required to send a single small product.  So you can determine a
maximum for how many products per second a downstream host can accept
from your LDM by taking the reciprocal of the ldmping time to that
host (ignoring the first few times ldmping reports, letting it settle
down to a steady state).  Similarly, the reciprocal of the ldmping
time from an upstream site is a good approximation for how many
products per second you can get from that site.

During some hours the rate for FOS products can be as high as 5.4
products/second, though the long-term average is about 3.1
products/second.

If we had been feeding FOS products to Jim Koermer's LDM during one of
the times when it was experiencing high latency, ldmping indicates it
would only have been able to handle about 5 product/second:

 test$ ldmping -h mammatus.plymouth.edu -i 1
 Nov 15 15:42:55      State    Elapsed Port   Remote_Host           rpc_stat
 Nov 15 15:42:56 RESPONDING   1.204837 4677   mammatus.plymouth.edu  
 Nov 15 15:42:57 RESPONDING   0.241447 4677   mammatus.plymouth.edu  
 Nov 15 15:42:58 RESPONDING   0.222650 4677   mammatus.plymouth.edu  
 Nov 15 15:42:59 RESPONDING   0.228247 4677   mammatus.plymouth.edu  
 Nov 15 15:43:01 RESPONDING   0.212776 4677   mammatus.plymouth.edu  
 Nov 15 15:43:02 RESPONDING   0.204985 4677   mammatus.plymouth.edu  
 ...

This shows that each RPC call takes about 0.2 seconds, so only 1/.2 or
about 5 products per second can be received.  Later in the same
afternoon, the ldmping times climbed even higher, to about .35 seconds
per RPC call, so at this point it could only keep up with about 3
products per second.

When the RPC rate is less than the rate at which products are injected
into the data stream, products back up at the upstream sender
process, until it ultimately gets a RECLASS (a message from the
downstream host indicating the offered products are too old) and jumps
to the start of the queue to send current data, dropping the
intervening products.

Other sites that don't see such high product latencies typically have
much smaller ldmping times, for example the upstream site from
Plymouth:

 test$ ldmping -h squall.atmos.uiuc.edu -i 1
 Nov 15 17:07:41      State    Elapsed Port   Remote_Host           rpc_stat
 Nov 15 17:07:41 RESPONDING   0.099968  388   squall.atmos.uiuc.edu  
 Nov 15 17:07:42 RESPONDING   0.030012  388   squall.atmos.uiuc.edu  
 Nov 15 17:07:43 RESPONDING   0.029179  388   squall.atmos.uiuc.edu  
 Nov 15 17:07:44 RESPONDING   0.029559  388   squall.atmos.uiuc.edu  
 Nov 15 17:07:45 RESPONDING   0.030265  388   squall.atmos.uiuc.edu  
 ...

which means an RPC call to this host takes about .03 secs, so it can
accept about 33 products per second.

These example network latencies are measured from here rather than
from the upstream IDD host, but they are probably representative.  It
would be instructive for most sites to get an ldmping log from their
upstream site for a 24 hour period, to see how network latencies vary.
Using "ldmping -i 5 -h hostname" would give latencies every 5 seconds,
and a 24-hour log would take about 1 MB of disk.  Latencies vary
widely, so if if the current latency is low but was high during the
previous hour, LDM products from an upstream host may still be
arriving late because it takes time to catch up with a backlog of
products.

Unfortunately, network latencies are not necessarily symmetric, so
running ldmping from a downstream host to an upstream LDM won't always
give a good approximation of the network latency in the other
direction.

This RPC latency as measured by ldmping may be the limiting factor for
many sites, rather than the volume of the data.  Here's some recent
maximum hourly product rates for common feed types:

 Feed type     prods/sec

 WSI           6.7 (all products, only distributed from WSI)
 NMC2          6.2 (CONDUIT model data, limited distribution)
 HDS           3.8
 IDS|DDPLUS    2.3 
 NNEXRAD       1.7 (NOAAPORT NEXRAD, available unencrypted in 2001)

Some of these rates can vary significantly at different times of the
day.  For example, the HDS rate varied from 0.3 to 3.8 products/second
during different hours of this period.  This means adding these gives
worst case rates, since the high rates for different feeds may occur
at different times.  For example, you might think from the above that
the worst case for FOS is obtained by adding HDS and IDS|DDPLUS rates
to get 6.1 products/second, but the highest rate for the combined feed
we have seen recently is only 5.4 products/second.  Nevertheless, if
the ldmping time from your upstream site is greater than about 1/6.1
or 0.16 seconds, you might not be able to keep up with the FOS data,
even if that is the only data you are getting.  Over brief intervals
data products can come in at much higher rates; we have occasionally
seen over 180 products/second on motherlode.ucar.edu.

By comparison, the MCIDAS data stream sends a maximum of about .005
products/second, so it is not a factor in these latency calculations
even though it contains large products.

So, what can a site do if their latency indicates they can't receive
products as fast as they are injected into the IDD?  

First, you can try to determine the cause of high latencies and
correct them, using ldmping as a measuring tool to evaluate proposed
improvements.

Second, you can request less data from upstream nodes.  Eliminating a
high-rate feed by not requesting that feed type is the best way to do
this, but if you're relaying data to downstream sites, you can't
eliminate a feed that downstream LDM's need.  If you can get agreement
from your downstream sites and any sites that might failover to you to
eliminate a feedtype, that might help you and your downstream sites.
As Tom McDermott pointed out, if you're a leaf node you have the
freedom to request just the subset of data you need.  And you can use
patterns within feedtypes in your ldmd.conf configuration file to
request a subset of the data products in a feed.

Finally, I should point out that the rate for NNEXRAD shown above (an
additional 1.7 products/second) may increase as more products are
added to the space made available by compressing products after
January.  We're currently trying to evaluate the effect the imminent
introduction of the NNEXRAD feed will have on relay sites and the IDD.

--Russ

_____________________________________________________________________

Russ Rew                                         UCAR Unidata Program
address@hidden                     http://www.unidata.ucar.edu