[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20010515: Slow Downstream Node Problem



>To: address@hidden
>From: Paul Hamer <address@hidden>
>Subject: Slow Downstream Node Problem
>Organization: NOAA/FSL
>Keywords: 200105152257.f4FMvRp04890

Hi Paul,

> We've been experiencing the "pq_del_oldest: conflict" message
> problem and reading the web page you have describing it got
> me thinking that there must be a better solution for the 
> slow or flaky network connection element.
> 
> The net result of the slow downstream feed is that the incoming
> data is delayed (possibly lost) due to the inablity of ldm
> to make space in the product queue by deleting the oldest data.
> It seems to me that the downstream side of things should 
> basically be told to disconnect and reconnect at the latest
> (newest) end of your product queue. Better the for a customer
> of your ldm to lose data than for you to, certainly for us anyway.

I agree.  I was surprised to discover that instead of jumping to the
newest end of the product queue, the downstream client just jumps
ahead one minute, so the problem quickly recurs.  But currently I
think the downstream node must determine it has fallen behind and jump
ahead, rather than the upstream telling it to do this (is this right
Anne?)

In some cases the problem can be alleviated or eliminated by
increasing the size of the product queue to hold significantly more
than an hour's worth of data (or for whatever time period the
downstream node is configured), since in that case the downstream node
would not be locking the oldest product in the queue but an hour-old
product somewhere in the middle of the queue.  When it got the
product, it would recognize it as too old, disconnect, and send a
RECLASS message asking for newer products, but products could still be
deleted to make room at the old end of the queue.

But I think a better solution would be for the downstream node to jump
to the new end of the queue (or maybe halfway there, since the pq
library can access products quickly by time) instead of just a minute
ahead.  But this still leaves the possibility that the downstream node
gets stuck and keeps a lock on an hour-old product until it really is
the oldest product in the queue, causing the upstream node to lose
data.

> I started looking at the code and realised that this might not
> be easy to do. After all how do you know which connection has 
> obtained the resource lock? Then I thought that maybe you don't
> have to know. If you signal the "pq_del_oldest: conflict" to the
> process group , i.e. Let everyone know you've seen EAGAIN, then
> in handling the signal the process checks the following:
> 
> 1. Does it have a lock? If not continue
> 2. If so is the lock on the oldest queue member? If not continue
> 3. If it is the oldest, free the resource, reset the pq cursor and
> disconnect from the peer.
> 
> I was thinking about trying to implement this but I don't have
> anytime available, certainly not in the near future, so I was
> wondering if you've been considering this?

This seems like a good idea, and one we hadn't considered.  The way we
had planned to fix the problem instead was to just try to delete the
next oldest product if there is a lock on the oldest product:

  http://www.unidata.ucar.edu/staff/russ/tmp/pq_del_oldest.html

but I think your solution may be better.  We'll discuss this and see
if we can find the resources to try implementing your solution instead.

--Russ

_____________________________________________________________________

Russ Rew                                         UCAR Unidata Program
address@hidden                     http://www.unidata.ucar.edu