[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20001030: More on queue problem



>To: Russ Rew <address@hidden>
>Cc: address@hidden, address@hidden,
>   address@hidden, address@hidden
>From: Tom McDermott <address@hidden>
>Subject: Re: 20001030: More on queue problem
>Keywords: 200010301740.e9UHeN411741 pq_del_oldest conflict

Tom,

> On Thu, 2 Nov 2000, Russ Rew wrote:
> 
> > We'll still be interested if you could note whether there is any pqact
> > delay when you get the "pq_del_oldest: conflict " message.  That would
> > be the first step to knowing whether it's pqact or something else
> > that's the culprit.
> 
> We just had another ocurrence here at 1537Z.  I didn't notice until 5
> minutes later at 1542Z, when I had the pqact log go verbose.  Strangely
> enough, the pqact time stamps were 1 second _ahead_ of the ldmadmin watch
> time stamps. Pqmon reported oldest product 4016 seconds, so I kind of
> doubt that pqact could fall over 1 hour behind and then catch up
> completely in 5 minutes.  So I tentatively rule out pqact as being the
> cause for now.

Yes, that makes it fairly certain we can rule out pqact.

 ... 
> > Maybe on your system it's not pqact that has the lock on the oldest
> > product.  A sender process could fall behind the feed rates if the
> > network connection to the downstream site is flaky or congested.
> > Although we haven't seen that here, a slow sender would have the same
> > symptom of locking each product as it sends it, so the oldest product
> > would eventually be locked, causing a conflict with incoming products.
> 
> Now this is the first possibility that you've mentioned that seems as if
> it maybe could apply to our system.  One of our downstream sites, Moravian
> College, apparently has limited bandwith.  We have RECLASS messages in our
> log continuously for them throughout the day except in the early hours of
> the morning after the 6Z models have come in.  Also lots of
> 'h_clnt_call:catwoman.cs.moravian.edu' messages; disconnects, 20 since
> midnight and the latest at 15:29:22Z.  But there is nothing new about
> this; it's been going on for a long time.
> 
> Something that might help us and Moravian perhaps even more would be for
> them to use the filter Unidata suggested that limited bandwith sites use
> in their 'request' lines in ldmd.conf at the time we started getting
> NOAAport.  Their request to us for FSL2|UNIDATA is ".*".  (We employ a
> filter in our request to Cornell, but it's different from the Unidata
> filter, since we get some 213 and 212 grids, for example.)
> 
> Something else I'm thinking of trying is going back to 180MB queue size,
> since the problem first arose after I reduced it to 150MB.  I'm thinking
> that if I do this, then the end of the queue should be much older than the
> region that the Moravian feed process is trying to access (typically
> around 61 minutes old if we can believe the RECLASS messages).  Does this
> make any sense to you?

Yes, decreasing the size of your queue might mean that occasionally
you hold less than an hour's worth of data, so at that point Moravian
would still be interested in the less-than-one-hour-old products you
are sending and a conflict could happen.  If you make your queue big
enough so that it never dips below an hour's worth of data, then when
the process that sends products to Moravian gets an hour behind, the
Moravian will send a RECLASS causing it to jump to the front of the
queue to send new products again, and their will never be a locking
conflict.

Thanks for helping resolve this.  We understand the problem better
now, and are creating a troubleshooting page to make it clearer what's
going on for others who may encounter this message in the future:

  http://www.unidata.ucar.edu/packages/ldm/troubleshooting/pq_del_oldest.html

Please let us know if you can suggest improvements to this explanation
as a result of your experience with the problem ...

--Russ