[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20001030: More on queue problem



>To: address@hidden
>From: Tom McDermott <address@hidden>
>Subject: LDM: queue problem
>Organization: UCAR/Unidata
>Keywords: 200010301740.e9UHeN411741 pq_del_oldest conflict

Tom,

> > This turned out not to be a problem with either the upstream or
> > downstream queue, but a symptom that appears when pqact cannot keep up
> > with the product streams.

> Because pqact is holding locks on the portion of the queue it hasn't
> processed yet, and so the ingester processes are unable to delete the
> oldest entries to make room for new ones?

Not quite; pqact is holding a single lock on the region of the queue
containing the product it is currently processing.  pqact gets this lock
to make sure some other process doesn't delete or overwrite the
product while pqact is processing it.

This prevents a receiver (ingester) process from inserting a new data
product in the queue.  It might be better to modify the code to just
delete the oldest non-locked product instead of just giving up when
the oldest product is locked.  That would allow a new product to be
inserted even when the oldest product was locked, but it would also
delete products from the queue that pqact hasn't had a chance to
process yet.  Is it better to stop inserting products that can be
recovered later, or to delete and hence not process products that have
already been received?  I don't know which is preferable, but the LDM
does the first ...

> I tried this [computing pqact delay] out just now while the 12Z AVN
> is coming in: the time stamps are identical.  So if a pqact slowdown
> really was the cause, it's not happening right now.

We saw the pqact delay hover between 0 and 30 seconds for a while,
then start climbing around 7:20 pm local time, for no particular
reason except that it might have just barely been keeping up and maybe
a scour script started up at the same time was enough to push it over
the edge of falling behind.  Even then it took several hours for it to
fall behind enough to get close to the end of the queue, and sometimes
it would go a little faster than the incoming products, lowering the
delay for a while.

I plan to change pqact to make it report the delay in its verbose
logging, so it will be easier to monitor.  Right now it's a pain to
monitor.

> I don't think this [overloaded host] describes our situation.  Just now:
> 
> last pid: 15681;  load averages:  0.38,  0.36,  0.51
> 11:55:24
> 119 processes: 118 sleeping, 1 on cpu
> CPU states: 79.0% idle,  7.6% user, 11.2% kernel,  2.2% iowait,  0.0% swap
> Memory: 512M real, 17M free, 124M swap in use, 653M swap free
> 
>    PID USERNAME THR PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
>  29919 ldm        1  55    0  148M  123M sleep  326:02  1.74% pqact
>  29918 ldm        1  48    0  147M  108M sleep  211:59  1.11% pqbinstats
>  15676 ldm        1  56    0 1448K 1196K cpu/2    0:00  1.10% top
 ...

Maybe on your system it's not pqact that has the lock on the oldest
product.  A sender process could fall behind the feed rates if the
network connection to the downstream site is flaky or congested.
Although we haven't seen that here, a slow sender would have the same
symptom of locking each product as it sends it, so the oldest product
would eventually be locked, causing a conflict with incoming products.

Unfortunately, I don't know of an easy way to determine which other
process has a lock on a region of the queue, when it's detected that
it's locked.  There may be a system interface for this that returns
the locking process ID, but I don't know whether there's something we
can use that's portable to all the Unix systems the LDM runs on.  I'll
try to look into this ...

 ...
> >  - Having not enough open-file and running-decoder slots for pqact.
> >    To avoid opening and closing lots of files and starting up new
> >    decoders for each product, pqact has a stack of open file
> >    descriptors for the most recently used files and pipes to decoders.
> >    The default number of these is MAXENTRIES, a macro defined in
> >    pqact/filel.c and currently set to 32.  You may need to increase
> >    this and recompile, if pqact appends to lots of files and sends
> >    data to lots of different decoders.
> 
> Possibly, since pqact does append to lots of files and sends data to lots
> of different decoders, but I imagine it does at most sites.  Is there a
> way to detect if there is a shortage of file descriptors for a given
> process? 

I'm was only thinking of the pqact process as possibly needing more
than the 32 file descriptors it now has.  You could probably detect
whether it needs more by looking at a verbose log of pqact and noting
the process IDs of the decoders it invokes.  If it often has to start
up a new instance of a decoder it was recently using (which had to be
shut down because it aged off the open file descriptor list), then it
needs more descriptors.  The easiest way to tell would be to just
double the MAXENTRIES parameter of pqact to 64 and see if it runs more
efficiently, by monitoring the delay between pqact and the queue
insertion.  But we don't even have any evidence that pqact is the
process that's falling behind on your system, so this may make no
difference, and it's currently hard to monitor the delay ...

We'll still be interested if you could note whether there is any pqact
delay when you get the "pq_del_oldest: conflict " message.  That would
be the first step to knowing whether it's pqact or something else
that's the culprit.

--Russ

_____________________________________________________________________

Russ Rew                                         UCAR Unidata Program
address@hidden                     http://www.unidata.ucar.edu