[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20001030: More on queue problem



>To: address@hidden
>From: Tom McDermott <address@hidden>
>Subject: LDM: queue problem
>Organization: UCAR/Unidata
>Keywords: 200010301740.e9UHeN411741

Tom,

On 30 Oct, you wrote:

> I'm getting a whole bunch of messages about 'pq_del_oldest: conflict' and
> 'comings: pqe_new: Resource temporarily unavailable' followed by
> disconnects from 'snow' which is the Cornell ldm server (see below for
> samples).  I'm not sure if this indicates a problem with the queue on our
> host or theirs.  I'm running ldm 5.1.2 on a sparc solaris 2.7 host.

and Anne replied:

> > We're developing a theory that we will be testing this afternoon. 
> ........
> > I'll let you know the results of our testing later today.
> 
> I'd be interested in hearing the results.

This turned out not to be a problem with either the upstream or
downstream queue, but a symptom that appears when pqact cannot keep up
with the product streams.

You can determine how far behind pqact is currently running by:
   
 - Run "ldmadmin watch" or "pqutil -w" for a while, capturing the
   output in a file (this tells when products are getting inserted
   into the product queue).  Here is an example of how to do this with
   pqutil in csh syntax: 
   
    % pqutil -w ~ldm/data/ldm.pq < /dev/null > & /tmp/pqutil.log &
   
 - Turn on verbose logging for pqact by finding its process ID and
   sending it a USR2 signal with "kill".  
   
 - Find a product ID that appeared in the "ldmadmin watch" output and
   find the same product in the "pqact" verbose logging output.  Note
   that if pqact has fallen way behind, the same product ID may not
   appear for several hours in the pqact output, but each product
   should appear in the exact same order as in the ldmadmin output,
   even if you don't do anything with the product in pqact.conf.
   
 - Take the difference between the time stamps for the product in the
   pqact log and the "ldmadmin watch" log; that's how far pqact is
   running behind.  If it's close to the age of the oldest product in
   the queue, as shown by the last column in the output of "pqmon",
   then you will be getting the "pq_del_oldest conflict:" messages.

 - Remember to turn off verbose logging in pqact by sending it another
   couple of USR2 signals to cycle into debug logging and then default
   silent logging.

Several problems can cause pqact falling behind, including:

 - Requesting pqact to write a lot of files.  For example, when we
   turn on filing all WSI products, each in a separate file using the
   "FILE -close" action, eventually pqact falls further and further
   behind with all the I/O activity and when it finally is acting on
   the oldest product in the queue at the same time that space is
   needed for new products, the 'pq_del_oldest: conflict' message
   results.  Note that when pqact writes a file in a deeply nested
   directory structure, it has to open and read each directory in the
   path to that file, so there's lots more I/O going on than just
   writing a single file.  We have seen clusters of as many as 200
   products per second, though the average we're handling is only
   about 16 products per second.  
   
 - Running pqact on a system with slow disks, or writing lots of files
   on remotely mounted systems.  This appears to have been our
   problem, since we only started seeing the pq_del_oldest conflicts
   after our RAID disks went out and we temporarily replaced them with
   slower disks, then started filing all the WSI products.

 - Running pqact on an overloaded or under-configured system.  You can
   tell if this is happening with the normal commands for monitoring
   system load.  pqact can invoke lots of decoders in parallel, so a
   tool like "top" can let you see what kind of load this is putting
   on the system.  Some decoders take lots of memory, so its possible
   that virtual memory thrashing could result from lots of decoders on
   a system without much memory.

 - Having too many pqact rules in the pqact.conf file.  Each product
   ID is sequentially matched against every pattern in the pqact.conf
   file before the next product is retrieved from the queue, so having
   hundreds or thousands of pattern-action entries could slow down the
   processing of each product enough to make pqact fall behind the
   data feeds.  If this turns out to be a problem, we can provide a
   fix by adding some extra action syntax that means "this is the last
   thing I want to do with this product so skip all the remaining
   patterns-action rules and start over with the next product".

 - Having not enough open-file and running-decoder slots for pqact.
   To avoid opening and closing lots of files and starting up new
   decoders for each product, pqact has a stack of open file
   descriptors for the most recently used files and pipes to decoders.
   The default number of these is MAXENTRIES, a macro defined in
   pqact/filel.c and currently set to 32.  You may need to increase
   this and recompile, if pqact appends to lots of files and sends
   data to lots of different decoders.

 - Having a product queue that's too small, so there is insufficient
   elasticity for pqact when it has too much to do

I'd be interested if you think any of these apply in your situation.
I don't think there's much that can be done to make pqact more
efficient currently, since it uses fast algorithms for pattern
matching and tries to use file descriptors efficiently to avoid
unnecessary process creation or file opening/closing.  But if you
profile pqact and find a bottleneck that's slowing it down, we'd be
interested in hearing about it.

--Russ

_____________________________________________________________________

Russ Rew                                         UCAR Unidata Program
address@hidden                     http://www.unidata.ucar.edu