[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20001030: More on queue problem (fwd)



I'm just forwarding this in case you're interested.  My earlier reply
indicated we thought we had figured out what was going on, but it
looks like the cause of the similar problem Tom McDermott is seeing is
still a mystery ...

--Russ

>To: address@hidden
>From: Tom McDermott <address@hidden>
>Subject: LDM: queue problem
>Organization: UCAR/Unidata
>Keywords: 200010301740.e9UHeN411741

------- Forwarded Message

Date:    Thu, 02 Nov 2000 13:03:27 -0500
From:    Tom McDermott <address@hidden>
To:      Russ Rew <address@hidden>
Subject: Re: 20001030: More on queue problem 

Russ, 

Thanks a lot for the detailed response.  My comments follow below.


On Thu, 2 Nov 2000, Russ Rew wrote:

> This turned out not to be a problem with either the upstream or
> downstream queue, but a symptom that appears when pqact cannot keep up
> with the product streams.

Because pqact is holding locks on the portion of the queue it hasn't
processed yet, and so the ingester processes are unable to delete the
oldest entries to make room for new ones?

The reason I mentioned the upstream queue (Cornell) in my original message
was that I had only seen these messages previously in Cornell's ldmd log
web page, never on our system.

> You can determine how far behind pqact is currently running by:

....... <instructions deleted> ......

>  - Take the difference between the time stamps for the product in the
>    pqact log and the "ldmadmin watch" log; that's how far pqact is
>    running behind.  If it's close to the age of the oldest product in
>    the queue, as shown by the last column in the output of "pqmon",
>    then you will be getting the "pq_del_oldest conflict:" messages.

I tried this out just now while the 12Z AVN is coming in:  the time stamps
are identical.  So if a pqact slowdown really was the cause, it's not
happening right now.

> Several problems can cause pqact falling behind, including:
> 
>  - Requesting pqact to write a lot of files.  For example, when we
>    turn on filing all WSI products, each in a separate file using the
>    "FILE -close" action, eventually pqact falls further and further
>    behind with all the I/O activity and when it finally is acting on
>    the oldest product in the queue at the same time that space is
>    needed for new products, the 'pq_del_oldest: conflict' message
>    results.  Note that when pqact writes a file in a deeply nested
>    directory structure, it has to open and read each directory in the
>    path to that file, so there's lots more I/O going on than just
>    writing a single file.  We have seen clusters of as many as 200
>    products per second, though the average we're handling is only
>    about 16 products per second.  

Possibly, but not likely.  I haven't changed any feed requests or
pattern-actions recently except to delete those dealing with DIFAX.

>  - Running pqact on a system with slow disks, or writing lots of files
>    on remotely mounted systems.  This appears to have been our
>    problem, since we only started seeing the pq_del_oldest conflicts
>    after our RAID disks went out and we temporarily replaced them with
>    slower disks, then started filing all the WSI products.

Unlikely.  I covered this point in my email to Anne on 10/31 which you
should have received a copy of.

>  - Running pqact on an overloaded or under-configured system.  You can
>    tell if this is happening with the normal commands for monitoring
>    system load.  pqact can invoke lots of decoders in parallel, so a
>    tool like "top" can let you see what kind of load this is putting
>    on the system.  Some decoders take lots of memory, so its possible
>    that virtual memory thrashing could result from lots of decoders on
>    a system without much memory.

I don't think this describes our situation.  Just now:

last pid: 15681;  load averages:  0.38,  0.36,  0.51
11:55:24
119 processes: 118 sleeping, 1 on cpu
CPU states: 79.0% idle,  7.6% user, 11.2% kernel,  2.2% iowait,  0.0% swap
Memory: 512M real, 17M free, 124M swap in use, 653M swap free

   PID USERNAME THR PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
 29919 ldm        1  55    0  148M  123M sleep  326:02  1.74% pqact
 29918 ldm        1  48    0  147M  108M sleep  211:59  1.11% pqbinstats
 15676 ldm        1  56    0 1448K 1196K cpu/2    0:00  1.10% top
 12889 ldm        1  58    0 5552K 2292K sleep   94:17  0.91% dchrly
 14338 ldm        1  33    0 7316K 2136K sleep    4:20  0.71% dcgrib
 29923 ldm        1  58    0   11M 4120K sleep  109:23  0.68% pqact
 29920 ldm        1  58    0  155M  110M sleep   69:40  0.44% pqsurf
 29925 ldm        1  58    0  147M  115M sleep   48:16  0.30% rpc.ldmd
 29935 ldm        1  58    0  147M  113M sleep   63:39  0.25% rpc.ldmd
   205 root       4  38    0 2224K  880K sleep   37:34  0.23% ypserv
 29924 ldm        1  58    0  147M   89M sleep   56:36  0.22% rpc.ldmd
   479 root       1  58    0 1984K  508K sleep   22:01  0.15% fingerd
 15583 ldm        1  58    0 1068K  888K sleep    0:00  0.13% mailx
 15677 ldm        1  52    2 1660K 1100K sleep    0:00  0.09% ping
   311 root      21  54    0 3948K 3220K sleep   11:29  0.08% nscd


It is frequently the case that there are multiple decoders executing
simultaneously (e.g., when a number of different models are coming in
simultaneously, but not enough to even come close to overloading the
system.  I've tried to write the pattern-actions to reduce the number of
simultaneous invocation of decoders to the greatest extent possible.
BTW, gettting rid of pqexpire with 5.1.2 reduced system load considerably.

>  - Having too many pqact rules in the pqact.conf file.

I have 393 pattern-actions currently in pqact.conf .  I don't think this
is enough to cause a problem and anyways this number hasn't changed
appreciably recently.

>  - Having not enough open-file and running-decoder slots for pqact.
>    To avoid opening and closing lots of files and starting up new
>    decoders for each product, pqact has a stack of open file
>    descriptors for the most recently used files and pipes to decoders.
>    The default number of these is MAXENTRIES, a macro defined in
>    pqact/filel.c and currently set to 32.  You may need to increase
>    this and recompile, if pqact appends to lots of files and sends
>    data to lots of different decoders.

Possibly, since pqact does append to lots of files and sends data to lots
of different decoders, but I imagine it does at most sites.  Is there a
way to detect if there is a shortage of file descriptors for a given
process? 

>  - Having a product queue that's too small, so there is insufficient
>    elasticity for pqact when it has too much to do

I don't think so, although I did reduce the queue size recently from 180
to 150MB, since pqmon was reporting ages of up to 21,000, which I thought
was execessive.  I checked it while running the pqutil/pqact experiment
and it was 5905, which is the lowest I've ever seen it, but I would think
it's still more than enough.

But I suspect this was the culprit in Cornell's case, since their queue is
too small.  They're still running ldm 5.0.x and they get tons of 'Deleting
oldest to make space' messages in their logs when the models come in.
 
> I'd be interested if you think any of these apply in your situation.
> I don't think there's much that can be done to make pqact more
> efficient currently, since it uses fast algorithms for pattern
> matching and tries to use file descriptors efficiently to avoid
> unnecessary process creation or file opening/closing.  But if you
> profile pqact and find a bottleneck that's slowing it down, we'd be
> interested in hearing about it.

No, I don't think so either; pqact seems to be very efficient.

Really, the only thing that keeps coming to mind is the coincidence of
installing those difax conversion scripts on Thurs. PM and then the first
instance pops up on Fri. AM.  It's hard for me to believe they could have
bogged down the system enough to slow down pqact, but maybe they did.  I
have subsequently moved the scripts to another host, so we can eliminate
this as a possibility in any future incident.

Unfortunately, I was not on the premises when these 2 incidents occurred,
but if it happens again and I am around, I can look at some of the things
you mention in your message and then get back to you with something more
concrete.

Thanks again.

Tom
------------------------------------------------------------------------------
Tom McDermott                           Email: address@hidden
System Administrator                    Phone: (716) 395-5718
Earth Sciences Dept.                    Fax: (716) 395-2416
SUNY College at Brockport


------- End of Forwarded Message