[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Tom McDermott: Re: 20001214: LDM: out of per user processes

Subject: Tom McDermott: Re: 20001214: LDM: out of per user processes
Date: Wed, 20 Dec 2000 10:54:21 -0700
Hi, 

In case you're interested in the mystery of the process table filling
up on Tom McDermott's LDM host, there are a few more possible clues in
this note, but it hasn't been solved yet ...

--Russ

------- Forwarded Message

Date:    Wed, 20 Dec 2000 12:35:25 -0500
From:    Tom McDermott <address@hidden>
To:      Russ Rew <address@hidden>
Subject: Re: 20001214: LDM: out of per user processes

On Fri, 15 Dec 2000, Russ Rew wrote:

> In the above message, you wrote:
>
> > Dec 14 06:31:03 vortex unix: NOTICE: out of per-user processes for uid 214
> > Dec 14 06:32:25 vortex last message repeated 23 times
>   ...
> > Now uid 214 is the ldm, so it is the likely culprit.  This happened
> > once before several months ago.

I think I should have phrased this as 'ldm is a possible culprit'.

> I've been unable to identify any certain LDM-related cause for this,
> but I can offer a couple of theories that you can evaluate.
>
> The only time we have seen anything like this here was in August when
> an LDM host had its load average climb to 2000 (!), and we determined
> that this was caused by a different LDM host running on an HP-UX 11.0
> system hammering it with FEEDME requests.  We have never successfully
> gotten around the RPC library problems on the HP-UX 11.0 platform, so
> we distribute HP-UX 10.20 binaries for it and recommend people build
> the LDM using the HP-UX 10 compatibility mode for HP-UX 11 platforms.
>
> So one possibility is that some downstream site built the LDM for
> HP-UX 11 and then requested data from your site many times per second,
> causing an LDM sender process to be launched for each such request.
> The only sites we see feeding from your vortex host are
> blizzard.weather.brockport.edu and catwoman.cs.moravian.edu, but we
> don't have a record of whether either of these is an HP-UX platform.
> Do you happen to know?  We've just gotten a new HP-UX 11 platform in,
> so we hope to be able to fix or find a workaround for this problem in
> the near future.

Well, blizzard is another Sun (Ultra-10) running Solaris 7.  Catwoman at
Moravian College is a Sun Ultra 450 server running Solaris 8, so I think
we can definitely exclude the HP possibility.

> Another possible cause Anne had seen was upgrading to 5.1.2 without
> remaking the queue, but I have been unable to duplicate this problem
> here and can't understand how that could cause spawning any additional
> processes.  When I tried it here, the LDM just reported the problem
> and exited, as it is supposed to do:
>
>   Dec 14 20:09:06 rpc.ldmd[25256]: ldm.pq: Not a product queue
>   Dec 14 20:09:06 rpc.ldmd[25256]: pq_open failed: ldm.pq: Invalid argument
>   Dec 14 20:09:06 rpc.ldmd[25256]: Exiting
>   Dec 14 20:09:06 rpc.ldmd[25256]: Terminating process group

I definitely remade the queue when upgrading to 5.1.2 at the end of August
(actually even before that because I ran the beta release), and have
remade the queue quite a few times since in order to change its size (you
recall the pq_del_oldest problem), so I think we can positively eliminate
this possibility as well.

> A final possibility is that the problem is caused by some decoder or
> other program or script launched by pqact.  It is relatively easy to
> write a recursive shell script that quickly consumes the process table
> if there are no per-user limits set for a user who tries to debug and
> run such a script (I've done it!).  If you have other users on the
> machine, one of their programs could have spawned processes
> recursively or in a loop and used up all the process table entries, so
> when the LDM tried to spawn a decoder process, it hit the limit and
> produced the message.

This I think is most likely.  The only problem I see with a different non-
ldm user executing a runaway program is that the syslog message stated
that it was 'out of per-user processes for uid 214', implying that it was
the user ldm which had run out of processes, not necessarily the system as
a whole.  I don't know that if a non-ldm user filled up the (global)
process table, a different user (ldm) trying to fork a new process would
get that per-user message if it was a different process that had run amok.
So it would seem that focussing on the processes ldm spawns, such as a
decoder or script, might be the first line of attack.  I don't recall any
recent changes to these.  I last compiled gempak programs on 7/28/00, but
there were almost no changes to the binaries.  The only other new decoder
is the pnga2area program for the new MCIDAS compression format, but I have
no reason to suspect this program.  I will make a closer inspection of
pqact.conf when I get a chance to make sure nothing has escaped my memory
on this point.

> Here's a couple of suggestions that might help diagnose the problem.
> First, take ps snapshots (or use top) to see all the ldm processes
> running and try to account for each one from log file entries, to make
> sure there aren't any extra processes being created.  The "pgrep"
> command on Solaris 2.7 and later is useful for this, for example
>
>   pgrep -fl -u ldm
>
> shows all processes owned by user "ldm", and piping this into "wc -l"
> would give you a quick count of ldm processes, and would let you
> monitor if ldm processes were climbing slowly.  But this would be of
> no help if something triggers spawning a bunch of processes quickly.
> If that happens, it would help to have ps or pgrep output, but to
> catch it you might have to run a cron job that dumped the output of
> pgrep to a file every minute (overwriting the previous file or
> ping-ponging between two files), so that if your process table filled,
> you would have a record of what things looked like within the previous
> minute.

OK, I will try the top and ps file output suggestion.  I guess the crontab
entry would be all stars (* * * * *) to indicate it should be run every
minute.  But it may be a while before get another ocurrrence.  It was
around 2 months between the 2 incidents.  Also, I will be upgrading to
gempak 5.6 sometime in the next couple of weeks, so that will change the
decoder environment.  And while I'm doing that, I'll be taking a closer
look at pqact.conf to see if there's anything amiss there.

The only other things that come to mind are that a much earlier version of
the ncsa httpd server (v. 1.3.4, I think), used to under some conditions
start forking hundreds of copies of the daemon on our system.  But
upgrading to v 1.5.2 about 4 years ago solved that problem, and I have no
special reason to believe it has caused this problem other than it caused
a similar problem in the past.  I suppose I should have upgraded to the
apache server before now.  The other possible problem is a script (run out
of ldm's crontab) that downloads some radar graphics from a nws http
server.  Under some conditions, which arise fairly frequently, the script
and its child processes don't exit and remain on the system.  This results
in 3 additional processes for each hour that it happens, but they aren't
forking more processes, which I think would be required to fill up the
process tables.

Something I thought of too late is that I'm running that syscheck script
distributed with ldm.  Since I almost never look at the output, I forgot
last week I was running it.  I didn't think of this until Monday and it
was rotated out of existence by then.  (I will rotate 7 logs from now on.)
But I'm not sure if that script gives the sort of info that might be
useful here anyways.  I will have to go back to my tapes and see what's in
the logs for the first incident in Oct.  I will let you know if I find
anything interesting.

> I'm still very interested in resolving whether this is a symptom of an
> LDM bug, so if you find out anything else, please let me know.
> Thanks!

Really, ldm on sparc Solaris has been an extremely stable, I might say
awesomely stable, platform in recent years.  However, I guess no program,
no matter how stable, is without bugs.  They just become more obscure and
are triggered under infrequent sequences of conditions.

Tom
------------------------------------------------------------------------------
Tom McDermott                           Email: address@hidden
System Administrator                    Phone: (716) 395-5718
Earth Sciences Dept.                    Fax: (716) 395-2416
SUNY College at Brockport



------- End of Forwarded Message
Prev by Date: Re: Recommendations for Receiving Radar Data
Next by Date: Re: [Fwd: Re: 20001215: LDM FEEDME]
Previous by thread: 20001220: LDM connection to cirrus.al.noaa.gov
Next by thread: Re: [Fwd: Re: 20001215: LDM FEEDME]
Index(es):
- Date
- Thread