[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20001214: LDM: out of per user processes



Hi Tom,

>From: Tom McDermott <address@hidden>
>Subject: LDM: out of per user processes
>Organization: SUNY Brockport
>Keywords: 200012141622.eBEGM4o06206 LDM processes

In the above message, you wrote:

> This morning when I came in (delayed several hours beacause of a
> snowstorm), no users were able to access our server.  The reason for
> this was clear from these messages in the system log:
>
> Dec 14 06:31:03 vortex unix: NOTICE: out of per-user processes for uid 214
> Dec 14 06:32:25 vortex last message repeated 23 times
  ...
> Now uid 214 is the ldm, so it is the likely culprit.  This happened
> once before several months ago.  At that time I recompiled ldm with
> just the '-O' option, since I suspected that a target option I
> originally used in compiling ldm might have been the cause.  But it
> appears I was wrong about that.
>
> Info: SparcStation 10/712MP, 512MB, Solaris 5.7, ldm 5.1.2 .
>
> I couldn't find anything on this in the ldm-support archives after a
> quick search, so I thought I'd ask if you had any ideas.  I suppose
> it could be a lot of things, pqact spawns tons of processes.

I've been unable to identify any certain LDM-related cause for this,
but I can offer a couple of theories that you can evaluate.

The only time we have seen anything like this here was in August when
an LDM host had its load average climb to 2000 (!), and we determined
that this was caused by a different LDM host running on an HP-UX 11.0
system hammering it with FEEDME requests.  We have never successfully
gotten around the RPC library problems on the HP-UX 11.0 platform, so
we distribute HP-UX 10.20 binaries for it and recommend people build
the LDM using the HP-UX 10 compatibility mode for HP-UX 11 platforms.

So one possibility is that some downstream site built the LDM for
HP-UX 11 and then requested data from your site many times per second,
causing an LDM sender process to be launched for each such request.
The only sites we see feeding from your vortex host are
blizzard.weather.brockport.edu and catwoman.cs.moravian.edu, but we
don't have a record of whether either of these is an HP-UX platform.
Do you happen to know?  We've just gotten a new HP-UX 11 platform in,
so we hope to be able to fix or find a workaround for this problem in
the near future.

Another possible cause Anne had seen was upgrading to 5.1.2 without
remaking the queue, but I have been unable to duplicate this problem
here and can't understand how that could cause spawning any additional
processes.  When I tried it here, the LDM just reported the problem
and exited, as it is supposed to do:

  Dec 14 20:09:06 rpc.ldmd[25256]: ldm.pq: Not a product queue
  Dec 14 20:09:06 rpc.ldmd[25256]: pq_open failed: ldm.pq: Invalid argument
  Dec 14 20:09:06 rpc.ldmd[25256]: Exiting
  Dec 14 20:09:06 rpc.ldmd[25256]: Terminating process group

A final possibility is that the problem is caused by some decoder or
other program or script launched by pqact.  It is relatively easy to
write a recursive shell script that quickly consumes the process table
if there are no per-user limits set for a user who tries to debug and
run such a script (I've done it!).  If you have other users on the
machine, one of their programs could have spawned processes
recursively or in a loop and used up all the process table entries, so
when the LDM tried to spawn a decoder process, it hit the limit and
produced the message.

Here's a couple of suggestions that might help diagnose the problem.
First, take ps snapshots (or use top) to see all the ldm processes
running and try to account for each one from log file entries, to make
sure there aren't any extra processes being created.  The "pgrep"
command on Solaris 2.7 and later is useful for this, for example

  pgrep -fl -u ldm

shows all processes owned by user "ldm", and piping this into "wc -l"
would give you a quick count of ldm processes, and would let you
monitor if ldm processes were climbing slowly.  But this would be of
no help if something triggers spawning a bunch of processes quickly.
If that happens, it would help to have ps or pgrep output, but to
catch it you might have to run a cron job that dumped the output of
pgrep to a file every minute (overwriting the previous file or
ping-ponging between two files), so that if your process table filled,
you would have a record of what things looked like within the previous
minute.

The only other suggestion I can offer is to set a limit on the number
of processes that can be spawned by the ldm user, to make sure it
doesn't use up processes needed by other users or cause the system to
run out of process slots.  I had thought you could use the "ulimit"
command to set this limit, but having read the ulimit man page, I
don't see how to do it.  I'll send email to our sysadmins, in case one
of them knows and pass on any useful answer I get.

I'm still very interested in resolving whether this is a symptom of an
LDM bug, so if you find out anything else, please let me know.  Thanks!

--Russ

_____________________________________________________________________

Russ Rew                                         UCAR Unidata Program
address@hidden                     http://www.unidata.ucar.edu