[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20050405: McIDAS and xvfb dispyay windows (cont.)



>From: "David B. Bukowski" <address@hidden>
>Organization: COD
>Keywords: 200504042331.j34NV1v2029887 McIDAS Xvfb

Hi David,

>Very odd Tom.  Basically since we stopped the MCIDAS processes from
>running (taking out of ldm and pqact) the server has NOT crashed/hung nor
>did any Xvfb windows die.  Nothing else has changed.

I would guess that the xvfb failures were related to system resource
starvation, and that at least part of the starvation was being caused
by script-initiated McIDAS processing not releasing the shared memory
segments that were allocated for their use upon exit.

>On your point about
>the SHMEM segments.  Yes I've noticed lots of SHMIDs running espectially
>when the system hangs/crashes/xvfb deaths.  as of right now there are
>none, but when I the MCIDAS processes are running, we have at least 5
>constantly (owned by ldm at that time) which I think may be the
>decoders.

Yes, there should be two or three depending on what McIDAS-XCD
decoding has been setup:

1 - segment created by the 'exec xcd_run MONITOR' entry in ldmd.conf
2 - segment created by IDS|DDPLUS decoding
3 - segment created by HRS decoding

On some OSes, these are actually created in pairs, so the numbers
could be doubled.  Why you have 5 segments when the LDM is running
is a bit of a mystery to me, but I would tend to believe that 
some were created by pqact.conf entry-initiated processing.

>but then when the system gets really hosed up i've had 2 pages
>of just shmid's.  So yes I have seen the runaway shmid case.

Sounds like this may be be the root of the problem.

>Just must be
>total coincidence then that the Xvfb's die when mcidas is running cause
>now they've been up for over 24 hours without running those processes.  

Well, I would venture to guess that Xvfb's require a substantial
fraction of system resources, and when those resources are unavilable,
they Xvfb(s) die.

>Any other insight would be greatly appreciated,  I hope paul checks the
>scripts to make sure that things are being executed correctly and being
>properly cleaned up.  

The only thing I can suggest is making sure that the scripts exit the
'mcenv' invocation that created the shared memory segment cleanly.  By
the way, this is akin to GEMPAK products needing to run GPEND to get
rid of the message queue created for their communications.

>side note on system hangs, when the cpu load is up to like 200 on
>occasion.  I cannot end the ldm process by ldmadmin stop.it seems to
>hang on trying to stop the mcidas decoders,  even kill -9 doesn't kill
>them they are in a D state on top/ps.  even ipcrm wont get rid of most
>shmid's.  have tried to do a safe reboot with the reboot command or a
>shutdown command, no success.  at these time's i've found myself having to
>physically goto the machine (1 hour long drive from home at nights) to hit
>the power button, which i hate doing but its the only way to recover.  on
>some of these occasions, i've noticed the kernel panic too.  So i'm
>thinking from your response the kernel panic may be possibly from a
>runaway shmid case.  

Situations where one can not kill processes with a 'kill -9' are rare
and not limited to McIDAS.  We have seen instances under Solaris SPARC
where the same thing can happen for processes that are using a lot of
system resources.  I have always considered these situations to be more
a function of the OS than the package since _no_ user-space application
should be allowed by the OS to intefere with system-level functions.

>FYI.  is there anything from the sysctl that u see i may need to chnage,
>i have allocated shmmax at 512 as per install instructions for mcidas
>kernel/threads-max = 102400
>kernel/cad_pid = 1
>kernel/sem = 250        32000   32      128
>kernel/msgmnb = 16384
>kernel/msgmni = 16
>kernel/msgmax = 8192
>kernel/shmmni = 4096
>kernel/shmall = 2097152
>kernel/shmmax = 536870912

The last 8 of these settings are identical to what I have setup on a
Fedora Core 1 Linux box here in my office EXCEPT that I did not
increase shared memory to 512 MB.  Since your problem appears to be
related to shared memory starvation, I would try decreasing the shmmax
value to something like 32 MB.  This is especially the case if you do
not try to run multiple interactive McIDAS sessions.

>thanks for any help

I am curious if you see problems when just running McIDAS-XCD
decoding (i.e., not kick off scripts to do processing).  I could
imagine a case where the release of the shared memory is not working
on your system.  Perhaps Debian uses a slightly different call to
release shared memory (you are using Debian, aren't you)?

Cheers,

Tom
--
NOTE: All email exchanges with Unidata User Support are recorded in the
Unidata inquiry tracking system and then made publicly available
through the web.  If you do not want to have your interactions made
available in this way, you must let us know in each email you send to us.

>From address@hidden  Tue Apr  5 13:58:28 2005

Ok, i'll give the shmemmax variable a drop to 32 meg to see if that could
be something, the system does have 6 gig of physical ram and another 4 gig
swap.  running kernel 2.6.4.   need to upgrade the kernel a bit too since
the are like on 2.6.11.6 now.  I'll let u know what I find over the next
few days and will make a redirect of the ipcs to a text file to post next
time.  but yes there are 3 pairs 6 total, ownership LDM.
just a quick overview.  LDM runs the processes that are executed from
pqact and the decoders, while MCIDAS runs cronjobs.  I will have to see
what the ownership is again next time the system goes all wacko, thanks
for the tips, will keep you posted.
-dave  (yes debian)