[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: troubles stopping ldm with ldmadmin on linux



"James D. Marco" wrote:

> Hi All,
>         Yes, I agree and it is not restricted to LDM. This is correct
> behavior, from a computer science standpoint. Several data-logger
> daemons I wrote do the same thing on HP, Sun, SGI, and Linux. These
> processes maintain open files over a long time monitoring/collecting
> 'bursty' data, but, are not continuously used; they receive data spaced
> over a 'long time' in terms of CPU utilization...more than one second. I
> always assumed this was caused by a combination of:
>                 Large files (as you mention)
>                 The process-owned buffered file IO (streams)
>                 The operating system disk buffering
>                 The OS swap: swaps a process to disk if not memory locked
>                 The hardware caching - CPU cache, Disk cache, Controller 
> cache.
>
> The entire network of dependencies for all this is quite large.
>
> One other item that occurs to me, the 'Garbage Collection' mechanism's
> in modern OS's.  Large processes and process buffer utilization will
> leave large holes in real memory when killed. Assuming a threshold value
> for memory fragmentation, the OS will probably initiate a cleanup...
> which can be expensive in terms of CPU time.
>
> Usually, all this happens within 10-15 seconds, but 60-120 seconds is
> probably not unusual for a well tuned system.
>
> If delays are much longer, I would look elsewhere.  My first guess is
> that the hard drive is badly fragmented, overloaded, or in need of
> reformatting (low level & file system.) File system organization can be
> a problem. Locate swap space and large LDM Queues/Decoder outputs on
> separate drives, not just partitions on the same physical drive. Increase
> the amount of RAM. More....
>                                                         jdm

From this discussion it's clear that there are many OS dependent issues
surrounding the termination of a process or group of processes.    The memory
mapped queue used by the LDM is clearly a factor.

As alluded to, the LDM itself does a few things that might cause its processes 
to
hang around for a while.  Most of the rpc.ldmd processes you see are either
receiving processes that write to the queue, or sending processes that read from
the queue and send products downstream.  These processes have signal handling
routines that allow them to finish critical tasks, tasks that should not be
interrupted, such as writing to the queue.  Thus, they will finish their 
critical
tasks before handling a signal.  A receiving process might take some time to
finish writing a large product to the queue. For a sending process, the time it
takes to finish a critical task might also be dependent on the state of network,
i.e., network congestion could slow it down.  Add to this the need to finally
write the memory mapped file back to disk, and the whole shut down could take a
while.

Doug's changes to the ldmadmin script that kill each child individually seem
fine.  However, in theory that should make no difference.  A SIGTERM (a
termination signal) should be handled by the receiving process in the same 
manner
regardless of source from which it was sent.   Still, I'm reminded of my 
favorite
quote: "In theory, theory and practice are the same.  In practice, they're not."

It would be different if Doug was sending a 'kill -9' signal, as these signals
cannot be caught and handled by the receiving processes.  I would not recommend
this if you care about maintaining the state of the queue.

Another variable is how memory mapped files are handled by the different OSs.  
In
the past we have had trouble with Linux's handling of memory mapped files.    I
don't fully understand this problem (this was before I got here) - it seems to
have been improved with recent releases of the OS.  This may or may not be
relevant to this shut down issue.

I have had anecdoctal experience with my own Red Hat 6.1 Linux box.  I have 
tried
to kill groups of processes by killing the parent and found that the children 
did
not die in what I thought should be a timely manner.   I'm still not sure what 
to
think of that...

Anne

--
***************************************************
Anne Wilson                     UCAR Unidata Program
address@hidden                  P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************