[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing



"Arthur A. Person" wrote:
> 
> >
> > When you say "it still thrashes", do you mean that products aren't being
> > received in a timely manner?  Right now products on ldm.meteo appear to
> > be arriving pretty quickly.  And, 'top' is showing a low load average,
> > the machine appears to be responsive, and there's a reasonable number of
> > rpc.ldmds...  Is this all with your 600Mb queue?
> 
> By thrashing, I mean that the disk I/O light is mostly on and occasionally
> blinks off and the system has very slow response and the IDD reception is
> lagging at the reclass time limit but a "top" shows only a few percent of
> cpu usage.  The IDD seems fine on ldm right now because I restarted it
> last night and also remade the queue to 600MB.  This doesn't tell us
> anything about the cause, but I'm beginning to suspect that it has
> something to do with using a large queue.  I'm going to run it with the
> queue at 600MB until I leave for vacation next Friday... if it makes it
> that long without a problem, I'll conclude it's queue size related and we
> can resume working on this when we both get back from vacation.
> 
> I still have my wsi data coming in, so if I don't see problems in the next
> week, I'll probably assume the wsi rpc's are a symptom rather than a
> cause, although they should still shut down when a connection is lost.
> 

Art,

FYI, Charlie O'Brian at WSI agreed to feed our 7.1 machine temporarily
starting Monday.  I'll request the WSI data then, and try it with
various queue sizes.

Also, he said: 

> Unless there is a problem (ie internet congestion, system crash,
> client LDM stopping, etc), out program should never have to reconnect.
> Our processes check every 5 minutes to make sure the client is
> connected.  I noticed that we did a lot of restarting thru 5z this
> morning.  I would hazzard to guess they are fine, now.

Yesterday, from the piece of the log I ftp'ed from your site, there were
155 connections in about 12 hours.  (And only 106 disconnects, as I
recall.)  Could connectivity be a factor?  And yet, I'm assuming you had
no similar problems when you were using navier, is that right?

You could try going back to the 2Gb queue and see if the problem
returns...

Anne
-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                  P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************