[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Thelma Down?



On Tue, 13 Jun 2000, Jason J. Levit wrote:

> 
> > Yes thelma was down at 5:23 this morning. The crash was caused by a known
> > problem on SGI machines.  If the LDM queue is growing while pqexpire is
> > running it creates a corrupt queue. At this point, I have remade the queue
> > and restarted the LDM.  I'll recalculate what the queue size should be
> > now, new products coming over noaaport. Then later today, I'll implement
> > the new queue size. UPC is in the process of replacing thelma hardware and
> > we have a new version of the LDM software that doesn't have this problem.
> > It will soon be installed on thelma eliminating this problem.
> > 
> > Thanks for the patience,
> > Robb...
> 
>   Hi Robb,
> 
>   I've been having severe problems with LDM crashing on our Origin 200
> machine, and this might explain it!  LDM will literally die every few
> minutes from time to time when incoming traffic gets high.  Let me see
> if this scenario sounds familiar: LDM dies for no apparent reason, the
> log file just says "interrupt" for all the processes, and a huge core
> file is dumped.  Was that the behavior you were seeing?
> 
Jason,

It sounds like this could be your problem. log entries:

Jun 13 05:23:37 5Q:thelma nport(feed)[9491]: RECLASS: 20000613042337.506
TS_ENDT
 {{WMO,  ".*"}}
Jun 13 05:23:43 5Q:thelma rpc.ldmd[6873]: child 6846 terminated by signal
11
Jun 13 05:23:43 5Q:thelma rpc.ldmd[6873]: Killing (SIGINT) process group
Jun 13 05:23:43 5Q:thelma rpc.ldmd[6873]: Interrupt
Jun 13 05:23:43 5Q:thelma nport(feed)[9491]: Interrupt
Jun 13 05:23:43 5Q:thelma snow(feed)[10378]: Interrupt
Jun 13 05:23:43 5Q:thelma iita(feed)[10370]: Interrupt
Jun 13 05:23:43 5Q:thelma ofour(feed)[9459]: Interrupt  
Jun 13 05:23:45 5Q:thelma unidata[6899]: Exiting
Jun 13 05:23:45 5Q:thelma pqexpire[6851]: Exiting
Jun 13 05:23:45 5Q:thelma pqexpire[6851]: > Up since:
20000610153554.787
Jun 13 05:23:45 5Q:thelma pqexpire[6851]: > Queue usage (bytes):285161608
Jun 13 05:23:45 5Q:thelma pqexpire[6851]: >          (nregions):   29879
Jun 13 05:23:45 5Q:thelma pqexpire[6851]: > nbytes recycle:   3984792280 (
63311
.904 kb/hr)
Jun 13 05:23:45 5Q:thelma pqexpire[6851]: > nprods deleted:       656340 (
10678
.457 per hour)
Jun 13 05:23:45 5Q:thelma pqexpire[6851]: > First deleted:
20000610143555.011
Jun 13 05:23:45 5Q:thelma pqexpire[6851]: > Last  deleted:
20000613040345.174
Jun 13 05:23:45 5Q:thelma ldm[6882]: Interrupt
Jun 13 05:23:45 5Q:thelma ldm[6882]: Exiting
Jun 13 05:23:45 5Q:thelma rpc.ldmd[6873]: Terminating process group 


>   How did you calculate the appropriate queue size?  I suppose I could
> just keep increasing it until the problem doesn't exist anymore...
>  

The queue size depends on the feeds you are receiving, for thelma it
receives NOAAport, McIdas, FSL2 and the queue size set in bin/ldmadmin is
set to:

$pq_size = 250000000; 


I would take the peak data rates on the feeds, combine them and add 10%
for the queue size.  You should check the ldmd.log files for messages
similar to:

Growing data by <size>

If you see these messages then the queue is too small.

Robb...

>   Jason
> 
> -- 
> ----------------------------------------------------------------------------
> Jason J. Levit, N9MLA                       Research Scientist,
> address@hidden                  Center for Analysis and Prediction of
> Storms
> Room 1014                                  University of Oklahoma
> 405/325-3503                               http://www.caps.ou.edu/
> 

===============================================================================
Robb Kambic                                Unidata Program Center
Software Engineer III                      Univ. Corp for Atmospheric Research
address@hidden             WWW: http://www.unidata.ucar.edu/
===============================================================================