[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LDM on SGI



On Mon, 19 Jun 2000, Jim Cowie wrote:

> Robb Kambic wrote:
> > 
> > Hiya,
> > 
> > Yes thelma was down at 5:23 this morning. The crash was caused by a known
> > problem on SGI machines.  If the LDM queue is growing while pqexpire is
> > running it creates a corrupt queue. At this point, I have remade the queue
> > and restarted the LDM.  I'll recalculate what the queue size should be
> > now, new products coming over noaaport. Then later today, I'll implement
> > the new queue size. UPC is in the process of replacing thelma hardware and
> > we have a new version of the LDM software that doesn't have this problem.
> > It will soon be installed on thelma eliminating this problem.
> > 
> > Thanks for the patience,
> > Robb...
> > 
> 
> Hey Robb,
> 
> When this happens, does the machine actually crash or does the
> queue just get corrupted? 

Jim,

The LDM exits, the machine doesn't crash.  If there are no Growing data by
...  lines in the ldmd.log file then this is not the problem.


We recently started using the LDM (5.0.9)
> on an SGI machine for the first time (IRIX 6.2), and the machine has
> crashed a couple of times in the last few weeks, apparently with random
> memory errors. I don't see anything in the ldm logs indicating the queue
> was growing though. Our queue size is 400MB. The only thing running on
> this machine is the LDM, we just want it to act as a relay (no pqact
> running either).
> 
> Here are a couple of lines from the two most recent crash dump analyses,
> each one reports a problem with a different SIMM which makes me think
> it is not really a hardware error but something that is using that
> chuck of memory at the time of the crash. Thanks for any help.
> 
We also had this problem. It was solved by reseating all the simms in the
machine.  A couple of them needed it.

Robb...

> 
> TIME OF CRASH:
>     960020629 Sat Jun  3 02:23:49 2000
> 
> PANIC STRING:
>     <0>PANIC: IRIX Killed due to Memory Error in SIMM  S11
> 
> ......
> 
> 
> TIME OF CRASH:
>     961100111 Thu Jun 15 14:15:11 2000
> 
> PANIC STRING:
>     <0>PANIC: IRIX Killed due to Memory Error in SIMM  S7
> 
> 
> 
> -- 
> Jim Cowie
> WITI
> Lifeminders, Inc.
> 
> 303-497-8584 (office)
> 720-231-7948 (cell)
> 

===============================================================================
Robb Kambic                                Unidata Program Center
Software Engineer III                      Univ. Corp for Atmospheric Research
address@hidden             WWW: http://www.unidata.ucar.edu/
===============================================================================