[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20000725: 20000725: LDM dying



Gilbert Sebenste wrote:

> Hi Steve,
>
> > Gilbert,
> > In the logs you sent, the assertion was thrown by pqexpire. Is this always 
> > the
> > case, or does it sometimes come from pqact or ldmd etc. Does the error 
> > generally
> > occur during the same time of day- or do you run any type of disk
> > defragmentation that might move part of the memory mapped data queue?
>
> Hmmm. Well, it does occur any time of the day or night, at random. And no,
> we don't run a defrag program. I thought it might be because it's only a
> PI 200 MHZ machine (too slow of a bus or something).
>
> > I'm throwing out things to check. Maybe Anne has more experience with Linux.
> >
> > Steve Chiswell
>
> Thanks!
>
> *******************************************************************************
> Gilbert Sebenste                                                     ********
> Internet: address@hidden    (My opinions only!)                     ******
> Staff Meteorologist, Northern Illinois University                      ****
> E-mail: address@hidden                                 ***
> web: http://weather.admin.niu.edu                                      **
> Work phone: 815-753-5492                                                *
> *******************************************************************************

Hi Gilbert,

And I too say, "Hmmmm...."

I too think that the queue is becoming corrupted.  I wouldn't think that system 
load
per se would be the cause.  More information would help:

What version of the LDM are you running?

Tell us a little bit about your hardware: How much memory do you have?  What 
kind of
interface is there to your hard drive?  IDE?   SCSI?

Please gather more info from other crashes.   That is,  save some more 
instances of
the log entries that occur at the time of a crash.  Also, take a look at the 
system
logs at the time of the crashes - perhaps something is appearing there that 
coincides
with the crashes.

I'm wondering if perhaps there is some system service running that might 
contribute,
somehow, to the corruption.  Or, maybe there's a bad block on your disk - if so,
something would probably appear in the system logs.   (Chiz just showed me a 
disk
test you can run: /etc/format.  You need root priveleges for this.)

Otherwise I assume you're aware that improper ldm shutdown could corrupt the 
queue.
For example, shutting down the ldm then rebooting the machine or otherwise 
killing
ldm processes that weren't allowed a graceful finish could cause queue 
corruption.
As has been discussed on ldm-users, sometimes it takes some minutes for all ldm
processes to die.

Also, the queue should be in a local directory.    On my Linux box if I make a 
queue
that is on a remotely mounted directory I get an error right away.  But it seems
possible that, depending on the OS, an error due to a remote mount might occur 
at
seemingly random times, say when a mount was changed or a connection was 
unavailable.

Anne



--
***************************************************
Anne Wilson                     UCAR Unidata Program
address@hidden                  P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************