[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [awipsldm] Re: awipsldm digest: December 02, 1999



On 3 Dec 1999, Ken Waters wrote:

> 
>      Thanks for the reply, Robb.
>      
>      I've done your suggestions and will keep an eye on the system.
>      
>      A couple of items:
>      
>      - I don't think disk space is a problem.  I checked it and didn't see 
>      any full or near-full directories.  I set up a temporary cron job, 
>      though, to run through the night for a day or two and will look at its 
>      output just to be sure.
>      
>      - It happened again yesterday afternoon and the log file shows the 
>      same symptom...a kill signal #11 sent to the program right after a 
>      connection was reset by peer.  Interestingly, it was from the same IP 
>      both times it was disconnected.  The pertinent lines from the log file 
>      are: 
>      
>      (note I made two modifications to this log text...(1) for security I 
>      replaced the actual IP with "(IP)", and (2) I deleted all the 
>      Interrupt messages from the different clients)
>      
>      Dec 02 22:15:01 ls1-ehu pqexpire[8873]: > Recycled   2545.879 kb/hr (  
>       162.787 prods per hour)
>      Dec 02 22:16:27 ls1-ehu (IP)[8877]: Connection reset by peer
>      Dec 02 22:16:57 ls1-ehu (IP)[8877]: run_requester: 19991202221506.017 
>      TS_ENDT {{ANY,  ".*"}}
>      Dec 02 22:17:03 ls1-ehu rpc.ldmd[8872]: child 8877 terminated by 
>      signal 11
>      
Ken,

This is reaching but I have some sites that have problems when their site
interacts with sites using a different version of the LDM, they lose data,
have trouble connecting, etc. I might be worth while to have all the sites
running the same version of the LDM.  I only bring this up because the
run_requester seems to be the process bringing down the whole LDM.

>      ..[a series of Interrupt messages from each of the connected sites]...
>      
>      Dec 02 22:17:03 ls1-ehu pqact[8874]: Interrupt
>      Dec 02 22:17:03 ls1-ehu pqexpire[8873]: Interrupt
>      Dec 02 22:17:03 ls1-ehu pqexpire[8873]: > First deleted: 
>      19991201092529.238
>      Dec 02 22:17:03 ls1-ehu rpc.ldmd[8872]: Interrupt
>      Dec 02 22:17:06 ls1-ehu rpc.ldmd[8872]: Terminating process group
>      
>      Is it not advisable to set a cron job to recyle (stop-start) the ldm 
>      on a regular basis?  I realize it's not the best solution, but at 
>      least it will keep the system running through the night.

Another test would be to eliminate some of the request to different sites
to find which one is the culprit. Then have that site update the LDM
version as stated above. 

>      
>      I also will always start the ldm in verbose mode from now on...at 
>      least until the problem is solved.
>      
I forgot to warn you that verbose logging in the LDM can make the logs
huge and cause disk space problems. 


Robb...
>      Thanks for your help.
>      
>      Ken
> 
> 
> ______________________________ Reply Separator 
> _________________________________
> Subject: [awipsldm] Re: awipsldm digest: December 02, 1999
> Author:  address@hidden at EXTERNAL
> Date:    12/3/99 10:39 AM
> 
> 
> AwipsLDM List
> On Fri, 3 Dec 1999, The AWIPS LDM list digest wrote:
>      
> > Digest for AwipsLDM
> > The AWIPS LDM list Digest for Thursday, December 02, 1999. 
> >
> > 1. Frequent Unexpected Stoppages of LDM 
> >
> > ---------------------------------------------------------------------- 
> >
> > Subject: Frequent Unexpected Stoppages of LDM 
> > From: "Ken Waters" <address@hidden>
> > Date: Thu, 2 Dec 1999 14:52:38
> > X-Message-Number: 1
> >
> > We are having a problem with LDM stopping every few days or so on our LDAD. 
> >  What are some of the diagnostics I can do to determine why this keeps
> > happening?
> >
> > I looked at our logs and found something like: 
> >
> > Dec 01 10:29:39 ls1-ehu rpc.ldmd[16243]: child 16248 terminated by signal 
> > 11
> >
> > Followed by a series of "Interrupt"s and then finally an "Exiting" message 
> > from rpc.ldmd.
> >
> > Does anyone have any ideas what might be causing these stoppages?  I looked 
> > at the Site Managers Guide and didn't find any clues.  Thanks.
>      
> Ken,
>      
> There are many things that could cause the LDM to just stop. Here's a 
> couple of things I would check:
>      
> - scan the ldmd.log for error messages 
> - remake the LDM queue
> - make sure your disks don't periodically fill to capacity, check scour 
> - some other program interferring with the LDM, ie hogging the CPU
> - put the LDM into verbose logging by in the LDM home dir:
>      
> % kill -USR2 `cat ldmd.pid`
>      
> - scan the ldmd.log for error messages  again
>      
> Read the ldmd man page for details about rotating LDM verbosity
>      
> Robb...
>      
> >
> > Ken Waters
> > SRH
> >
> >
> >
> > ---
> >
> > END OF DIGEST
> >
> > ---
> > You are currently subscribed to awipsldm as: address@hidden
> > To unsubscribe send a blank email to address@hidden
> >
>      
> ===============================================================================
>  
> Robb Kambic                          Unidata Program Center
> Software Engineer III                  Univ. Corp for Atmospheric
>  Research
> address@hidden             WWW: http://www.unidata.ucar.edu/ 
> ===============================================================================
>      
>      
>      
> ---
> You are currently subscribed to awipsldm as: address@hidden
> To unsubscribe send a blank email to address@hidden
> 

===============================================================================
Robb Kambic                                Unidata Program Center
Software Engineer III                      Univ. Corp for Atmospheric Research
address@hidden             WWW: http://www.unidata.ucar.edu/
===============================================================================