[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

awipsldm digest: December 13, 1999 (fwd)




===============================================================================
Robb Kambic                                Unidata Program Center
Software Engineer III                      Univ. Corp for Atmospheric Research
address@hidden             WWW: http://www.unidata.ucar.edu/
===============================================================================

---------- Forwarded message ----------
Date: Tue, 14 Dec 1999 00:00:05 -0500
From: The AWIPS LDM list digest <address@hidden>
To: awipsldm digest recipients <address@hidden>
Subject: awipsldm digest: December 13, 1999

Digest for AwipsLDM
The AWIPS LDM list Digest for Monday, December 13, 1999.

1. Re: awipsldm digest: December 02, 1999

----------------------------------------------------------------------

Subject: Re: awipsldm digest: December 02, 1999
From: Ken Waters <address@hidden>
Date: 13 Dec 1999 12:31:04 -0500
X-Message-Number: 1


     Just to follow up on our frequent LDM stoppages...
     
     We still periodically receive the "signal 11" terminations of the 
     program.  These events seem to be tied to the connection to the same 
     site each time.  Although we have not determined the cause of the 
     terminations, we have at least installed a script (courtesy of Western 
     Region SSD) which checks to see if the LDM is running and if not, 
     restarts it.  At least that keeps us in business.  
     
     I still would like to know what is causing these problems.  I have 
     eliminated the possibility that it is related to a disk filling up or 
     to conflict with CPU usage.  I did check with the site on the other 
     end and it turns out they are running version 5.0.6.  Based on my 
     conversation, they will try installing 5.0.8 (the version we are 
     running) to see if that makes a difference.
     
     Ken Waters
     Southern Region SSD


______________________________ Reply Separator _________________________________
Subject: [awipsldm] Re: awipsldm digest: December 02, 1999
Author:  address@hidden at EXTERNAL
Date:    12/3/1999 2:26 PM


AwipsLDM List
On 3 Dec 1999, Ken Waters wrote:
     
>
>      Thanks for the reply, Robb.
>
>      I've done your suggestions and will keep an eye on the system. 
>
>      A couple of items:
>
>      - I don't think disk space is a problem.  I checked it and didn't see 
>      any full or near-full directories.  I set up a temporary cron job,
>      though, to run through the night for a day or two and will look at its 
>      output just to be sure.
>
>      - It happened again yesterday afternoon and the log file shows the 
>      same symptom...a kill signal #11 sent to the program right after a
>      connection was reset by peer.  Interestingly, it was from the same IP 
>      both times it was disconnected.  The pertinent lines from the log file 
>      are:
>
>      (note I made two modifications to this log text...(1) for security I 
>      replaced the actual IP with "(IP)", and (2) I deleted all the
>      Interrupt messages from the different clients) 
>
>      Dec 02 22:15:01 ls1-ehu pqexpire[8873]: > Recycled   2545.879 kb/hr ( 
>       162.787 prods per hour)
>      Dec 02 22:16:27 ls1-ehu (IP)[8877]: Connection reset by peer
>      Dec 02 22:16:57 ls1-ehu (IP)[8877]: run_requester: 19991202221506.017 
>      TS_ENDT {{ANY,  ".*"}}
>      Dec 02 22:17:03 ls1-ehu rpc.ldmd[8872]: child 8877 terminated by 
>      signal 11
>
Ken,
     
This is reaching but I have some sites that have problems when their site 
interacts with sites using a different version of the LDM, they lose data, 
have trouble connecting, etc. I might be worth while to have all the sites 
running the same version of the LDM.  I only bring this up because the 
run_requester seems to be the process bringing down the whole LDM.
     
>      ..[a series of Interrupt messages from each of the connected sites]... 
>
>      Dec 02 22:17:03 ls1-ehu pqact[8874]: Interrupt
>      Dec 02 22:17:03 ls1-ehu pqexpire[8873]: Interrupt
>      Dec 02 22:17:03 ls1-ehu pqexpire[8873]: > First deleted: 
>      19991201092529.238
>      Dec 02 22:17:03 ls1-ehu rpc.ldmd[8872]: Interrupt
>      Dec 02 22:17:06 ls1-ehu rpc.ldmd[8872]: Terminating process group 
>
>      Is it not advisable to set a cron job to recyle (stop-start) the ldm 
>      on a regular basis?  I realize it's not the best solution, but at
>      least it will keep the system running through the night.
     
Another test would be to eliminate some of the request to different sites 
to find which one is the culprit. Then have that site update the LDM 
version as stated above.
     
>
>      I also will always start the ldm in verbose mode from now on...at 
>      least until the problem is solved.
>
I forgot to warn you that verbose logging in the LDM can make the logs 
huge and cause disk space problems.
     
     
Robb...
>      Thanks for your help.
>
>      Ken
>
>
> ______________________________ Reply Separator
 _________________________________
> Subject: [awipsldm] Re: awipsldm digest: December 02, 1999 
> Author:  address@hidden at EXTERNAL
> Date:    12/3/99 10:39 AM
>
>
> AwipsLDM List
> On Fri, 3 Dec 1999, The AWIPS LDM list digest wrote: 
>
> > Digest for AwipsLDM
> > The AWIPS LDM list Digest for Thursday, December 02, 1999. 
> >
     > > 1. Frequent Unexpected Stoppages of LDM 



---

END OF DIGEST

---
You are currently subscribed to awipsldm as: address@hidden
To unsubscribe send a blank email to address@hidden