[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #XDI-772585]: Problems hanging our server possibly related to LDM?



Greg,

> We've been having problems with a Linux server we are running. The
> server seems to intermittently get into a hung state where processes
> seem to freeze and cannot be killed even by root. This problem has
> surfaced in the last few weeks but we have been unable to find the root
> cause of it. It appears to be related to the LDM in that if LDM is
> stopped the problem seems to immediately clear up. There are
> unfortunately no messages produced in the system log during the freeze
> state to help diagnose the problem. Our systems folks have been working
> on the issue for some time and I had an e-mail conversation with Mike
> Schmidt today. Here's my last e-mail to him which he suggested I forward
> to you:
> 
> Mike,
> >    Thanks again for the suggestions. I passed them along to Ted. He
> > mentioned that we've run memtest with no complaints, they've checked the
> > seating of the memory boards as well although Ted may check that again.
> > Ted also mentions that when the LDM tries to connect to wsihcn, it
> > spends 5 minutes trying to connect, them time-outs and tries again. This
> > happens 4 times in succession so there is a 20 minute period where our
> > machine (tsunami.eol.ucar.edu or ingest.eol.ucar.edu) is trying to
> > connect and this seems to correlate well with the period of time the
> > machines hangs.
> 
> ldmd.log:
> 
> Oct 11 13:47:55 ingest wsihcsn[9164] NOTE: LDM-6 desired product-class:
> 20071011124755.819 TS_ENDT {{WSI,  "^NOW/MASTER"}}
> Oct 11 13:54:13 ingest wsihcsn[9164] ERROR: Terminating due to LDM
> failure; Couldn't connect to LDM on wsihcsn.unidata.ucar.edu using
> either port 388 or portmapper; : RPC: Remote system error - Connection
> timed out
> Oct 11 13:54:54 ingest wsihcsn[9164] NOTE: LDM-6 desired product-class:
> 20071011125454.078 TS_ENDT {{WSI,  "^NOW/MASTER"}}
> Oct 11 14:01:12 ingest wsihcsn[9164] ERROR: Terminating due to LDM
> failure; Couldn't connect to LDM on wsihcsn.unidata.ucar.edu using
> either port 388 or portmapper; : RPC: Remote system error - Connection
> timed out
> Oct 11 14:01:53 ingest wsihcsn[9164] NOTE: LDM-6 desired product-class:
> 20071011130153.584 TS_ENDT {{WSI,  "^NOW/MASTER"}}
> Oct 11 14:08:11 ingest wsihcsn[9164] ERROR: Terminating due to LDM
> failure; Couldn't connect to LDM on wsihcsn.unidata.ucar.edu using
> either port 388 or portmapper; : RPC: Remote system error - Connection
> timed out
> Oct 11 14:08:54 ingest wsihcsn[9164] NOTE: LDM-6 desired product-class:
> 20071011130854.137 TS_ENDT {{WSI,  "^NOW/MASTER"}}
> Oct 11 14:15:12 ingest wsihcsn[9164] ERROR: Terminating due to LDM
> failure; Couldn't connect to LDM on wsihcsn.unidata.ucar.edu using
> either port 388 or portmapper; : RPC: Remote system error - Connection
> timed out
> Oct 11 14:15:47 ingest rpc.ldmd[9155] NOTE: SIGTERM
> 
> The SIGTERM is us stopping ldm. Can you provide any clues that might
> help us pinpoint the source of the problem here? Is this connection
> problem purely coincidental? After the failure today, I commented out
> the feed request in our ldmd.conf file for the WSI Feed. We also
> upgraded our ldm installation from 6.4.4

The LDM shouldn't be able to hang the computer because it doesn't have
the necessary authority at the time in question, it's not doing anything
that should hang the computer, and the O/S should prevent anything the
LDM does from hanging the computer.  I suspect, therefore, a problem
with the computer or O/S.  See, for example, 
<https://bugzilla.redhat.com/show_bug.cgi?id=166292>.

I also recommend doing an "strace" of the LDM to discover what it's
doing.

> Thanks,
> Greg
> --
> 
> ~~N~A~T~I~O~N~A~L~~C~E~N~T~E~R~~F~O~R~~A~T~M~O~S~.~~R~E~S~E~A~R~C~H
> Greg Stossmeister                      e-mail: address@hidden
> NCAR/EOL                               phone: (303)497-8692
> P.O. Box 3000                          web: http://www.eol.ucar.edu
> Boulder, CO 80307-3000
> ~~~~~~~~E~A~R~T~H~~~O~B~S~E~R~V~I~N~G~~~L~A~B~O~R~A~T~O~R~Y~~~~~~~~

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: XDI-772585
Department: Support LDM
Priority: Normal
Status: Closed