[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #OWN-737372]: Re: Re-occurring problem with LDM..



Mathew,

> I wanted to get in touch with everyone about a problem we seem to be
> having with the ldm on ice.ssec.wisc.edu as a part of the Antarctic-
> IDD. Below is the tail output of the log file that shows all is fine
> until 8:39:42 UTC, when something goes south.  Now, yes, I am using
> ldm version 6.4.2.4 which isn't new, but I have a second system
> running this feed too (from ice data get pulled over to
> iceberg.ssec.wisc.edu) and it is not having any problems, nor
> suffering this problem, despite having the same feed of data for the
> most part.  This is at least the second time this has happened (last
> time on the 1st of May).
> 
> What is odd is that running ldmadmin watch gives no clue of problems
> (data seems to get in to the queue just fine)...and it seems to me
> that ice.ssec.wisc.edu is just fine feeding my secondary system -
> iceberg.ssec.wisc.edu (as ldmadmin watch on iceberg.ssec.wisc.edu
> matches ice.ssec.wisc.edu with the most recent data!) while I'll bet
> everyone right now is not able to ldmping or connect to get anything
> from ice.ssec.wisc.edu outside SSEC. Can anyone verify if they are
> getting anything recent (after 8:30 UTC) from iceberg.ssec.wisc.edu?
> 
> Of course, a restart will solve this problem, but I'd like to know
> why this is happening before I restart the system.  Any suggestions
> welcome.
> 
> If I don't hear from anyone in a few hours, I'll restart
> ice.ssec.wisc.edu so that the Antarctic-IDD can keep on doing its
> thing...and perhaps just upgrading to the latest LDM is worth trying
> to fix this...
> 
> Thanks all!
> 
> Matthew
> 
> 
> 
> May 22 08:30:02 ice sc030ws079(feed)[14970] NOTE: Starting Up
> (6.4.2.4/6): 200705
> 22083001.530 TS_ENDT {{EXP,  "USAP.(NZCM|SSCC).*"}}, Alternate
> May 22 08:30:02 ice sc030ws079(feed)[14970] NOTE: topo:
> sc030ws079.chs.spawar.n
> avy.mil {{EXP, (.*)}}
> May 22 08:31:09 ice sc030ws079(feed)[14970] ERROR: feed or notify
> failure; COMIN
> GSOON: RPC: Unable to receive; errno = Connection reset by peer
> May 22 08:31:09 ice rpc.ldmd[4364] NOTE: child 14970 exited with
> status 6
> May 22 08:31:10 ice sc030ws079(feed)[16030] NOTE: Starting Up
> (6.4.2.4/6): 200705
> 22083109.443 TS_ENDT {{EXP,  "USAP.(NZCM|SSCC).*"}}, Primary
> May 22 08:31:10 ice sc030ws079(feed)[16030] NOTE: topo:
> sc030ws079.chs.spawar.n
> avy.mil {{EXP, (.*)}}
> May 22 08:39:42 ice rpc.ldmd[4364] NOTE: child 683 terminated by
> signal 7
> May 22 08:39:42 ice rpc.ldmd[4364] NOTE: Killing (SIGINT) process group
...

The problem is that process 683 terminated due to signal 7
(which is SIGBUS on my system).  TThe LDM system then terminated 
because it doesn't know what else to do if one of its child
processes terminates due to faulty code.

The question is, what was process 683?  I suggest grep-ing through
your log files to determine that.  Let me know.

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: OWN-737372
Department: Support LDM
Priority: Normal
Status: On Hold