[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

19990526: problems with waldo (cont.)



>From: alan anderson <address@hidden>
>Organization: St. Cloud State
>Keywords: 199905252106.PAA12911 Solaris x86 fsck

Alan,

>Followed the directions you gave regarding use of fsck.  Got several lines
>regarding problems with directories and files, total no. was less than 10.
>Rebooted and system came back up.

Sounds good so far.

>I logged in as ldm, then tried to start ldm.  Output from ldmd.log file is
>listed below.
>
>Lines from ldmd.log after trying to start ldm on 5/26/99
>
>/usr/local/ldm/logs% tail -50 ldmd.log
>May 26 14:41:33 waldo rpc.ldmd[472]: Starting Up (built: Sep 22 1998 08:17:10)
>May 26 14:41:33 waldo hobbes[477]: run_requester: Starting Up:
>hobbes.stcloudstate.edu
>May 26 14:41:33 waldo pqexpire[473]: Starting Up
>May 26 14:41:34 waldo pqbinstats[475]: Starting Up (472)
>May 26 14:41:34 waldo hobbes[477]: run_requester: 19990526134133.322
>TS_ENDT {{FSL2|MCIDAS|IDS|DDPLUS,  ".*"}}
>May 26 14:41:34 waldo udp.ldmd[478]: Starting Up
>May 26 14:41:34 waldo pqact[476]: Starting Up
>May 26 14:41:34 waldo pqexpire[473]: assertion "*binp != OFF_NONE" failed:
>file "pq.c", line 673
>May 26 14:41:34 waldo hobbes[477]: FEEDME(hobbes.stcloudstate.edu): OK
>May 26 14:41:35 waldo localhost[489]: Connection from localhost
>May 26 14:41:35 waldo localhost[489]: Connection reset by peer
>May 26 14:41:35 waldo localhost[489]: Exiting
>May 26 14:41:35 waldo rpc.ldmd[472]: child 473 terminated by signal 6
>May 26 14:41:35 waldo rpc.ldmd[472]: Killing (SIGINT) process group
>May 26 14:41:35 waldo pqbinstats[475]: Interrupt
>May 26 14:41:35 waldo pqbinstats[475]: Exiting
>May 26 14:41:35 waldo pqact[476]: Interrupt
>May 26 14:41:35 waldo pqact[476]: Exiting
>May 26 14:41:35 waldo hobbes[477]: Interrupt
>May 26 14:41:35 waldo hobbes[477]: Exiting
>May 26 14:41:35 waldo udp.ldmd[478]: Interrupt
>May 26 14:41:35 waldo udp.ldmd[478]: Exiting
>May 26 14:41:35 waldo rpc.ldmd[472]: Interrupt
>May 26 14:41:35 waldo rpc.ldmd[472]: Exiting
>May 26 14:41:35 waldo rpc.ldmd[472]: Terminating process group
>May 26 14:41:35 waldo rpc.ldmd[472]: child 474 terminated by signal 2
>/usr/local/ldm/logs% 
>
>I don't have a clue as to what the problem might be with pqexpire, so will
>turn to you.

Typically, errors like the ones in the list are caused by a corrupt LDM
queue.  The solution at this point is:

<login as ldm>
<make sure that the LDM is not running>
ldmadmin delqueue
ldmadmin mkqueue
ldmadmin start

I did this on waldo for you just to make sure that there wasn't some
other problem.  The LDM started running smoothly and is now decoding
data as fast as it can.  The reason for this is that if you
delete/remake the queue and then restart the LDM, you will be
implicitly asking for an hour's worth of data from your upstream
feeder.  All of the data that it has will come gushing towards you, and
the various decoders will have to work hard to catch up.

As I sit here watching things on waldo, I feel confident that your last
problem was the corrupted queue, and remaking it fixed the problem.

Tom