[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20010202: ldm gremlins loose (cont.)



Hi Jennie,

Tom and I have been working on this together.  I assume you saw my earlier
message about what happened - ldmfail detected that navier was down, and
attempted to failover.  But, it wasn't able to restart the ldm at that
time, probably because not all the processes were dead yet and the
restart detected that.  ldmfail isn't smart enough to respond to that
situation.  (Maintenence item #1: we could at least put a sleep in there
so that the processes have a better chance of dying.)

Anyway, when ldmfail ran again 20 minutes later, it detectd that ldm was
down, restarted it to feed from the failover (because it was that conf
file that was in place due to the previous invocation of ldmfail), then
promptly detected that the primary feed was up, stopped the ldm, swapped
the conf files, and restarted it using the primary feed.  Hence the log
file (ldmd.log.1?) that only covered about 5 seconds of time and was
feeding from sunset. The brief restart seems silly, but that's what's in
the code.  (Maintenence item #2: detecting when the ldm is down and
detecting when to failover might be better implemented as two separate
processes.)

So, something is wrong with the environment when ldmfail runs.  We are
hypothesizing that the path was not constructed correctly.  In particular,
the '~' in the '~/util' at the front of the path might not have been
expanded properly.  This coupled with the 'ldm-mcidas<Version>/bin' being
in the path would have caused it to not find xcd_run and to invoke the
wrong batch.k.  So, we changed the path, replace '~' with ${HOME} and
took out the old ldm-mcidas path.

Of course, this doesn't explain why ldmfail presumably worked for the ldma
account.  If I had endless time I would explore this mystery...

So, this current configuration seems to be working so far, but needs
ldmfail to run to test it properly.  I dinked around with lmdfail to try
to get it to fail over, but have not yet succeeded on that.  I'll have to
get back to it later.

I'll check windfall over the weekend.

Anne

***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                 P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************

On Fri, 2 Feb 2001, Unidata Support wrote:

> 
> ------- Forwarded Message
> 
> >Subject: ldm gremlins loose
> >Organization: UVa
> >Keywords: 200102012128.f11LSdX24812 LDM McIDAS ROUTE PP
> 
> Anne et al.,
> 
> Still problems popping up with the ldm on windfall.  The 
> logs seem to imply some switching of servers (from navier to
> sunset and back?), the Mcidas data kept flowing, but then
> around 14:50UTC, the batch.k not found error cropped up again
> in the Mcidas log (after it had run smoothly since last night).
> 
> Feb 02 14:16:58 pnga2area[17505]: output file pathname:
> data/mcidasd/AREA0179
> Feb 02 14:16:59 pnga2area[17505]: unPNG::   154361    613456
> 3.9742
> Feb 02 14:16:59 pnga2area[17505]: Exiting
> Feb 02 14:17:53 pnga2area[17585]: Starting Up
> Feb 02 14:17:53 pnga2area[17585]: output file pathname:
> data/mcidasd/AREA0129
> Feb 02 14:17:54 pnga2area[17585]: unPNG::  1407965   2439056
> 1.7323
> Feb 02 14:17:54 pnga2area[17585]: Exiting
> Feb 02 14:50:18 pnga2area[18237]: Starting Up
> Feb 02 14:50:18 pnga2area[18237]: output file pathname:
> data/mcidasd/AREA1110
> Feb 02 14:50:18 pnga2area[18237]: unPNG::    79487    242800
> 3.0546
> Feb 02 14:50:18 pnga2area[18237]: Exiting
> /bin/sh: batch.k: not found
> Feb 02 14:50:20 pnga2area[18315]: Starting Up
> Feb 02 14:50:20 pnga2area[18315]: output file pathname:
> data/mcidasd/AREA0167
> Feb 02 14:50:21 pnga2area[18315]: unPNG::   303200    607776
> 2.0045
> Feb 02 14:50:21 pnga2area[18315]: Exiting
> Feb 02 14:50:21 pnga2area[18330]: Starting Up
> Feb 02 14:50:21 pnga2area[18330]: output file pathname:
> data/mcidasd/AREA1132
> Feb 02 14:50:21 pnga2area[18330]: unPNG::    80787    242720
> 3.0044
> Feb 02 14:50:21 pnga2area[18330]: Exiting
> /bin/sh: batch.k: not found
> 
> The ldmd.log is again full of the child pid exited with status
> 127 again....pretty aggravating, huh?
> 
> Your help is much appreciated.
> 
> Jennie
> 
> >From address@hidden Fri Feb  2 09:51:19 2001
> >Subject: ldm log contents
> 
> Just thought I would paste some of the logs here, in case they
> continue to cycle, and this gets lost.  It would appear that
> things ran from 22:46 last night until 14:07 this morning, when
> for some reason, I got a disconnect from navier, and then
> an RPC: Program not registered error (a problem on their end?)
> This is from the ldmd.log.2 file
> 
> Feb 01 22:46:39 windfall.evsc.Virginia.EDU rpc.ldmd[11574]:
> Starting Up (built: Aug 24 2000 16:14:30)
> Feb 01 22:46:40 windfall.evsc.Virginia.EDU navier[11578]:
> run_requester: Starting Up: navier.meteo.psu.edu
> Feb 01 22:46:40 windfall.evsc.Virginia.EDU navier[11578]:
> run_requester: 20010201224535.619 TS_ENDT {{HDS|DDPLUS,
> ".*"},{MCIDAS,  "^pn
> ga2area Q[01]"}}
> Feb 01 22:46:40 windfall.evsc.Virginia.EDU pqbinstats[11577]:
> Starting Up (11574)
> Feb 01 22:46:40 windfall.evsc.Virginia.EDU pqact[11576]: Starting
> Up
> Feb 01 22:46:40 windfall.evsc.Virginia.EDU navier[11578]:
> FEEDME(navier.meteo.psu.edu): OK
> Feb 01 22:46:42 windfall.evsc.Virginia.EDU localhost[11622]:
> Connection from localhost
> Feb 01 22:46:42 windfall.evsc.Virginia.EDU localhost[11622]:
> Connection reset by peer
> Feb 01 22:46:42 windfall.evsc.Virginia.EDU localhost[11622]:
> Exiting
> Feb 02 14:07:33 windfall.evsc.Virginia.EDU pqact[11576]:
> pbuf_flush 5: time elapsed   2.238267
> Feb 02 14:18:13 windfall.evsc.Virginia.EDU navier[11578]:
> Connection reset by peer
> Feb 02 14:18:13 windfall.evsc.Virginia.EDU navier[11578]:
> Disconnect
> Feb 02 14:18:43 windfall.evsc.Virginia.EDU navier[11578]:
> run_requester: 20010202141758.951 TS_ENDT {{HDS|DDPLUS,
> ".*"},{MCIDAS,  "^pn
> ga2area Q[01]"}}
> Feb 02 14:18:43 windfall.evsc.Virginia.EDU navier[11578]:
> FEEDME(navier.meteo.psu.edu): RPC: Program not registered
> Feb 02 14:30:00 windfall.evsc.Virginia.EDU ldmping[17644]:
> SVC_UNAVAIL   0.095947    0   navier.meteo.psu.edu  RPC: Program
> not registe
> red
> Feb 02 14:30:03 windfall.evsc.Virginia.EDU rpc.ldmd[11574]:
> Exiting
> Feb 02 14:30:03 windfall.evsc.Virginia.EDU rpc.ldmd[11574]:
> Terminating process group
> Feb 02 14:30:03 windfall.evsc.Virginia.EDU pqact[11576]: Exiting
> Feb 02 14:30:03 windfall.evsc.Virginia.EDU pqbinstats[11577]:
> Exiting
> Feb 02 14:30:33 windfall.evsc.Virginia.EDU navier[11578]: Exiting
> 
> 
> But then it seems, the ldm tried to failover to sunset, and this
> corresponds with the time that the batch.k not found error's
> cropping up in the mcidas.log file.  
> Furthermore, its seems this never got any data flowing.
>  
> windfall: /usr/local/ldm/logs $ more ldmd.log.1
> Feb 02 14:50:01 windfall.evsc.Virginia.EDU rpc.ldmd[17835]: Starting Up 
> (built: Aug 24 2000 16:14:30)
> Feb 02 14:50:01 windfall.evsc.Virginia.EDU sunset[17840]: run_requester: 
> Starting Up: sunset.meteor.wisc.edu
> Feb 02 14:50:01 windfall.evsc.Virginia.EDU sunset[17840]: run_requester: 
> 20010202141758.951 TS_ENDT {{HDS|DDPLUS,  ".*"},
> {MCIDAS,  "^pnga2area Q[01]"}}
> Feb 02 14:50:01 windfall.evsc.Virginia.EDU pqact[17837]: Starting Up
> Feb 02 14:50:01 windfall.evsc.Virginia.EDU pqact[17837]: child 17841 exited 
> with status 127
> Feb 02 14:50:01 windfall.evsc.Virginia.EDU pqbinstats[17839]: Starting Up 
> (17835)
> Feb 02 14:50:01 windfall.evsc.Virginia.EDU pqact[17837]: child 17842 exited 
> with status 127
> Feb 02 14:50:03 windfall.evsc.Virginia.EDU rpc.ldmd[17835]: child 17836 
> exited with status 127
> Feb 02 14:50:03 windfall.evsc.Virginia.EDU localhost[17852]: Connection from 
> localhost
> Feb 02 14:50:03 windfall.evsc.Virginia.EDU localhost[17852]: Connection reset 
> by peer
> Feb 02 14:50:03 windfall.evsc.Virginia.EDU localhost[17852]: Exiting
> Feb 02 14:50:03 windfall.evsc.Virginia.EDU sunset[17840]: 
> FEEDME(sunset.meteor.wisc.edu): OK
> Feb 02 14:50:03 windfall.evsc.Virginia.EDU rpc.ldmd[17835]: Exiting
> Feb 02 14:50:03 windfall.evsc.Virginia.EDU rpc.ldmd[17835]: Terminating 
> process group
> Feb 02 14:50:03 windfall.evsc.Virginia.EDU pqbinstats[17839]: Exiting
> Feb 02 14:50:03 windfall.evsc.Virginia.EDU pqact[17837]: Exiting
> Feb 02 14:50:04 windfall.evsc.Virginia.EDU sunset[17840]: Exiting
> 
> Six seconds later, after trying to feed from meteor, it switches
> back to navier??  The most recent log shows all the errors in child 
> processes exiting, but this is again requestiong data from navier,
> and the mcidas log already has the errors on batch.k being recorded.
> 
> 
> 
> windfall: /usr/local/ldm/logs $ more ldmd.log
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU rpc.ldmd[17891]:
> Starting Up (built: 
> Aug 24 2000 16:14:30)
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU pqact[17894]: Starting
> Up
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU pqbinstats[17896]:
> Starting Up (17891
> )
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU pqact[17894]: child
> 17898 exited with
>  status 127
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU navier[17900]:
> run_requester: Startin
> g Up: navier.meteo.psu.edu
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU navier[17900]:
> run_requester: 2001020
> 2141758.951 TS_ENDT {{HDS|DDPLUS,  ".*"},{MCIDAS,  "^pnga2area
> Q[01]"}}
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU pqact[17894]: child
> 17895 exited with
>  status 127
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU navier[17900]:
> FEEDME(navier.meteo.ps
> u.edu): OK
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU pqact[17894]:
> pbuf_flush (4) write: B
> roken pipe
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU pqact[17894]:
> pipe_dbufput: xcd_runHR
> S write error
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU pqact[17894]:
> pipe_prodput: trying ag
> ain
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU pqact[17894]:
> pbuf_flush (4) write: B
> roken pipe
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU pqact[17894]:
> pipe_dbufput: xcd_runHR
> S write error
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU pqact[17894]: child
> 17901 exited with
>  status 127
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU pqact[17894]: child
> 17903 exited with
>  status 127
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU pqact[17894]:
> pbuf_flush (4) write: B
> roken pipe
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU pqact[17894]:
> pipe_dbufput: xcd_runHR
> S write error
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU pqact[17894]:
> pipe_prodput: trying ag
> ain
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU pqact[17894]:
> pbuf_flush (4) write: B
> roken pipe
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU pqact[17894]:
> pipe_dbufput: xcd_runHR
> S write error
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU pqact[17894]: child
> 17907 exited with
>  status 127
> Feb 02 14:50:06 windfall.evsc.Virginia.EDU pqact[17894]: child
> 17905 exited with
>  status 127
> Feb 02 14:50:07 windfall.evsc.Virginia.EDU pqact[17894]:
> pbuf_flush (4) write: B
> 
> Hope this might help.  Was afraid that if the system failed
> over any more, these logs might get lost.....
> 
> Jennie
> 
> 
> 
> ------- End of Forwarded Message
> 
>