[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

19991123: ldm problems after reboot at St. Cloud State



>From: alan anderson <address@hidden>
>Organization: St. Cloud State
>Keywords: 199911232244.PAA17363 LDM 

Alan,

>We are having a problem with our ldm machine.  I checked it today
>and found it in a state that looked to me like it had been shutdown
>and rebooted; a ps indicated no  ldm processes running.
>The ldm does not start automatically upon a reboot.

Does not start, or did not start?

>I did an ldmadmin stop just to be sure, and then rebooted.  System
>came back up with no problems or messages.
>
>Tried to start the ldm, but the start did not confirm; I recalled that
>queue often is corrupted, so deleted the queue, then a mkqueue, then
>restarted ldm, which was confirmed.

Did you make sure to become the user 'ldm' before trying to do restart
the LDM?  This is important: the LDM should never be run as 'root'.

>Log files (excerpt below) show that something is still wrong.  My shallow
>memory about what this could be leaves me blank, so I am writing to you.

No problem.

>waldo is the place where ldm lives; I think you already know the rest 
>if you want to look around.  Otherwise, could I have some instructions?

I decided to login and do some snooping.  More below.

>Nov 23 22:27:58 waldo pqact[1514]: pbuf_flush (4) write: Broken pipe
>Nov 23 22:27:58 waldo pqact[1514]: pipe_dbufput: xcd_runDDS write error
>Nov 23 22:27:58 waldo pqact[1514]: pipe_prodput: trying again
>Nov 23 22:27:58 waldo pqact[1514]: pbuf_flush (4) write: Broken pipe
>Nov 23 22:27:58 waldo pqact[1514]: pipe_dbufput: xcd_runDDS write error
>Nov 23 22:27:58 waldo pqact[1514]: child 5357 exited with status 127
>Nov 23 22:27:58 waldo pqact[1514]: child 5355 exited with status 127
>Nov 23 22:27:58 waldo pqact[1514]: pbuf_flush (4) write: Broken pipe
>Nov 23 22:27:58 waldo pqact[1514]: pipe_dbufput: xcd_runDDS write error
>Nov 23 22:27:58 waldo pqact[1514]: pipe_prodput: trying again
>Nov 23 22:27:58 waldo pqact[1514]: pbuf_flush (4) write: Broken pipe
>Nov 23 22:27:58 waldo pqact[1514]: pipe_dbufput: xcd_runDDS write error
>Nov 23 22:27:58 waldo pqact[1514]: child 5361 exited with status 127
>Nov 23 22:27:58 waldo pqact[1514]: child 5359 exited with status 127

The repeated start and failure of 'xcd_run DDS' is telling us that the
process that xcd_run is running (ingetext.k in this case) is exiting
without reading from from the LDM.  This is a big hint that the LDM was
most likely not started by the user 'ldm' since things were working
correctly before.

I did a quick look around and found that you must have started the LDM
as 'root' as a number of files were owned by root:

/usr/local/ldm% ls -al
total 10980
drwxr-xr-x  16 ldm      data        1024 Nov 23 22:07 ./
-rw-rw-r--   1 root     other          5 Nov 23 22:07 ldmd.pid

waldo# ls -al
total 3854
drwxr-xr-x   2 ldm      other       1024 Nov 23 22:48 .
drwxr-xr-x   7 ldm      data         512 Nov 23 22:06 ..
-rw-rw-r--   1 ldm      data         446 Nov 21 19:03 1999112118.stats
-rw-rw-r--   1 ldm      data         446 Nov 21 20:03 1999112119.stats
-rw-rw-r--   1 ldm      data         446 Nov 21 21:04 1999112120.stats
-rw-rw-r--   1 ldm      data         446 Nov 21 22:09 1999112121.stats
-rw-rw-r--   1 ldm      data         446 Nov 21 23:03 1999112122.stats
-rw-rw-r--   1 ldm      data         446 Nov 22 00:03 1999112123.stats
-rw-rw-r--   1 ldm      data         446 Nov 22 01:03 1999112200.stats
-rw-rw-r--   1 ldm      data         446 Nov 22 02:04 1999112201.stats
-rw-rw-r--   1 ldm      data         446 Nov 22 03:05 1999112202.stats
-rw-rw-r--   1 ldm      data         446 Nov 22 04:08 1999112203.stats
-rw-rw-r--   1 ldm      data         110 Nov 22 04:45 1999112204.stats
-rw-rw-r--   1 ldm      data         446 Nov 22 06:27 1999112205.stats
-rw-rw-r--   1 ldm      data         446 Nov 22 07:03 1999112206.stats
-rw-rw-r--   1 ldm      data         446 Nov 22 08:22 1999112207.stats
-rw-rw-r--   1 ldm      data         446 Nov 22 09:03 1999112208.stats
-rw-rw-r--   1 ldm      data         446 Nov 22 10:03 1999112209.stats
-rw-rw-r--   1 ldm      data         446 Nov 22 11:03 1999112210.stats
-rw-rw-r--   1 ldm      data         446 Nov 22 12:03 1999112211.stats
-rw-rw-r--   1 ldm      data         446 Nov 22 13:03 1999112212.stats
-rw-rw-r--   1 ldm      data         446 Nov 22 14:03 1999112213.stats
-rw-rw-r--   1 ldm      data         446 Nov 22 15:59 1999112214.stats
-rw-rw-r--   1 ldm      data         446 Nov 22 16:28 1999112215.stats
-rw-rw-r--   1 root     other        449 Nov 23 21:59 1999112320.stats
-rw-rw-r--   1 root     other        559 Nov 23 22:49 1999112321.stats
-rw-rw-r--   1 root     other        559 Nov 23 23:07 1999112322.stats
-rw-rw-r--   1 ldm      data         301 Apr 23  1999 f.log
-rw-r--r--   1 ldm      data        1146 Nov 23 22:35 ldmbinstats.upc
-rw-rw-r--   1 root     other    1587076 Nov 23 23:07 ldmd.log
-rw-rw-r--   1 root     other      92333 Nov 23 21:59 ldmd.log.1
-rw-r--r--   1 ldm      data        3938 Nov 23 21:49 ldmd.log.2
-rw-r--r--   1 ldm      data       87387 Nov 22 16:27 ldmd.log.3
-rw-r--r--   1 ldm      data      142790 Nov 21 23:58 ldmd.log.4
-rw-r--r--   1 ldm      data           0 Mar 29  1999 ldmfail
-rw-rw-r--   1 ldm      data        3591 Apr 23  1999 netcheck.log

I corrected this by becoming 'root' and changing the ownership of all
files owned by 'root' in the ~ldm directory tree.  This included
~ldm/data/ldmd.pq, the LDM product queue:

waldo# chown ldm *
waldo# chgrp data *
waldo# cd ~ldm/logs
waldo# chown ldm *
waldo# chgrp data *
waldo# cd ~ldm/data
waldo# chown ldm *
waldo# chgrp data *

Next, I tried starting the LDM as 'ldm', but I couldn't since the
hidden LDM lock file in /tmp was still owned by 'root'.  So, I became
root again and stopped the LDM:

su -
<password>
exec csh
setenv PATH ~ldm/bin:$PATH
ldmadmin stop
exit

After this I was back to being the user 'ldm'.  For good measure, I did
an 'ldmadmin stop' and then started the LDM:

ldmadmin stop
ldmadmin start
ldmadmin tail
/usr/local/ldm/logs% ldmadmin tail
Nov 23 23:17:04 waldo chinook[17562]: run_requester: 19991123222238.441 TS_ENDT 
{{FSL2|MCIDAS|IDS|DDPLUS,  ".*"}}
Nov 23 23:17:04 waldo chinook[17562]: FEEDME(chinook.unl.edu): OK
Nov 23 23:17:05 waldo udp.ldmd[17566]: Starting Up
Nov 23 23:17:06 waldo localhost[17590]: Connection from localhost
Nov 23 23:17:06 waldo localhost[17590]: Connection reset by peer
Nov 23 23:17:06 waldo localhost[17590]: Exiting
Nov 23 23:17:45 waldo proftomd[17596]: Starting up
Nov 23 23:17:46 waldo proftomd[17596]: Making /var/data/mcidas/MDXX0097; may 
take some time...
Nov 23 23:17:49 waldo proftomd[17596]: Decoding 1999327.2212 data into 
/var/data/mcidas/MDXX0097
Nov 23 23:17:49 waldo proftomd[17596]: Exiting
Nov 23 23:21:00 waldo lwtoa3[17606]: PRODUCT CODE=UX          99327       223019
Nov 23 23:21:00 waldo lwtoa3[17606]:  Done -- AREA= 109
Nov 23 23:21:06 waldo pqact[17558]: pbuf_flush (6) write: Broken pipe
Nov 23 23:21:06 waldo pqact[17558]: pbuf_flush 6: time elapsed   5.351715
Nov 23 23:21:06 waldo pqact[17558]: pipe_dbufput: 
-closelwtoa3-d/var/data/mcidas write error
Nov 23 23:21:06 waldo pqact[17558]: pipe_prodput: trying again
Nov 23 23:21:06 waldo lwtoa3[17622]: PRODUCT CODE=UX          99327       223019
Nov 23 23:21:06 waldo lwtoa3[17622]:  Done -- AREA= 100
Nov 23 23:21:10 waldo pqact[17558]: pbuf_flush (6) write: Broken pipe
Nov 23 23:21:10 waldo pqact[17558]: pbuf_flush 6: time elapsed   4.002119
Nov 23 23:21:10 waldo pqact[17558]: pipe_dbufput: 
-closelwtoa3-d/var/data/mcidas write error
Nov 23 23:22:04 waldo pqexpire[17555]: > Recycled  27588.838 kb/hr (  3567.163 
prods per hour)
Nov 23 23:22:39 waldo lwtoa3[17642]: PRODUCT CODE=UA          99327       223134
Nov 23 23:22:41 waldo lwtoa3[17642]:  Done -- AREA= 167
Nov 23 23:27:04 waldo pqexpire[17555]: > Recycled  19132.159 kb/hr (  2987.324 
prods per hour)
 
The pbuf_flush (6) write: Broken pipe error seemed to be telling me
that an lwtoa3 process was failing to write to an AREA file in
/var/data/mcidas, but I looked there and all files are owned by 'ldm'.
The success of AREA0167 further told me that things seemed to be
working correctly, so I decided to let things run and see what
happens.

Please let me know if you see problems with ldm-mcidas or XCD data
decoding.

Tom

>From address@hidden  Wed Nov 24 13:05:53 1999

Hi Tom

Just a short note to acknowledge your fix on waldo.  I was 
not aware of the problems created by having root perform any
system maintenance on the ldm.  

Our system seems to be working fine again.  

Thanks 

alan