[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20000918: Linux rpc.ldmd problem



On Mon, 18 Sep 2000, Unidata Support wrote:

> 
> ------- Forwarded Message
> 
> >To: address@hidden
> >From: Jason Burks <address@hidden>
> >Subject: Linux rpc.ldmd problem
> >Organization: NOAA/NWS/WR
> >Keywords: 200009182301.e8IN1Lb21079
> 
> Hello,
> I have read through the  emails in the support archives about having
> problems with rpc.ldmd not going away after a ldmadmin stop on linux
> machines and was wondering if there was a resolution that I missed.  The
> problem I am having is described below:
> This machine is fresh and I just loaded Redhat 6.2 on it. We loaded ldm
> 5.0.8 and started it working. Ldm ran for a while and then it jsut
> stopped receiving products, so we did an ldmadmin stop. This worked
> except it left a process running. I tried to start LDM back up but
> because that process was still there LDM would not start. I tried
> killing the process with -9 and nothing would kill it. So I had to
> reboot the machine. Again a day later the machine had a similar problem.
> I know the other support emails people said they went away after 8
> minutes. I just let one run for a while and after 2 hours did not go
> away. I thought maybe because the machine was a busy machine this was
> the problem. I tried this on another Redhat 6.2 machine and I got it to
> fail by starting LDM and then stopping ldm a couple of minutes later.
> When I stopped it it left the process hanging around.  This machine
> basically does nothing so it is not very busy. The process left over is
> owned by init so the parent process has died and the child did not die
> with it. I upgraded to LDM 5.1.2 to see if I could overcome this problem
> through upgrading and the same problems exist. Is there something I am
> missing. Was there a solution to this problem? I have log files and
> messages log files if you want them. I will also add when we were at
> Redhat 6.0 we did not see these same problems, however we cannot go back
> to Redhat 6.0 because of hardware support problems with our RAID. Let me
> know what you think.
> Thanks in advance.
> Jason
> 
> Here are some snippets fo log files: Bad things started happening around
> 21:47
> 
> Messages:
> Sep 18 21:36:01 ted pqbinstats[4332]: Starting Up (4331)
> Sep 18 21:36:01 ted pqact[4333]: Starting Up
> Sep 18 21:36:01 ted pqbinstats[4332]: Exiting

Jason, 

It appears to a couple things wrong.

I believe that pqbinstats doesn't have write permission in the log file,
causing it to exit.



> Sep 18 21:36:01 ted galileo[4335]: run_requester: 20000918212000.686
> TS_ENDT {{EXP,  ".*(raws.all)"}}
> Sep 18 21:36:01 ted voyager[4336]: run_requester: 20000918211927.335
> TS_ENDT {{FSL5,  ".*(mnet.ldad)"}}
> Sep 18 21:36:01 ted ls1-vhw[4334]: FEEDME(ls1-vhw.wrh.noaa.gov): OK
> Sep 18 21:36:01 ted galileo[4335]: FEEDME(galileo.wrh.noaa.gov.): OK
> Sep 18 21:36:01 ted voyager[4336]: FEEDME(voyager.wrh.noaa.gov.): OK
> Sep 18 21:36:03 ted rpc.ldmd[4331]: child 4332 exited with status 4
> Sep 18 21:36:03 ted ted[4720]: Connection from ted.wrh.noaa.gov
> Sep 18 21:36:03 ted ted[4720]: Connection reset by peer
> Sep 18 21:36:03 ted ted[4720]: Exiting
> Sep 18 21:36:31 ted syslogd 1.3-3: restart.

I would check to make sure that the syslog.conf file is set up correctly.
I'm seeing messages from non-ldm processes.  This shouldn't cause a crash
but it makes the log files confusing to read.  I believe the setup is in
the preinstall instructions.


> Sep 18 21:47:16 ted pqact[4333]: child 7739 exited with status 1
> Sep 18 21:47:19 ted rpc.ldmd[4331]: Exiting
> Sep 18 21:47:19 ted rpc.ldmd[4331]: Terminating process group
> Sep 18 21:47:19 ted galileo[4335]: Exiting
> Sep 18 21:47:19 ted pqact[4333]: child 7765 terminated by signal 15
> Sep 18 21:47:19 ted pqact[4333]: child 7764 terminated by signal 15
> Sep 18 21:47:19 ted pqact[4333]: child 7763 terminated by signal 15

Looks like pqact doesn't have permission to write so the entries are
failing or it could be an pqact entry that is syntacally correct but is
not well formed, ie there are more \n then sets of parens in the entry. I
would maybe use a scaled down pqact.conf at first.  


> Sep 18 21:47:19 ted voyager[4336]: Exiting
> Sep 18 21:47:19 ted pqact[4333]: child 7742 terminated by signal 15
> Sep 18 21:47:19 ted pqact[4333]: child 7734 terminated by signal 15
> Sep 18 21:47:19 ted pqact[4333]: child 7741 terminated by signal 15
> Sep 18 21:47:19 ted kernel: VFS: Close: file count is 0
> Sep 18 21:47:19 ted pqact[4333]: child 7733 terminated by signal 15
> Sep 18 21:47:19 ted ls1-vhw[4334]: Exiting
> Sep 18 21:47:19 ted pqact[4333]: child 7731 terminated by signal 15
> Sep 18 21:47:19 ted pqact[4333]: Exiting
> Sep 18 21:47:19 ted kernel: VFS: Close: file count is 0
> Sep 18 21:47:32 ted syslogd 1.3-3: restart.
> Sep 18 21:47:32 ted rpc.ldmd[8205]: Starting Up (built: Aug 31 2000
> 11:48:33)
> Sep 18 21:47:32 ted ls1-vhw[8208]: run_requester: Starting Up:
> ls1-vhw.wrh.noaa.gov
> Sep 18 21:47:32 ted galileo[8209]: run_requester: Starting Up:
> galileo.wrh.noaa.gov.
> Sep 18 21:47:32 ted ls1-vhw[8208]: run_requester: 20000918214629.182
> TS_ENDT {{AFOS,  ".*"}}
> Sep 18 21:47:32 ted voyager[8210]: run_requester: Starting Up:
> voyager.wrh.noaa.gov.
> Sep 18 21:47:32 ted pqbinstats[8206]: Starting Up (8205)
> Sep 18 21:47:32 ted pqact[8207]: Starting Up
> Sep 18 21:47:32 ted pqbinstats[8206]: Exiting
> Sep 18 21:47:32 ted kernel: Unable to handle kernel NULL pointer
> dereference at virtual address 00000008
> Sep 18 21:47:32 ted kernel: current->tss.cr3 = 346dc000, %cr3 = 346dc000
> 
> Sep 18 21:47:32 ted kernel: *pde = 00000000
> Sep 18 21:47:32 ted kernel: Oops: 0000
> Sep 18 21:47:32 ted kernel: CPU:    0
> Sep 18 21:47:32 ted kernel: EIP:    0010:[locks_remove_flock+14/148]
> Sep 18 21:47:32 ted kernel: EFLAGS: 00010296
> Sep 18 21:47:32 ted kernel: eax: 00000000   ebx: eeee0ca0   ecx:
> eeee0ca0   edx: 00000000
> Sep 18 21:47:32 ted kernel: esi: 00000006   edi: f8e72cbc   ebp:
> bffff664   esp: f0851f10
> Sep 18 21:47:32 ted kernel: ds: 0018   es: 0018   ss: 0018
> Sep 18 21:47:32 ted kernel: Process rpc.ldmd (pid: 8209, process nr:
> 143, stackpage=f0851000)
> Sep 18 21:47:32 ted kernel: Stack: f8e72cbc bffff664 00001020 00000000
> bffff674 c0143cc0 eeee0ca0 f8797ed0
> Sep 18 21:47:32 ted kernel:        c0129a0a 00000286 00000002 f98f1da0
> 40016000 f9301ba0 f98f1ddc eeee0ca0
> Sep 18 21:47:32 ted kernel:        c011f5e0 fbf92680 f98f1da0 f98f1da0
> 00000286 c012ae29 eeee0ca0 ffffffea
> Sep 18 21:47:32 ted kernel: Call Trace: [<00001020>] [<00000000>]
> [ext2_release_file+20/28] [__fput+62/72] [<00000286>] [<00000002>]
> [unmap_fixup+116/348]
> Sep 18 21:47:32 ted kernel:        [<00000286>] [fput+17/72]
> [sys_fcntl+1031/1064] [<00001020>] [<00001020>] [<00002000>]
> [sys_munmap+61/100] [system_call+52/56]
> Sep 18 21:47:32 ted kernel:        [<00000001>] [<00000006>]
> [<00001020>] [<00000037>] [<0000002b>] [<0000002b>] [<00000037>]
> [<00000023>]
> Sep 18 21:47:32 ted kernel:        [<00000296>] [<0000002b>]
> Sep 18 21:47:32 ted kernel: Code: 8b 40 08 89 44 24 14 83 c0 74 89 44 24
> 10 8b 4c 24 14 8b 6c
> Sep 18 21:47:32 ted kernel: Unable to handle kernel NULL pointer
> dereference at virtual address 00000008
> Sep 18 21:47:32 ted kernel: current->tss.cr3 = 00101000, %cr3 = 00101000
> 
> Sep 18 21:47:32 ted kernel: *pde = 00000000
> Sep 18 21:47:32 ted kernel: Oops: 0000
> Sep 18 21:47:32 ted kernel: CPU:    0
> Sep 18 21:47:32 ted kernel: EIP:    0010:[filp_close+56/92]
> Sep 18 21:47:32 ted kernel: EFLAGS: 00010246
> Sep 18 21:47:33 ted kernel: eax: 00000000   ebx: eeee0ca0   ecx:
> f8e72ba0   edx: eeee0ca0
> Sep 18 21:47:33 ted kernel: esi: 00000000   edi: 00000000   ebp:
> 00000001   esp: f0851e6c
> Sep 18 21:47:33 ted kernel: ds: 0018   es: 0018   ss: 0018
> Sep 18 21:47:33 ted kernel: Process rpc.ldmd (pid: 8209, process nr:
> 143, stackpage=f0851000)
> Sep 18 21:47:33 ted kernel: Stack: f8e72ba0 c0119d55 eeee0ca0 f8e72ba0
> f0851ed4 f0850000 00000008 08048000
> Sep 18 21:47:33 ted kernel:        f8e72ba0 f0850000 c0109753 0000000b
> 00000000 c01109e4 c01ee3ce f0851ed4
> Sep 18 21:47:33 ted kernel:        00000000 f0850000 00000006 f8e72cbc
> bffff664 00000000 f937a358 c010936d
> Sep 18 21:47:33 ted kernel: Call Trace: [do_exit+361/744] [<00000008>]
> [die+83/84] [<0000000b>] [<00000000>] [do_page_fault+708/908]
> [lk_lockmsg+6254/6528]
> Sep 18 21:47:33 ted kernel:        [<00000000>] [<00000006>]
> [<00000000>] [error_code+45/52] [<00000000>] [<00000000>] [<00000006>]
> [<00000000>]
> Sep 18 21:47:33 ted kernel:        [sys_uselib+160/200] [<00000018>]
> [locks_remove_flock+14/148] [<00000010>] [<00010296>] [<00000006>]
> [<00001020>] [<00000000>]
> Sep 18 21:47:33 ted kernel:        [ext2_release_file+20/28]
> [__fput+62/72] [<00000286>] [<00000002>] [unmap_fixup+116/348]
> [<00000286>] [fput+17/72] [sys_fcntl+1031/1064]
> Sep 18 21:47:33 ted kernel:        [<00001020>] [<00001020>]
> [<00002000>] [sys_munmap+61/100] [system_call+52/56] [<00000001>]
> [<00000006>] [<00001020>]
> Sep 18 21:47:33 ted kernel:        [<00000037>] [<0000002b>]
> [<0000002b>] [<00000037>] [<00000023>] [<00000296>] [<0000002b>]
> Sep 18 21:47:33 ted kernel: Code: 83 7f 08 00 74 0e 8b 44 24 14 50 53 e8
> 6b c0 00 00 83 c4 08
> Sep 18 21:47:32 ted ls1-vhw[8208]: FEEDME(ls1-vhw.wrh.noaa.gov): OK
> Sep 18 21:47:33 ted ls1-vhw[8208]: fcntl F_WRLCK failed for rgn (0
> SEEK_SET, 4096) 22: Invalid argument

This is major error, looks like your LDM queue is corrupt.  Rebuild the
LDM queue.  For linux systems it helps to have enough physical ram to cove
the size of the LDM queue.  You might want to decrease the size of the
queue as a test.

Here's how I would debug the problem. I would comment out all the entries
in the etc/ldmd.conf file except the allow ones. I would use the trial and
error method by starting/stopping the LDM, checking if all the processes
stop.  I would also check the logs/ldmd.log file for any error messages.
Then I would uncomment another entry in the ldmd.conf file.  You could
also run the ldm from the command line with  the messages coming to the
console, ie

% rpc.ldmd -xvl - etc/ldmd.conf

from the LDM home directory.


Robb...


> Sep 18 21:47:33 ted ls1-vhw[8208]: hereis: pq_insert failed: Invalid
> argument: d729e28e3e89b85b47bc53b8c3914239     1279
> 20000918214629.182    AFOS 000  SLCPRCWV
> Sep 18 21:47:33 ted ls1-vhw[8208]: Connection reset by peer
> Sep 18 21:47:33 ted ls1-vhw[8208]: Disconnect
> Sep 18 21:47:34 ted rpc.ldmd[8205]: child 8209 terminated by signal 11
> Sep 18 21:47:34 ted rpc.ldmd[8205]: Killing (SIGINT) process group
> Sep 18 21:47:34 ted rpc.ldmd[8205]: Interrupt
> Sep 18 21:47:34 ted pqact[8207]: Interrupt
> Sep 18 21:47:34 ted ls1-vhw[8208]: Interrupt
> Sep 18 21:47:34 ted rpc.ldmd[8205]: Exiting
> Sep 18 21:47:34 ted pqact[8207]: Exiting
> Sep 18 21:47:34 ted ted[8217]: Connection from ted.wrh.noaa.gov
> Sep 18 21:47:34 ted ls1-vhw[8208]: Exiting
> Sep 18 21:47:34 ted ted[8217]: Interrupt
> Sep 18 21:47:34 ted ted[8217]: Exiting
> Sep 18 21:47:34 ted kernel: VFS: Close: file count is 0
> Sep 18 21:47:34 ted kernel: VFS: Close: file count is 0
> 
> 
> ldmd.log from a later time
> Sep 18 22:03:33 ted rpc.ldmd[8868]: Starting Up (built: Aug 31 2000
> 11:48:33)
> Sep 18 22:03:33 ted ls1-vhw[8871]: run_requester: Starting Up:
> ls1-vhw.wrh.noaa.gov
> Sep 18 22:03:33 ted ls1-vhw[8871]: run_requester: 20000918214629.182
> TS_ENDT {{AFOS,  ".*"}}
> Sep 18 22:03:33 ted galileo[8872]: run_requester: Starting Up:
> galileo.wrh.noaa.gov.
> Sep 18 22:03:33 ted voyager[8873]: run_requester: Starting Up:
> voyager.wrh.noaa.gov.
> Sep 18 22:03:33 ted pqbinstats[8869]: Starting Up (8868)
> Sep 18 22:03:33 ted pqact[8870]: Starting Up
> Sep 18 22:03:33 ted ls1-vhw[8871]: FEEDME(ls1-vhw.wrh.noaa.gov): OK
> Sep 18 22:03:33 ted ls1-vhw[8871]: fcntl F_WRLCK failed for rgn (0
> SEEK_SET, 4096) 22: Invalid argument
> Sep 18 22:03:33 ted ls1-vhw[8871]: hereis: pq_insert failed: Invalid
> argument: d729e28e3e89b85b47bc53b8c3914239     1279
> 20000918214629.182    AFOS 000  SLCPRCWV
> Sep 18 22:03:33 ted ls1-vhw[8871]: Connection reset by peer
> Sep 18 22:03:33 ted ls1-vhw[8871]: Disconnect
> Sep 18 22:03:35 ted rpc.ldmd[8868]: child 8873 terminated by signal 11
> Sep 18 22:03:35 ted rpc.ldmd[8868]: Killing (SIGINT) process group
> Sep 18 22:03:35 ted rpc.ldmd[8868]: Interrupt
> Sep 18 22:03:35 ted pqbinstats[8869]: Interrupt
> Sep 18 22:03:35 ted pqact[8870]: Interrupt
> Sep 18 22:03:35 ted pqbinstats[8869]: Exiting
> Sep 18 22:03:35 ted ls1-vhw[8871]: Interrupt
> Sep 18 22:03:35 ted pqact[8870]: Exiting
> Sep 18 22:03:35 ted ted[8880]: Connection from ted.wrh.noaa.gov
> Sep 18 22:03:35 ted ls1-vhw[8871]: Exiting
> Sep 18 22:03:35 ted rpc.ldmd[8868]: Exiting
> Sep 18 22:03:35 ted ted[8880]: Interrupt
> Sep 18 22:03:35 ted rpc.ldmd[8868]: Terminating process group
> Sep 18 22:03:35 ted rpc.ldmd[8868]: child 8880 terminated by signal 11
> Sep 18 22:03:36 ted rpc.ldmd[8868]: Killing (SIGINT) process group
> 
> 
> ------- End of Forwarded Message
> 

===============================================================================
Robb Kambic                                Unidata Program Center
Software Engineer III                      Univ. Corp for Atmospheric Research
address@hidden             WWW: http://www.unidata.ucar.edu/
===============================================================================