[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Fwd: 20000918: Linux rpc.ldmd problem]



Randy, et al

I'm not an expert with Linux so I'm cc our sys admin who knows much more
about Linux.  Maybe he can shed some light on the problem that appears to
be the management of the LDM queue ( memory mapped file). There are some
other IDD sites that use Linux, maybe they can also add some imput. My gut
opinion is your system configuration about memory mapped files needs to be
changed or the LDM interaction with queue needs to be changed. So maybe
you could do some research in that direction. Is it possible to run the
LDM only on one of the four processors?

Robb...


On Fri, 22 Sep 2000, Randy Weatherly wrote:

> Robb,
> 
> Thanks for your response to Jason's email.  We are still having problems and I
> would like to run a few more things past you.
> 
> We have two Redhat Linux 6.2 SMP machines.  Both running 2.2.14-6.1.1smp (if 
> that
> matters).  Both machines are 4 processor Dell servers, Pentium III chips and 
> both
> have hardware based raid.  Both exhibit the same behaviour although one of 
> them is
> more problematic.
> 
> I've been spending most of my time on ted.  It is the more problematic of the 
> two.
> maul runs, but will die occasionally.  Here's what I've found.
> 
> I've tried both binaries and source, 5.0.8 and 5.1.2.  All behave the same.  
> So I
> don't think it is a problem with LDM, but rather something between ldm and 
> this
> version of Linux on this type of machine.
> 
> I am able to get ldm to stop and start pretty well if I stop it, delete the 
> queue,
> make the queue, and start it.  But if I stop it, then try and start (after 
> making
> sure no processes are left running), then I get the kernel errors in the 
> system log
> file:
> 
>  Sep 18 21:47:32 ted kernel: Unable to handle kernel NULL pointer
> dereference at virtual address 00000008
>  Sep 18 21:47:32 ted kernel: current->tss.cr3 = 346dc000, %cr3 = 346dc000
> 
>  Sep 18 21:47:32 ted kernel: *pde = 00000000
>  Sep 18 21:47:32 ted kernel: Oops: 0000
>  Sep 18 21:47:32 ted kernel: CPU:    0
>  Sep 18 21:47:32 ted kernel: EIP:    0010:[locks_remove_flock+14/148]
> Sep 18 21:47:32 ted kernel: EFLAGS: 00010296
>  Sep 18 21:47:32 ted kernel: eax: 00000000   ebx: eeee0ca0   ecx:
>  eeee0ca0   edx: 00000000
>  Sep 18 21:47:32 ted kernel: esi: 00000006   edi: f8e72cbc   ebp:
>  bffff664   esp: f0851f10
>  Sep 18 21:47:32 ted kernel: ds: 0018   es: 0018   ss: 0018
>  Sep 18 21:47:32 ted kernel: Process rpc.ldmd (pid: 8209, process nr:
>  143, stackpage=f0851000)
>  Sep 18 21:47:32 ted kernel: Stack: f8e72cbc bffff664 00001020 00000000
>  bffff674 c0143cc0 eeee0ca0 f8797ed0
>  Sep 18 21:47:32 ted kernel:        c0129a0a 00000286 00000002 f98f1da0
>  40016000 f9301ba0 f98f1ddc eeee0ca0
>  Sep 18 21:47:32 ted kernel:        c011f5e0 fbf92680 f98f1da0 f98f1da0
>  00000286 c012ae29 eeee0ca0 ffffffea
>  Sep 18 21:47:32 ted kernel: Call Trace: [<00001020>] [<00000000>]
>  [ext2_release_file+20/28] [__fput+62/72] [<00000286>] [<00000002>]
>  [unmap_fixup+116/348]
>  Sep 18 21:47:32 ted kernel:        [<00000286>] [fput+17/72]
>  [sys_fcntl+1031/1064] [<00001020>] [<00001020>] [<00002000>]
>  [sys_munmap+61/100] [system_call+52/56]
>  Sep 18 21:47:32 ted kernel:        [<00000001>] [<00000006>]
>  [<00001020>] [<00000037>] [<0000002b>] [<0000002b>] [<00000037>]
>  [<00000023>]
>  Sep 18 21:47:32 ted kernel:        [<00000296>] [<0000002b>]
>  Sep 18 21:47:32 ted kernel: Code: 8b 40 08 89 44 24 14 83 c0 74 89 44 24
>  10 8b 4c 24 14 8b 6c
> 
> Pretty ugly stuff.  Since you mentioned that you thought this was a corrupt 
> queue,
> and since if I delete and remake the queue it works ok, that seems like the 
> right
> track.  On our other machine, maul, I see those error messages in the system 
> log
> file as well.  But it doesn't fail very often.  I think in the last month, it 
> has
> failed only a couple of times.
> 
> Any ideas?  I'm sure this isn't your normal everyday stuff, but I thought if 
> maybe
> you knew of others that had problems with SMP machines or the SMP version of 
> the
> kernel it might help.
> 
> Thanks in advance
> 
> Randy Weatherly
> 

===============================================================================
Robb Kambic                                Unidata Program Center
Software Engineer III                      Univ. Corp for Atmospheric Research
address@hidden             WWW: http://www.unidata.ucar.edu/
===============================================================================