[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Fwd: 20000918: Linux rpc.ldmd problem]



since this is obviously an issue in kernel, the first thing i would do is 
update it.
redhat-6.2 has a "patch" which upgrades the kernel to 2.2.16 (be sure to read 
all the
directions, and update your initrd and lilo if you're using it).  It looks like 
you are
using the default kernel, but if you need to compile it for some reason, you 
might as
well download and compile 2.2.17.  you may consider looking for latest versions 
of
drivers (like for your raid card) also.  that would be my first steps, and if 
that
didn't solve the problem, i'd get more creative.

-sandy


Robb Kambic wrote:

> Randy, et al
>
> I'm not an expert with Linux so I'm cc our sys admin who knows much more
> about Linux.  Maybe he can shed some light on the problem that appears to
> be the management of the LDM queue ( memory mapped file). There are some
> other IDD sites that use Linux, maybe they can also add some imput. My gut
> opinion is your system configuration about memory mapped files needs to be
> changed or the LDM interaction with queue needs to be changed. So maybe
> you could do some research in that direction. Is it possible to run the
> LDM only on one of the four processors?
>
> Robb...
>
> On Fri, 22 Sep 2000, Randy Weatherly wrote:
>
> > Robb,
> >
> > Thanks for your response to Jason's email.  We are still having problems 
> > and I
> > would like to run a few more things past you.
> >
> > We have two Redhat Linux 6.2 SMP machines.  Both running 2.2.14-6.1.1smp 
> > (if that
> > matters).  Both machines are 4 processor Dell servers, Pentium III chips 
> > and both
> > have hardware based raid.  Both exhibit the same behaviour although one of 
> > them is
> > more problematic.
> >
> > I've been spending most of my time on ted.  It is the more problematic of 
> > the two.
> > maul runs, but will die occasionally.  Here's what I've found.
> >
> > I've tried both binaries and source, 5.0.8 and 5.1.2.  All behave the same. 
> >  So I
> > don't think it is a problem with LDM, but rather something between ldm and 
> > this
> > version of Linux on this type of machine.
> >
> > I am able to get ldm to stop and start pretty well if I stop it, delete the 
> > queue,
> > make the queue, and start it.  But if I stop it, then try and start (after 
> > making
> > sure no processes are left running), then I get the kernel errors in the 
> > system log
> > file:
> >
> >  Sep 18 21:47:32 ted kernel: Unable to handle kernel NULL pointer
> > dereference at virtual address 00000008
> >  Sep 18 21:47:32 ted kernel: current->tss.cr3 = 346dc000, %cr3 = 346dc000
> >
> >  Sep 18 21:47:32 ted kernel: *pde = 00000000
> >  Sep 18 21:47:32 ted kernel: Oops: 0000
> >  Sep 18 21:47:32 ted kernel: CPU:    0
> >  Sep 18 21:47:32 ted kernel: EIP:    0010:[locks_remove_flock+14/148]
> > Sep 18 21:47:32 ted kernel: EFLAGS: 00010296
> >  Sep 18 21:47:32 ted kernel: eax: 00000000   ebx: eeee0ca0   ecx:
> >  eeee0ca0   edx: 00000000
> >  Sep 18 21:47:32 ted kernel: esi: 00000006   edi: f8e72cbc   ebp:
> >  bffff664   esp: f0851f10
> >  Sep 18 21:47:32 ted kernel: ds: 0018   es: 0018   ss: 0018
> >  Sep 18 21:47:32 ted kernel: Process rpc.ldmd (pid: 8209, process nr:
> >  143, stackpage=f0851000)
> >  Sep 18 21:47:32 ted kernel: Stack: f8e72cbc bffff664 00001020 00000000
> >  bffff674 c0143cc0 eeee0ca0 f8797ed0
> >  Sep 18 21:47:32 ted kernel:        c0129a0a 00000286 00000002 f98f1da0
> >  40016000 f9301ba0 f98f1ddc eeee0ca0
> >  Sep 18 21:47:32 ted kernel:        c011f5e0 fbf92680 f98f1da0 f98f1da0
> >  00000286 c012ae29 eeee0ca0 ffffffea
> >  Sep 18 21:47:32 ted kernel: Call Trace: [<00001020>] [<00000000>]
> >  [ext2_release_file+20/28] [__fput+62/72] [<00000286>] [<00000002>]
> >  [unmap_fixup+116/348]
> >  Sep 18 21:47:32 ted kernel:        [<00000286>] [fput+17/72]
> >  [sys_fcntl+1031/1064] [<00001020>] [<00001020>] [<00002000>]
> >  [sys_munmap+61/100] [system_call+52/56]
> >  Sep 18 21:47:32 ted kernel:        [<00000001>] [<00000006>]
> >  [<00001020>] [<00000037>] [<0000002b>] [<0000002b>] [<00000037>]
> >  [<00000023>]
> >  Sep 18 21:47:32 ted kernel:        [<00000296>] [<0000002b>]
> >  Sep 18 21:47:32 ted kernel: Code: 8b 40 08 89 44 24 14 83 c0 74 89 44 24
> >  10 8b 4c 24 14 8b 6c
> >
> > Pretty ugly stuff.  Since you mentioned that you thought this was a corrupt 
> > queue,
> > and since if I delete and remake the queue it works ok, that seems like the 
> > right
> > track.  On our other machine, maul, I see those error messages in the 
> > system log
> > file as well.  But it doesn't fail very often.  I think in the last month, 
> > it has
> > failed only a couple of times.
> >
> > Any ideas?  I'm sure this isn't your normal everyday stuff, but I thought 
> > if maybe
> > you knew of others that had problems with SMP machines or the SMP version 
> > of the
> > kernel it might help.
> >
> > Thanks in advance
> >
> > Randy Weatherly
> >
>
> ===============================================================================
> Robb Kambic                                Unidata Program Center
> Software Engineer III                      Univ. Corp for Atmospheric Research
> address@hidden                   WWW: http://www.unidata.ucar.edu/
> ===============================================================================