[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Fwd: 20000918: Linux rpc.ldmd problem] (fwd)




===============================================================================
Robb Kambic                                Unidata Program Center
Software Engineer III                      Univ. Corp for Atmospheric Research
address@hidden             WWW: http://www.unidata.ucar.edu/
===============================================================================

---------- Forwarded message ----------
Date: Wed, 27 Sep 2000 08:31:39 -0600
From: Randy Weatherly <address@hidden>
To: Sandy Whitesel <address@hidden>
     support-ldm <address@hidden>
Subject: Re: [Fwd: 20000918: Linux rpc.ldmd problem]

Sandy,
Thanks for the response.

I did upgrade to 2.2.16-4.  And also made sure I had the latest drivers and 
even firmware
for the hardware.  I think it is strictly a kernel issue since I have a second 
Dell quad
processor that has different raid hardware and it exhibits the same behaviour.  
I just
replied to Robb's message about dropping back to an old kernel and getting it 
to work.

I thought about progressing up through the kernels until I find the one that it 
breaks on
and then posting a message on bugzillia to see if anyone knows what changed.

But for now it works with the old kernel, and other than security fixes I'm not 
sure what
value the newer kernels have for me.

Randy


Sandy Whitesel wrote:

> since this is obviously an issue in kernel, the first thing i would do is 
> update it.
> redhat-6.2 has a "patch" which upgrades the kernel to 2.2.16 (be sure to read 
> all the
> directions, and update your initrd and lilo if you're using it).  It looks 
> like you are
> using the default kernel, but if you need to compile it for some reason, you 
> might as
> well download and compile 2.2.17.  you may consider looking for latest 
> versions of
> drivers (like for your raid card) also.  that would be my first steps, and if 
> that
> didn't solve the problem, i'd get more creative.
>
> -sandy
>
> Robb Kambic wrote:
>
> > Randy, et al
> >
> > I'm not an expert with Linux so I'm cc our sys admin who knows much more
> > about Linux.  Maybe he can shed some light on the problem that appears to
> > be the management of the LDM queue ( memory mapped file). There are some
> > other IDD sites that use Linux, maybe they can also add some imput. My gut
> > opinion is your system configuration about memory mapped files needs to be
> > changed or the LDM interaction with queue needs to be changed. So maybe
> > you could do some research in that direction. Is it possible to run the
> > LDM only on one of the four processors?
> >
> > Robb...
> >
> > On Fri, 22 Sep 2000, Randy Weatherly wrote:
> >
> > > Robb,
> > >
> > > Thanks for your response to Jason's email.  We are still having problems 
> > > and I
> > > would like to run a few more things past you.
> > >
> > > We have two Redhat Linux 6.2 SMP machines.  Both running 2.2.14-6.1.1smp 
> > > (if that
> > > matters).  Both machines are 4 processor Dell servers, Pentium III chips 
> > > and both
> > > have hardware based raid.  Both exhibit the same behaviour although one 
> > > of them is
> > > more problematic.
> > >
> > > I've been spending most of my time on ted.  It is the more problematic of 
> > > the two.
> > > maul runs, but will die occasionally.  Here's what I've found.
> > >
> > > I've tried both binaries and source, 5.0.8 and 5.1.2.  All behave the 
> > > same.  So I
> > > don't think it is a problem with LDM, but rather something between ldm 
> > > and this
> > > version of Linux on this type of machine.
> > >
> > > I am able to get ldm to stop and start pretty well if I stop it, delete 
> > > the queue,
> > > make the queue, and start it.  But if I stop it, then try and start 
> > > (after making
> > > sure no processes are left running), then I get the kernel errors in the 
> > > system log
> > > file:
> > >
> > >  Sep 18 21:47:32 ted kernel: Unable to handle kernel NULL pointer
> > > dereference at virtual address 00000008
> > >  Sep 18 21:47:32 ted kernel: current->tss.cr3 = 346dc000, %cr3 = 346dc000
> > >
> > >  Sep 18 21:47:32 ted kernel: *pde = 00000000
> > >  Sep 18 21:47:32 ted kernel: Oops: 0000
> > >  Sep 18 21:47:32 ted kernel: CPU:    0
> > >  Sep 18 21:47:32 ted kernel: EIP:    0010:[locks_remove_flock+14/148]
> > > Sep 18 21:47:32 ted kernel: EFLAGS: 00010296
> > >  Sep 18 21:47:32 ted kernel: eax: 00000000   ebx: eeee0ca0   ecx:
> > >  eeee0ca0   edx: 00000000
> > >  Sep 18 21:47:32 ted kernel: esi: 00000006   edi: f8e72cbc   ebp:
> > >  bffff664   esp: f0851f10
> > >  Sep 18 21:47:32 ted kernel: ds: 0018   es: 0018   ss: 0018
> > >  Sep 18 21:47:32 ted kernel: Process rpc.ldmd (pid: 8209, process nr:
> > >  143, stackpage=f0851000)
> > >  Sep 18 21:47:32 ted kernel: Stack: f8e72cbc bffff664 00001020 00000000
> > >  bffff674 c0143cc0 eeee0ca0 f8797ed0
> > >  Sep 18 21:47:32 ted kernel:        c0129a0a 00000286 00000002 f98f1da0
> > >  40016000 f9301ba0 f98f1ddc eeee0ca0
> > >  Sep 18 21:47:32 ted kernel:        c011f5e0 fbf92680 f98f1da0 f98f1da0
> > >  00000286 c012ae29 eeee0ca0 ffffffea
> > >  Sep 18 21:47:32 ted kernel: Call Trace: [<00001020>] [<00000000>]
> > >  [ext2_release_file+20/28] [__fput+62/72] [<00000286>] [<00000002>]
> > >  [unmap_fixup+116/348]
> > >  Sep 18 21:47:32 ted kernel:        [<00000286>] [fput+17/72]
> > >  [sys_fcntl+1031/1064] [<00001020>] [<00001020>] [<00002000>]
> > >  [sys_munmap+61/100] [system_call+52/56]
> > >  Sep 18 21:47:32 ted kernel:        [<00000001>] [<00000006>]
> > >  [<00001020>] [<00000037>] [<0000002b>] [<0000002b>] [<00000037>]
> > >  [<00000023>]
> > >  Sep 18 21:47:32 ted kernel:        [<00000296>] [<0000002b>]
> > >  Sep 18 21:47:32 ted kernel: Code: 8b 40 08 89 44 24 14 83 c0 74 89 44 24
> > >  10 8b 4c 24 14 8b 6c
> > >
> > > Pretty ugly stuff.  Since you mentioned that you thought this was a 
> > > corrupt queue,
> > > and since if I delete and remake the queue it works ok, that seems like 
> > > the right
> > > track.  On our other machine, maul, I see those error messages in the 
> > > system log
> > > file as well.  But it doesn't fail very often.  I think in the last 
> > > month, it has
> > > failed only a couple of times.
> > >
> > > Any ideas?  I'm sure this isn't your normal everyday stuff, but I thought 
> > > if maybe
> > > you knew of others that had problems with SMP machines or the SMP version 
> > > of the
> > > kernel it might help.
> > >
> > > Thanks in advance
> > >
> > > Randy Weatherly
> > >
> >
> > ===============================================================================
> > Robb Kambic                                Unidata Program Center
> > Software Engineer III                      Univ. Corp for Atmospheric 
> > Research
> > address@hidden                   WWW: http://www.unidata.ucar.edu/
> > ===============================================================================

--
Randy Weatherly             AWIPS/Computer Systems Analyst
National Weather Service
Salt Lake City UT
address@hidden    801-524-5120 x284