[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Fwd: 20000918: Linux rpc.ldmd problem]



Well I took it one more step, forward this time.  I couln't find 2.2.17, but 
instead got
2.2.16-22.  It works!!

I haven't let it run very long yet, but so far I can stop and start it at will. 
 With the
kernel's that didn't work I couldn't stop it once it was started without 
problems.  So I am
confident.  I'll run it awhile and see how it goes.

Randy

Randy Weatherly wrote:

> Sandy,
> Thanks for the response.
>
> I did upgrade to 2.2.16-4.  And also made sure I had the latest drivers and 
> even firmware
> for the hardware.  I think it is strictly a kernel issue since I have a 
> second Dell quad
> processor that has different raid hardware and it exhibits the same 
> behaviour.  I just
> replied to Robb's message about dropping back to an old kernel and getting it 
> to work.
>
> I thought about progressing up through the kernels until I find the one that 
> it breaks on
> and then posting a message on bugzillia to see if anyone knows what changed.
>
> But for now it works with the old kernel, and other than security fixes I'm 
> not sure what
> value the newer kernels have for me.
>
> Randy
>
> Sandy Whitesel wrote:
>
> > since this is obviously an issue in kernel, the first thing i would do is 
> > update it.
> > redhat-6.2 has a "patch" which upgrades the kernel to 2.2.16 (be sure to 
> > read all the
> > directions, and update your initrd and lilo if you're using it).  It looks 
> > like you are
> > using the default kernel, but if you need to compile it for some reason, 
> > you might as
> > well download and compile 2.2.17.  you may consider looking for latest 
> > versions of
> > drivers (like for your raid card) also.  that would be my first steps, and 
> > if that
> > didn't solve the problem, i'd get more creative.
> >
> > -sandy
> >
> > Robb Kambic wrote:
> >
> > > Randy, et al
> > >
> > > I'm not an expert with Linux so I'm cc our sys admin who knows much more
> > > about Linux.  Maybe he can shed some light on the problem that appears to
> > > be the management of the LDM queue ( memory mapped file). There are some
> > > other IDD sites that use Linux, maybe they can also add some imput. My gut
> > > opinion is your system configuration about memory mapped files needs to be
> > > changed or the LDM interaction with queue needs to be changed. So maybe
> > > you could do some research in that direction. Is it possible to run the
> > > LDM only on one of the four processors?
> > >
> > > Robb...
> > >
> > > On Fri, 22 Sep 2000, Randy Weatherly wrote:
> > >
> > > > Robb,
> > > >
> > > > Thanks for your response to Jason's email.  We are still having 
> > > > problems and I
> > > > would like to run a few more things past you.
> > > >
> > > > We have two Redhat Linux 6.2 SMP machines.  Both running 
> > > > 2.2.14-6.1.1smp (if that
> > > > matters).  Both machines are 4 processor Dell servers, Pentium III 
> > > > chips and both
> > > > have hardware based raid.  Both exhibit the same behaviour although one 
> > > > of them is
> > > > more problematic.
> > > >
> > > > I've been spending most of my time on ted.  It is the more problematic 
> > > > of the two.
> > > > maul runs, but will die occasionally.  Here's what I've found.
> > > >
> > > > I've tried both binaries and source, 5.0.8 and 5.1.2.  All behave the 
> > > > same.  So I
> > > > don't think it is a problem with LDM, but rather something between ldm 
> > > > and this
> > > > version of Linux on this type of machine.
> > > >
> > > > I am able to get ldm to stop and start pretty well if I stop it, delete 
> > > > the queue,
> > > > make the queue, and start it.  But if I stop it, then try and start 
> > > > (after making
> > > > sure no processes are left running), then I get the kernel errors in 
> > > > the system log
> > > > file:
> > > >
> > > >  Sep 18 21:47:32 ted kernel: Unable to handle kernel NULL pointer
> > > > dereference at virtual address 00000008
> > > >  Sep 18 21:47:32 ted kernel: current->tss.cr3 = 346dc000, %cr3 = 
> > > > 346dc000
> > > >
> > > >  Sep 18 21:47:32 ted kernel: *pde = 00000000
> > > >  Sep 18 21:47:32 ted kernel: Oops: 0000
> > > >  Sep 18 21:47:32 ted kernel: CPU:    0
> > > >  Sep 18 21:47:32 ted kernel: EIP:    0010:[locks_remove_flock+14/148]
> > > > Sep 18 21:47:32 ted kernel: EFLAGS: 00010296
> > > >  Sep 18 21:47:32 ted kernel: eax: 00000000   ebx: eeee0ca0   ecx:
> > > >  eeee0ca0   edx: 00000000
> > > >  Sep 18 21:47:32 ted kernel: esi: 00000006   edi: f8e72cbc   ebp:
> > > >  bffff664   esp: f0851f10
> > > >  Sep 18 21:47:32 ted kernel: ds: 0018   es: 0018   ss: 0018
> > > >  Sep 18 21:47:32 ted kernel: Process rpc.ldmd (pid: 8209, process nr:
> > > >  143, stackpage=f0851000)
> > > >  Sep 18 21:47:32 ted kernel: Stack: f8e72cbc bffff664 00001020 00000000
> > > >  bffff674 c0143cc0 eeee0ca0 f8797ed0
> > > >  Sep 18 21:47:32 ted kernel:        c0129a0a 00000286 00000002 f98f1da0
> > > >  40016000 f9301ba0 f98f1ddc eeee0ca0
> > > >  Sep 18 21:47:32 ted kernel:        c011f5e0 fbf92680 f98f1da0 f98f1da0
> > > >  00000286 c012ae29 eeee0ca0 ffffffea
> > > >  Sep 18 21:47:32 ted kernel: Call Trace: [<00001020>] [<00000000>]
> > > >  [ext2_release_file+20/28] [__fput+62/72] [<00000286>] [<00000002>]
> > > >  [unmap_fixup+116/348]
> > > >  Sep 18 21:47:32 ted kernel:        [<00000286>] [fput+17/72]
> > > >  [sys_fcntl+1031/1064] [<00001020>] [<00001020>] [<00002000>]
> > > >  [sys_munmap+61/100] [system_call+52/56]
> > > >  Sep 18 21:47:32 ted kernel:        [<00000001>] [<00000006>]
> > > >  [<00001020>] [<00000037>] [<0000002b>] [<0000002b>] [<00000037>]
> > > >  [<00000023>]
> > > >  Sep 18 21:47:32 ted kernel:        [<00000296>] [<0000002b>]
> > > >  Sep 18 21:47:32 ted kernel: Code: 8b 40 08 89 44 24 14 83 c0 74 89 44 
> > > > 24
> > > >  10 8b 4c 24 14 8b 6c
> > > >
> > > > Pretty ugly stuff.  Since you mentioned that you thought this was a 
> > > > corrupt queue,
> > > > and since if I delete and remake the queue it works ok, that seems like 
> > > > the right
> > > > track.  On our other machine, maul, I see those error messages in the 
> > > > system log
> > > > file as well.  But it doesn't fail very often.  I think in the last 
> > > > month, it has
> > > > failed only a couple of times.
> > > >
> > > > Any ideas?  I'm sure this isn't your normal everyday stuff, but I 
> > > > thought if maybe
> > > > you knew of others that had problems with SMP machines or the SMP 
> > > > version of the
> > > > kernel it might help.
> > > >
> > > > Thanks in advance
> > > >
> > > > Randy Weatherly
> > > >
> > >
> > > ===============================================================================
> > > Robb Kambic                                Unidata Program Center
> > > Software Engineer III                      Univ. Corp for Atmospheric 
> > > Research
> > > address@hidden                   WWW: http://www.unidata.ucar.edu/
> > > ===============================================================================
>
> --
> Randy Weatherly             AWIPS/Computer Systems Analyst
> National Weather Service
> Salt Lake City UT
> address@hidden    801-524-5120 x284

--
Randy Weatherly             AWIPS/Computer Systems Analyst
National Weather Service
Salt Lake City UT
address@hidden    801-524-5120 x284