[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20030409: ldm-6.0.10 issues under irix 6.5



Pete,

>Date: Wed, 09 Apr 2003 14:02:15 -0500
>From: Pete Pokrandt <address@hidden>
>Organization: University of Wisconsin
>To: Steve Emmerson <address@hidden>
>Subject: Re: 20030409: ldm-6.0.10 issues under irix 6.5 

The above message contained the following:

> Answers to your questions are below. Hopefully some clues.
> Take a look through. I'll test 6.0.10 with the same queue
> size that works for 6.0.2 in the meantime, to see if that
> matters.

I doubt that queue size is important.

> Process 1710 was an rpc.ldmd feeding from f5:
> 
> sunset 8% grep 1710 ldmd.log.3
> Apr 08 21:33:14 5Q:sunset f5[1710]: Starting Up(6.0.10): f5.aos.wisc.edu: 
> TS_ZERO TS_ENDT {{HDS,  "(^[^S|^S[^D]|^SD..[^59]|^SD..9[^7])"}}
> Apr 08 21:33:14 5Q:sunset f5[1710]: Desired product class: 20030408203314.458 
> TS_ENDT {{HDS,  "(^[^S|^S[^D]|^SD..[^59]|^SD..9[^7])"}}
> Apr 08 21:33:14 5Q:sunset f5[1710]: Connecting to upstream LDM using protocol 
> version 6...
> Apr 08 21:33:14 5Q:sunset f5[1710]: Upstream LDM is willing to feed
> Apr 08 21:43:02 5Q:sunset rpc.ldmd[1697]: child 1710 terminated by signal 9

Then we'd better figure this out.

>  >Signal 9 is SIGKILL, which cannot be caught or ignored by a process and
>  >is actually handled by the operating system on behalf of the "receiving"
>  >process.  Because this signal isn't used by the LDM package, the only
>  >way a process of the LDM package could be "sent" this signal is by an
>  >outside source.
>  >
>  >Who or what "sent" the SIGKILL to process 1710?
> 
> >From /var/adm/SYSLOG: [didn't think to look there yesterday]
> 
> Apr  8 16:42:56 1A:sunset unix: |$(0x6dd)ALERT: Process [rpc.ldmd] 1710 
> generated trap, but has signal 10 h
> eld or ignored
> Apr  8 16:42:56 6A:sunset unix:         epc 0x1001a75c ra 0x1001a754 badvaddr 
> 0x846796dc
> Apr  8 16:42:56 6A:sunset unix: Process has been killed to prevent infinite 
> loop

Interesting. Based on my knowledge of the LDM and our IRIX 6.5
sigaction(2) manual-page, I suspect that a SIGBUS is being generated
inside a critical section of the "pq" module while that signal is
blocked.  This might explain the lack of a core file.

I'm unable to duplicate this problem on our IRIX 6.5 system so, if
you're willing, I'd like your help in the analysis. Specifically

    1.  Go into the "pq" subdirectory of the LDM 6.0.10 source tree.

    2.  Edit the file "pq.c".  Search for SIGSEGV.  Change

            (void) sigdelset(&set, SIGSEGV);
            if(sigprocmask(SIG_BLOCK, &set, &pq->sav_set) < 0)

        to

            (void) sigdelset(&set, SIGSEGV);
            (void) sigdelset(&set, SIGBUS);
            if(sigprocmask(SIG_BLOCK, &set, &pq->sav_set) < 0)

        (i.e., add SIGBUS to the set of signals that will NOT be blocked).
        This should allow the SIGBUS to generate a core file.

    3.  Rebuild and reinstall LDM 6.0.10 with debugging enabled.  In general
        this will mean the following:

        A.  Go to the top-level source directory.

        B.  Execute the command "make distclean".

        C.  Ensure that the environment variable CFLAGS contains the string
            "-g".

        D.  Execute the configure script.

        E.  Execute the command "make".

        F.  Execute the command "make install".

        G.  Execute the command "make install_setuids" as root.

    4.  Ensure that you can generate a core file (e.g. "ulimit -c").

    5.  Stop LDM 6.0.2.

    6.  Start LDM 6.0.10.

    7.  When it fails, revert to LDM 6.0.2.

    8.  Use a debugger (e.g., dbx(1)) on the LDM 6.0.10 program rpc.ldmd
        and the just-generated core file to determine exactly where
        the error occurred (the core file should be in the LDM user's
        home-directory).

If you have any questions about any of this, please contact me.

Regards,
Steve Emmerson