[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20040831 rpc.ldmd signal 11s



Hi Art,

> To: address@hidden
> From: "Arthur A. Person" <address@hidden>
> Subject: rpc.ldmd signal 11's
> Organization: Penn State University
> Keywords: 200408311254.i7VCsh8E018445

The above message contained the following:

> I've seen two cases (on two separate systems) where an rpc.ldmd process 
> has died with signal 11 killing LDM data collection.  Both are running LDM 
> V6.0.15 on RedHat EL 3 update 2 kernel 2.4.21-15.0.4.ELsmp and fully 
> patched.

This is new.  I'll see if I can reproduce that behavior here.

> The process did not core dump.

The operating system must be told to allow a core-dump by the LDM user.
This is usually done via the command

    ulimit -c unlimited

Before executing the "ldmadmin start" command, verify that core-dumps are
allowed via the command

    ulimit -c

and use the previous command if they're not.

> Here's an excerpt of the ldmd.log files for the most recent:
> 
> Aug 26 13:58:49 ls2 rpc.ldmd[3634]: Starting Up (version: 6.0.15; built: 
> Jul 14 2004 15:25:10)
> 
> Aug 26 13:58:49 ls2 pqact[3637]: Starting Up
> Aug 26 13:58:49 ls2 pqact[3638]: Starting Up
> Aug 26 13:58:49 ls2 pqact[3639]: Starting Up
> Aug 26 13:58:49 ls2 pqbinstats[3635]: Starting Up (3634)
> Aug 26 13:58:49 ls2 pqact[3641]: Starting Up
> Aug 26 13:58:49 ls2 pqact[3640]: Starting Up
> Aug 26 13:58:49 ls2 ldm[3645]: Starting Up(6.0.15): ldm.meteo.psu.edu: 
> TS_ZERO TS_ENDT {{ANY,
>   ".*"}}
> Aug 26 13:58:49 ls2 ldm[3645]: Desired product class: 20040826135844.784 
> TS_ENDT {{ANY,  ".*"}
> }
> Aug 26 13:58:49 ls2 pqsurf[3643]: Starting Up (3634)
> Aug 26 13:58:49 ls2 rtstats[3644]: Starting Up (3634)
> Aug 26 13:58:49 ls2 pqact[3646]: Starting Up
> Aug 26 13:58:50 ls2 ldm[3645]: Connected to upstream LDM-6
> Aug 26 13:58:51 ls2 ldm[3645]: Upstream LDM is willing to feed
> Aug 26 14:00:06 ls2 pnga2area[4203]: Starting Up
> Aug 26 14:00:06 ls2 pnga2area[4203]: unPNG::   115626    309200  2.6741
> Aug 26 14:00:06 ls2 pnga2area[4203]: Exiting
> Aug 26 14:00:50 ls2 pnga2area[4780]: Starting Up
> Aug 26 14:00:50 ls2 pnga2area[4780]: unPNG::   856353   4506096  5.2620
> Aug 26 14:00:50 ls2 pnga2area[4780]: Exiting
> Aug 26 14:00:52 ls2 pnga2area[4819]: Starting Up
> Aug 26 14:00:52 ls2 pnga2area[4819]: unPNG::  1067122   4506096  4.2227
> Aug 26 14:00:52 ls2 pnga2area[4819]: Exiting
>     .
>     .
>     .
> Aug 28 05:33:04 ls2 pnga2area[30968]: Starting Up
> Aug 28 05:33:04 ls2 pnga2area[30968]: unPNG::    90094    242720  2.6941
> Aug 28 05:33:04 ls2 pnga2area[30968]: Exiting
> Aug 28 05:34:03 ls2 pnga2area[31478]: Starting Up
> Aug 28 05:34:03 ls2 pnga2area[31478]: unPNG::    74544    242720  3.2561
> Aug 28 05:34:03 ls2 pnga2area[31478]: Exiting
> Aug 28 05:35:17 ls2 rpc.ldmd[3634]: child 3645 terminated by signal 11
> Aug 28 05:35:17 ls2 rpc.ldmd[3634]: Killing (SIGINT) process group
> Aug 28 05:35:17 ls2 pqact[3637]: Interrupt
> Aug 28 05:35:17 ls2 pqbinstats[3635]: Interrupt
> Aug 28 05:35:17 ls2 pqact[3637]: Exiting
> Aug 28 05:35:17 ls2 pqact[3638]: Interrupt
> Aug 28 05:35:17 ls2 pqact[3638]: Exiting
> Aug 28 05:35:17 ls2 pqact[3639]: Interrupt
> Aug 28 05:35:17 ls2 rtstats[3644]: Interrupt
> Aug 28 05:35:17 ls2 rpc.ldmd[3634]: SIGINT
> Aug 28 05:35:17 ls2 pqact[3646]: Interrupt
> Aug 28 05:35:17 ls2 pqsurf[3643]: Interrupt
> Aug 28 05:35:17 ls2 pqact[3641]: Interrupt
> Aug 28 05:35:17 ls2 pqact[3639]: Exiting
> Aug 28 05:35:17 ls2 pqact[3646]: Exiting
> Aug 28 05:35:17 ls2 pqact[3640]: Interrupt
> Aug 28 05:35:17 ls2 pqact[3640]: Exiting
> Aug 28 05:35:17 ls2 pqbinstats[3635]: Exiting
> Aug 28 05:35:17 ls2 rtstats[3644]: Exiting
> Aug 28 05:35:17 ls2 pqsurf[3643]: Exiting
> Aug 28 05:35:17 ls2 pqact[3641]: Exiting
> Aug 28 05:35:17 ls2 pqsurf[3643]:   Queue usage (bytes):10682240
> Aug 28 05:35:17 ls2 pqsurf[3643]:            (nregions):   54034
> Aug 28 05:35:17 ls2 pqsurf[3643]: Number of products 86836
> Aug 28 05:35:17 ls2 pqsurf[3643]: Number of observations 381591
> Aug 28 05:35:17 ls2 pqsurf[3643]: Number of dups 51657
> Aug 28 05:35:17 ls2 rpc.ldmd[3634]: Terminating process group
> 
> Any ideas what might be causing this, and/or what I might do to capture 
> more/better information to track it down?

Unless the LDM server is built with the "-g" (debugging) option, the
core-dump will be of limited utility.  If you don't mind, doing the
following would help greatly:

    1.  Go to the top-level source-directory.

    2.  Execute the command "make distclean".

    3.  Set the environment variables CFLAGS and CPPFLAGS to "-g" 
        and "-DNDEBUG", respectively (without the quotes).

    4.  Execute the following commands in order:
    
            make
            ldmadmin stop

    5.  Become the superuser.

    6.  Execute the following commands in order:
    
            make server/install_setuids
            ulimit -c unlimited
            ldmadmin start

    7. Cross your fingers.  :-)

Your help in this would be greatly appreciated.

>                                  Thanks.
> 
>                                    Art.
> 
> Arthur A. Person
> Research Assistant, System Administrator
> Penn State Department of Meteorology
> email:  address@hidden, phone:  814-863-1563

Regards,
Steve Emmerson

> NOTE: All email exchanges with Unidata User Support are recorded in the
> Unidata inquiry tracking system and then made publically available
> through the web.  If you do not want to have your interactions made
> available in this way, you must let us know in each email you send to us.