[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20050420: LDM scour issue and computer lock-up



Gabriel,

> To: address@hidden
> From: "Gabriel Langbauer" <address@hidden>
> Subject: LDM - 2 seperate problems (on 2 machines): scour issue and computer 
> lock up
> Organization: UCAR/Unidata
> Keywords: 200504191903.j3JJ3u4n010376

The above message contained the following:

> Institution: Ohio State University
> Package Version: 6.1.0
> Operating System: RedHat enterprise linux v. 3
> Hardware Information: dell xeon processor
> Inquiry: Hello,
> 
> Here at Ohio State we've started to experience a problem where our
> system completely locks up occasionally.  When this occurs, LDM crashes
> and gempak no longer gets called from the crontab.  I've attached a log
> file from one of the days in question (both a full and a short
> version).

The complete logfile is full of entries like this

    Apr 19 13:44:36 twister eldm[29563]: Desired product class: 
20050419124436.926 TS_ENDT {{PCWS,  "^FSL\.NetCDF\.ACARS\.QC\."}} 
    Apr 19 13:44:36 twister eldm[29563]: ERROR: requester6.c:455; 
ldm_clnt.c:286: nu
    llproc_6 failure to eldm.fsl.noaa.gov; ldm_clnt.c:142: RPC: Unable to 
receive; errno = Connection reset by peer 

which means that the LDM on host Twister is unable to receive
data-product of feedtype PCWS from host eldm.fsl.noaa.gov because that
system keeps closing the connection.  Therefore, Twister's LDM should
either stop requesting that data (in order to unclutter the logfile) or
you should contact eldm.fsl.noaa.gov's LDM administrator to correct the
problem.

> The short version has been egrep'd for the following
> expressions:
> 
> pnga2area
> eldm\[29563\]: Desired
> eldm\[29563\]: ERROR
> 
> I have also attached the short version of the previous days log file.
> 
> Other log files show that the crontab stopped calling programs sometime
> between 9:19 and 9:49

The last entry in that time period:

    Apr 19 09:32:22 twister pqact[29562]: pbuf_flush 9: time elapsed 4.275043 

indicates that the LDM system was still running at 09:32:22.

I interpret the logfile entries as indicating that the LDM system
crashed, leaving the writer-counter of the product-queue at 12.
Sometime between 18:24:45 and 18:25:10 the writer-counter was forced to
zero (no doubt using "pqcheck -F").

When a computer crashes, I recommend deleting and recreating the
product-queue ("ldmadmin delqueue && ldmadmin mkqueue -f").  Otherwise,
the product-queue might be corrupt and you'll never know until something
inexplicable happens.

> Seperatly, I've built another ldm on another machine (PC running Fedora
> Core 2) to host the AIDD for the Byrd Center.  However, my scour script
> isn't working.  I get the following error when I run scour:  Couldn't
> discover meaning of '-mtime' argument of find(1)

Interesting.  Unfortunately, the meaning of the "-mtime" option of the
find(1) utility isn't specified exactly by the UNIX standard and it
actually behaves differently on different systems. scour(1), therefore,
attempts to discover the exact meaning of the "-mtime" option before
using it.  Apparently, on your system, it can't.

Please execute the following sh(1) script as the LDM user on the system
in question to acertain the problem:

    dayOffsetName=scour_$$
    cd /tmp
    if find . \! -name . -prune -mtime 0 -name $dayOffsetName \
                | grep $dayOffsetName >/dev/null; then
        echo DAY_OFFSET=1
    elif find . \! -name . -prune -mtime 1 -name $dayOffsetName \
                | grep $dayOffsetName >/dev/null; then
        echo DAY_OFFSET=0
    else
        echo "Couldn't discover meaning of '-mtime' argument of find(1)"
        exit 1
    fi

> Any help on these issues would be greatly appreciated.

Regards,
Steve Emmerson

> NOTE: All email exchanges with Unidata User Support are recorded in the
> Unidata inquiry tracking system and then made publicly available
> through the web.  If you do not want to have your interactions made
> available in this way, you must let us know in each email you send to us.