[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[TIGGE #EVX-684652]: ldm crash



Manuel,

> We have installed version 6.4.6 and have been running for more than a
> month. We have created a script to monitor whether LDM is running or
> not. From time to time, ldm is not running.

How is the determination that the LDM is not running made?  It is possible for 
a single ldmping(1) to indicate that the LDM is not running when, in fact, it 
is.  This occurs if the LDM is very busy and unable to service the ldmping(1) 
before it times-out.

> Also, we have several
> occasions when we are unable to start LDM, since the queue seems corrupt:
> 
> ldm@tigge-ldm:~> ldmadmin clean
> 
> ldm@tigge-ldm:~> ldmadmin start
> The writer-counter of the product-queue is not zero.  Either
> a process has the product-queue open for writing or the queue
> might be corrupt.  Terminate the process and recheck or use
> pqcat -l- -s -q /usr/local/ldm/data/ldm.pq && pqcheck -F -q
> /usr/local/ldm/data/ldm.pq
> to validate the queue and set the writer-counter to zero.
> Dec 4 14:42:49 UTC tigge-ldm.ecmwf.int : LDM not started
> 
> ldm@tigge-ldm:~> pqcat -l- -s -q /usr/local/ldm/data/ldm.pq && pqcheck
> -F -q /usr/local/ldm/data/ldm.pq
> Dec 04 14:43:29 pqcat NOTE: Starting Up (28241)
> Dec 04 14:43:44 pqcat ERROR: pqcat queueSanityCheck: Product count
> doesn't match
> Dec 04 14:43:44 pqcat ERROR: products tallied: 50745   Value in queue: 50746
> Dec 04 14:43:44 pqcat NOTE: Exiting
> Dec 04 14:43:44 pqcat NOTE: Number of products 50745

The error-messages from "ldmadmin start" and pqcat(1) do indicate that the 
product-queue is corrupt.  Your only recourse is to remake the queue.

What is causing the product-queue to become corrupt?  This should only occur if 
a process that has the product-queue open for writing terminates abnormally -- 
such as by crashing or from receiving a SIGKILL.  Is a SIGKILL being sent to 
the LDM system as a result of an ldmping(1) failure?

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: EVX-684652
Department: Support IDD TIGGE
Priority: Normal
Status: On Hold