[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[TIGGE #EVX-684652]: ldm crash



Manuel,

> We execute the following lines to check if LDM is running...
> 
> kill -s 0 $(cat ldmd.pid) && exit 0
> 
> ps -elf | mail -s "TIGGE: ldm is not running on $HOSTNAME"
> address@hidden address@hidden
> 
> killall -9 rpc.ldmd pqact ldmping rtstats send || true
> ldmadmin clean
> pqcat -l- -s -q /usr/local/ldm/data/ldm.pq && pqcheck -F -q
> /usr/local/ldm/data/ldm.pq
> ldmadmin start
> 
> LDM has just crashed again. This is the output of the script that
> contains the above lines, which has just executed:
> 
> ++ cat ldmd.pid
> + kill -s 0 14972
> -bash: line 6: kill: (14972) - No such process
> + ps -elf
> + mail -s 'TIGGE: ldm is not running on tigge-ldm' address@hidden
> address@hidden
> + killall -9 rpc.ldmd pqact ldmping rtstats send
> ldmping: no process killed
> + true
...
> 0 D ldm 610 16806 0 76 0 - 1004497 sync_p 17:10 ? 00:00:00 pqinsert -v -l 
> /usr/local/ldm/logs/ldmd.log -p 
z_tigge_c_rjtd_20061202120000_glob_prod_pf_pl_0060_014_0300_v.grib:14747
...
1 S ldm      14982     1  1  76   0 - 1005582 -    15:51 ?        00:01:34 rpc.l
dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br />
1 S ldm      14983     1  3  75   0 - 1005582 -    15:51 ?        00:02:34 rpc.l
dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br />
1 S ldm      14984     1  3  76   0 - 1005582 -    15:51 ?        00:02:36 rpc.l
dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br />
1 S ldm      14985     1  0  75   0 - 1005649 -    15:51 ?        00:00:14 rpc.l
dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br />
1 S ldm      14986     1  5  75   0 - 1005581 -    15:51 ?        00:04:40 rpc.l
dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br />
1 S ldm      14987     1  0  75   0 - 1005649 -    15:51 ?        00:00:13 rpc.l
dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br />
1 S ldm      14988     1  5  76   0 - 1005581 -    15:51 ?        00:04:24 rpc.l
dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br />
1 S ldm      14989     1  0  75   0 - 1005649 -    15:51 ?        00:00:14 rpc.l
dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br />
1 S ldm      14990     1  5  75   0 - 1005582 -    15:51 ?        00:04:26 rpc.l
dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br />
1 S ldm      14991     1  0  75   0 - 1005649 -    15:51 ?        00:00:15 rpc.l
dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br />
1 S ldm      14992     1  5  76   0 - 1005582 -    15:51 ?        00:04:23 rpc.l
dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br />
1 S ldm      14993     1  0  75   0 - 1005648 -    15:51 ?        00:00:15 rpc.l
dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br />
1 S ldm      14994     1  5  75   0 - 1005582 -    15:51 ?        00:04:30 rpc.l
dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br />
1 S ldm      14995     1  0  75   0 - 1005648 -    15:51 ?        00:00:13 rpc.l
dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br />
1 S ldm      14996     1  5  75   0 - 1005582 -    15:51 ?        00:04:24 rpc.l
dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br />
1 S ldm      14997     1  0  75   0 - 1005648 -    15:51 ?        00:00:14 rpc.l
dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br />
...

It appears that the "kill -s 0" to the top-level LDM server is returning
with an unsuccessful status, indicating that the LDM server isn't
running, when that is not the case.  The subsequent "killall -9" abruptly
terminates all LDM processes -- including those that have the product-queue
open for writing (e.g., the pqinsert(1) process and, probably, some of the
"rpc.ldmd" processes)  This is why the product-queue is getting corrupted.

I do not know why the "kill -s 0" indicates that the LDM server
isn't running.  You might check your documentation on that command.
It's also possible that an "ldmadmin stop" was executed just before
the script but that not all the processes had terminated.

In order to fix this, the "kill -s 0" command needs to be fixed, or
it should be executed multiple times in order to be certain, or the
"killall" command should be replaced with an "ldmadmin stop" so that
processes that have the product-open for writing have a chance to
close the product-queue.  if the pqinsert(1) process was started by
an EXEC entry in the LDM configuration-file (ldmd.conf) then it
will receive a SIGINT and terminate gracefully.

In any case, a "killall -9" probably should not be executed.  Try
an "ldmadmin stop" instead.

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: EVX-684652
Department: Support IDD TIGGE
Priority: Normal
Status: Closed