[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #FBV-178390]: LDM not talking



Karen,

> Interesting problem. If you could give me any ideas on what might have
> gone wrong here I would greatly appreciate it. I hope this all makes
> sense.....
> 
> Machine A(pluto) is sending data to Machine B, it also gets data from
> Machine B.
> 
> Machine B(dontpanic) receives data from Machine A and sends it to
> Machine C, and vice-versa.
> 
> Machine C(kyodai) gets data from Machine B and also sends to Machine B.
> 
> So basically Machine B is a middle-man for A and C (which do not
> interact directly -- different networks with Machine B in a DMZ).
> 
> Today Machine C created a new file which was inserted into its queue and
> then sent to Machine B.  However, when Machine B tried to send the file
> on to Machine A, it got an error
> 
> Log snippet from Machine B -- 20287 is the PID of the rpc.ldmd for
> Machine A :
> 
> Apr 26 16:38:03 dontpanic 172.16.20.32[20287] ERROR: pqe_new(): zero
> product size
> Apr 26 16:38:03 dontpanic 172.16.20.32[20287] ERROR: pqe_new() failed:
> Invalid argument: d41d8cd98f00b204e9800998ecf8427e        0
> 20070426163803.011     EXP 000
> wdssii/KTLX_RVP.20070426.163650.vcp32.2.dN5.nc.gz
> Apr 26 16:38:03 dontpanic rpc.ldmd[20251] NOTE: child 20287 exited with
> status 10

The above indicates that the downstream LDM on host Dontpanic received
a zero-length data-product from the upstream LDM on host 172.16.20.32.
Due to a bug in the code, this caused the downstream LDM to terminate.
This would only happen if the data transfer was occurring in ALTERNATE
mode.

I've fixed the code and it will be in the next release (6.6.4).

Why was a zero-length data-product generated?

> However, the data originated on Machine C, so if there was a problem
> with it I'm not sure how it got into the queue successfully on Machine
> C, or how it got sent successfully to Machine B.
> 
> Log snippet from Machine A shows:
> 
> Apr 26 16:38:03 pluto dontpanic(feed)[7356] ERROR: feed or notify
> failure; COMINGSOON: RPC: Remote system error
> Apr 26 16:38:03 pluto rpc.ldmd[7351] NOTE: child 7356 exited with status 6

The above indicates that the upstream LDM on host Pluto couldn't
send a data-product to the downstream LDM on host Dontpanic. 
Consequently, it terminated.  This was at the same time that
the downstream LDM on host Dontpanic that received the
zero-length data-product terminated.  Was it sent from host
Pluto?

> Thereafter no data was processed between Machines A and B.  Machine C
> was idle as it doesn't have anything to do unless it gets the data from
> A (via B) first.
> 
> Restarting LDM on Machines A and C had no effect (not surprising), but
> restarting LDm on Machine B solved the problem.
> 
> --
> Beware programmers who carry screwdrivers.
> 
> -------------------------------------------
> address@hidden
> 
> Phone:  405-325-6982
> Cell: 405-834-8559
> SAIC/Systems Analyst
> National Severe Storms Laboratory
Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: FBV-178390
Department: Support LDM
Priority: Normal
Status: On Hold