[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[TIGGE #CUA-629523]: Re: dataportal not receiving data from tigge-ldm.ecmwf.int



Manuel,

> This is the output of "lsof | egrep 'PID|unidata'" on tigge-portal:
> COMMAND     PID       USER   FD      TYPE             DEVICE       SIZE
> NODE NAME
> rpc.ldmd  29317        ldm    0u     IPv4            1017103
> TCP *:unidata-ldm (LISTEN)
> rpc.ldmd  29317        ldm    1u     IPv4            1168838
> TCP
> tigge-portal.ecmwf.int:unidata-ldm->tigge-ldm.ecmwf.int:45328 (CLOSE_WAIT)
> rpc.ldmd  29321        ldm    0u     IPv4            1017103
> TCP *:unidata-ldm (LISTEN)
> rpc.ldmd  29321        ldm    4u     IPv4            2421682
> TCP
> tigge-portal.ecmwf.int:48653->tigge-ldm.ecmwf.int:unidata-ldm (ESTABLISHED)
> rpc.ldmd  29322        ldm    0u     IPv4            1017103
> TCP *:unidata-ldm (LISTEN)
> rpc.ldmd  29322        ldm    3u     IPv4            3064475
> TCP
> tigge-portal.ecmwf.int:55991->tigge-ldm.ecmwf.int:unidata-ldm (SYN_SENT)
> rpc.ldmd  29322        ldm    4u     IPv4            2421860
> TCP
> tigge-portal.ecmwf.int:48659->tigge-ldm.ecmwf.int:unidata-ldm (ESTABLISHED)
> rpc.ldmd  29323        ldm    0u     IPv4            1017103
> TCP *:unidata-ldm (LISTEN)
> rpc.ldmd  29323        ldm    3u     IPv4            3064477
> TCP
> tigge-portal.ecmwf.int:55992->tigge-ldm.ecmwf.int:unidata-ldm (ESTABLISHED)
> rpc.ldmd  29325        ldm    0u     IPv4            1017103
> TCP *:unidata-ldm (LISTEN)
> rpc.ldmd  29325        ldm    3u     IPv4            3064474
> TCP
> tigge-portal.ecmwf.int:55990->tigge-ldm.ecmwf.int:unidata-ldm (SYN_SENT)
> rpc.ldmd  29326        ldm    0u     IPv4            1017103
> TCP *:unidata-ldm (LISTEN)
> rpc.ldmd  29326        ldm    4u     IPv4            2421808
> TCP
> tigge-portal.ecmwf.int:48657->tigge-ldm.ecmwf.int:unidata-ldm (ESTABLISHED)
> 
> All rpc.ldmd do listen on port 388. And this is because when a process
> fork(2) another process, the child inherits the open file descriptors of
> the parent process. This is normal behaviour.

One of the very first things a child LDM process does is to close the listening 
socket (see "server/ldmd.c"; search for "fork()") Therefore, you should never 
see what you did see unless something is very wrong, in my opinion.

Also, the ps(1) output you sent showed multiple, top-level LDM servers.  While 
not impossible, this also shouldn't happen.

The netstat(1) utility on one of our Linux systems has a "-p" option that 
prints the PID.  Can you verify multiple LDM listeners using that utility?

> I suppose the one with the lowest PID. I have been digging in logs, and
> this is the extract of the logfiles when I last started LDM:
> 
> Apr 11 08:46:04 tigge-ldm rpc.ldmd[31899] NOTE: Starting Up (version:
> 6.4.5.1; built: Jan 23 2006 22:38:02)
> Apr 11 08:46:04 tigge-ldm rpc.ldmd[31899] NOTE: Using local address
> 0.0.0.0:388
> Apr 11 08:46:04 tigge-ldm pqact[31903] NOTE: Starting Up
> Apr 11 08:46:04 tigge-ldm rtstats[31904] NOTE: Starting Up (31899)
> Apr 11 08:46:04 tigge-ldm tigge-portal[31907] NOTE: Starting
> Up(6.4.5.1): tigge-portal.ecmwf.int:388 20060411074604.938 TS_ENDT {{A
> NY,  "\.missing$"}}
> Apr 11 08:46:04 tigge-ldm dataportal[31906] NOTE: Starting Up(6.4.5.1):
> dataportal.ucar.edu:388 20060411074604.938 TS_ENDT {{ANY,
> "\.missing$"}}
> Apr 11 08:46:04 tigge-ldm pqact[31903] INFO: Successfully read
> configuration-file "etc/tigge_pqact.conf"
> Apr 11 08:46:05 tigge-ldm pqact[31903] INFO: TS_ZERO TS_ENDT {{ANY,
> "missing"}}
> Apr 11 08:46:05 tigge-ldm pqact[31903] INFO:        0 20060411084605.347
> ANY 000  _BEGIN_
> Apr 11 08:46:05 tigge-ldm dataportal[31906] INFO: No matching
> data-product in product-queue
> Apr 11 08:46:05 tigge-ldm tigge-portal[31907] INFO: No matching
> data-product in product-queue
> Apr 11 08:46:05 tigge-ldm dataportal[31906] NOTE: LDM-6 desired
> product-class: 20060411074605.349 TS_ENDT {{ANY,  "\.missing$"}}
> Apr 11 08:46:05 tigge-ldm tigge-portal[31907] NOTE: LDM-6 desired
> product-class: 20060411074605.349 TS_ENDT {{ANY,  "\.missing$"}}
> Apr 11 08:46:05 tigge-ldm dataportal[31906] INFO: Connected to upstream
> LDM-6 on host dataportal.ucar.edu using port 388
> Apr 11 08:46:05 tigge-ldm dataportal[31906] NOTE: Upstream LDM-6 on
> dataportal.ucar.edu is willing to be a primary feeder
> pqinsert INFO:  9205744 20060411084605.849     EXP 000
> z_tigge_c_ecmf_20060410120000.manifest
> Apr 11 08:46:06 tigge-ldm rpc.ldmd[31899] INFO: RPC buffer sizes for
> dataportal.ucar.edu: send=16384; recv=87380
> Apr 11 08:46:06 tigge-ldm dataportal[31913] INFO: Connection from
> dataportal.ucar.edu
> pqinsert INFO:   428963 20060411084606.439     EXP 000
> z_tigge_c_ecmf_20060410120000_0001_pf_pl_0090_002_0600_u.grib:88065
> pqinsert INFO:   428963 20060411084606.484     EXP 000
> z_tigge_c_ecmf_20060410120000_0001_pf_pl_0090_002_0600_v.grib:88066
> 
> 
> After all this information, what do you want me to do ? Do you still
> want me to go ahead with:
> ldmadmin stop
> kill remaining
> ldmadmin clean
> pqcheck -v
> check everything is gone
> ldmadmin start

Try using netstat(1) to verify multiple listeners.  Then, stop everything, 
restart, and see if you get multiple top-level LDM-s again.

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: CUA-629523
Department: Support IDD TIGGE
Priority: Normal
Status: On Hold