[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20031013: data stops ingesting from SDI



David,

> Date: 13 Oct 2003 12:15:19 -0500
> From: David Larson <address@hidden>
> To: address@hidden
> Subject: data stops ingesting from SDI

The above message contained the following:

> Sorry to email you directly like this, but I thought I'd run something
> by you that seems kinda low-level.  But before I get to that ...
> 
> Every day or so, data stops arriving to our LDM from SSEC/SDI.
> 
> Post mortem analysis goes something like this:
> 
> Hmmm, the data in our system is old!  Lets check out LDM:
> 
>       [ldm@decoder3 ~]$ ldmadmin watch
>       (Type ^D or ^C when finished)
>       ^C
>       [ldm@decoder3 ~]$
> 
> After a minute or so (plenty of time), no data arrived so I killed it.
> Well, what is the pqing doing?
> 
>       [ldm@decoder3 ~/etc]$ strace -p 14206
>       write(0, "\0", 1 <unfinished ...>
>       [ldm@decoder3 ~/etc]$
> 
> Hmmmm, that's odd, it is stuck on a write to fd#0.  What file is it
> trying to write a null byte to?  Let's see what the file descriptor
> points to:
> 
> [ldm@decoder3 ~/etc]$ /usr/sbin/lsof | grep pqing | grep 0u
> pqing     14206     ldm    0u  IPv4    2087720                 TCP
> decoder3.digitalcyclone.com:39519->noaaport.digitalcyclone.com:1501
> (ESTABLISHED)
> 
> Yikes!  That is the TCP connection to our SDI box!  And the other end of
> that connection is just a "cat" sending data from a FIFO to the TCP
> connection ...

That's a problem.

> Why is that happening?  I looked at pqing.c and saw some VOODOO that
> seemed pretty reasonable, and also gave me a clue as to why this is
> somewhat unpredictable (it only does the write when the select
> times-out).
> 
> It makes perfect sense to try to write to the socket and induce a
> connection reset by peer,

Yup.

> but unfortunately, I believe that if the other
> end of the connection doesn't eat these up, the buffers on both sides of
> the TCP connection will eventually fill up and cause the pqing to block.

Yup.

> Actually, I kinda like the VOODOO and I'd really like to enhance the SDI
> side so that it discards anything that arrives (it only needs to do a
> non-blocking read every now and then to empty the buffer).  Otherwise, I
> might settle for TCP KEEPALIVE (not as good).
> 
> Thoughts?

The voodoo in pqinc.c is there for a reason: it's the minimal-effort way
for the pqing process to determine if the upstream feeder is still
alive.  Upsteam feeders MUST eat the NUL bytes that the pqing process
sends their way.  (The TCP_KEEPALIVE option is unusable because not all
TCP layers support it -- and those that do tend to have a two hour
timeout).

An easy alternative would be to comment-out the code in pqing.c that
sends the NUL byte.  This will risk the pqinq process never learning
that the upstream process died.  Since this can occur, it will occur.

> Thanks,
> Dave

Regards,
Steve Emmerson
LDM Developer