[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20030924: LDM-6 connection to time out and be reset



Tom,

> To: <address@hidden>
> From: Tom McDermott <address@hidden>
> Subject: LDM-6 Possible Problem
> Organization: UCAR/Unidata
> Keywords: 200309241848.h8OImak1025105

The above message contained the following:

> We have been having a number of network problems here at SUNY Brockport
> which may have exposed a possible problem with the LDM-6 software.
> 
> Under certain conditions, there may be a loss of communications between
> the downstream and upstream servers such that the downstream believes that
> the connection between them is good but that the upstream has no data to
> send, when in fact the upstream has plenty of data to send, but the
> downstream does not recognize this.   This situation can persist for
> many minutes or even hours.  The only remedy for this has been to
> restart the ldm server.  In other words, the lack of any
> data does not cause the connection to time out and be reset.

The situation you describe is definitely a problem and I thank you for
bringing it to my attention.

It is unlikely, however, that the downstream LDM is merely waiting.
Code in the downstream LDM causes it make a connection to the top-level
LDM on the upstream host and inquire as to whether or not its upstream 
LDM is still alive.  This happens every 60 seconds if the downstream LDM
hasn't received a data-product in that time.

> I have observed this happening on numerous occasions with all of our feeds
> over the past month.  Here is an example from yesterday.  On the HDS feed,
> a product was last received at 1401Z.  Then this sequence from the log:
> 
> Sep 23 14:02:18 vortex atm[26947]: ERROR: requester6.c:206: Connection to 
> upstream LDM closed
> Sep 23 14:02:48 vortex atm[26947]: Desired product class: 20030923140025.915 
> TS_ENDT {{HDS,  "(^[A-OQ-X])|(^[YZ].[HIQR])"}}
> Sep 23 14:02:51 vortex atm[26947]: Connected to upstream LDM-6
> Sep 23 14:02:52 vortex atm[26947]: Upstream LDM is willing to feed
> Sep 23 14:04:00 vortex atm[26947]: ERROR: requester6.c:206: Connection to 
> upstream LDM closed
> Sep 23 14:04:30 vortex atm[26947]: Desired product class: 20030923140026.038 
> TS_ENDT {{HDS,  "(^[A-OQ-X])|(^[YZ].[HIQR])"}}
> Sep 23 14:04:30 vortex atm[26947]: Connected to upstream LDM-6
> Sep 23 14:04:30 vortex atm[26947]: Upstream LDM is willing to feed
> Sep 23 14:06:39 vortex atm[26947]: ERROR: requester6.c:206: Connection to 
> upstream LDM closed
> Sep 23 14:07:09 vortex atm[26947]: Desired product class: 20030923140112.512 
> TS_ENDT {{HDS,  "(^[A-OQ-X])|(^[YZ].[HIQR])"}}
> Sep 23 14:07:09 vortex atm[26947]: Connected to upstream LDM-6
> Sep 23 14:07:09 vortex atm[26947]: Upstream LDM is willing to feed
> 
> After this last message, there were no furthur messages regarding the HDS
> feed or HDS products received until 1456 when I recognized the situation
> and restarted the server.

The corresponding log-entries on the upstream host (atm.geo.nsf.gov) are

Sep 23 14:02:28 atm.geo.nsf.gov vortex(feed)[8145]: up6.c:168: HEREIS: RPC: 
Unable to send; errno = Broken pipe
Sep 23 14:02:28 atm.geo.nsf.gov vortex(feed)[8145]: up6.c:396: Product send 
failure: I/O error
Sep 23 14:02:52 atm.geo.nsf.gov vortex(feed)[17715]: up6.c:331: Starting 
Up(6.0.14/6): 20030923140025.915 TS_ENDT {{HDS,  "(^[A-OQ-X])|(^[YZ].[HIQR])"}}
Sep 23 14:02:52 atm.geo.nsf.gov vortex(feed)[17715]: topo:  
vortex.esc.brockport.edu HDS
Sep 23 14:04:00 atm.geo.nsf.gov vortex(feed)[17715]: up6.c:168: HEREIS: RPC: 
Unable to send; errno = Broken pipe
Sep 23 14:04:00 atm.geo.nsf.gov vortex(feed)[17715]: up6.c:396: Product send 
failure: I/O error
Sep 23 14:04:30 atm.geo.nsf.gov vortex(feed)[18484]: up6.c:331: Starting 
Up(6.0.14/6): 20030923140026.038 TS_ENDT {{HDS,  "(^[A-OQ-X])|(^[YZ].[HIQR])"}}
Sep 23 14:04:30 atm.geo.nsf.gov vortex(feed)[18484]: topo:  
vortex.esc.brockport.edu HDS
Sep 23 14:06:39 atm.geo.nsf.gov vortex(feed)[18484]: up6.c:168: HEREIS: RPC: 
Unable to send; errno = Broken pipe
Sep 23 14:06:39 atm.geo.nsf.gov vortex(feed)[18484]: up6.c:396: Product send 
failure: I/O error
Sep 23 14:07:09 atm.geo.nsf.gov vortex(feed)[20960]: up6.c:331: Starting 
Up(6.0.14/6): 20030923140112.512 TS_ENDT {{HDS,  "(^[A-OQ-X])|(^[YZ].[HIQR])"}}
Sep 23 14:07:09 atm.geo.nsf.gov vortex(feed)[20960]: topo:  
vortex.esc.brockport.edu HDS
Sep 23 14:56:49 atm.geo.nsf.gov vortex(feed)[20960]: up6.c:288: nullproc_6() 
failure to vortex.esc.brockport.edu: RPC: Unable to receive; errno = Connection 
reset by peer

Other than the "Broken pipe" message, there's no indication as to why
the connection was broken.  I've seen this type of behavior before.
The reason for the disconnection has always lain outside the LDM system
itself (e.g., router changes, firewall changes, misbehaving firewalls,
broken cables).

The last entry above is particularly interesting because it indicates
that the upstream LDM on Atm was perfectly happy from 14:07:09 UTC until
14:56:49 UTC when it couldn't flush the connection.  Upstream LDM-s send
a NULLPROC message down the connection every 30 seconds if they haven't
sent a data-product in that time -- this prevents the downstream LDM-s
from establishing connections to their upstream hosts more often than
necessary for the ISALIVE inquiry that I mentioned earlier.

If you didn't receive any data-products from Atm matching the criteria

    20030923140112.512 TS_ENDT {{HDS,  "(^[A-OQ-X])|(^[YZ].[HIQR])"}}

between 14:07:09 UTC and 14:56:49 UTC then Atm probably didn't have any
to send you.

> Here is an example from today.  The most recent IDS|DDPLUS product was
> received at 1727 today.  Then:
> 
> Sep 24 17:27:55 vortex atm[24309]: ERROR: requester6.c:206: Connection to 
> upstream LDM closed
> Sep 24 17:28:25 vortex atm[24309]: Desired product class: 20030924171035.238 
> TS_ENDT {{DIFAX|FNEXRAD|FSL2|UNIWISC|IDS|DDPLUS,  ".*"}}
> Sep 24 17:28:25 vortex atm[24309]: Connected to upstream LDM-6
> Sep 24 17:28:25 vortex atm[24309]: Upstream LDM is willing to feed
> Sep 24 17:40:18 vortex atm[24308]: ERROR: requester6.c:206: Connection to 
> upstream LDM closed
> Sep 24 17:40:48 vortex atm[24308]: Desired product class: 20030924154411.333 
> TS_ENDT {{HDS,  "(^[A-OQ-X])|(^[YZ].[HIQR])"}}
> Sep 24 17:40:51 vortex atm[24308]: Connected to upstream LDM-6
> Sep 24 17:40:51 vortex atm[24308]: Upstream LDM is willing to feed
> Sep 24 17:48:41 vortex atm[24309]: ERROR: requester6.c:206: Connection to 
> upstream LDM closed
> Sep 24 17:49:11 vortex atm[24309]: Desired product class: 20030924171035.238 
> TS_ENDT {{DIFAX|FNEXRAD|FSL2|UNIWISC|IDS|DDPLUS,  ".*"}}
> Sep 24 17:49:11 vortex atm[24309]: Connected to upstream LDM-6
> Sep 24 17:49:11 vortex atm[24309]: Upstream LDM is willing to feed
> 
> No more messages appear in the log regarding this feed and up to now
> (1838Z), no IDS|DDPLUS products have been received.

The corresponding log-entries on Atm are

Sep 24 17:25:41 atm.geo.nsf.gov vortex(feed)[29743]: up6.c:331: Starting 
Up(6.0.14/6): 20030924170902.070 TS_ENDT 
{{DIFAX|FNEXRAD|FSL2|UNIWISC|IDS|DDPLUS,  ".*"}}
Sep 24 17:25:41 atm.geo.nsf.gov vortex(feed)[29743]: topo:  
vortex.esc.brockport.edu DIFAX|FNEXRAD|FSL2|UNIWISC|IDS|DDPLUS
Sep 24 17:27:55 atm.geo.nsf.gov vortex(feed)[29743]: up6.c:168: HEREIS: RPC: 
Unable to send; errno = Broken pipe
Sep 24 17:27:55 atm.geo.nsf.gov vortex(feed)[29743]: up6.c:396: Product send 
failure: I/O error
Sep 24 17:28:25 atm.geo.nsf.gov vortex(feed)[481]: up6.c:331: Starting 
Up(6.0.14/6): 20030924171035.238 TS_ENDT 
{{DIFAX|FNEXRAD|FSL2|UNIWISC|IDS|DDPLUS,  ".*"}}
Sep 24 17:28:25 atm.geo.nsf.gov vortex(feed)[481]: topo:  
vortex.esc.brockport.edu DIFAX|FNEXRAD|FSL2|UNIWISC|IDS|DDPLUS
Sep 24 17:40:18 atm.geo.nsf.gov vortex(feed)[16542]: up6.c:168: HEREIS: RPC: 
Unable to send; errno = Broken pipe
Sep 24 17:40:18 atm.geo.nsf.gov vortex(feed)[16542]: up6.c:396: Product send 
failure: I/O error
Sep 24 17:40:51 atm.geo.nsf.gov vortex(feed)[9494]: up6.c:331: Starting 
Up(6.0.14/6): 20030924154411.333 TS_ENDT {{HDS,  "(^[A-OQ-X])|(^[YZ].[HIQR])"}}
Sep 24 17:40:51 atm.geo.nsf.gov vortex(feed)[9494]: topo:  
vortex.esc.brockport.edu HDS
Sep 24 17:43:07 atm.geo.nsf.gov vortex(feed)[9494]: up6.c:168: HEREIS: RPC: 
Unable to send; errno = Broken pipe
Sep 24 17:43:07 atm.geo.nsf.gov vortex(feed)[9494]: up6.c:396: Product send 
failure: I/O error
Sep 24 17:43:38 atm.geo.nsf.gov vortex(feed)[10229]: up6.c:331: Starting 
Up(6.0.14/6): 20030924155158.370 TS_ENDT {{HDS,  "(^[A-OQ-X])|(^[YZ].[HIQR])"}}
Sep 24 17:43:38 atm.geo.nsf.gov vortex(feed)[10229]: topo:  
vortex.esc.brockport.edu HDS
Sep 24 17:48:41 atm.geo.nsf.gov vortex(feed)[481]: up6.c:168: HEREIS: RPC: 
Unable to send; errno = Broken pipe
Sep 24 17:48:41 atm.geo.nsf.gov vortex(feed)[481]: up6.c:396: Product send 
failure: I/O error
Sep 24 17:49:11 atm.geo.nsf.gov vortex(feed)[13572]: up6.c:331: Starting 
Up(6.0.14/6): 20030924171035.238 TS_ENDT 
{{DIFAX|FNEXRAD|FSL2|UNIWISC|IDS|DDPLUS,  ".*"}}
Sep 24 17:49:11 atm.geo.nsf.gov vortex(feed)[13572]: topo:  
vortex.esc.brockport.edu DIFAX|FNEXRAD|FSL2|UNIWISC|IDS|DDPLUS
Sep 24 17:57:22 atm.geo.nsf.gov vortex(feed)[10229]: up6.c:168: HEREIS: RPC: 
Unable to send; errno = Broken pipe
Sep 24 17:57:22 atm.geo.nsf.gov vortex(feed)[10229]: up6.c:396: Product send 
failure: I/O error

It appears from this that something outside the LDM systems on Atm and
Vortex is destroying the connections between the two.

> From these 2 examples it appears that the condition that triggers this
> problem seems to occur shortly after a new connection is established, but
> I can't be sure that this is always the case.
> 
> I want to emphasize that I am aware that it is our network problems that
> are causing this, and not the ldm itself.  My concern is how the ldm
> handles this situation.

As you can see from your log-entries, the downstream LDM closes and
reconnects very soon after the upstream LDM has gone away because the
connection has been somehow destroyed.

> As I recall, with the ldm version 5, if there was
> a problem with the connection, it would eventually be recognized and it
> would time out and the connection reset.  (I believe after 12 minutes of
> inactivtity.)  I suppose the ldm-6 should behave in the same fashion, but
> this does not appear to be the case.

LDM 6's reconnection strategy is a little different than LDM 5's -- but
it should be no less robust and considerably more efficient.

> I hope you will excuse some vagueness in my presentation of this problem.

No problem.  I'm glad to get this report.

What we have to do now is figure out why the connections between Vortex
and Atm are being destroyed.  They're occurring a such a high frequency
that it's a wonder you get any data at all.

What are the "network problems" of which you spoke?

> Tom
> -----------------------------------------------------------------------------
> Tom McDermott                         Email: address@hidden
> Systems Administrator                 Phone: (585) 395-5718
> Earth Sciences Dept.                  Fax: (585) 395-2416
> SUNY College at Brockport

Regards,
Steve Emmerson