[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20030919: atm.geo.nsf.gov is not responding



>From: Bill Noon <address@hidden>
>Organization: Cornell
>Keywords: 200309191249.h8JCnrk1029145 IDD

Bill,

>Folks -- It looks like atm.geo.nsf.gov is no longer servicing LDM 
>requests.  This started at just after 0Z today.

I just logged onto atm and see that its LDM is running, is feeding
58 downstream connections, and has been running continuously for the
past 7 days.

Looking through the LDM log file, I see a failure to
idd.nrcc.cornell.edu at 00:10:21 Z:

Sep 19 00:10:21 atm.geo.nsf.gov idd(feed)[24347]: up6.c:288: nullproc_6() failur
e to idd.nrcc.cornell.edu: RPC: Timed out

Interpretation of this error is that an attempt by idd.cornell to flush
the connection to atm failed due to an RPC timeout.  This might be due
to network congestion or some other network problem.  Logs on atm show
that the LDM there was running throughout this period and was servicing
numerous downstream connections.

I do see from the ancillary stats we log on atm that there was some
sort of a network glitch right at the time you stopped feeding:

Date/time      1, 5, 15 min load   connections  age  mem   swap wait rtstats
 ...
20030919.0009   0.15  0.30  0.51   55   7  62   8211 1070M 356M    1   1
20030919.0010   0.17  0.28  0.49   55   7  62   8272 1066M 361M    0   1
20030919.0011   2.64  1.11  0.78   51   7  58   8331 1078M 344M   25   0
20030919.0012   1.79  1.16  0.82   53   7  60   8392 1081M 350M   35   2
20030919.0013   0.78  0.99  0.78   54   7  61   8451 1086M 353M    2   2
20030919.0014   0.47  0.86  0.75   56   7  63   8470 1087M 359M    5   1
20030919.0015   0.38  0.77  0.72   55   7  62   8531 1086M 360M    1   1
 ...

The log messages at .0011 show that a number of connections went into
some sort of a WAIT state (second column from right) at the same time
the number of established downstream connections (5th column from left)
dipped from 55 to 51).  The LDM apparently recovered from this rapidly
since the number of WAITing connections dropped back to 1 within 5
minutes and then number of active connections went back up to what it
had been.

Please try first an ldmping to atm and then a notifyme so we can
try to understand why you are not able to connect.

>I tried to fail idd.nrcc.cornell.edu over to motherlode or thelma and 
>they don't have idd.nrcc.cornell.edu authorized.

I just checked, and you are allowed on thelma (all *.edu sites are).  I
verified the general allow by logging on to a machine at the University
of Virginia and a machine at the University of North Carolina at
Ashville both of which are explicitly not in thelma's allow list and I
was able to do a notifymes to thelma showing that I could feed from
thelma to them.  Given that you are allowed on both atm and thelma and
they are both up and servicing other requests, I have to believe that
the problem is somehow localized to the Cornell I2 connection.

I also verified that you are allowed on unidata2.ssec.wisc.edu, and it
has the full complement of data.  Try failing over to it.

>I have pointed 
>snow.nrcc.cornell.edu to motherlode and it is getting fed but I pay for 
>every MB of data transfered to that machine.

unidata2.ssec is on I2, so you shouldn't have the pay per byte issue
there as long as idd.nrcc can connect over I2.  If it can't due to some
sort of problem with your I2 connection, I would say that you have a
good case for not paying for the traffic over your commodity internet
link.  By the way, snow should be able to connect to atm and it has a
lot more data in its LDM queue than motherlode.

>Can you authorize idd.nrcc.cornell.edu on some backup machines for IDD 
>traffic?

It already was on atm, thelma, and unidata2.ssec three of the top level
IDD relays in the nation.  I just added you to the LSU IDD top level
relay, seistan.srcc.lsu.edu.  I allowed any machine from the
nrcc.cornell.edu domain:

allow   ANY-WSI-NLDN    ^[a-z].*\.nrcc\.cornell.\edu$

so idd.nrcc and snow.nrcc should be able to connect there. seistan
has about 5600 seconds of data in its queue at the moment.

>Also, is there a machine that has more than a couple of hours of data 
>buffered in its queues?

The machines with the largest queues are atm (4 GB) and thelma (6 GB).
The age of the oldest product on atm is about 10800 seconds at the
moment, so connecting to it would give you the most data.  thelma has
12000 seconds.  unidata2 has a 2 GB queue, and the age of the oldest
product in its queue is over 5500 seconds at the moment.

>Thanks -- Bill Noon

Please keep us informed about your progress in getting idd connected
back to atm over I2.

>Northeast Regional Climate Center
>Cornell University

Tom Yoksas