[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Data outage



Charles O'Brien wrote:
> 
> Anne,
> 
> To recap...
> Over the past weeks, we at WSI have been doing some DNS changing.
> However, it only affected our "other" network (not the one on
> the LDM Data feed).  However, for some reason there were a handful
> of clients that could not get our data.  So, for those clients I
> had them add a few more allow/accept lines in their ldmd.conf
> (add sysu1.wsicorp.com, 198.115.158.1 as well as sysu1.uni.wsicorp.com).
> For the other clients, it worked.  Not with Purdue.
> 
> We did not drop their account.  It is currently deactivated because
> the errors and nullprocs going on was killing our system.
> 
> At some point we got LDMPING to work.  Then because I thought
> ANVIL was having DNS issues, I had Eric put my addresses in his
> /etc/hosts file.  This did not fix it.  It actually made it worse.
> 
> Traceroute/nslookup works fine:
> 
> traceroute anvil.eas.purdue.edu
> traceroute to anvil.eas.purdue.edu (128.210.168.99), 30 hops max, 40 byte 
> packets
>  1  rt-wsi-bbn (198.115.158.249)  3 ms  2 ms  2 ms
>  2  s3-0-0-22.cambridge1-cr20.bbnplanet.net (4.1.134.229)  6 ms  5 ms  5 ms
>  3  p2-1.cambridge1-nbr1.bbnplanet.net (4.0.1.153)  5 ms  5 ms  5 ms
>  4  p3-0.cambridge1-nbr2.bbnplanet.net (4.0.5.18)  9 ms  5 ms  5 ms
>  5  p4-0.bstnma1-br1.bbnplanet.net (4.0.5.157)  6 ms  6 ms  6 ms
>  6  p9-0.nycmny1-nbr2.bbnplanet.net (4.24.6.50)  12 ms  12 ms  12 ms
>  7  p1-0.nycmny1-br2.bbnplanet.net (4.24.10.86)  12 ms  12 ms  12 ms
>  8  p4-0.nycmny1-br1.bbnplanet.net (4.24.6.225)  12 ms  12 ms  12 ms
>  9  p1-0.nycmny1-ba1.bbnplanet.net (4.24.6.230)  12 ms  12 ms  12 ms
> 10  a1-0.xnycmny4-uunet.bbnplanet.net (4.0.6.142)  12 ms  14 ms  24 ms
> 11  0.at-6-0-0.XL2.NYC9.ALTER.NET (152.63.18.226)  13 ms  18 ms  14 ms
> 12  0.so-7-0-0.XR1.NYC9.ALTER.NET (152.63.23.138)  12 ms  12 ms  12 ms
> 13  0.so-3-0-0.TR1.NYC9.ALTER.NET (152.63.22.98)  12 ms  12 ms  13 ms
> 14  125.at-5-0-0.TR1.CHI2.ALTER.NET (152.63.1.45)  43 ms  43 ms  43 ms
> 15  197.at-5-0-0.XR1.CHI4.ALTER.NET (152.63.65.49)  44 ms  44 ms  45 ms
> 16  195.ATM11-0-0.GW1.IND1.ALTER.NET (146.188.208.169)  48 ms  47 ms  52 ms
> 17  157.130.101.106 (157.130.101.106)  54 ms  70 ms  75 ms
> 18  cisco2-242.tcom.purdue.edu (128.210.242.7)  108 ms  73 ms  78 ms
> 19  anvil.eas.purdue.edu (128.210.168.99)  90 ms  81 ms  88 ms
> 
> nslookup anvil.eas.purdue.edu
> Server:         127.0.0.1
> Address:        127.0.0.1#53
> 
> Non-authorative answer:
> Name:   anvil.eas.purdue.edu
> Address: 128.210.168.99
> 
> nslookup 128.210.168.99
> Server:         127.0.0.1
> Address:        127.0.0.1#53
> 
> Non-authorative answer:
> 99.168.210.128.in-addr.arpa     name = anvil.eas.purdue.edu.
> 
> Authoritative answers can be found from:
> 210.128.in-addr.arpa    nameserver = ns2.purdue.edu.
> 210.128.in-addr.arpa    nameserver = pendragon.cs.purdue.edu.
> 210.128.in-addr.arpa    nameserver = harbor.ecn.purdue.edu.
> 210.128.in-addr.arpa    nameserver = ns.purdue.edu.
> ns.purdue.edu   internet address = 128.210.11.5
> ns2.purdue.edu  internet address = 128.210.11.57
> pendragon.cs.purdue.edu internet address = 128.10.2.5
> harbor.ecn.purdue.edu   internet address = 128.46.154.76
> 
> ldmping -h anvil.eas.purdue.edu. -l - -v
> Mar 19 21:03:06      State    Elapsed Port   Remote_Host           rpc_stat
> Mar 19 21:03:07  ADDRESSED   0.200509    0   anvil.eas.purdue.edu.  RPC: 
> Unable to receive; errno = Connection reset by peer
> Mar 19 21:03:32 SVC_UNAVAIL   0.239751    0   anvil.eas.purdue.edu.  RPC: 
> Unable to receive; errno = Connection reset by peer
> Mar 19 21:03:57 SVC_UNAVAIL   0.291509    0   anvil.eas.purdue.edu.  RPC: 
> Unable to receive; errno = Connection reset by peer
> 
> Eric, for grins, could you reboot ANVIL?  That could be all it needs.
> 
> Charlie
> 
>   ============================================================================
>   Charles O'Brien                                         WSI Corporation
>   Software Engineer/Meteorologist                         4 Federal Street
>   EMAIL: address@hidden                                  Billerica, MA  01821
>   PHONE: (978) 670-5152                                   FAX: (978) 670-5100
>   ============================================================================

Thanks for the info, Charlie.  That was helpful.  We did a bit of
testing here using rpcinfo.  In particular, we did 'rpcinfo -T tcp
anvil.eas.purdue.edu 300029 5' from both a unidata host, and also
another non-unidata host.  (This uses tcp to do a RPC nullproc to
program 300029 version 5 on anvil, i.e., the LDM.)   

From this we were able to confirm that we can execute and an LDM
nullproc from unidata (recall that Unidata hosts should have 'allow's on
all LDM sites), but not from the non-unidata machine.  From the
non-unidata machine, the results were like Charlie is getting.

This points to two possibilities: a wrong address on the part of WSI, or
a problem with the 'allow' line on anvil.

At WSI: Charlie, regarding the changes to your DNS, could you have the
wrong Perdue address in your /etc/hosts file?  

At Perdue:  Eric, are you sure your allow line is correct?  Perhaps it
would be useful to broaden your allow to "*.wsicorp.com".

Anne
-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                 P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************