[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[IDD #JLJ-308670]: NEXRAD Level II outage

Subject: [IDD #JLJ-308670]: NEXRAD Level II outage
Date: Tue, 26 Feb 2013 16:12:35 -0700
Hi Jamie,

This is a semi-quick follow-up to the reply that Steve just sent you.

Steve mused:
> This might indicate a problem with our Linux Virtual Server (LVS) 
> implementation
> (idd.unidata.ucar.edu is actually a cluster of computers served by LVS).

I, for one, do not think that the situation you experienced has anything
to do with the LVS that directs the idd.unidata.ucar.edu cluster, but
that is a discussion that is ongoing here in Unidata.

We just worked through a situation at LSU that was very similar
(if not exactly the same) as what you experienced today.  The LSU situation
was diagnosed and fixed during a conference call that we had last Friday;
the participants in the call were two of us here in the UPC, one person
from LSU/SRCC, one LSU networking admin, and a representative from
Juniper Networks (the company whose edge router and firewall are used
at LSU).  The write-up we received this afternoon about their problem
is as follows:

> Technical Details of Issue:
> All traffic at LSU is subject to an IDP to block p2p. Since January
> 18th, the IDP was turned off because it was causing high CPU
> utilization, slowing the entire campus traffic. Unfortunately, there was
> a rule that still would point to an inactive IDP. The way this works,
> the flows that match this IDP policy will be asked to redirect to the
> IDP module. They will go into a queue waiting for IDP inspection.
> Because the IDP module is not enabled, they are momentarily stuck there
> waiting for timeout. All this waiting will being to clog up the
> buffer-queue, which eventually triggers the log message we saw on the
> SRX (i.e. Feb 22 14:03:24  csc-118-l7-srx5800-edgefrwl fpc1 Cobar: XLR1
> flow_held_mbuf 500, raise above 500, 1000th time. ). When the queue gets
> full, packets were dropped. Most connections will not see a problem
> because TCP will recover and start a new connection. For other
> applications, this may look like a DoS due to the constant creation of
> new connections, e.g. Unidata.
> 
> LSU future plans:
> * Feature request to Juniper: if IDP is turned off, send a warning
>   message or error trigger that a rule is still pointing to the IDP even
>   though it is off.

Questions:

- the first question I have for you is if there has been any recent
  modifications to firewalls or any "packet shaping" systems at LL or
  MIT?

- the second is if you are using Juniper network equipment?

We would like to propose that you change your ~ldm/etc/ldmd.conf REQUEST
for NEXRAD2 data to point at the specific idd.unidata.ucar.edu cluster
node where the problem was experienced today, uni19.unidata.ucar.edu.

NB:

It only make sense to make this change if/when repeated connection attempts
get denied by idd.unidata.ucar.edu.  The reason for this is as follows:

- since the outage you experienced today was transitory (i.e., data began
  flowing again at 21:30:05Z with no change here at the UPC or by you
  with your LDM configuration), it may be difficult if not impossible
  to determine the cause of the problem unless the problem occurs again

  For instance, if there was some maintenance being performed on a router
  or "packet shaper" in either LL or MIT and the existing LDM connection
  was through that router or "packet shaper", then the problem may not
  return since the work is now complete.  The situation at LSU was much
  easier (but still very hard) to diagnose since we could make the connection
  fail any time we (meaning the LSU folks or UPC staff since we were granted
  login capability to the LSU LDM machines) chose.

re:
> > We just experienced a full outage of all our NEXRAD Level II data that we 
> > pull
> > from Unidata via LDM. We're now trying to determine whether the problem was 
> > at
> > our end or the Unidata end.

Even if there was a problem at LL/MIT, our upgrading our cluster nodes to a new
version of the LDM where duplicate REQUESTs are rejected would magnify the
effects of a problem at LL/MIT.  With the previous versions of the LDM that
were running on idd.unidata.ucar.edu cluster nodes, duplicate REQUEST were 
allowed,
so a transient situation like yours may never have been noticed by you or us
as long as the number of new connections did not cause the total number of
connections to exceed the maximum we impose on each cluster node (256).

Question:

- are you OK with leaving the LDM REQUEST on llwxldm1 as is, and only
  change the REQUEST if you experience the service denial again?

Cheers,

Tom
--
****************************************************************************
Unidata User Support                                    UCAR Unidata Program
(303) 497-8642                                                 P.O. Box 3000
address@hidden                                   Boulder, CO 80307
----------------------------------------------------------------------------
Unidata HomePage                       http://www.unidata.ucar.edu
****************************************************************************


Ticket Details
===================
Ticket ID: JLJ-308670
Department: Support LDM
Priority: Normal
Status: Closed
Prev by Date: [IDD #JLJ-308670]: NEXRAD Level II outage
Next by Date: [IDD #JLJ-308670]: NEXRAD Level II outage
Previous by thread: [IDD #JLJ-308670]: NEXRAD Level II outage
Next by thread: [IDD #JLJ-308670]: NEXRAD Level II outage
Index(es):
- Date
- Thread