[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20210830: Re: High CONDUIT latencies from vm-lnx-conduit2.ncep.noaa.gov



Tom,

Sorry, was off setting up a new iMac for one of our grad students.. 

I'm pretty certain that the reason I am losing data is that the ldm queue on idd-agg.aos.wisc.edu (which is where most of my IDD feeds first come in before being passed along to idd1.aos and idd2.aos) is too small when the lags get over ~1000 seconds. The product queue on idd-agg is currently 73 Gb. I think that machine has 96 Gb of RAM, and the product queue is on a RAM disk (/dev/shm)

My guess is that all of the data is making it into the queue, but with a lag of > 1000 seconds it's not getting processed out of the queue..

Pete


-----
Pete Pokrandt - Systems Programmer
UW-Madison Dept of Atmospheric and Oceanic Sciences
608-262-3086  - address@hidden



From: Tom Yoksas <address@hidden>
Sent: Monday, August 30, 2021 1:22 PM
To: Pete Pokrandt <address@hidden>; Anne Myckow - NOAA Federal <address@hidden>
Cc: address@hidden <address@hidden>; Tyle, Kevin R <address@hidden>; address@hidden <address@hidden>
Subject: 20210830: Re: High CONDUIT latencies from vm-lnx-conduit2.ncep.noaa.gov
 
Hi Pete and Kevin,

I am not CCing the NCEP folks on this note...

When you say "are still losing data" what exactly do you mean:

- you are not receiving data that should be in the feed?

- something else like the residency time of data in your LDM queue(s)
   is not large enough that it is not being processed out of the
   queue (e.g., FILEd, distributed to downstreams, etc.)?

One of the reasons I am asking is we have had support inquiries from
two sites that were not receiving all of the CONDUIT data that they
believed that they should be receiving.  In on case, the problem was
tracked down to a network problem outside of the department in question,
and in the other case, the problem appears to be partly related to the
machine doing the ingest.

Given the inquiries that we have received, I need to know exactly what
your comment means.

Thanks in advance...

Cheers,

Tom

On 8/30/21 10:58 AM, Pete Pokrandt wrote:
> Monday morning update - the large lags from vm-lnx-conduit2 are still
> there, and we are still losing data..
>
> FYI
> Pete
>
>
>
> <http://www.weather.com/tv/shows/wx-geeks/video/the-incredible-shrinking-cold-pool>-----
> Pete Pokrandt - Systems Programmer
> UW-Madison Dept of Atmospheric and Oceanic Sciences
> 608-262-3086  - address@hidden
>
>
> ------------------------------------------------------------------------
> *From:* Anne Myckow - NOAA Federal <address@hidden>
> *Sent:* Friday, August 27, 2021 9:46 AM
> *To:* Pete Pokrandt <address@hidden>
> *Cc:* Tyle, Kevin R <address@hidden>; address@hidden
> <address@hidden>; address@hidden
> <address@hidden>; address@hidden
> <address@hidden>
> *Subject:* Re: High CONDUIT latencies from vm-lnx-conduit2.ncep.noaa.gov
> Thanks Pete. We are engaging our networking folks on this issue now.
>
> However, be warned that we are having a major internet outage at our
> Boulder data center. We are moving more apps over to College Park, so
> you will most likely see latency on both conduit systems today, until at
> least 22Z. Hopefully our networking folks can actually find a bottleneck
> this time around but just wanted to let you know. Will keep you posted.
>
> Thanks,
> Anne
>
> On Thu, Aug 26, 2021 at 5:30 PM Pete Pokrandt <address@hidden
> <mailto:address@hidden>> wrote:
>
>     Anne,
>
>     It's hard to say. To my eyes, it looks like the latency problem was
>     not solved by a reboot or by moving the server to a different part
>     of your infrastructure.. The graphs show that there are still large
>     latencies from vm-lnx-conduit2, but maybe not quite as bad as
>     before? I did still lose some of the 00 UTC 26 GFS run.. So the
>     problem definitely is not resolved.
>
>     Unidata folks, any ideas on things they can try to figure out what's
>     going on here? Is their internal network just saturated to the point
>     where it can't keep up? Or something about the vm itself that might
>     cause that?
>
>     Pete
>
>
>
>
>
>
>
>     <http://www.weather.com/tv/shows/wx-geeks/video/the-incredible-shrinking-cold-pool>-----
>     Pete Pokrandt - Systems Programmer
>     UW-Madison Dept of Atmospheric and Oceanic Sciences
>     608-262-3086  - address@hidden <mailto:address@hidden>
>
>
>     ------------------------------------------------------------------------
>     *From:* Anne Myckow - NOAA Federal <address@hidden
>     <mailto:address@hidden>>
>     *Sent:* Wednesday, August 25, 2021 2:46 PM
>     *To:* Tyle, Kevin R <address@hidden <mailto:address@hidden>>
>     *Cc:* Pete Pokrandt <address@hidden
>     <mailto:address@hidden>>; address@hidden
>     <mailto:address@hidden> <address@hidden
>     <mailto:address@hidden>>; address@hidden
>     <mailto:address@hidden>
>     <address@hidden
>     <mailto:address@hidden>>;
>     address@hidden
>     <mailto:address@hidden>
>     <address@hidden
>     <mailto:address@hidden>>
>     *Subject:* Re: High CONDUIT latencies from
>     vm-lnx-conduit2.ncep.noaa.gov <http://vm-lnx-conduit2.ncep.noaa.gov>
>     We have moved vm-lnx-conduit2 to a less busy area within our
>     infrastructure. Is the feed from condui1 still good? And please let
>     us know what conduit2 looks like.
>
>     Thanks,
>     Anne
>
>     On Wed, Aug 25, 2021 at 9:18 AM Anne Myckow - NOAA Federal
>     <address@hidden <mailto:address@hidden>> wrote:
>
>         Also, I'd like to know if there are any of you all that are
>         *not* experiencing latency. Please let me know if you are in
>         that camp.
>
>         Thanks so much,
>         Anne
>
>         On Wed, Aug 25, 2021 at 9:04 AM Anne Myckow - NOAA Federal
>         <address@hidden <mailto:address@hidden>> wrote:
>
>             Morning,
>
>             I don't see the crazy latency from that one cycle yesterday
>             but it still looks pretty bad to me - do you concur?
>
>             Thanks,
>             Anne
>
>             On Tue, Aug 24, 2021 at 4:03 PM Anne Myckow - NOAA Federal
>             <address@hidden <mailto:address@hidden>> wrote:
>
>                 Hi everyone,
>
>                 We've tried rebooting the systems, I checked your graph
>                 and it looks like we won't know for a few cycles if it's
>                 better - can you let us know if you see something before
>                 we check it tomorrow morning?
>
>                 Thanks,
>                 Anne
>
>                 On Tue, Aug 24, 2021 at 1:59 PM Tyle, Kevin R
>                 <address@hidden <mailto:address@hidden>> wrote:
>
>                     Hi all,____
>
>                     __ __
>
>                     I can state that our GFS grib file reception via LDM
>                     has been extremely spotty, particularly for the
>                     F48-F192 forecast hour periods, for several weeks
>                     now. We feed from Pete’s LDM at UW-MSN so this is
>                     consistent with what Pete has been seeing.____
>
>                     __ __
>
>                     It would be really nice if NCEP’s CONDUIT feed can
>                     return to the level of consistent service that we in
>                     the community had been accustomed to for many years.____
>
>                     __ __
>
>                     Cheers,____
>
>                     __ __
>
>                     Kevin____
>
>                     __ __
>
>                     _____________________________________________________
>
>                     __ __
>
>                     Kevin Tyle, M.S.; Manager of Departmental Computing____
>
>                     NSF XSEDE Campus Champion
>                     Dept. of Atmospheric & Environmental Sciences
>                     UAlbany ETEC Bldg – Harriman Campus
>                     1220 Washington Avenue, Room 419
>                     Albany, NY 12222
>                     address@hidden <mailto:address@hidden> |
>                     518-442-4578 | @nywxguy | he/him/his ____
>
>                     _____________________________________________________
>
>                     __ __
>
>                     *From:* conduit <address@hidden
>                     <mailto:address@hidden>> *On
>                     Behalf Of *Pete Pokrandt via conduit
>                     *Sent:* Tuesday, August 24, 2021 1:26 PM
>                     *To:* Anne Myckow - NOAA Federal
>                     <address@hidden <mailto:address@hidden>>
>                     *Cc:* address@hidden
>                     <mailto:address@hidden>;
>                     address@hidden
>                     <mailto:address@hidden>;
>                     address@hidden
>                     <mailto:address@hidden>;
>                     address@hidden
>                     <mailto:address@hidden>
>                     *Subject:* Re: [conduit] High CONDUIT latencies from
>                     vm-lnx-conduit2.ncep.noaa.gov
>                     <http://vm-lnx-conduit2.ncep.noaa.gov>____
>
>                     __ __
>
>                     Dear Anne and all,____
>
>                     __ __
>
>                     Just a note to let you know we are still
>                     experiencing the high latencies. In fact, today the
>                     latencies from both vm-lnx-conduit1 and
>                     vm-lnx-conduit2 are high.____
>
>                     __ __
>
>                     Pete____
>
>                     __ __
>
>                     https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+idd-agg.aos.wisc.edu
>                     <https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+idd-agg.aos.wisc.edu>____
>
>                     __ __
>
>                     ____
>
>                     __ __
>
>                     __ __
>
>                     https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+conduit.unidata.ucar.edu
>                     <https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+conduit.unidata.ucar.edu>____
>
>                     ____
>
>                     __ __
>
>                     -----
>                     Pete Pokrandt - Systems Programmer
>                     UW-Madison Dept of Atmospheric and Oceanic Sciences
>                     608-262-3086  - address@hidden
>                     <mailto:address@hidden>____
>
>                     __ __
>
>                     ------------------------------------------------------------------------
>
>                     *From:*Anne Myckow - NOAA Federal
>                     <address@hidden <mailto:address@hidden>>
>                     *Sent:* Friday, August 20, 2021 12:14 PM
>                     *To:* Pete Pokrandt <address@hidden
>                     <mailto:address@hidden>>
>                     *Cc:* address@hidden
>                     <mailto:address@hidden>
>                     <address@hidden
>                     <mailto:address@hidden>>;
>                     address@hidden
>                     <mailto:address@hidden>
>                     <address@hidden
>                     <mailto:address@hidden>>;
>                     address@hidden
>                     <mailto:address@hidden>
>                     <address@hidden
>                     <mailto:address@hidden>>;
>                     address@hidden
>                     <mailto:address@hidden>
>                     <address@hidden
>                     <mailto:address@hidden>>
>                     *Subject:* Re: High CONDUIT latencies from
>                     vm-lnx-conduit2.ncep.noaa.gov
>                     <http://vm-lnx-conduit2.ncep.noaa.gov> ____
>
>                     ____
>
>                     Pete, ____
>
>                     __ __
>
>                     conduit.ncep.noaa.gov <http://conduit.ncep.noaa.gov>
>                     is a load-balanced DNS that points to both conduit1
>                     and conduit2 servers on the backend. I'm going to
>                     see if we can push you all off of conduit2 for now,
>                     hopefully those of you connected to conduit2 will
>                     see a brief interruption and then connect to
>                     conduit1 automatically.____
>
>                     __ __
>
>                     More to come.____
>
>                     Anne____
>
>                     __ __
>
>                     On Fri, Aug 20, 2021 at 1:12 PM Pete Pokrandt
>                     <address@hidden <mailto:address@hidden>>
>                     wrote:____
>
>                         It looks like conduit.ncep.noaa.gov
>                         <http://conduit.ncep.noaa.gov> is pulling data
>                         from both vm-lnx-conduit1 and vm-linux-conduit2
>                         - conduit1 seems ok, it's just conduit2 that is
>                         showing the large lags.____
>
>                         __ __
>
>                         I don't know how things are set up exactly, but
>                         it might work to have conduit.ncep.noaa.gov
>                         <http://conduit.ncep.noaa.gov> only request
>                         CONDUIT data from vm-lnx-conduit1 until the
>                         problem with feeding from conduit2 is resolved? ____
>
>
>                         Unidata folks, any suggestions from your end?
>
>                         Thanks, we do appreciate all your work on our
>                         behalf!____
>
>                         Pete____
>
>                         __ __
>
>                         __ __
>
>                         -----
>                         Pete Pokrandt - Systems Programmer
>                         UW-Madison Dept of Atmospheric and Oceanic Sciences
>                         608-262-3086  - address@hidden
>                         <mailto:address@hidden>____
>
>                         __ __
>
>                         ------------------------------------------------------------------------
>
>                         *From:*Anne Myckow - NOAA Federal
>                         <address@hidden <mailto:address@hidden>>
>                         *Sent:* Friday, August 20, 2021 12:07 PM
>                         *To:* Pete Pokrandt <address@hidden
>                         <mailto:address@hidden>>
>                         *Cc:* address@hidden
>                         <mailto:address@hidden>
>                         <address@hidden
>                         <mailto:address@hidden>>;
>                         address@hidden
>                         <mailto:address@hidden>
>                         <address@hidden
>                         <mailto:address@hidden>>;
>                         address@hidden
>                         <mailto:address@hidden>
>                         <address@hidden
>                         <mailto:address@hidden>>;
>                         address@hidden
>                         <mailto:address@hidden>
>                         <address@hidden
>                         <mailto:address@hidden>>
>                         *Subject:* Re: High CONDUIT latencies from
>                         vm-lnx-conduit2.ncep.noaa.gov
>                         <http://vm-lnx-conduit2.ncep.noaa.gov> ____
>
>                         ____
>
>                         Hi Pete, ____
>
>                         __ __
>
>                         We have a lot of systems and applications
>                         running out of College Park right now, which I
>                         think is part of it. But I will have someone
>                         take a look at conduit2 today, see if maybe we
>                         need to try and move your connections to
>                         conduit1 instead.____
>
>                         __ __
>
>                         Thanks,____
>
>                         Anne____
>
>                         __ __
>
>                         On Fri, Aug 20, 2021 at 12:54 PM Pete Pokrandt
>                         <address@hidden <mailto:address@hidden>>
>                         wrote:____
>
>                             Dear Anne, Dustin and all,____
>
>                             __ __
>
>                             Did you see this? We are still experiencing
>                             high latencies of 800-1000 seconds on our
>                             CONDUIT feeds during the times when the GFS
>                             comes through that appear to be coming from
>                             the host____
>
>                             __ __
>
>                             vm-lnx-conduit2.ncep.noaa.gov
>                             <http://vm-lnx-conduit2.ncep.noaa.gov>____
>
>                             __ __
>
>                             Here are the most recent lags. Any ideas?
>
>                             Thanks,____
>
>                             Pete____
>
>                             __ __
>
>                             https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+idd-agg.aos.wisc.edu
>                             <https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+idd-agg.aos.wisc.edu>____
>
>                             __ __
>
>                             ____
>
>                             __ __
>
>                             __ __
>
>                             https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+conduit.unidata.ucar.edu
>                             <https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+conduit.unidata.ucar.edu>____
>
>                             ____
>
>                             __ __
>
>                             __ __
>
>                             -----
>                             Pete Pokrandt - Systems Programmer
>                             UW-Madison Dept of Atmospheric and Oceanic
>                             Sciences
>                             608-262-3086  - address@hidden
>                             <mailto:address@hidden>____
>
>                             __ __
>
>                             ------------------------------------------------------------------------
>
>                             *From:*Pete Pokrandt
>                             *Sent:* Wednesday, August 18, 2021 3:02 PM
>                             *To:* address@hidden
>                             <mailto:address@hidden>
>                             <address@hidden
>                             <mailto:address@hidden>>;
>                             address@hidden
>                             <mailto:address@hidden>
>                             <address@hidden
>                             <mailto:address@hidden>>;
>                             address@hidden
>                             <mailto:address@hidden>
>                             <address@hidden
>                             <mailto:address@hidden>>
>                             *Cc:* address@hidden
>                             <mailto:address@hidden>
>                             <address@hidden
>                             <mailto:address@hidden>>;
>                             address@hidden
>                             <mailto:address@hidden>
>                             <address@hidden
>                             <mailto:address@hidden>>
>                             *Subject:* High CONDUIT latencies from
>                             vm-lnx-conduit2.ncep.noaa.gov
>                             <http://vm-lnx-conduit2.ncep.noaa.gov> ____
>
>                             ____
>
>                             Dear Anne, Dustin and all,____
>
>                             __ __
>
>                             Recently we have noticed fairly high
>                             latencies on the CONDUIT ldm data feed
>                             originating from the machine
>                             vm-lnx-conduit2.ncep.noaa.gov
>                             <http://vm-lnx-conduit2.ncep.noaa.gov>. The
>                             feed originating from
>                             vm-lnx-conduit1.ncep.noaa.gov
>                             <http://vm-lnx-conduit1.ncep.noaa.gov> does
>                             not have the high latencies. Unidata and
>                             other top level feeds are seeing similar
>                             high latencies from
>                             vm-lnx-conduit2.ncep.noaa.gov
>                             <http://vm-lnx-conduit2.ncep.noaa.gov>.____
>
>                             __ __
>
>                             Here are some graphs showing the latencies
>                             that I'm seeing:____
>
>                             __ __
>
>                              From
>                             https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+idd-agg.aos.wisc.edu
>                             <https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+idd-agg.aos.wisc.edu> -
>                             latencies for CONDUIT data arriving at our
>                             UW-Madison AOS ingest machine____
>
>                             __ __
>
>                             ____
>
>                             __ __
>
>                              From
>                             https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/siteindex?conduit.unidata.ucar.edu
>                             <https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/siteindex?conduit.unidata.ucar.edu> (latencies
>                             at Unidata)____
>
>                             __ __
>
>                             ____
>
>                             __ __
>
>                             At least here at UW-Madison, these latencies
>                             are causing us to lose some data during the
>                             large GFS/GEFS periods.____
>
>                             __ __
>
>                             Any idea what might be causing this?____
>
>                             __ __
>
>                             Pete____
>
>                             __ __
>
>                             __ __
>
>                             __ __
>
>                             __ __
>
>                             -----
>                             Pete Pokrandt - Systems Programmer
>                             UW-Madison Dept of Atmospheric and Oceanic
>                             Sciences
>                             608-262-3086  - address@hidden
>                             <mailto:address@hidden>____
>
>
>                         ____
>
>                         __ __
>
>                         -- ____
>
>                         Anne Myckow____
>
>                         Dataflow Team Lead____
>
>                         NWS/NCEP/NCO____
>
>
>                     ____
>
>                     __ __
>
>                     -- ____
>
>                     Anne Myckow____
>
>                     Dataflow Team Lead____
>
>                     NWS/NCEP/NCO____
>
>
>
>                 --
>                 Anne Myckow
>                 Dataflow Team Lead
>                 NWS/NCEP/NCO
>
>
>
>             --
>             Anne Myckow
>             Dataflow Team Lead
>             NWS/NCEP/NCO
>
>
>
>         --
>         Anne Myckow
>         Dataflow Team Lead
>         NWS/NCEP/NCO
>
>
>
>     --
>     Anne Myckow
>     Dataflow Team Lead
>     NWS/NCEP/NCO
>
>
>
> --
> Anne Myckow
> Dataflow Team Lead
> NWS/NCEP/NCO

--
+----------------------------------------------------------------------+
* Tom Yoksas                                      UCAR Unidata Program *
* (303) 497-8642 (last resort)                           P.O. Box 3000 *
* address@hidden                                    Boulder, CO 80307 *
* Unidata WWW Service                     http://www.unidata.ucar.edu/ *
+----------------------------------------------------------------------+