[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Top level CONDUIT relay

Subject: Re: Top level CONDUIT relay
Date: Thu, 21 Jun 2007 10:27:20 -0400
Justin Cooke wrote:
> Chi,
>
> Was the change made to both ldm1 and ldm2?

Yes.


>
> Justin
>
> Chi.Y.Kang wrote:
>> Yes, I made the change to the LDM servers to test the shared memory
>> configuration.
>>
>> # Setting SHMMAX Parameter 4 GB
>> kernel.shmmax = 4294967296
>> # getconf PAGE_SIZE
>> kernel.shmmni = 4096
>> kernel.shmall = 2097152
>>
>> However, this doesn't explain the performance relief because...  ldm
>> doesn't seem to be using shared memory, or at least not listed on the
>> table.  Mr Cano thought LDM might be using this.
>>
>> ldm1:~$ ipcs -a
>>
>> ------ Shared Memory Segments --------
>> key        shmid      owner      perms      bytes      nattch   
>> status     0x00000000 0          root      600        3976      
>> 4         dest       
>> ------ Semaphore Arrays --------
>> key        semid      owner      perms      nsems   
>> ------ Message Queues --------
>> key        msqid      owner      perms      used-bytes   messages  
>>
>> Justin Cooke wrote:
>>  
>>> Chi,
>>>
>>> Has anything at all changed on ldm1 since yesterday? Starting at 04Z
>>> the feed on node6 improved dramatically, all other subscribers to ldm1
>>> also noticed improved performance.
>>>
>>> Justin
>>>
>>> Steve Chiswell wrote:
>>>    
>>>> Justin,
>>>>
>>>> I noticed that the feeds from ldm1 dropped as you said. Do you know
>>>> if anything
>>>> changed related to that machine?
>>>>
>>>> I can add daffy back to ldm1 and see if things maintain their
>>>> performance, but
>>>> will wait to find out if any changes were made? Since ldm2 is still
>>>> lagging,
>>>> seems like it is not a network wide issue?
>>>>
>>>> Steve
>>>>
>>>> On Thu, 21 Jun 2007, Justin Cooke wrote:
>>>>
>>>>  
>>>>      
>>>>> Steve,
>>>>>
>>>>> Looking at the graphs it appears that transfers improved greatly
>>>>> after
>>>>> 04Z today. I did a netstat on ldm1 and I still see where atm and
>>>>> flood
>>>>> are subscribing to it, same as yesterday.
>>>>>
>>>>> Although looking at the latency graphs you provide it looks like
>>>>> those
>>>>> subscribing to ldm2 are still seeing delays.
>>>>>
>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+atm.cise-nsf.gov
>>>>>
>>>>>
>>>>>
>>>>> Justin
>>>>>
>>>>> Steve Chiswell wrote:
>>>>>           
>>>>>> Justin,
>>>>>>
>>>>>> I am receiving the stats from node6:
>>>>>> Latency:
>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+node6.woc.noaa.gov
>>>>>>
>>>>>>
>>>>>> Volume:
>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_vol_nc?CONDUIT+node6.woc.noaa.gov
>>>>>>
>>>>>>
>>>>>>
>>>>>> The latency there to ldm1 is climbing on the initial connection, and
>>>>>> will start off by catching up on the last hours worth of data in the
>>>>>> upstream queue. After that, we can see what the latency is doing.
>>>>>>
>>>>>> Steve
>>>>>>
>>>>>> On Wed, 2007-06-20 at 12:43 -0400, Justin Cooke wrote:
>>>>>>
>>>>>>               
>>>>>>> Steve and Chi,
>>>>>>>
>>>>>>> I tried to ping rtstats.unidata.ucar.edu but was unable to.
>>>>>>>
>>>>>>> Chi would you be able to set up a static route from node6 to
>>>>>>> rstats.unidata.ucar.edu like Steve mentions?
>>>>>>>
>>>>>>> I actually am unable to connect to ncepldm.woc.noaa.gov either.
>>>>>>> However
>>>>>>> I did set up a feed to "ldm1" and am receiving CONDUIT data
>>>>>>> currently.
>>>>>>>
>>>>>>> Steve how tough would it be to do the pqact step you mention and
>>>>>>> to get
>>>>>>> the stats reports from those if Chi is unable to get the static
>>>>>>> route
>>>>>>> going?
>>>>>>>
>>>>>>> Thanks for all the help,
>>>>>>>
>>>>>>> Justin
>>>>>>>
>>>>>>> On Jun 20, 2007, at 12:16 PM, Steve Chiswell wrote:
>>>>>>>
>>>>>>>
>>>>>>>                   
>>>>>>>> Justin,
>>>>>>>>
>>>>>>>> Is that box capable of sending stats to our
>>>>>>>> rtstats.unidata.ucar.edu
>>>>>>>> host?
>>>>>>>> Eg, is it allowed to connect outside your domain?
>>>>>>>>
>>>>>>>> The ldm won't need to run pqact to test out the throughput and
>>>>>>>> netwrok,
>>>>>>>> but will need ldmd.conf lines:
>>>>>>>>
>>>>>>>> EXEC    "rtstats -h rtstats.unidata.ucar.edu"
>>>>>>>> request CONDUIT ".*" ncepldm.woc.noaa.gov
>>>>>>>>
>>>>>>>> The pqact EXEC action can be commented out. The request
>>>>>>>> line will start the feed to ncepldm which flood.atmos.uiuc.edu is
>>>>>>>> pointing to, and showing high latency. If you are able to feed
>>>>>>>> from
>>>>>>>> ncepldm
>>>>>>>> without the latency that outside hosts are showing, then it would
>>>>>>>> isolate the
>>>>>>>> problem further to the border of your network to the outside. If
>>>>>>>> you do
>>>>>>>> show similar latency, then it would either be the LDM
>>>>>>>> configuration
>>>>>>>> itself, or the local
>>>>>>>> router that the machines are on.
>>>>>>>>
>>>>>>>> If you are able to send rtstats out to us, then we can monitor
>>>>>>>> stats on
>>>>>>>> our web pages.
>>>>>>>> Your network might require a static route be added for sending
>>>>>>>> that
>>>>>>>> outside your domain (that would something your networking folks
>>>>>>>> would
>>>>>>>> know). The rtstats sends
>>>>>>>> a small text report about every 60 seconds, so not a lot of
>>>>>>>> traffic.
>>>>>>>>
>>>>>>>> If you can't configure your host to send rtstats, then we could
>>>>>>>> create
>>>>>>>> q
>>>>>>>> pqact.conf action to file the .status reports and calculate the
>>>>>>>> latency
>>>>>>>> from those.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Steve
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, 2007-06-20 at 12:03 -0400, Justin Cooke wrote:
>>>>>>>>
>>>>>>>>                       
>>>>>>>>> Steve,
>>>>>>>>>
>>>>>>>>> If you provide us a pqact.conf I can have the box chi set up to
>>>>>>>>> feed
>>>>>>>>> off of ldm1 and see how its latencies are.
>>>>>>>>>
>>>>>>>>> Justin
>>>>>>>>> On Jun 20, 2007, at 11:36 AM, Steve Chiswell wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                           
>>>>>>>>>> Justin,
>>>>>>>>>>
>>>>>>>>>> Since the change at 13Z by dropping daffy.unidata.ucar.edu out
>>>>>>>>>> of the
>>>>>>>>>> top level nodes the ldm2 feed to NSF is showing little/no
>>>>>>>>>> latency at
>>>>>>>>>> all. The ldm1 feed to NSF which is connected using the
>>>>>>>>>> alternate LDM
>>>>>>>>>> mode is only devivering the .status messages its creates as all
>>>>>>>>>> the
>>>>>>>>>> other products are duplicates of products already being
>>>>>>>>>> received from
>>>>>>>>>> LDM2 and that is showing high latency:
>>>>>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?
>>>>>>>>>> CONDUIT+atm.cise-nsf.gov
>>>>>>>>>>
>>>>>>>>>> This configuration is getting data out to the community at the
>>>>>>>>>> moment.
>>>>>>>>>> The downside here is that it puts a single point of failure at
>>>>>>>>>> NSF in
>>>>>>>>>> getting the data to Unidata, but
>>>>>>>>>> I'll monitor that end.
>>>>>>>>>>
>>>>>>>>>> It seems that ldm1 is either slow, or it is showing network
>>>>>>>>>> limitations
>>>>>>>>>> (since
>>>>>>>>>> flood.atmos.uiuc.edu is feeding from ncepldm which is apparently
>>>>>>>>>> pointing to ldm1, there is load on ldm1 besides the NSF feed.
>>>>>>>>>> LDM2 is
>>>>>>>>>> feeding both NSF and idd.aos.wisc.edu (and Wisc looks good
>>>>>>>>>> since 13Z
>>>>>>>>>> as
>>>>>>>>>> well) so it is able to
>>>>>>>>>> handle the throughput to 2 downstreams, but adding daffy as the
>>>>>>>>>> 3rd
>>>>>>>>>> seems to
>>>>>>>>>> cross some point in volume of what can be sent out.
>>>>>>>>>>
>>>>>>>>>> Steve
>>>>>>>>>>
>>>>>>>>>> On Wed, 2007-06-20 at 09:45 -0400, Justin Cooke wrote:
>>>>>>>>>>
>>>>>>>>>>                               
>>>>>>>>>>> Thanks Steve,
>>>>>>>>>>>
>>>>>>>>>>> Chi has set up a box on the lan for us to run LDM on, I am
>>>>>>>>>>> beginning
>>>>>>>>>>> to
>>>>>>>>>>> get things running on there.
>>>>>>>>>>>
>>>>>>>>>>> have you seen any improvement since dropping daffy?
>>>>>>>>>>>
>>>>>>>>>>> Justin
>>>>>>>>>>>
>>>>>>>>>>> On Jun 20, 2007, at 9:03 AM, Steve Chiswell wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                                   
>>>>>>>>>>>> Justin,
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, this does appear to be the case. I will drop daffy from
>>>>>>>>>>>> feeding
>>>>>>>>>>>> directly and instead move it to feed from NSF. That will
>>>>>>>>>>>> remove one
>>>>>>>>>>>> of the top level relays of data having to go out of NCEP and
>>>>>>>>>>>> we can see if the other nodes show an improvement.
>>>>>>>>>>>>
>>>>>>>>>>>> Steve
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 20 Jun 2007, Justin Cooke wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                                       
>>>>>>>>>>>>> Steve,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Did you see a slowdown to ldm2 after Pete and the other sites
>>>>>>>>>>>>> began
>>>>>>>>>>>>> making connections?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Chi, considering steve saw a good connection to ldm1
>>>>>>>>>>>>> before the
>>>>>>>>>>>>> other
>>>>>>>>>>>>> sites connected doesn't that point toward a network issue?
>>>>>>>>>>>>>
>>>>>>>>>>>>> All of our queue processing on the diskserver has been
>>>>>>>>>>>>> running
>>>>>>>>>>>>> without
>>>>>>>>>>>>> any problems so I don't believe anything on that system would
>>>>>>>>>>>>> impacting
>>>>>>>>>>>>> ldm1/ldm2.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Justin
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Jun 20, 2007, at 12:04 AM, Chi Y Kang wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                                           
>>>>>>>>>>>>>> I setup the test LDM server for the NCEP folks to test the
>>>>>>>>>>>>>> local
>>>>>>>>>>>>>> pull
>>>>>>>>>>>>>> from the LDM servers.  That should give us some
>>>>>>>>>>>>>> information /
>>>>>>>>>>>>>> network
>>>>>>>>>>>>>> or system related issue.  We'll handle that tomorrow.  I
>>>>>>>>>>>>>> am a
>>>>>>>>>>>>>> little
>>>>>>>>>>>>>> bit concerned that the slow down all occurred at the some
>>>>>>>>>>>>>> time as
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> ldm1 crash last week.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also, can NCEP also check if there are any bad dbnet
>>>>>>>>>>>>>> queues on
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> backend servers?  Just to verify.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Steve Chiswell wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                                               
>>>>>>>>>>>>>>> Thanks Justin,
>>>>>>>>>>>>>>> I also had a typo in my message:
>>>>>>>>>>>>>>> ldm1 is running slower than ldm2
>>>>>>>>>>>>>>> Now if the feed to ldm2 all of a sudden slows down if Pete
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>> sites add a request to it, it would really signal some
>>>>>>>>>>>>>>> sort of
>>>>>>>>>>>>>>> total
>>>>>>>>>>>>>>> bandwidth limitation
>>>>>>>>>>>>>>> on the I2 connection. Seemed a little coincidental that we
>>>>>>>>>>>>>>> had a
>>>>>>>>>>>>>>> show
>>>>>>>>>>>>>>> period
>>>>>>>>>>>>>>> of good connectivity to ldm1 after which it slowed way
>>>>>>>>>>>>>>> down.
>>>>>>>>>>>>>>> Steve
>>>>>>>>>>>>>>> On Tue, 2007-06-19 at 17:01 -0400, Justin Cooke wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                                                   
>>>>>>>>>>>>>>>> I just realized the issue. When I disabled the "pqact"
>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>> ldm2 earlier today it caused our monitor script (in cron,
>>>>>>>>>>>>>>>> every 5
>>>>>>>>>>>>>>>> min) to kill the LDM and restart it. I have removed the
>>>>>>>>>>>>>>>> check
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> the pqact in that monitor...things should be a bit better
>>>>>>>>>>>>>>>> now.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Chi.Y.Kang wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>                                                       
>>>>>>>>>>>>>>>>> Huh, i thought you guys were on the system.  let me
>>>>>>>>>>>>>>>>> take a
>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>> ldm2
>>>>>>>>>>>>>>>>> and see what is going on.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Justin Cooke wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>                           
>>>>>>>>>>>>>>>>>                                
>>>>>>>>>>>>>>>>>> Chi.Y.Kang wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>                             
>>>>>>>>>>>>>>>>>>                                  
>>>>>>>>>>>>>>>>>>> Steve Chiswell wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>>>>>>                                    
>>>>>>>>>>>>>>>>>>>> Pete and David,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I changed the CONDUIT request lines at NSF and
>>>>>>>>>>>>>>>>>>>> Unidata to
>>>>>>>>>>>>>>>>>>>> request data
>>>>>>>>>>>>>>>>>>>> from ldm1.woc.noaa.gov rather than
>>>>>>>>>>>>>>>>>>>> ncepldm.woc.noaa.gov
>>>>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>> seeing
>>>>>>>>>>>>>>>>>>>> lots of
>>>>>>>>>>>>>>>>>>>> disconnect/reconnects to the ncepldm virtual name.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The LDM appears to have caught up here as an interim
>>>>>>>>>>>>>>>>>>>> solution.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Still don't know the cause of the problem.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Steve
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>                                  
>>>>>>>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>>>>>> I know the NCEP was stop and starting the LDM service
>>>>>>>>>>>>>>>>>>> on the
>>>>>>>>>>>>>>>>>>> ldm2
>>>>>>>>>>>>>>>>>>> box
>>>>>>>>>>>>>>>>>>> where the VIp address is pointed to at this time. 
>>>>>>>>>>>>>>>>>>> how is
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>>>>>> connection to LDM1?  is the speed of the conduit feed
>>>>>>>>>>>>>>>>>>> acceptable?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>                                
>>>>>>>>>>>>>>>>>>>                                     
>>>>>>>>>>>>>>>>>> Chi, NCEP has not restarted the LDM on ldm2 at all
>>>>>>>>>>>>>>>>>> today. But
>>>>>>>>>>>>>>>>>> looking
>>>>>>>>>>>>>>>>>> at the logs it appears to be dying and getting
>>>>>>>>>>>>>>>>>> restarted by
>>>>>>>>>>>>>>>>>> cron.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I will watch and see if I see anything.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Justin
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>                              
>>>>>>>>>>>>>>>>>>                                   
>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>> Chi Y. Kang
>>>>>>>>>>>>>> Contractor
>>>>>>>>>>>>>> Principal Engineer
>>>>>>>>>>>>>> Phone: 301-713-3333 x201
>>>>>>>>>>>>>> Cell: 240-338-1059
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                                                 
>>>>>>>>>> -- 
>>>>>>>>>> Steve Chiswell <address@hidden>
>>>>>>>>>> Unidata
>>>>>>>>>>
>>>>>>>>>>                                 
>>>>>>>> -- 
>>>>>>>> Steve Chiswell <address@hidden>
>>>>>>>> Unidata
>>>>>>>>
>>>>>>>>                         
>>
>>
>>   


-- 
Chi Y. Kang
Contractor
Principal Engineer
Phone: 301-713-3333 x201
Cell: 240-338-1059
Follow-Ups:
- Re: Top level CONDUIT relay
  - From: Justin Cooke
References:
- Top level CONDUIT relay
  - From: Steve Chiswell
- Re: Top level CONDUIT relay
  - From: Justin Cooke
- Re: Top level CONDUIT relay
  - From: Chi.Y.Kang
- Re: Top level CONDUIT relay
  - From: Justin Cooke
- Re: Top level CONDUIT relay
  - From: Steve Chiswell
- Re: Top level CONDUIT relay
  - From: Steve Chiswell
- Re: Top level CONDUIT relay
  - From: Steve Chiswell
- Re: Top level CONDUIT relay
  - From: Justin Cooke
- Re: Top level CONDUIT relay
  - From: Steve Chiswell
- Re: Top level CONDUIT relay
  - From: Justin Cooke
- Re: Top level CONDUIT relay
  - From: Steve Chiswell
- Re: Top level CONDUIT relay
  - From: Justin Cooke
- Re: Top level CONDUIT relay
  - From: Steve Chiswell
- Re: Top level CONDUIT relay
  - From: Justin Cooke
- Re: Top level CONDUIT relay
  - From: Chi.Y.Kang
- Re: Top level CONDUIT relay
  - From: Justin Cooke
Prev by Date: Re: Top level CONDUIT relay
Next by Date: Re: Top level CONDUIT relay
Previous by thread: Re: Top level CONDUIT relay
Next by thread: Re: Top level CONDUIT relay
Index(es):
- Date
- Thread