[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Top level CONDUIT relay



current configuration. sysrq is for us for console "reset / debug "
access to the box.

# increase the amount of memory associated with input and output socket
buffers:
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# recommended to increase this for 1000 BT or higher
net.core.netdev_max_backlog = 2500

# don't cache ssthresh from previous connection
net.ipv4.tcp_no_metrics_save = 1

# Sys RQ
kernel.sysrq = 1

# LDM Tuning Setting by Chi
# Setting SHMMAX Parameter 4 GB
kernel.shmmax = 4294967296
# getconf PAGE_SIZE
kernel.shmmni = 4096
kernel.shmall = 2097152


Mike Schmidt wrote:
> Chi,
>
> Ideally, all of the LDM connections will stay in the ESTABLISHED state.
> When connections are in TIME_WAIT, they are in the process of closing
> down and disconnecting.  If connections are continually in TIME_WAIT,
> that's usually an indication of an underlying problem.
>
> With the volume of data and distance (latency) of the network connections
> you have between Illinois, Wisconsin, and Unidata, you will want to have
> done some TCP stack tuning.  Here are values we use on our cluster nodes;
>
> net.core.wmem_max = 8388608
> net.core.rmem_max = 8388608
> net.ipv4.tcp_wmem = 4096 2000000 8388608
> net.ipv4.tcp_rmem = 4096 524288 8388608
> net.ipv4.tcp_adv_win_scale = 7
> net.ipv4.tcp_moderate_rcvbuf = 1
>
> Let me know if you have questions.
>
> mike
>
> On Jun 21,  9:11am, Steve Chiswell wrote:
>   
>> Subject: Re: Top level CONDUIT relay
>>
>> Chi,
>>
>> LDM memory maps theproduct queue. It does not use chared memory. I would
>> guess that the shared memory segment you see in use is by the operating
>>     
> system
>   
>> or window manager.
>>
>> The parameter that can be tuned to improve LDM performance is the tcp stack
>> size.
>>
>> The netstat listing that you showed with several LDM connections in time_wait
>> may mean somethng to Steve Emmerson and/or Mike Schmidt, so I'll see if they
>> have any imput as well as suggestions on TCP tuning.
>>
>> Steve
>>
>>
>>
>> On Thu, 21 Jun 2007, Chi.Y.Kang wrote:
>>
>>     
>>> Yes, I made the change to the LDM servers to test the shared memory
>>> configuration.
>>>
>>> # Setting SHMMAX Parameter 4 GB
>>> kernel.shmmax = 4294967296
>>> # getconf PAGE_SIZE
>>> kernel.shmmni = 4096
>>> kernel.shmall = 2097152
>>>
>>> However, this doesn't explain the performance relief because...  ldm
>>> doesn't seem to be using shared memory, or at least not listed on the
>>> table.  Mr Cano thought LDM might be using this.
>>>
>>> ldm1:~$ ipcs -a
>>>
>>> ------ Shared Memory Segments --------
>>> key        shmid      owner      perms      bytes      nattch
>>> status
>>> 0x00000000 0          root      600        3976       4
>>> dest
>>>
>>> ------ Semaphore Arrays --------
>>> key        semid      owner      perms      nsems
>>>
>>> ------ Message Queues --------
>>> key        msqid      owner      perms      used-bytes   messages
>>>
>>>
>>> Justin Cooke wrote:
>>>       
>>>> Chi,
>>>>
>>>> Has anything at all changed on ldm1 since yesterday? Starting at 04Z
>>>> the feed on node6 improved dramatically, all other subscribers to ldm1
>>>> also noticed improved performance.
>>>>
>>>> Justin
>>>>
>>>> Steve Chiswell wrote:
>>>>         
>>>>> Justin,
>>>>>
>>>>> I noticed that the feeds from ldm1 dropped as you said. Do you know
>>>>> if anything
>>>>> changed related to that machine?
>>>>>
>>>>> I can add daffy back to ldm1 and see if things maintain their
>>>>> performance, but
>>>>> will wait to find out if any changes were made? Since ldm2 is still
>>>>> lagging,
>>>>> seems like it is not a network wide issue?
>>>>>
>>>>> Steve
>>>>>
>>>>> On Thu, 21 Jun 2007, Justin Cooke wrote:
>>>>>
>>>>>
>>>>>           
>>>>>> Steve,
>>>>>>
>>>>>> Looking at the graphs it appears that transfers improved greatly after
>>>>>> 04Z today. I did a netstat on ldm1 and I still see where atm and flood
>>>>>> are subscribing to it, same as yesterday.
>>>>>>
>>>>>> Although looking at the latency graphs you provide it looks like those
>>>>>> subscribing to ldm2 are still seeing delays.
>>>>>>
>>>>>>
>>>>>>             
> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+atm.cise-nsf.gov
>   
>>>>>> Justin
>>>>>>
>>>>>> Steve Chiswell wrote:
>>>>>>
>>>>>>             
>>>>>>> Justin,
>>>>>>>
>>>>>>> I am receiving the stats from node6:
>>>>>>> Latency:
>>>>>>>
>>>>>>>               
> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+node6.woc.noaa.gov
>   
>>>>>>> Volume:
>>>>>>>
>>>>>>>               
> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_vol_nc?CONDUIT+node6.woc.noaa.gov
>   
>>>>>>> The latency there to ldm1 is climbing on the initial connection, and
>>>>>>> will start off by catching up on the last hours worth of data in the
>>>>>>> upstream queue. After that, we can see what the latency is doing.
>>>>>>>
>>>>>>> Steve
>>>>>>>
>>>>>>> On Wed, 2007-06-20 at 12:43 -0400, Justin Cooke wrote:
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> Steve and Chi,
>>>>>>>>
>>>>>>>> I tried to ping rtstats.unidata.ucar.edu but was unable to.
>>>>>>>>
>>>>>>>> Chi would you be able to set up a static route from node6 to
>>>>>>>> rstats.unidata.ucar.edu like Steve mentions?
>>>>>>>>
>>>>>>>> I actually am unable to connect to ncepldm.woc.noaa.gov either.
>>>>>>>> However
>>>>>>>> I did set up a feed to "ldm1" and am receiving CONDUIT data
>>>>>>>> currently.
>>>>>>>>
>>>>>>>> Steve how tough would it be to do the pqact step you mention and
>>>>>>>> to get
>>>>>>>> the stats reports from those if Chi is unable to get the static route
>>>>>>>> going?
>>>>>>>>
>>>>>>>> Thanks for all the help,
>>>>>>>>
>>>>>>>> Justin
>>>>>>>>
>>>>>>>> On Jun 20, 2007, at 12:16 PM, Steve Chiswell wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> Justin,
>>>>>>>>>
>>>>>>>>> Is that box capable of sending stats to our rtstats.unidata.ucar.edu
>>>>>>>>> host?
>>>>>>>>> Eg, is it allowed to connect outside your domain?
>>>>>>>>>
>>>>>>>>> The ldm won't need to run pqact to test out the throughput and
>>>>>>>>> netwrok,
>>>>>>>>> but will need ldmd.conf lines:
>>>>>>>>>
>>>>>>>>> EXEC    "rtstats -h rtstats.unidata.ucar.edu"
>>>>>>>>> request CONDUIT ".*" ncepldm.woc.noaa.gov
>>>>>>>>>
>>>>>>>>> The pqact EXEC action can be commented out. The request
>>>>>>>>> line will start the feed to ncepldm which flood.atmos.uiuc.edu is
>>>>>>>>> pointing to, and showing high latency. If you are able to feed from
>>>>>>>>> ncepldm
>>>>>>>>> without the latency that outside hosts are showing, then it would
>>>>>>>>> isolate the
>>>>>>>>> problem further to the border of your network to the outside. If
>>>>>>>>> you do
>>>>>>>>> show similar latency, then it would either be the LDM configuration
>>>>>>>>> itself, or the local
>>>>>>>>> router that the machines are on.
>>>>>>>>>
>>>>>>>>> If you are able to send rtstats out to us, then we can monitor
>>>>>>>>> stats on
>>>>>>>>> our web pages.
>>>>>>>>> Your network might require a static route be added for sending that
>>>>>>>>> outside your domain (that would something your networking folks
>>>>>>>>> would
>>>>>>>>> know). The rtstats sends
>>>>>>>>> a small text report about every 60 seconds, so not a lot of traffic.
>>>>>>>>>
>>>>>>>>> If you can't configure your host to send rtstats, then we could
>>>>>>>>> create
>>>>>>>>> q
>>>>>>>>> pqact.conf action to file the .status reports and calculate the
>>>>>>>>> latency
>>>>>>>>> from those.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Steve
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, 2007-06-20 at 12:03 -0400, Justin Cooke wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>>> Steve,
>>>>>>>>>>
>>>>>>>>>> If you provide us a pqact.conf I can have the box chi set up to
>>>>>>>>>> feed
>>>>>>>>>> off of ldm1 and see how its latencies are.
>>>>>>>>>>
>>>>>>>>>> Justin
>>>>>>>>>> On Jun 20, 2007, at 11:36 AM, Steve Chiswell wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>>>> Justin,
>>>>>>>>>>>
>>>>>>>>>>> Since the change at 13Z by dropping daffy.unidata.ucar.edu out
>>>>>>>>>>> of the
>>>>>>>>>>> top level nodes the ldm2 feed to NSF is showing little/no
>>>>>>>>>>> latency at
>>>>>>>>>>> all. The ldm1 feed to NSF which is connected using the
>>>>>>>>>>> alternate LDM
>>>>>>>>>>> mode is only devivering the .status messages its creates as all
>>>>>>>>>>> the
>>>>>>>>>>> other products are duplicates of products already being
>>>>>>>>>>> received from
>>>>>>>>>>> LDM2 and that is showing high latency:
>>>>>>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?
>>>>>>>>>>> CONDUIT+atm.cise-nsf.gov
>>>>>>>>>>>
>>>>>>>>>>> This configuration is getting data out to the community at the
>>>>>>>>>>> moment.
>>>>>>>>>>> The downside here is that it puts a single point of failure at
>>>>>>>>>>> NSF in
>>>>>>>>>>> getting the data to Unidata, but
>>>>>>>>>>> I'll monitor that end.
>>>>>>>>>>>
>>>>>>>>>>> It seems that ldm1 is either slow, or it is showing network
>>>>>>>>>>> limitations
>>>>>>>>>>> (since
>>>>>>>>>>> flood.atmos.uiuc.edu is feeding from ncepldm which is apparently
>>>>>>>>>>> pointing to ldm1, there is load on ldm1 besides the NSF feed.
>>>>>>>>>>> LDM2 is
>>>>>>>>>>> feeding both NSF and idd.aos.wisc.edu (and Wisc looks good
>>>>>>>>>>> since 13Z
>>>>>>>>>>> as
>>>>>>>>>>> well) so it is able to
>>>>>>>>>>> handle the throughput to 2 downstreams, but adding daffy as the
>>>>>>>>>>> 3rd
>>>>>>>>>>> seems to
>>>>>>>>>>> cross some point in volume of what can be sent out.
>>>>>>>>>>>
>>>>>>>>>>> Steve
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 2007-06-20 at 09:45 -0400, Justin Cooke wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                       
>>>>>>>>>>>> Thanks Steve,
>>>>>>>>>>>>
>>>>>>>>>>>> Chi has set up a box on the lan for us to run LDM on, I am
>>>>>>>>>>>> beginning
>>>>>>>>>>>> to
>>>>>>>>>>>> get things running on there.
>>>>>>>>>>>>
>>>>>>>>>>>> have you seen any improvement since dropping daffy?
>>>>>>>>>>>>
>>>>>>>>>>>> Justin
>>>>>>>>>>>>
>>>>>>>>>>>> On Jun 20, 2007, at 9:03 AM, Steve Chiswell wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                         
>>>>>>>>>>>>> Justin,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, this does appear to be the case. I will drop daffy from
>>>>>>>>>>>>> feeding
>>>>>>>>>>>>> directly and instead move it to feed from NSF. That will
>>>>>>>>>>>>> remove one
>>>>>>>>>>>>> of the top level relays of data having to go out of NCEP and
>>>>>>>>>>>>> we can see if the other nodes show an improvement.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Steve
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, 20 Jun 2007, Justin Cooke wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> Steve,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Did you see a slowdown to ldm2 after Pete and the other sites
>>>>>>>>>>>>>> began
>>>>>>>>>>>>>> making connections?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Chi, considering steve saw a good connection to ldm1 before the
>>>>>>>>>>>>>> other
>>>>>>>>>>>>>> sites connected doesn't that point toward a network issue?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> All of our queue processing on the diskserver has been running
>>>>>>>>>>>>>> without
>>>>>>>>>>>>>> any problems so I don't believe anything on that system would
>>>>>>>>>>>>>> impacting
>>>>>>>>>>>>>> ldm1/ldm2.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Justin
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jun 20, 2007, at 12:04 AM, Chi Y Kang wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                             
>>>>>>>>>>>>>>> I setup the test LDM server for the NCEP folks to test the
>>>>>>>>>>>>>>> local
>>>>>>>>>>>>>>> pull
>>>>>>>>>>>>>>> from the LDM servers.  That should give us some information /
>>>>>>>>>>>>>>> network
>>>>>>>>>>>>>>> or system related issue.  We'll handle that tomorrow.  I am a
>>>>>>>>>>>>>>> little
>>>>>>>>>>>>>>> bit concerned that the slow down all occurred at the some
>>>>>>>>>>>>>>> time as
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> ldm1 crash last week.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also, can NCEP also check if there are any bad dbnet queues on
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> backend servers?  Just to verify.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Steve Chiswell wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                               
>>>>>>>>>>>>>>>> Thanks Justin,
>>>>>>>>>>>>>>>> I also had a typo in my message:
>>>>>>>>>>>>>>>> ldm1 is running slower than ldm2
>>>>>>>>>>>>>>>> Now if the feed to ldm2 all of a sudden slows down if Pete
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>> sites add a request to it, it would really signal some
>>>>>>>>>>>>>>>> sort of
>>>>>>>>>>>>>>>> total
>>>>>>>>>>>>>>>> bandwidth limitation
>>>>>>>>>>>>>>>> on the I2 connection. Seemed a little coincidental that we
>>>>>>>>>>>>>>>> had a
>>>>>>>>>>>>>>>> show
>>>>>>>>>>>>>>>> period
>>>>>>>>>>>>>>>> of good connectivity to ldm1 after which it slowed way down.
>>>>>>>>>>>>>>>> Steve
>>>>>>>>>>>>>>>> On Tue, 2007-06-19 at 17:01 -0400, Justin Cooke wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>                                 
>>>>>>>>>>>>>>>>> I just realized the issue. When I disabled the "pqact"
>>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>> ldm2 earlier today it caused our monitor script (in cron,
>>>>>>>>>>>>>>>>> every 5
>>>>>>>>>>>>>>>>> min) to kill the LDM and restart it. I have removed the
>>>>>>>>>>>>>>>>> check
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> the pqact in that monitor...things should be a bit better
>>>>>>>>>>>>>>>>> now.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Chi.Y.Kang wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>                                   
>>>>>>>>>>>>>>>>>> Huh, i thought you guys were on the system.  let me take a
>>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>> ldm2
>>>>>>>>>>>>>>>>>> and see what is going on.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Justin Cooke wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>                                     
>>>>>>>>>>>>>>>>>>> Chi.Y.Kang wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>>>>>>> Steve Chiswell wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>>>>>> Pete and David,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I changed the CONDUIT request lines at NSF and
>>>>>>>>>>>>>>>>>>>>> Unidata to
>>>>>>>>>>>>>>>>>>>>> request data
>>>>>>>>>>>>>>>>>>>>> from ldm1.woc.noaa.gov rather than ncepldm.woc.noaa.gov
>>>>>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>> seeing
>>>>>>>>>>>>>>>>>>>>> lots of
>>>>>>>>>>>>>>>>>>>>> disconnect/reconnects to the ncepldm virtual name.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The LDM appears to have caught up here as an interim
>>>>>>>>>>>>>>>>>>>>> solution.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Still don't know the cause of the problem.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Steve
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>                                           
>>>>>>>>>>>>>>>>>>>> I know the NCEP was stop and starting the LDM service
>>>>>>>>>>>>>>>>>>>> on the
>>>>>>>>>>>>>>>>>>>> ldm2
>>>>>>>>>>>>>>>>>>>> box
>>>>>>>>>>>>>>>>>>>> where the VIp address is pointed to at this time.  how is
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>>>>>>> connection to LDM1?  is the speed of the conduit feed
>>>>>>>>>>>>>>>>>>>> acceptable?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>                                         
>>>>>>>>>>>>>>>>>>> Chi, NCEP has not restarted the LDM on ldm2 at all
>>>>>>>>>>>>>>>>>>> today. But
>>>>>>>>>>>>>>>>>>> looking
>>>>>>>>>>>>>>>>>>> at the logs it appears to be dying and getting
>>>>>>>>>>>>>>>>>>> restarted by
>>>>>>>>>>>>>>>>>>> cron.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I will watch and see if I see anything.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Justin
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>                                       
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Chi Y. Kang
>>>>>>>>>>>>>>> Contractor
>>>>>>>>>>>>>>> Principal Engineer
>>>>>>>>>>>>>>> Phone: 301-713-3333 x201
>>>>>>>>>>>>>>> Cell: 240-338-1059
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                               
>>>>>>>>>>> --
>>>>>>>>>>> Steve Chiswell <address@hidden>
>>>>>>>>>>> Unidata
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                       
>>>>>>>>> --
>>>>>>>>> Steve Chiswell <address@hidden>
>>>>>>>>> Unidata
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>> --
>>> Chi Y. Kang
>>> Contractor
>>> Principal Engineer
>>> Phone: 301-713-3333 x201
>>> Cell: 240-338-1059
>>>
>>>       
>> -- End of excerpt from Steve Chiswell
>>     
>
>
>   


-- 
Chi Y. Kang
Contractor
Principal Engineer
Phone: 301-713-3333 x201
Cell: 240-338-1059