[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20050722: LDM Server Farm question



>From: Bob Lipschutz <address@hidden>
>Organization: NOAA/FSL
>Keywords: 200507221824.j6MIOTjo024590 IDD cluster

Hi Bob,

I apologize for not being responsive to emails, but we are currently
hosting an LDM workshop.

Cheers,

Tom Yoksas

Bob Lipschutz said:
>After much testing and discussion, we think we have an idea of
>what we've been seeing.  First, we discovered that our Load
>Balancer was not the problem at all, and that the condition
>happened when our clients connected directly to one of the server
>hosts.  Instead, the issue appears to be that the servers are in
>our DMZ, while our desktop clients' iptables are set to block all
>incoming DMZ connections. We believe that's why we see in our 
>client logs messages like:
>
>  requester6.c:105: No IS_ALIVE reply from upstream LDM 
>
>and messages in the server logs like:
>  
>  up6.c:287: nullproc_6() failure to xxx.fsl.noaa.gov: RPC: 
>Timed out
>
>indicating that the server is unable to deliver reply messages to
>the downstream system.  Meanwhile, of course, the data channels
>are established and working normally. (BTW, we're running ldm 6.3
>on these Linux RHEL4 machines).  
>
>The important thing to note is that after a while like this, the
>server process seems to get into a hung state that no longer
>accepts new connection requests.  It also doesn't seem to respond
>to kill signals (ldmadmin stop) in this state.  So, it appears
>that we have a repeatable scenario in which a client system with
>particular firewall settings can adversely affect a server and
>cause it to stop working normally.  This would see to be a very
>undesirable feature. Previously, we had thought that the
>downstream system would not need to adjust its firewall/iptables
>configuration to allow port 388 in order to make requests, but
>that does not seem correct, given our findings.
>
>   Bob Lipschutz
>
>> 
>> 
>> Hi.  Based on discussions we had with you folks this spring, we
>> in the FSL ITS group are now implementing a 3-host LDM farm
>> arrangement for external distribution of MADIS and other data
>> sets.  We  have a configuration that is working after a fashion,
>> but it also seems to be having some trouble.  We're hoping you
>> might have some advice.
>> 
>> The main difference between what you described for your IDD
>> servers and ours implementation is that we are trying to use a
>> Foundry switch for the front end load balancing, instead of the
>> software implementation you have.
>> 
>> Unfortunately, in tests with with this switch/farm arrangement,
>> we're now seeing some very bad behavior.  Here's the scenario:
>> 
>> For my testing, I'm running clients on 4 desktops, with multiple
>> request lines on each, for a total of about a couple dozen
>> connections to the three servers.
>> 
>> Initially, the switch seems to do the right thing in handing off
>> connection requests to the available (running) LDM servers,
>> distributing the connections evenly.  Data transfers to the
>> clients look normal.  Shutting down a server results, as expected,
>> in clients reconnecting on another.
>> 
>> After running for a while, though, it appears that the LDM child
>> processes on the server side die, forcing the clients to that
>> host to reconnect.  In the client logs, we see entries like:
>> 
>>   ERROR: requester6.c:233: Upstream LDM died 
>> 
>> It appears that the clients do successfully reconnect for a
>> while.  However, eventually, the LDM server gets into a bad
>> state, such that no clients are able to connect thereafter.  At
>> that point, netstat shows the incoming connection requests in
>> "SYN_RECV" state, and the logs report timeouts on the requests.
>> ldmping also reports timeouts, even if run on the localhost.
>> Killing and restarting ldm on the server only clears the problem
>> for a while.  We actually have seen this "host server" behavior
>> occasionally on other (non-farm) systems, but the farm seems
>> to exacerbate the condition.
>> 
>> We've tested adjusting the session timeout period on the switch,
>> and can see that it seems to affect how quickly the problem
>> starts.
>> 
>> Interestingly (unfortunately!), the switch doesn't detect that
>> the LDM server is not properly handling connections (it's doing
>> a layer 3 health check).  So, the switch continues to hand off that 
>> server to the client that has been timing out.  Evidently, port 
>> 388 appears to be active as far as the switch is concerned, which 
>> seems to be all that the switch cares about.  
>> 
>> One other test I did was to explicitly request directly to one of
>> the farm servers, rather than using the virtual address.  In that
>> case, I believe the connections had no problem staying alive.
>> 
>> So, the bottom line appears to be that our LDM farm arrangement
>> appears to be aggravating a condition that may already be in
>> LDM.  We're wondering if you've seen the same behavior and
>> how you've set your session timeout to work in your cluster.
>> 
>> Thanks for your assistance!
>> 
>>    Bob Lipschutz and Chris MacDermaid
>> 
>> 
>> 
>> 
>> 
>
>
--
NOTE: All email exchanges with Unidata User Support are recorded in the
Unidata inquiry tracking system and then made publicly available
through the web.  If you do not want to have your interactions made
available in this way, you must let us know in each email you send to us.