Download (direct link):
Stateful failover requires that the two units communicate with each other whenever a session is established or terminated. The protocol and exact semantics will vary among the products, but the standby unit must keep track of the entire session table, as maintained in the active unit, and keep it up to date on a continuous basis. When the active unit fails, the standby unit must know the load on each server, have an accurate copy of the entire session table, and be able to maintain session persistence as necessary.
Providing stateful failover is very complicated when the load balancers are performing delayed binding for URL, cookie, or SSL session ID based switching. Because the sequence or ACK numbers are modified in each request and reply packet, the standby unit must be updated after each packet for the correct sequence and ACK number count to ensure correct stateful failover. This can create a lot of overhead. In order to provide stateful failover for SSL session ID based switching, the standby unit must be updated with the SSL session ID table whenever there is a change to the table. When the standby unit takes over, it must be able to associate an SSL session ID with the correct server to ensure persistence.
Stateful failover is a great high-availability feature because it not only allows us to recover from the failure of a load balancer, but also causes no interruption to any of the active connections. The importance of stateful failover is greater for some applications versus others. In general, stateful failover provides more benefits for applications that use long-lived connections. For example, streaming-video connections are open for as long as it takes us to watch the video stream. HTTP connections are typically very short lived, because the browser makes one or more HTTP requests in one TCP connection and then closes the connection. Depending on the product, you may be able to enable stateful failover only for specific applications on the VIP, as opposed to all
applications, in order to efficiently utilize the load-balancer resources.
Stateful failover can affect the performance of the load balancer and the network design. The load balancers must communicate with each other to synchronize the session-table updates, and this is additional work for the load balancers. A load-balancing product may place a restriction on how closely the two units must be located to each other. It’s a good idea to have the load balancers connected on the shortest path possible, to ensure minimal latency for any communication. In the case of active-active configurations, stateful failover affects session-table capacity and utilization. If load balancer 1 has 50,000 active sessions and load balancer 2 has 100,000 active sessions, each load balancer will use 150,000 session-table entries to track all of the active sessions when performing stateful failover.
So far, we have only discussed the case in which each load balancer only has one VIP, but, in fact, we may have multiple VIPs on each load balancer. The load balancers may negotiate which VIP is active on what load balancer, or require the network administrator to configure this. Since each VIP represents a certain amount of load on the load balancer, it’s important to configure this properly to distribute load evenly among the load balancers. Further, depending on the network design and topology, it may make sense for certain VIPs to be served by one load balancer versus the other. For example, if load balancer 1 loses connectivity to real servers for VIP10, it’s better to selectively fail over only VIP10 to load balancer 2.
When a load balancer fails, the other unit takes over immediately. But, what if the failed load balancer is repaired and comes back? When using stateful failover, it will take some time for the recovered unit to synchronize all session information from scratch. The unit can be considered fully recovered only after the synchronization is complete. When not using stateful failover, moving VIPs from one load balancer to another causes disruption by terminating all sessions. It’s nice to have the recovered load balancer take over all VIPs it previously owned because this provides better load-balancer scalability. But the administrator may want to control when this happens to avoid the disruption of losing all active sessions for those VIPs.
High-Availability Design Options
In this section, let’s go through the evolution of high-availability network designs and consider the benefits and issues for each design.
Let’s start with a simple design of one router and one load balancer with directly attached servers, as shown in Figures 4.1 and 4.2. To tolerate load-balancer failure, we introduce two load balancers with active-standby configuration, as shown in Figure 4.8. For simplicity, we split the servers among the two load balancers in this approach. The VIP is bound to all the real servers. If the standby blocks all traffic, servers connected to the standby unit will not be available to the active unit for load balancing. This only provides half of the servers for load balancing. Further, if a load balancer fails, we also lose all the servers connected to the load balancer. This is the biggest limitation of this design.