Deal with Reconnection Storm — Two Strategies
It is a big challenge to roll out a new release without causing a Reconnection Storm when dealing with thousands of active WebSocket connections across a group of servers.
Reconnection Storm
It refers to a situation where a server experiences a sudden and overwhelming surge in the number of incoming connection requests due to reconnection. This can occur due to various reasons, such as a large number of clients trying to reconnect simultaneously after a network outage, a new release, or many other reasons.
Problem Statement
Websockets provide a persistent connection between the client and server, enabling real-time communication. These connections are often distributed across multiple servers (typically load-balanced). When a new release is rolled out, all existing WebSocket connections must reconnect to the new server version. If not managed correctly, this can lead to a reconnection storm, overwhelming the system and degrading the user experience.
Draining Strategies
📣 Funnel (Graceful Connection Closure): This is a methodical way to gracefully close connections at the application level. This technique involves disconnecting clients in controlled batches — let’s say groups of 1,000 clients every 3 seconds. This gradual disconnection prevents a sudden surge of reconnections, enabling a smoother client migration to the new server release —You may have heard different names for a similar pattern.
🐾 Staggered Deployment: It’s very similar to the Funnel but instead of having it implemented within the server we limit the number of connections to drain to the number of available connections per instance. It is a technique where new servers are rolled out gradually rather than all at once. By introducing a small percentage of the servers into the production environment at a time and closely monitoring their performance, you can ensure that any issues are identified early without impacting all clients simultaneously. Each server gets replaced with a new version, and after passing a certain amount of time, we continue with other servers. For instance, we have 4 servers, and each serves 50K active connections on average. When we replace servers one at a time, only 50K clients have to reconnect. By doing it gradually, we can ensure that there won't be a sudden surge of activity.
Choosing a draining strategy depends on many factors, such as client or connection behavior patterns and deployment/release strategy. In this post, we will explore the Funnel Strategy in-depth and briefly share an overview of Staggered Deployment.
Funnel
The funnel (Graceful Connection Closure) involves gradually off-boarding clients in a controlled manner to prevent a sudden drain of active connections from the currently running servers. This method ensures clients reconnect to the new release in manageable batches, maintaining system stability and performance. I tested the Funnel strategy on the application level and wrote some codes for that.
Let’s imagine we want to roll out a new release and deploy new servers. In the below scenario, I’ve considered each 🐝 a WebSocket connection.
If we just start draining the current servers, we put a lot of load on the new servers because all those connections will immediately start reconnecting. Those bees are impatiently looking for a new hive.
What if we take a different approach? Instead of releasing the bees into the wild in chaos mode, we could place them in a box with a pre-defined capacity and then release them after a certain amount of time — Draining based on time, in addition to a capped limit, is a good idea as this would allow us to them go even if the bees have not reached capacity.
In reality, there is no guarantee that a connection end up in the same server unless a feature like session sticky is being used. However, it doesn’t change the fact of funneling the 🐝 connections while shutting down the servers. Now, let’s talk about numbers and not bees!
First, I want to show you how it works in a regular application without having a graceful shutdown and draining strategy to disconnect clients. I’ve developed a very simple WebSocket server that accepts connections and a client script that opens 100 connections to the server. It exposes Prometheus metrics so we can gather some numbers.
I started a server instance and connected 100 clients; a few moments later, I released a new instance, and the old running server started disconnecting the connected clients. As you can see in the chart below, in less than 8 seconds, all connections were disconnected from the old instance and reconnected to the new instance! You can imagine if you have any business logic or database queries when accepting a connection, it comes with a huge cost and you probably will see spikes on charts for that short amount of time because loads are neither distributed nor chunked.
I’ve modified the server, added a graceful shutdown mechanism, and implemented a logic that will start closing 10 connections every 3 seconds once the server receives a shutdown signal. Though 10 connections every 3 seconds is too slow for the real-world use case, I intentionally wanted to add more gaps.
It took roughly 28 seconds to close those 100 opened connections, and the new server instance was simultaneously accepting those new clients.
Note that the old server was not allowed to accept new connections anymore.
As you can see, we could intentionally slow down the draining process to prevent servers from becoming overwhelmed with a storm of (re)connections. It's crucial to consider the type of client, audience, whether the delay is acceptable, and rollout policies, as not every application can benefit from funneling. For example, your client might not try to reconnect immediately, or you might have another mechanism in the middle that can handle such a situation.
Staggered Deployment
I'd like to briefly share an example of this strategy and ask you to take your time to think about how to proceed and combine it with the Funnel strategy:
Common Considerations
When we try to drain and accept new connections, it is crucial to consider specific points applicable to both strategies:
- Before shutting down or updating an instance, put it into a `draining` state where it stops accepting new connections. Existing WebSocket connections are allowed to continue running until they close naturally or after a grace period.
- Modify your load balancer, service, or reverse proxy settings to route new connections only to the updated instances.
- Use a blue-green deployment strategy where all new connections are routed to the “green” (new) environment while the “blue” (old) environment is phased out.
Conclusion
I highly recommend not relying solely on a single strategy (Funnel) for draining connections but rather trying to combine it with other strategies, such as limiting the number of connections per instance, staggered deployment, etc.
I mentioned WebSocket, but it’s not necessarily designed for WebSocket servers. You got the idea!
Thanks for reading ❤️
Please Follow and Share!