Lesson learned from an NLB connection timeout | by Jian Li | Expedia Group Technology

How we fixed an issue identified by internal users of our API

Image by James Harrison on Unsplash

We recently received a ticket from one of our customers stating that he kept getting timeouts when trying to connect to our service through the API Gateway. This turned out to be a tricky issue, but our duty engineer was able to identify the root cause and implemented a fix. This article explains the issue and the mitigation. We believe a similar issue can occur with other services that use a Network Load Balancer behind an API Gateway. Therefore, we would like to share our experience to minimize duplicate efforts.

The story begins with a ticket created by one of our customers. The ticket claims that frequent timeouts occurred when calling our API. The following is a conceptual diagram illustrating how our API is exposed.

Clients call through an API gateway and NLB to access our service in ECS
Client calls go through the API Gateway and NLB

As shown above, a customer-initiated API call is first sent to the API Gateway before reaching AWS Network Load Balancer and then our service in ECS. The API gateway is managed by the platform team and handles authentication and authorization for us. The AWS Network Load Balancer distributes traffic to our ECS task. Note that the detailed routing path is more complicated, this illustration is simplified to focus only on the problem.

Once the ticket was acknowledged, we checked the service status and metrics in ECS. The service status checks seemed fine. The service was returning 200 HTTP responses. Everything in the service seemed to work normally. Our client also filed a ticket with the API Gateway team, so we had a brief sync with them. API Gateway logs showed that they received timeouts from our service and returned a 502 to clients. We started wondering – is the NLB messing up?

Luckily, there is a blog post already written by another team that ran into a similar issue. [1], in which the NLB silently closed idle TCP connections, causing timeouts on their Tomcat server. Our monitoring measurements have indicated that a similar issue has occurred on our service. Every API call timeout seemed to come with at least a “load balancer reset count” and a “client reset count”. These two values ​​correspond to the AWS Load Balancer “TCP_ELB_Reset_Count” and “TCP_Client_Reset_Count”, which are used to store the number of RST packets sent by the load balancer (LB) and the LB client, which in our case is the API gateway.

TCP_Client_Reset_Count
----------------------
The total number of reset (RST) packets sent from a client to a target. These resets are generated by the client and forwarded by the load balancer.Reporting criteria: Always reported.Statistics: The most useful statistic is Sum.Dimensions
-
LoadBalancer
- AvailabilityZone, LoadBalancer
TCP_ELB_Reset_Count
-------------------
The total number of reset (RST) packets generated by the load balancer.Reporting criteria: Always reported.Statistics: The most useful statistic is Sum.Dimensions
-
LoadBalancer
- AvailabilityZone, LoadBalancer

Excerpts from AWS documentation of Network Load Balancer metrics, see https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-cloudwatch-metrics.html

At first glance we thought it might be an NLB bug, but after checking the official NLB documentation [3]we thought it was designed to act in the following way:

For each TCP request that a client makes through a Network Load Balancer, the state of that connection is tracked. If no data is sent over the connection by the client or target for longer than the idle timeout, the connection is closed. If a client or target sends data after the idle timeout expires, it receives a TCP RST packet to indicate that the connection is no longer valid.

Elastic Load Balancing sets the idle timeout value for TCP streams to 350 seconds. You cannot change this value. Clients or targets can use TCP keepalive packets to reset the idle timeout.

Whenever the TCP connection was idle for 350 seconds, the NLB silently closed the connection without notifying anyone and only sent RST when data packets were sent to the closed connection. It’s finally clear how customers saw waiting times! The sequence of events unfolds as follows.

  1. NLB silently closes the connection when it reaches the idle timeout threshold.
  2. The client continues to call the API without knowing that the NLB has closed the connection.
  3. The API Gateway sends traffic to a closed connection.
  4. NLB returns the RST packet to tell the API Gateway to use a new connection.
  5. API Gateway times out while calling on the closed connection and returns 502 Bad Gateway to the client.
  6. Meanwhile, the downstream ECS thinks everything is fine.
An illustration of the steps outlined above
Sequence of events after connection is silently closed by NLB

Once we identify the root cause of this problem, the best and easiest solution should be to decrease the TCP keepalive value [4] to a number less than the NLB idle timeout threshold of 350 seconds, so that the connection is refreshed before NLB silently closes it. Unfortunately, our service resides in an AWS shared account, which means we don’t have permission to modify the keepalive property at the operating system level.

Therefore, we adopted the alternative solution proposed by the team who encountered the same problem before. Instead of letting the NLB close the connection, we instruct our service in ECS to proactively close the TCP connection using a smaller idle timeout threshold, so the API Gateway stops using the connection. origin.

Now, if you’re using a Tomcat server, which is used by Spring Boot by default, you’re really in luck, because there’s a very handy “server.connection-timeout” configuration property for such a use case. All you have to do is replace this property with a lower value, and your problem is perfectly solved!! However, for us, this is where we started to struggle. Our service leverages Reactor Netty, which is used by Spring Boot Reactive by default. Reactor Netty naturally does not support a similar property. We ended up using IdleStateHandler [5] to enable service timeout configurations, which work like:

Triggers a IdleStateEvent when a Channel hasn’t read, written, or both for a while.

We use a WebServerFactoryCustomizer customize the connection channel by adding a ChannelInitializerwhich adds the IdleStateHandler to the channel. The code is written:

factory.addServerCustomizers(http -> http.tcpConfiguration( 
tcp -> tcp.bootstrap(bs -> bs.childHandler(new ChannelInitializer<>() {
@Override
protected void initChannel(Channel c) throws Exception {
c.pipeline().addLast(
new IdleStateHandler(0, 0, Duration.ofMillis(timeout, NANOSECONDS) {
private final AtomicBoolean closed = new AtomicBoolean();
@Override
protected void channelIdle(ChannelHandlerContext ctx, IdleStateEvent evt) {
if (closed.compareAndSet(false, true)) {
ctx.close();
}
}
});
}})
);

Things are finally back on track! We ran local tests, pushed the code, deployed it to the test environment. Then a new problem happened!! the ChannelInitializer ended up overwriting the SSL configuration! The channel pipeline is expected to include an SSL handler at the front of the pipeline. However, because we overstepped the initChannel , the pipeline is no longer defined by the Spring Boot properties file. What should we do? Well, let’s add the SSL handler back and see if that fixes the problem.

char[] keyStorePassword = keyPassword.toCharArray(); 
KeyStore ks = KeyStore.getInstance(KeyStore.getDefaultType());
keyStore.load(new FileInputStream(ksFile, keyStorePassword);
SslContextBuilder builder = SslContextBuilder.forServer((PrivateKey) keyStore.getKey(keyAlias, keyStorePassword), new X509Certificate[]{(X509Certificate) keyStore.getCertificateChain(keyAlias)[0]});channel.pipeline().addFirst(builder.build().newHandler(channel.alloc()));
channel.pipeline().addLast(
new IdleStateHandler(0, 0, Duration.ofMillis(timeout, NANOSECONDS) {
private final AtomicBoolean closed = new AtomicBoolean();
@Override
protected void channelIdle(ChannelHandlerContext ctx, IdleStateEvent evt) {
if (closed.compareAndSet(false, true)) {
ctx.close();
}
}
});

To finish! We tested it in the test environment by making hundreds of API calls and saw no more timeouts. The load balancer reset count has also been reduced to 0. The NLB twist is finally resolved! The following figure shows the number of NLB resets before and after the fix.

The left side of the graph shows
NLB reset account before and after our fix

When we use a Layer 4 load balancer, we need to pay special attention to the different timeouts at each layer. Focusing only on Layer 7 configurations can potentially cause unexpected errors. NLB reset count metrics can be easily overlooked when we focus on the health of our infrastructure. In fact, the problem we explained in this article would not occur if the NLB was replaced by a classic load balancer such as Elastic Load Balancer (ELB) or Application Load Balancer (ALB). But if we make this change, we would be giving up the many benefits that NLB brings, such as superior performance over ELB/ALB, support for containerized applications, and more, as detailed in this AWS documentation. [6]. This probably illustrates how we often have to consider the trade-offs between different technology options and make the best judgment while happily setting up complex systems in the unreliable distributed cloud.

Comments are closed.