Unlocking Resilience and Transient-fault-handling in your C# Code
In an ideal world, every operation we execute, every API we call, and every database we query, would always work flawlessly. Unfortunately, we live in a world where network outages, server overloads, and unexpected exceptions are common realities. To maintain robust, resilient applications, we must anticipate these mishaps. Enter retry logic.
Hope for the best, prepare for the worst.
What is Retry Logic
Retry logic is a programming pattern that helps an application recover gracefully from transient failures. It does so by repeating a failed operation a certain number of times before finally giving up and throwing an error. This simple yet powerful mechanism can be the difference between a temporary hiccup and a full-blown application failure.
Imagine this scenario: Your application is trying to fetch data from an API. Suddenly, due to a network glitch, the API is temporarily unavailable. Without retry logic, your application might immediately crash or enter an erroneous state. But with retry logic implemented, your application will instead try to fetch the data again after a short interval. If the API is still unavailable, it might wait a bit longer and then try again, repeating the process until it either succeeds or reaches a defined limit.
In C#, this logic can be implemented within a block of code using error handling and loop statements, or more effectively with the help of external libraries designed for this purpose. It is typically applied to operations that have transient errors, that is, errors that may be resolved upon subsequent attempts.
Example:
int retryCount = 3;
while (retryCount > 0)
{
try
{
retryCount = 0;
}
catch (Exception)
{
retryCount--;
if (retryCount <= 0) throw;
}
}
In this simple example, an operation is attempted three times before finally throwing the exception if it continues to fail.
Why Retry Logic
In the era of distributed systems and cloud services, network failures, timeouts, and resource contention are not uncommon. These transient failures can cause an operation to fail temporarily, but a subsequent retry may succeed. Without retry logic, your application may report a failure when a simple retry could have resolved the issue.
Whether it’s a complex enterprise application, a website, or a simple microservice, the odds are high that you’ll run into a situation where an operation doesn’t go as planned. That’s where the concept of retry logic comes into play.
A simple example could be a C# application that relies on a database connection to retrieve data. If the database service is temporarily unavailable due to network latency or a short-lived outage, the application might immediately throw an error. However, if we employ retry logic, the application could wait and retry the connection, allowing for temporary issues to resolve, and thereby avoid a potential application failure.
Another scenario could be an API call. APIs often have rate limits, and excessive calls may result in temporary blocking of the service. If the API call in your C# code doesn’t succeed the first time, retry logic can be used to make another call after a delay, thereby effectively handling rate limit issues.
As we delve into the realm of distributed systems and cloud computing, the need for retry logic becomes even more significant. A microservice might be temporarily unresponsive. A cloud resource might be momentarily unavailable. Transient network issues can cause operations to fail. Retry logic in your C# code ensures your application remains resilient and reliable in the face of these transient failures.
It’s also worth noting that the necessity of retry logic isn’t limited to handling transient errors. It’s a valuable strategy to ensure the smooth execution of your code in the face of any operation that has a potential to fail but might succeed on a subsequent attempt. This includes operations like file handling, where a file may initially be locked or inaccessible, or multi-threaded operations, where resources may be temporarily unavailable due to concurrent access.
Implementing retry logic enhances the resilience of your application, allowing it to gracefully handle temporary issues and offer a more robust and reliable service to users. It’s like teaching your code to get back on its feet, even after stumbling on an unexpected obstacle.
But remember, with great power comes great responsibility. While retry logic can be a lifesaver, incorrect use can lead to a spiral of repeated failures, consuming resources and time. Hence, understanding when and how to implement this logic effectively is crucial.
When to Implement Retry Logic
Having established the necessity of retry logic in C#, it’s essential to know when to implement this powerful strategy in your code. Implementing retry logic isn’t about plastering it all over your codebase, but rather it’s about deploying it judiciously where it matters the most.
Retry logic is best implemented in situations where transient errors or temporary conditions can cause an operation to fail. The keyword here is ‘transient’. If an operation is likely to fail repeatedly due to an unresolvable issue, retrying it multiple times will simply waste resources and compound the problem. On the other hand, if there’s a good chance that the issue will be resolved shortly (like a temporary network glitch, a brief service disruption, or a momentary resource unavailability), then retry logic is your best friend.
Here are some optimal scenarios for implementing retry logic:
- Network Operations: Network connections aren’t always stable, and temporary network issues can cause operations to fail. In cases where your C# code is making a network request, implementing retry logic can help navigate these transient network errors.
- Database Operations: If your application interfaces with a database, retry logic can handle temporary connection issues, lock contention, or momentary unavailability.
- API Calls: APIs often have usage limits, and you might temporarily get blocked if you exceed them. Retry logic can handle this by retrying after a delay.
- Cloud Services: Cloud services may occasionally suffer from short disruptions. For example, During deployments or updates, services might briefly become unavailable. If your C# application interfaces with cloud services, implementing retry logic ensures that these transient disruptions don’t result in application failure.
- File Operations: File access can fail due to temporary conditions like the file being locked by another process. Retry logic can help your code wait and retry, allowing for the condition to be resolved.
Remember, while retry logic can mitigate the impact of transient issues, it’s not a magic bullet. Not all errors should be retried, and not all retries will eventually succeed. It’s crucial to understand the nature of the operations you’re working with, the errors they may throw, and the feasibility of a successful retry.
Here are some situations where retry logic might not be beneficial or could even worsen the problem:
- Non-transient faults: If the problem isn’t going to resolve itself after a short period (e.g., a SQL syntax error or a missing file), retrying the operation will only waste resources and delay error handling.
- Long operations: If an operation naturally takes a long time to complete, implementing retry logic could result in a significantly longer wait time, especially if the operation keeps failing.
- High-frequency operations: For operations that happen at a very high frequency, retrying on failure might overwhelm the system, leading to more failures.
- Uncertain outcome operations: If an operation might have succeeded despite throwing an error (e.g., an API might have processed a request but failed to send a success response), retrying could result in unintended consequences such as double-charging in a payment system.
No hope no retry.
Components of Retry Logic
Retry logic is not just a single ‘try again’ command. It’s a system comprising several interconnected components, each playing a crucial role in ensuring the logic works efficiently and effectively. Here are the key components:
- Retry policy: This is the set of rules determining when to retry an operation. It usually includes the maximum number of retry attempts and the conditions under which retries should occur. For example, we may decide to retry only on certain types of exceptions, like network-related ones, and avoid retrying on others, like those related to business logic.
- Delay strategy: This defines the wait period between retries. A common practice is to use an exponential backoff delay strategy, where the wait time doubles with each subsequent retry. This helps to avoid overwhelming a struggling system with continuous retry attempts.
- Fallback mechanism: A fallback mechanism is what the application does if all retry attempts fail. This could be returning a default value, showing an error message to the user, logging the error for later analysis, or even triggering a circuit breaker if you’re in a distributed system.
Delay Strategies for Retry Logic
There’s more than one way to skin a cat, and that holds true for implementing retry logic as well. In fact, the strategy you adopt for your retry logic can greatly affect how well it mitigates transient issues and contributes to the resilience of your C# application.
- Simple Retry: This is the most basic strategy and involves simply retrying an operation a fixed number of times when it fails.
- Exponential Backoff: As we saw in the previous chapter, this strategy introduces a delay between retries, and that delay increases exponentially after each failed retry. This is particularly useful in scenarios where repeated, rapid retries could compound the problem rather than resolve it.
- Incremental Backoff: Similar to exponential backoff, this strategy introduces a delay between retries, but the delay increases linearly, rather than exponentially. This can be useful when you want to avoid the potentially long waits of exponential backoff, but still want some delay between retries.
- Randomized Exponential Backoff: This strategy adds a degree of randomness to the exponential backoff delay, helping to avoid a scenario where multiple instances of an application all retry at the same time, causing a stampede effect.
- Circuit Breaker: This advanced strategy involves ‘opening’ a circuit breaker when repeated retries fail, which stops all further attempts for a specified period. This can help to avoid overwhelming a failing service with repeated retries.
Circuit Breaker strategy is a design pattern that’s often used in microservices architecture to prevent a network application from continually attempting to execute an operation that’s likely to fail.
Here’s a simple overview of how it works:
- Closed State: The Circuit Breaker starts in a closed state. In this state, requests to a remote service or resource are allowed to go through. The Circuit Breaker monitors the requests for failures (exceptions or defined error conditions).
- Open State: If the number of failures breaches a specified threshold within a given period, the Circuit Breaker trips, and it goes into an open state. In this state, any requests to the service are automatically blocked for a certain period (the “reset timeout”), and an error is returned immediately without any network call. This allows the failing service some time to recover and prevents the application from being choked by continuous failing requests.
- Half-Open State: After the reset timeout has elapsed, the Circuit Breaker goes into a half-open state. In this state, it allows a limited number of test requests to pass through. If these requests succeed, the Circuit Breaker assumes the problem with the service is fixed and goes back to the closed state. If the test requests fail, the Circuit Breaker returns to the open state and blocks requests for another timeout period.
The Circuit Breaker pattern can help to make an application more resilient and prevent it from getting stuck trying to perform an operation that’s likely to fail, thus improving its overall stability and functionality.
How to Implement Retry Logic
At its core, retry logic revolves around the concept of repeating an operation if it fails due to certain types of exceptions. Here’s the fundamental structure of how it can be achieved in C#:
int retryCount = 3;
while (retryCount > 0)
{
try
{
PerformOperation();
retryCount = 0;
}
catch (ExceptionType1)
{
retryCount--;
if (retryCount <= 0)
{
throw;
}
Thread.Sleep(2000);
}
}
In this basic structure, the PerformOperation()
method is invoked inside a try
block. If it throws an exception of ExceptionType1
, the code decreases the retry count and waits for a set duration before attempting the operation again. This continues until the operation succeeds or the retry count hits zero.
You can keep improving the retry logic. For instance, rather than using a fixed delay between retries, you might want to use an exponential backoff strategy where the delay increases after each retry. However, if we were to include this structure every time we make an API call, perform a database operation, or interact with a cloud service, our program would become messy and difficult to maintain. There must be a more efficient approach to implementing retry logic.
Let’s look at an example.
Consider the following scenario: We have a RESTful Web API deployed in the cloud, and we now require its integration within our client application. To achieve this, we introduce an HTTP client helper class, which serves as a demonstration for utilizing the API. Please note that the helper class provided here is solely for illustrative purposes. Please note that the helper class provided here is solely for illustrative purposes.
public class ApiHelper
{
private readonly HttpClient _httpClient;
public ApiHelper(string apiServer)
{
_httpClient = new HttpClient
{
BaseAddress = new Uri(apiServer)
};
_httpClient.DefaultRequestHeaders.Accept.Add(new System.Net.Http.Headers.MediaTypeWithQualityHeaderValue("application/json"));
}
private async Task<string> RequestString(HttpRequestMessage requestMessage)
{
var response = await _httpClient.SendAsync(requestMessage);
if (response.IsSuccessStatusCode)
{
return await response.Content.ReadAsStringAsync();
}
else
{
return null;
}
}
private async Task<T> RequestObject<T>(HttpRequestMessage requestMessage)
{
var result = await RequestString(requestMessage);
if (result == null)
{
return default(T);
}
return JsonConvert.DeserializeObject<T>(result);
}
private HttpRequestMessage CreateRequestMessage(HttpMethod httpMethod, string url, object model = null)
{
var requestMessage = new HttpRequestMessage(httpMethod, url);
if (model != null)
{
JsonSerializerSettings settings = new JsonSerializerSettings
{
NullValueHandling = NullValueHandling.Ignore
};
requestMessage.Content = new StringContent(JsonConvert.SerializeObject(model, settings), Encoding.UTF8, "application/json");
}
return requestMessage;
}
public async Task<T> GetAsync<T>(string url)
{
var requestMessage = CreateRequestMessage(HttpMethod.Get, url);
return await RequestObject<T>(requestMessage);
}
public async Task<T> PostAsync<T>(string url, object model)
{
var requestMessage = CreateRequestMessage(HttpMethod.Post, url, model);
return await RequestObject<T>(requestMessage);
}
public async Task<T> PatchAsync<T>(string url, object model)
{
var requestMessage = CreateRequestMessage(new HttpMethod("PATCH"), url, model);
return await RequestObject<T>(requestMessage);
}
public async Task<T> PutAsync<T>(string url, object model)
{
var requestMessage = CreateRequestMessage(HttpMethod.Put, url, model);
return await RequestObject<T>(requestMessage);
}
public async Task<T> DeleteAsync<T>(string url)
{
var requestMessage = CreateRequestMessage(HttpMethod.Delete, url);
return await RequestObject<T>(requestMessage);
}
}
The first question that arises when implementing retry logic is: Where should we place the retry logic?
Implementing retry logic in the correct place is crucial for building resilient and reliable systems.
For the given example, the ideal place to incorporate the retry logic would be within the RequestObject<T>
method. This method is responsible for sending the HTTP request, handling the response, and deserializing it into the specified object type.
To implement the retry logic, you can modify the RequestObject<T>
method as follows:
private async Task<T> RequestObject<T>(HttpRequestMessage requestMessage)
{
int maxRetries = 3;
int currentAttempt = 0;
Exception lastException = null;
while (currentAttempt < maxRetries)
{
try
{
var result = await RequestString(requestMessage);
if (result == null)
{
return default(T);
}
return JsonConvert.DeserializeObject<T>(result);
}
catch (Exception ex)
{
lastException = ex;
currentAttempt++;
}
}
throw lastException;
}
We will address the specific issues with the current implementation later on. We still want to pose the question: Is that the optimal location for implementing the retry logic?
Given that HttpClient
supports the pipeline pattern, it is indeed more advantageous to implement the retry logic within a custom HttpMessageHandler
. This approach offers several benefits and is considered a better practice. By implementing the retry logic in a HttpMessageHandler
, you can take advantage of the HttpClient
pipeline and achieve the following:
- Separation of Concerns: Placing the retry logic in a dedicated
HttpMessageHandler
promotes a clear separation of concerns. Each handler in the pipeline focuses on a specific aspect of the request/response flow, making the code more modular and maintainable.
- Reusability: With the retry logic encapsulated within a custom
HttpMessageHandler
, you can reuse the handler across different HttpClient
instances or even in other parts of your application. This promotes code reuse and reduces duplication.
- Centralized Configuration: Implementing the retry logic within a single
HttpMessageHandler
allows for centralized configuration. You can easily adjust the retry behavior, such as the maximum number of retries, delay between retries, or any other specific retry policies, without modifying the individual request methods.
- Flexibility: By utilizing a custom
HttpMessageHandler
, you have the flexibility to implement complex retry policies tailored to your specific requirements. You can incorporate exponential backoff strategies, customize retry conditions based on response status codes or exception types, and handle transient failures in a fine-grained manner.
- Compatibility: Since
HttpClient
follows the pipeline pattern, you can seamlessly integrate the retry logic with other handlers in the pipeline. This allows you to combine the retry logic with authentication, logging, or other custom handlers, creating a comprehensive and extensible HTTP processing pipeline.
Here’s an example of how you can implement a RetryHandler to incorporate retry logic:
public class RetryHandler : DelegatingHandler
{
private readonly int maxRetries;
private readonly TimeSpan delayBetweenRetries;
public RetryHandler(HttpMessageHandler innerHandler, int maxRetries, TimeSpan delayBetweenRetries)
: base(innerHandler)
{
this.maxRetries = maxRetries;
this.delayBetweenRetries = delayBetweenRetries;
}
protected override async Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
{
int retries = 0;
while (true)
{
try
{
var response = await base.SendAsync(request, cancellationToken);
if (response.IsSuccessStatusCode || retries >= maxRetries)
{
return response;
}
}
catch (Exception)
{
if (retries >= maxRetries)
{
throw;
}
}
await Task.Delay(delayBetweenRetries, cancellationToken);
retries++;
}
}
}
We can modify the constructor of our helper class to utilize the retry handler:
public ApiHelper(string apiServer)
{
var httpClientHandler = new HttpClientHandler();
var retryHandler = new RetryHandler(httpClientHandler, maxRetries: 3, delayBetweenRetries: TimeSpan.FromSeconds(1));
_httpClient = new HttpClient(retryHandler)
{
BaseAddress = new Uri(apiServer)
};
_httpClient.DefaultRequestHeaders.Accept.Add(new System.Net.Http.Headers.MediaTypeWithQualityHeaderValue("application/json"));
}
Returning to the actual implementation of the retry logic, there are two significant issues with the current approach.
The first issue relates to the rules that determine when to retry an operation. Presently, the retry logic is applied to every failed status code and exception encountered during the operation. However, it is important to recognize that not all failed HTTP status codes and exceptions are transient or warrant retry attempts. For example, HTTP status codes such as 500, 400, 404, or 401 typically indicate non-transient errors. Retrying such status codes is unlikely to yield a different outcome.
To address this issue, it is necessary to refine the retry conditions and determine which status codes should trigger retries. This can be achieved by customizing the retry logic based on specific requirements and considering factors such as known transient errors, the nature of the API being called, and the likelihood of a successful outcome from retries. For the exception, likely it is not transient based on the document.
Here’s an updated example of the RetryHandler
that incorporates more refined retry conditions:
public class RetryHandler : DelegatingHandler
{
private readonly int maxRetries;
private readonly TimeSpan delayBetweenRetries;
public RetryHandler(HttpMessageHandler innerHandler, int maxRetries, TimeSpan delayBetweenRetries)
: base(innerHandler)
{
this.maxRetries = maxRetries;
this.delayBetweenRetries = delayBetweenRetries;
}
protected override async Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
{
int retries = 0;
while (true)
{
HttpResponseMessage response = null;
bool shouldRetry = false;
try
{
response = await base.SendAsync(request, cancellationToken);
shouldRetry = !response.IsSuccessStatusCode && IsTransientStatusCode(response.StatusCode);
}
catch (Exception ex)
{
shouldRetry = false;
}
if (!shouldRetry || retries >= maxRetries)
{
return response;
}
await Task.Delay(delayBetweenRetries, cancellationToken);
retries++;
}
}
private bool IsTransientStatusCode(HttpStatusCode statusCode)
{
switch (statusCode)
{
case HttpStatusCode.RequestTimeout:
case HttpStatusCode.TooManyRequests:
case HttpStatusCode.ServiceUnavailable:
case HttpStatusCode.GatewayTimeout:
case HttpStatusCode.BadGateway:
case HttpStatusCode.InsufficientStorage:
case HttpStatusCode:
return true;
default:
return false;
}
}
}
The second concern with the previous implementation is the manual implementation of the retry delay strategy. Attempting to implement various delay strategies, such as exponential backoff, incremental backoff, or randomized exponential backoff, can lead to increased complexity and challenges. Fortunately, there are excellent libraries available that excel in handling such scenarios more effectively. One such library is Polly, which offers a comprehensive set of features for resilience and transient fault handling in .NET applications.
Polly provides a wide range of retry policies and built-in delay strategies that are highly configurable and customizable. For example, you can easily implement exponential backoff or incremental backoff strategies using Polly, without the need for manual implementation.
GitHub - App-vNext/Polly
Here is the updated the RetryHandler
class to incorporate the exponential backoff strategy using the Polly library:
public class RetryHandler : DelegatingHandler
{
private readonly AsyncRetryPolicy<HttpResponseMessage> retryPolicy;
public RetryHandler(HttpMessageHandler innerHandler, int maxRetries, TimeSpan initialDelay)
: base(innerHandler)
{
this.retryPolicy = Policy
.HandleResult<HttpResponseMessage>(response => !response.IsSuccessStatusCode
&& IsTransientStatusCode(response.StatusCode))
.WaitAndRetryAsync(
maxRetries,
retryAttempt => initialDelay * Math.Pow(2, retryAttempt - 1),
(response, timespan, retryCount, context) =>
{
if (IsTransientStatusCode(response.Result.StatusCode))
{
}
}
);
}
protected override async Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
{
return await retryPolicy.ExecuteAsync(async () => await base.SendAsync(request, cancellationToken));
}
private bool IsTransientStatusCode(HttpStatusCode statusCode)
{
switch (statusCode)
{
case HttpStatusCode.RequestTimeout:
case HttpStatusCode.TooManyRequests:
case HttpStatusCode.ServiceUnavailable:
case HttpStatusCode.GatewayTimeout:
case HttpStatusCode.BadGateway:
case HttpStatusCode.InsufficientStorage:
return true;
default:
return false;
}
}
}
To add the Circuit Breaker pattern to the RetryHandler
using Polly, you can modify the policy by combining the WaitAndRetryAsync
and CircuitBreakerAsync
policies.
public RetryHandler(HttpMessageHandler innerHandler, int maxRetries, TimeSpan initialDelay, int circuitBreakerThreshold, TimeSpan circuitBreakerDuration)
: base(innerHandler)
{
var retryPolicy = Policy
.HandleResult<HttpResponseMessage>(response => !response.IsSuccessStatusCode
&& IsTransientStatusCode(response.StatusCode))
.Or<Exception>()
.WaitAndRetryAsync(
maxRetries,
retryAttempt => initialDelay * Math.Pow(2, retryAttempt - 1)
);
this.retryPolicy = Policy
.HandleResult<HttpResponseMessage>(response => !response.IsSuccessStatusCode
&& IsTransientStatusCode(response.StatusCode))
.AdvancedCircuitBreakerAsync(
failureThreshold: circuitBreakerThreshold,
samplingDuration: circuitBreakerDuration,
onBreak: (ex, breakDelay) =>
{
},
onReset: () =>
{
}
)
.WrapAsync(retryPolicy);
}
The preceding example demonstrates our thought process for implementing a retry logic for RESTful HTTP client API calls.
When it comes to other scenarios, such as database operations, adding retry logic using the Polly library is straightforward and can be broken down into two steps: defining the policy and executing the function with the defined policy.
First, you define the retry policy using the Polly library, specifying the desired retry conditions and behavior. This includes setting the number of retries, any specific exception types to handle, and optional custom actions to perform on each retry attempt.
Next, you apply the defined retry policy to the function or method where you want to incorporate the retry logic. By executing the function within the retry policy, Polly will automatically handle retry attempts based on the specified policy.
Here’s an example of how you can add retry logic to the SqlConnection.Open
method based on the IsTransient
method you provided:
var retryPolicy = Policy
.Handle<SqlException>(ex => IsTransient(ex))
.WaitAndRetry(maxRetries, _ => delayBetweenRetries);
retryPolicy.Execute(() =>
{
using (var connection = new SqlConnection(connectionString))
{
connection.Open();
}
});
public bool IsTransient(Exception ex)
{
SqlException ex2;
if (ex != null && (ex2 = ex as SqlException) != null)
{
int number = ex2.Number;
if (number == 11001)
{
return true;
}
}
return false;
}
SqlConnection does not support pipeline pattern. To enhance transparency, you can consider defining a new class derived from SqlConnection
and overriding the Open
method. This allows you to customize the behavior of the Open
method to incorporate additional functionality or implement specific logic according to your requirements.
Here’s an example:
using Polly;
using System;
using System.Data.SqlClient;
public class RetrySqlConnection : SqlConnection
{
private readonly int maxRetries;
private readonly TimeSpan delayBetweenRetries;
public RetrySqlConnection(string connectionString, int maxRetries, TimeSpan delayBetweenRetries)
: base(connectionString)
{
this.maxRetries = maxRetries;
this.delayBetweenRetries = delayBetweenRetries;
}
public new void Open()
{
var retryPolicy = Policy
.Handle<SqlException>(ex => IsTransient(ex))
.WaitAndRetry(maxRetries, _ => delayBetweenRetries);
retryPolicy.Execute(() =>
{
base.Open();
});
}
private bool IsTransient(SqlException ex)
{
int number = ex.Number;
if (number == 11001)
{
return true;
}
return false;
}
}
Final thought
Through the demonstrations of HttpClient and SqlConnection, we have gained insights into implementing retry logic in our applications. Understanding the rationale behind retry logic and common approaches is crucial for effectively incorporating it into our codebase.
Fortunately, popular libraries like SqlClient and AWS S3’s .NET library have already integrated retry logic, alleviating the need for manual implementation in those specific scenarios. The latest version of SqlClient, for instance, includes built-in retry capabilities.
Configurable retry logic in SqlClient - ADO.NET Provider for SQL Server
However, comprehending the why, when, and how of implementing retry logic empowers us to easily add it to our applications wherever needed. This article aims to provide that understanding, enabling us to incorporate retry logic seamlessly into our codebase, even for scenarios where it is not readily available in existing libraries.