How to: Build resilient applications patterns and best practices
The Hype about building large Applications and distributed Systems may be over. However, resilient services are required more than in the past.
Let's dive deep into the topic of resilient applications. But first, some basics:
Understanding resilient applications
What are resilient applications? So here a simple sentence describing resilient applications:
Resilient applications are designed to withstand failures, adapt to changing conditions, and continue providing essential functionality despite disruptions.
This means that these applications are built with fault tolerance, graceful degradation, and self-healing mechanisms to ensure high availability and reliability.
Why Do We Need Resilience?
Resilience is critical due to increasing complexity, distributed systems, and reliance on third-party services. Without resilience, even minor failures can cascade into major system outages, leading to downtime, data loss, and poor user experience. Businesses demand high availability, making resilience a non-negotiable aspect of application design. These are horror scenarios for developers who must bug hunt on Sunday morning, caused by a small error like an erroneous regex or something similar.
But what are the main techniques?
Key Techniques for Application Resilience
Resilience strategies vary depending on application architecture. Let's compare monolithic and distributed architectures using bus systems, highlighting Azure services and practical implementation examples (because Azure has good services for building resilient applications).
1. Resilience in Monolithic Applications
Monolithic applications run as a single unit, where failure in one component can impact the entire system.
In fact, I use standard libraries to support the patterns like Polly.net. I learned to love this project, because it seamlessly integrate into projects also when it's already shipped.
Let's come back to the techniques. this includes:
Circuit Breaker Pattern
The circuit breaker prevents cascading failures by detecting slow or failing dependencies and halting requests until recovery.
Example in .NET (Polly Library):
var policy = Policy
.Handle<HttpRequestException>()
.CircuitBreakerAsync(3, TimeSpan.FromSeconds(30));
await policy.ExecuteAsync(async () =>
{
await httpClient.GetAsync("https://api.example.com/data");
});
Retry Mechanism
The other part is the retry mechanism. This can be combined with the circuit breaker pattern. It retries to help handle transient failures by reattempting operations before failing.
Example in .NET (Polly Library):
var retryPolicy = Policy
.Handle<HttpRequestException>()
.RetryAsync(3);
await retryPolicy.ExecuteAsync(async () =>
{
await httpClient.GetAsync("https://api.example.com/data");
});
Fallback Strategy
Providing a default response when a service is unavailable ensures continuity.
Example:
var fallbackPolicy = Policy<string>
.Handle<HttpRequestException>()
.FallbackAsync("Fallback Response");
2. Resilience in Distributed Architectures with Bus Systems
In distributed environments, applications communicate over message buses (e.g., Azure Service Bus). Ensuring resilience involves additional strategies.
Event-Driven Architecture
Using message queues decouples components, allowing services to function independently.
Example in TypeScript with Azure Service Bus:
import { ServiceBusClient } from "@azure/service-bus";
const client = new ServiceBusClient("<connection-string>");
const sender = client.createSender("my-queue");
await sender.sendMessages({ body: "Hello, world!" });
await sender.close();
await client.close();
Idempotent Message Processing
To avoid duplicate processing, services should handle retries idempotently. For that, their standard (in every bus system) will deliver a message ID. To avoid duplicate executions, you must track this ID to prevent duplicate processing. Or use a rollback scenario to enable multiple executions. In total, you must be idempotent 😄.
Example:
function processMessage(messageId: string, data: any) {
if (isAlreadyProcessed(messageId)) return;
storeAsProcessed(messageId);
// Process data
}
Azure Functions with Durable Entities
Using Durable Functions (blog post) ensures stateful workflows for fault-tolerant execution.
Example in .NET:
[FunctionName("DurableFunctionExample")]
public static async Task Run([OrchestrationTrigger] IDurableOrchestrationContext context)
{
var result = await context.CallActivityAsync<string>("ActivityFunction", "input");
return result;
}
Different Resilience Patterns
The patterns above are for the service communication itself, to prevent DoS attacks or other wild running services. For Business processes and keeping services in the healthy state you must use other patterns that I will describe here
Bulkhead Pattern
This pattern isolates different components or services to prevent failures from affecting the entire system. This will avoid a total outage due to the domino effect of failing services. Polly has a nice function for thst
Example in .NET:
var bulkheadPolicy = Policy.BulkheadAsync(10, 20);
The policy limits the number of concurrent operations to 10. If more than 10 concurrent tasks are requested, the extra tasks will be queued up to a maximum of 20. Once there is a free slot in the concurrent task pool. When the maximum of 20 queued tasks is exceeded, any new requests will be blocked or rejected, depending on how the policy is configured.
Timeout Pattern
Limits the time a service spends waiting for a response to avoid indefinite hangs. This will ensure that the service will answer quick, in best case with an HTTP 200. This will be used in requests that may use a huge amount of resources. To prevent to overload the server(s) you can set a timeout value.
Example in .NET:
var timeoutPolicy = Policy.TimeoutAsync<HttpResponseMessage>(TimeSpan.FromSeconds(2));
This example will set the timeout to 2 seconds.
Saga Pattern
Used in distributed systems for managing long-running transactions by breaking them into smaller compensable transactions. I like a simple definition of workflow for the business process. In this, it doesn't matter if the code was executed by a message bus handler or if it was called remotely. The main advantage of this pattern is that you can quickly react to the faults and redesign the business process.
Example in TypeScript:
async function sagaTransaction() {
try {
await stepOne();
await stepTwo();
} catch (error) {
await compensateStepOne();
}
}
In this example, the Process executes two steps. When any error occurs, the elements from step one rolls back.
Strangler Fig Pattern
Gradually replaces parts of a legacy system with new functionality without disrupting operations.
Example in .NET:
if (useNewService)
{
NewService.ProcessRequest(request);
}
else
{
LegacyService.ProcessRequest(request);
}
Shadow Traffic Pattern
This pattern will be mainly used when you want to run an A/B deployment. So avoid missing production data onto the new node. It simplify routes a copy of production traffic to a new system without impacting users. This scenario will also be used for mirroring production data into the test environment.
Example in TypeScript:
async function handleRequest(request) {
await sendToProduction(request);
await sendToShadowSystem(request);
}
This example is a very simple example, but you can use also an API management layer that will send the data also into the second system too.
Conclusion
Resilience is an essential aspect of modern application design. Whether building monolithic or distributed applications, techniques such as circuit breakers, retries, event-driven architectures, and Azure services like Service Bus and Durable Functions enhance reliability. By implementing these best practices and resilience patterns, organizations can ensure high availability, fault tolerance, and a seamless user experience.