In a distributed system, failures are not the exception but the norm: networks drop, services get saturated, and dependencies become slow. Resilience is the ability of a system to keep working, even if in a degraded fashion, when its dependencies fail. Without resilience patterns, a single slow service can trigger a cascading failure that brings down the entire platform.

In this lesson we will study the fundamental patterns (timeouts, retries with backoff, circuit breaker, and bulkhead) and see how to implement them in Java with the Resilience4j library.

Contents

  1. Why we need resilience: the cascading failure
  2. Timeouts
  3. Retries with backoff and jitter
  4. Circuit Breaker and its states
  5. Bulkhead
  6. Fallback (graceful degradation)
  7. Combining patterns
  8. Common mistakes and tips
  9. Exercises
  10. Conclusion

  1. Why we need resilience: the cascading failure

Imagine that service A calls service B and B becomes slow. If A has no timeout, its threads stay waiting. When all threads are exhausted, A stops responding. Whoever calls A also hangs, and so on. A local failure turns into a global outage.

graph LR
    C[Client] --> A[Service A]
    A --> B[Slow Service B]
    B -. blocks threads .-> A
    A -. runs out of threads .-> C
    C -. total error .-> X[Cascading failure]

Resilience patterns break this chain: they limit the wait time, isolate resources, and cut off the flow toward sick dependencies.

  1. Timeouts

A timeout defines the maximum time we wait for a response before giving up. It is the first line of defense: without a timeout, any slow dependency drags you down.

// Explicit timeout: if the call takes longer than 2 seconds, it fails fast
TimeLimiterConfig config = TimeLimiterConfig.custom()
        .timeoutDuration(Duration.ofSeconds(2))
        .build();
TimeLimiter timeLimiter = TimeLimiter.of(config);

The rule: every network call must have a timeout. It is better to fail fast and in a controlled way than to hang indefinitely. A good timeout is usually based on the observed 99th percentile of latency, with some margin.

  1. Retries with backoff and jitter

Many failures are transient (a lost packet, a micro-saturation). In those cases, retrying makes sense. But retrying badly makes things worse.

  • Naive retry: retrying immediately and many times can further saturate a service that is already struggling.
  • Exponential backoff: waiting longer and longer between retries (1s, 2s, 4s...) gives it time to recover.
  • Jitter: adding randomness to the wait prevents all clients from retrying at the same time (the "thundering herd" effect).
// Retries with exponential backoff and jitter using Resilience4j
RetryConfig config = RetryConfig.custom()
        .maxAttempts(3)                        // 1 attempt + 2 retries
        .intervalFunction(
            IntervalFunction.ofExponentialRandomBackoff(
                Duration.ofMillis(500),        // initial wait
                2.0,                           // factor: 500ms, 1s, 2s...
                0.5))                          // 50% jitter
        .retryOnException(e -> e instanceof IOException) // transient failures only
        .build();
Retry retry = Retry.of("customers", config);

Key aspects:

  • maxAttempts(3): at most 3 attempts in total.
  • ofExponentialRandomBackoff: each retry waits longer, with randomness.
  • retryOnException: we only retry transient errors. Never retry non-idempotent operations without protection, or you could duplicate effects (such as a charge).

  1. Circuit Breaker and its states

The circuit breaker is the flagship pattern. It works like the circuit breaker in your home's electrical panel: if it detects too many failures, it "opens the circuit" and stops sending requests to the sick dependency for a while, giving it room to recover and avoiding wasting resources on calls doomed to fail.

It has three states:

State Behavior Transition
Closed Calls pass through normally; failures are counted. If the failure rate exceeds the threshold → Open.
Open Calls are rejected instantly (fail-fast). After a wait time → Half-Open.
Half-Open Lets a few test calls through. If they succeed → Closed; if they fail → Open.
stateDiagram-v2
    [*] --> Closed
    Closed --> Open: failure rate > threshold
    Open --> HalfOpen: wait time elapses
    HalfOpen --> Closed: test calls OK
    HalfOpen --> Open: test calls fail

Implementation with Resilience4j:

// Circuit breaker: opens if >50% of the last 10 calls fail
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
        .failureRateThreshold(50)                       // 50% failure threshold
        .slidingWindowSize(10)                          // window of 10 calls
        .waitDurationInOpenState(Duration.ofSeconds(10))// 10s open
        .permittedNumberOfCallsInHalfOpenState(3)       // 3 test calls
        .build();
CircuitBreaker breaker = CircuitBreaker.of("customers", config);

// We decorate the call with the breaker
Supplier<Customer> call = CircuitBreaker
        .decorateSupplier(breaker, () -> customerClient.get(id));

When the circuit is open, calls fail instantly without touching the network, which protects both the caller (it doesn't exhaust threads) and the sick service (it doesn't receive more load).

  1. Bulkhead

The name comes from the bulkheads of a ship: watertight compartments that, if one floods, do not sink the whole vessel. In software, the bulkhead isolates resources (threads, connections) per dependency, so that a saturated dependency does not consume all of the service's resources.

// Bulkhead: at most 10 concurrent calls to the customers service
BulkheadConfig config = BulkheadConfig.custom()
        .maxConcurrentCalls(10)                  // maximum 10 in parallel
        .maxWaitDuration(Duration.ofMillis(100)) // maximum wait to enter
        .build();
Bulkhead bulkhead = Bulkhead.of("customers", config);

Without a bulkhead, if the customers service slows down, it could monopolize the service's 200 threads and starve calls to other healthy services. With a bulkhead, only 10 threads can get stuck; the rest keep serving other dependencies.

Bulkhead type Mechanism
Semaphore Limits the number of concurrent calls (lightweight).
Thread pool Assigns a dedicated thread pool per dependency.

  1. Fallback (graceful degradation)

When a call fails (due to timeout, open circuit, or full bulkhead), a fallback provides an alternative response instead of propagating the error. It is the difference between "the entire system is down" and "this part works in degraded mode".

// Fallback: if we cannot get the customer, we return minimal cached data
public Customer getCustomerResilient(String id) {
    try {
        return decorateWithBreakerAndRetry(id);
    } catch (Exception e) {
        // Graceful degradation: partial response instead of a total error
        return localCache.getOrDefault(id, Customer.unknown(id));
    }
}

The fallback must offer something useful: cached data, a reasonable default value, or a clear message. What it must never do is hide a critical error without logging it.

  1. Combining patterns

The patterns complement each other and are usually applied together, in a logical order:

// Typical composition: bulkhead -> timelimiter -> circuit breaker -> retry -> fallback
Supplier<Customer> decorated = Decorators.ofSupplier(() -> customerClient.get(id))
        .withBulkhead(bulkhead)        // 1. limits concurrency
        .withTimeLimiter(timeLimiter, scheduler) // 2. cuts off if it takes too long
        .withCircuitBreaker(breaker)   // 3. cuts off if the dependency is sick
        .withRetry(retry)              // 4. retries transient failures
        .withFallback(List.of(Exception.class),
                      e -> Customer.unknown(id)) // 5. degradation
        .decorate();

Order matters: the bulkhead and the timeout protect resources; the circuit breaker cuts off the flow; the retry recovers transient failures; and the fallback always guarantees a response. Resilience4j lets you compose them declaratively.

Common Mistakes and Tips

  • Not setting timeouts: it is the number one cause of cascading failures. Put timeouts on everything.
  • Retrying non-idempotent operations: a retry of a charge can duplicate it. Ensure idempotency before retrying.
  • Retries without backoff or jitter: they further saturate the sick service and cause the thundering herd effect.
  • Poorly calibrated circuit breaker thresholds: too sensitive opens for nothing; too tolerant doesn't protect. Tune them with real data.
  • Fallbacks that hide errors: a fallback must record (log and metric) the failure, not bury it silently.

Exercises

  1. Describe the three states of a circuit breaker and the conditions that trigger each transition.
  2. Explain why retries should use exponential backoff with jitter instead of retrying immediately. What problem does jitter avoid?
  3. Design, in Java pseudocode with Resilience4j, a call to the "policies service" protected with a 2s timeout, a circuit breaker (50% threshold), and a fallback that returns an empty list.

Solutions

  1. Closed: calls pass through and failures are counted; if the failure rate exceeds the threshold, it moves to Open. Open: rejects calls instantly; after the wait time, it moves to Half-Open. Half-Open: lets a few test calls through; if they succeed it returns to Closed, if they fail it returns to Open.

  2. Exponential backoff gives the sick service time to recover instead of bombarding it. Jitter (randomness) avoids the thundering herd effect: all clients, which failed at the same time, retrying at exactly the same moment and saturating the service again in synchronized waves.

CircuitBreaker breaker = CircuitBreaker.of("policies",
        CircuitBreakerConfig.custom().failureRateThreshold(50).build());
TimeLimiter limiter = TimeLimiter.of(
        TimeLimiterConfig.custom().timeoutDuration(Duration.ofSeconds(2)).build());

Supplier<List<Policy>> decorated = Decorators
        .ofSupplier(() -> policyClient.list(customerId))
        .withCircuitBreaker(breaker)
        .withTimeLimiter(limiter, scheduler)
        .withFallback(List.of(Exception.class), e -> Collections.emptyList())
        .decorate();

Conclusion

We have seen that resilience is built by combining patterns: timeouts avoid endless waits, retries with backoff and jitter overcome transient failures, the circuit breaker cuts off the flow toward sick dependencies, the bulkhead isolates resources, and the fallback guarantees a degraded response. Together, they break cascading failures and keep the system standing.

These patterns handle availability failures, but a deeper challenge remains: when data is spread and replicated, can we have consistency, availability, and partition tolerance all at once? The next lesson, The CAP Theorem and Data Consistency, gives us the theoretical framework to understand these inevitable trade-offs.

Application Architecture Course

Module 1: Fundamentals of Application Architecture

Module 2: Design Principles and Tactics

Module 3: Architectural Styles and Patterns

Module 4: Distributed Architectures and Microservices

Module 5: Event-Driven Architectures and Messaging

Module 6: Domain-Driven Design (DDD)

Module 7: Data and Persistence

Module 8: Cloud Architecture and Deployment

Module 9: Quality, Security and Observability

Module 10: Evolution, Governance and Case Studies

© Copyright 2026. All rights reserved