The Project | About Us | Contribute | Donations | License

HOME

High availability (HA) is the property of a system that keeps providing service even when some of its components fail. In the real world, hardware breaks, networks go down, and software has bugs: the question is not whether something will fail, but when. A well-designed system assumes failure as something normal and prepares to survive it. In this lesson you will learn to measure availability with SLA, SLO, and SLI, you will understand what the "nines" of availability mean, and you will see the techniques that make it possible for a system to withstand failures: redundancy, automatic failover, graceful degradation, and health checks.

SLA, SLO, and SLI: the language of availability
The nines of availability
Redundancy: eliminating single points of failure
Failover: automatic switchover on failure
Graceful degradation
Health checks: detecting that something is wrong
Common mistakes and tips
Exercises

SLA, SLO, and SLI: the language of availability

To talk about availability rigorously, we need three concepts that are often confused:

Term	What it is	Who sets it	Example
SLI (Indicator)	The metric you measure	Technical team	% of requests with a response < 300 ms
SLO (Objective)	The internal target for the SLI	Team/Product	"99.9% of requests OK per month"
SLA (Agreement)	The contractual commitment to the customer	Business/Legal	"99.5% or we refund your money"

The relationship is hierarchical: the SLI is the raw measurement, the SLO is the target you set internally, and the SLA is the formal promise to the customer (with penalties if breached). As a matter of prudence, the SLA is always set below the SLO: if your internal target is 99.9% but you promise 99.5%, you have a safety margin.

From this comes a key concept: the error budget. If your SLO is 99.9% monthly, you "allow yourself" 0.1% of failures. That 0.1% is a budget you can spend on deploying risky changes. If you exhaust it, you freeze deployments and focus on stability.

Monthly error budget with SLO = 99.9%
  Total time in the month:  30 days = 43,200 minutes
  Required availability:    99.9%
  Allowed downtime:         0.1% of 43,200 = 43.2 minutes/month

That is, with an SLO of 99.9% you can be down for at most 43 minutes a month without breaching your target.

The nines of availability

Availability is expressed as the percentage of time in service, and colloquially it is counted in "nines". Each additional nine drastically reduces the allowed downtime and multiplies the cost.

Availability	"Nines"	Downtime per year	Downtime per month	Downtime per week
90%	one nine	36.5 days	72 hours	16.8 hours
99%	two nines	3.65 days	7.2 hours	1.68 hours
99.9%	three nines	8.76 hours	43.2 min	10.1 min
99.99%	four nines	52.6 min	4.32 min	1.01 min
99.999%	five nines	5.26 min	25.9 sec	6.05 sec

Five nines (99.999%) is legendary in telecommunications, but it demands an enormous investment: full redundancy, multiple data centers, complete automation. The architectural question is not "how many nines can I achieve?" but "how many nines does the business really need and is willing to pay for?". Going from 99.9% to 99.99% can multiply the cost tenfold to save a few hours a year.

Redundancy: eliminating single points of failure

A single point of failure (SPOF) is any component whose failure brings down the entire system. The foundation of high availability is to eliminate SPOFs through redundancy: having more than one copy of every critical component.

There are two redundancy models:

Model	Description	Advantage	Drawback
Active-Passive	One active replica and another on standby	Simple, no conflicts	The passive replica is idle
Active-Active	All replicas serve traffic	Uses all resources	More complex (synchronization)

graph TD
    LB[Balancer] --> A[Node A - active]
    LB -.standby.-> B[Node B - passive]
    A --> DBp[(Primary DB)]
    DBp -.replication.-> DBs[(Replica DB)]

In this active-passive scheme, node B only comes into play if A goes down, and the replica database receives a continuous copy of the data so it can replace the primary. The practical rule: identify each component and ask yourself "what happens if this goes down?". If the answer is "everything goes down", you have a SPOF that you must replicate.

Failover: automatic switchover on failure

Having redundancy is worthless if no one activates it when needed. Failover is the process of detecting a component's failure and automatically switching over to its replica. Two metrics define its quality:

RTO (Recovery Time Objective): how long it takes us to recover the service. Seconds? Minutes?
RPO (Recovery Point Objective): how much data we can afford to lose, measured in time. The last transaction? The last 5 minutes?

                  Failure
                    |
   ----[ data safe ]----X----[ service down ]----[ service recovered ]----
        <----- RPO ----->                    <-------- RTO -------->
        (data lost)                            (time without service)

An RPO of zero means "lose no data" (requires synchronous replication, which is slower). An RTO of seconds requires fully automated failover.

# Example: failover managed by health checks in a balancer (HAProxy)
backend web_servers
  option httpchk GET /health        # Check health with GET /health
  default-server inter 2s fall 3 rise 2
  server web1 10.0.0.1:8080 check   # Actively checked
  server web2 10.0.0.2:8080 check backup   # Only used if web1 goes down

Explanation: option httpchk GET /health indicates that HAProxy will periodically call each server's /health endpoint. inter 2s checks every 2 seconds; fall 3 marks the server as down after 3 consecutive failures; rise 2 reactivates it after 2 successes. The web2 ... backup only receives traffic when web1 is down: this is automatic active-passive failover.

Graceful degradation

Sometimes you cannot keep the entire service running, but you can keep part of it. Graceful degradation means continuing to offer the essential functionality even when secondary functions are lost, rather than going down completely. A service "running at half throttle" is preferable to an error screen.

Example: if a store's recommendations service goes down, the website must keep selling, simply without showing recommendations.

// Degradation pattern with a fallback value
public List<Product> getRecommendations(String userId) {
    try {
        // We try to call the recommendations service
        return recommendationsService.recommend(userId);
    } catch (ServiceUnavailableException e) {
        // If it fails, we do NOT break the page: we return the best sellers.
        log.warn("Recommendations unavailable, using fallback", e);
        return catalog.bestSellers();
    }
}

Here, if the recommendations service does not respond, we catch the exception and return an alternative list (the best sellers) instead of propagating the error to the user. The page keeps working, just with a slightly reduced experience.

This pattern is often combined with the Circuit Breaker: if a service fails repeatedly, the circuit breaker "opens" and stops calling it for a while, returning the fallback directly. This avoids overloading an already sick service and speeds up the response.

Health checks: detecting that something is wrong

You cannot react to a failure you do not detect. Health checks are endpoints that report the state of an instance. Two essential types are distinguished in orchestrators like Kubernetes:

Type	Question it answers	What happens if it fails
Liveness	Is the process alive?	The container is restarted
Readiness	Is it ready to receive traffic?	Traffic stops being sent to it (without restarting)

The distinction is crucial: an instance can be alive but not ready (for example, starting up or reconnecting to the database). In that case we do not want to restart it (liveness OK), but we do want to stop sending it requests (readiness KO).

# Health probes in a Kubernetes Pod
livenessProbe:
  httpGet:
    path: /health/live      # Is the process still alive?
    port: 8080
  initialDelaySeconds: 10    # Wait 10s after startup before the 1st probe
  periodSeconds: 5           # Check every 5 seconds
readinessProbe:
  httpGet:
    path: /health/ready     # Can it handle requests?
    port: 8080
  periodSeconds: 5

Line by line: the livenessProbe calls /health/live; if it fails, Kubernetes restarts the container. The initialDelaySeconds: 10 avoids premature restarts while the app starts up. The readinessProbe calls /health/ready; if it fails, Kubernetes removes that Pod from balancing but leaves it alive so it can recover. A good /health/ready should check the critical dependencies (database, queues) and not merely return "OK" every time.

Common Mistakes and Tips

Confusing SLA, SLO, and SLI. Remember: SLI is what you measure, SLO is what you aim for, SLA is what you promise. The SLA always goes below the SLO.
Chasing unnecessary nines. Each additional nine costs much more. Match availability to the real business need.
Redundancy that shares a hidden SPOF. Two nodes in the same rack, with the same power supply or the same cloud provider, share a point of failure. True redundancy distributes the risks.
Trivial health check. A /health that always returns 200 is useless: it must actually check the critical dependencies.
Never testing failover. A failover mechanism that is not rehearsed periodically will probably not work on the real day. Practice with drills (Chaos Engineering).
Tip: assume everything will fail and design for it. Reliability is built, not hoped for.

Exercises

Availability calculation. Your system was down for 4 hours in a year. What percentage availability did you provide, and roughly how many "nines" does that equal?
Health check design. A service restarts continuously in a loop. On investigation, you see that its livenessProbe points to an endpoint that checks the connection to a slow database that takes a while to start up. What is wrong and how would you fix it?
Continuity strategy. Define a reasonable RTO and RPO for (a) a corporate blog and (b) a banking payment system, justifying the difference.

Solutions

There are 8,760 hours in a year. Availability = (8,760 - 4) / 8,760 = 8,756 / 8,760 = 99.954%. That exceeds three nines (99.9% allowed 8.76 h) but does not reach four nines (99.99% only allowed 52.6 min). It is therefore between three and four nines.
The error is that the liveness probe checks a slow external dependency. While the database is starting up, the probe fails, Kubernetes thinks the process is dead and restarts it, entering a loop. Solution: the liveness should only check that the process responds (something trivial and fast); checking the database belongs to the readiness probe, which removes the Pod from traffic without restarting it. It is also advisable to adjust initialDelaySeconds.
(a) Corporate blog: an RTO of hours and an RPO of hours (even a day) are acceptable; losing the last article or being down for a while is not critical, low cost prevails. (b) Banking payments: an RTO of seconds/minutes and an RPO of zero; you cannot lose any transaction or stop operating, which justifies synchronous replication and automatic failover even if expensive. The difference reflects the business impact of each outage.

Conclusion

You have learned to measure availability (SLI/SLO/SLA and the nines), to understand its rising cost, and to apply the techniques that make a system resilient: redundancy to eliminate SPOFs, failover to switch over automatically, graceful degradation to avoid going down entirely, and health checks to detect problems in time. High availability is, in essence, assuming that failure is inevitable and designing so that the user barely notices it.

An available but insecure system is a vulnerable system. In the next lesson, Security by Design and Authentication/Authorization, we will see how to protect the application against attacks from the very first moment of the design.

High Availability and Fault Tolerance

Contents

SLA, SLO, and SLI: the language of availability

The nines of availability

Redundancy: eliminating single points of failure

Failover: automatic switchover on failure

Graceful degradation

Health checks: detecting that something is wrong

Common Mistakes and Tips

Exercises

Solutions

Conclusion

Application Architecture Course

Module 1: Fundamentals of Application Architecture

Module 2: Design Principles and Tactics

Module 3: Architectural Styles and Patterns

Module 4: Distributed Architectures and Microservices

Module 5: Event-Driven Architectures and Messaging

Module 6: Domain-Driven Design (DDD)

Module 7: Data and Persistence

Module 8: Cloud Architecture and Deployment

Module 9: Quality, Security and Observability

Module 10: Evolution, Governance and Case Studies