High availability (HA) is the property of a system that keeps providing service even when some of its components fail. In the real world, hardware breaks, networks go down, and software has bugs: the question is not whether something will fail, but when. A well-designed system assumes failure as something normal and prepares to survive it. In this lesson you will learn to measure availability with SLA, SLO, and SLI, you will understand what the "nines" of availability mean, and you will see the techniques that make it possible for a system to withstand failures: redundancy, automatic failover, graceful degradation, and health checks.
Contents
- SLA, SLO, and SLI: the language of availability
- The nines of availability
- Redundancy: eliminating single points of failure
- Failover: automatic switchover on failure
- Graceful degradation
- Health checks: detecting that something is wrong
- Common mistakes and tips
- Exercises
- SLA, SLO, and SLI: the language of availability
To talk about availability rigorously, we need three concepts that are often confused:
| Term | What it is | Who sets it | Example |
|---|---|---|---|
| SLI (Indicator) | The metric you measure | Technical team | % of requests with a response < 300 ms |
| SLO (Objective) | The internal target for the SLI | Team/Product | "99.9% of requests OK per month" |
| SLA (Agreement) | The contractual commitment to the customer | Business/Legal | "99.5% or we refund your money" |
The relationship is hierarchical: the SLI is the raw measurement, the SLO is the target you set internally, and the SLA is the formal promise to the customer (with penalties if breached). As a matter of prudence, the SLA is always set below the SLO: if your internal target is 99.9% but you promise 99.5%, you have a safety margin.
From this comes a key concept: the error budget. If your SLO is 99.9% monthly, you "allow yourself" 0.1% of failures. That 0.1% is a budget you can spend on deploying risky changes. If you exhaust it, you freeze deployments and focus on stability.
Monthly error budget with SLO = 99.9% Total time in the month: 30 days = 43,200 minutes Required availability: 99.9% Allowed downtime: 0.1% of 43,200 = 43.2 minutes/month
That is, with an SLO of 99.9% you can be down for at most 43 minutes a month without breaching your target.
- The nines of availability
Availability is expressed as the percentage of time in service, and colloquially it is counted in "nines". Each additional nine drastically reduces the allowed downtime and multiplies the cost.
| Availability | "Nines" | Downtime per year | Downtime per month | Downtime per week |
|---|---|---|---|---|
| 90% | one nine | 36.5 days | 72 hours | 16.8 hours |
| 99% | two nines | 3.65 days | 7.2 hours | 1.68 hours |
| 99.9% | three nines | 8.76 hours | 43.2 min | 10.1 min |
| 99.99% | four nines | 52.6 min | 4.32 min | 1.01 min |
| 99.999% | five nines | 5.26 min | 25.9 sec | 6.05 sec |
Five nines (99.999%) is legendary in telecommunications, but it demands an enormous investment: full redundancy, multiple data centers, complete automation. The architectural question is not "how many nines can I achieve?" but "how many nines does the business really need and is willing to pay for?". Going from 99.9% to 99.99% can multiply the cost tenfold to save a few hours a year.
- Redundancy: eliminating single points of failure
A single point of failure (SPOF) is any component whose failure brings down the entire system. The foundation of high availability is to eliminate SPOFs through redundancy: having more than one copy of every critical component.
There are two redundancy models:
| Model | Description | Advantage | Drawback |
|---|---|---|---|
| Active-Passive | One active replica and another on standby | Simple, no conflicts | The passive replica is idle |
| Active-Active | All replicas serve traffic | Uses all resources | More complex (synchronization) |
graph TD
LB[Balancer] --> A[Node A - active]
LB -.standby.-> B[Node B - passive]
A --> DBp[(Primary DB)]
DBp -.replication.-> DBs[(Replica DB)]In this active-passive scheme, node B only comes into play if A goes down, and the replica database receives a continuous copy of the data so it can replace the primary. The practical rule: identify each component and ask yourself "what happens if this goes down?". If the answer is "everything goes down", you have a SPOF that you must replicate.
- Failover: automatic switchover on failure
Having redundancy is worthless if no one activates it when needed. Failover is the process of detecting a component's failure and automatically switching over to its replica. Two metrics define its quality:
- RTO (Recovery Time Objective): how long it takes us to recover the service. Seconds? Minutes?
- RPO (Recovery Point Objective): how much data we can afford to lose, measured in time. The last transaction? The last 5 minutes?
Failure
|
----[ data safe ]----X----[ service down ]----[ service recovered ]----
<----- RPO -----> <-------- RTO -------->
(data lost) (time without service)An RPO of zero means "lose no data" (requires synchronous replication, which is slower). An RTO of seconds requires fully automated failover.
# Example: failover managed by health checks in a balancer (HAProxy) backend web_servers option httpchk GET /health # Check health with GET /health default-server inter 2s fall 3 rise 2 server web1 10.0.0.1:8080 check # Actively checked server web2 10.0.0.2:8080 check backup # Only used if web1 goes down
Explanation: option httpchk GET /health indicates that HAProxy will periodically call each server's /health endpoint. inter 2s checks every 2 seconds; fall 3 marks the server as down after 3 consecutive failures; rise 2 reactivates it after 2 successes. The web2 ... backup only receives traffic when web1 is down: this is automatic active-passive failover.
- Graceful degradation
Sometimes you cannot keep the entire service running, but you can keep part of it. Graceful degradation means continuing to offer the essential functionality even when secondary functions are lost, rather than going down completely. A service "running at half throttle" is preferable to an error screen.
Example: if a store's recommendations service goes down, the website must keep selling, simply without showing recommendations.
// Degradation pattern with a fallback value
public List<Product> getRecommendations(String userId) {
try {
// We try to call the recommendations service
return recommendationsService.recommend(userId);
} catch (ServiceUnavailableException e) {
// If it fails, we do NOT break the page: we return the best sellers.
log.warn("Recommendations unavailable, using fallback", e);
return catalog.bestSellers();
}
}Here, if the recommendations service does not respond, we catch the exception and return an alternative list (the best sellers) instead of propagating the error to the user. The page keeps working, just with a slightly reduced experience.
This pattern is often combined with the Circuit Breaker: if a service fails repeatedly, the circuit breaker "opens" and stops calling it for a while, returning the fallback directly. This avoids overloading an already sick service and speeds up the response.
- Health checks: detecting that something is wrong
You cannot react to a failure you do not detect. Health checks are endpoints that report the state of an instance. Two essential types are distinguished in orchestrators like Kubernetes:
| Type | Question it answers | What happens if it fails |
|---|---|---|
| Liveness | Is the process alive? | The container is restarted |
| Readiness | Is it ready to receive traffic? | Traffic stops being sent to it (without restarting) |
The distinction is crucial: an instance can be alive but not ready (for example, starting up or reconnecting to the database). In that case we do not want to restart it (liveness OK), but we do want to stop sending it requests (readiness KO).
# Health probes in a Kubernetes Pod
livenessProbe:
httpGet:
path: /health/live # Is the process still alive?
port: 8080
initialDelaySeconds: 10 # Wait 10s after startup before the 1st probe
periodSeconds: 5 # Check every 5 seconds
readinessProbe:
httpGet:
path: /health/ready # Can it handle requests?
port: 8080
periodSeconds: 5Line by line: the livenessProbe calls /health/live; if it fails, Kubernetes restarts the container. The initialDelaySeconds: 10 avoids premature restarts while the app starts up. The readinessProbe calls /health/ready; if it fails, Kubernetes removes that Pod from balancing but leaves it alive so it can recover. A good /health/ready should check the critical dependencies (database, queues) and not merely return "OK" every time.
Common Mistakes and Tips
- Confusing SLA, SLO, and SLI. Remember: SLI is what you measure, SLO is what you aim for, SLA is what you promise. The SLA always goes below the SLO.
- Chasing unnecessary nines. Each additional nine costs much more. Match availability to the real business need.
- Redundancy that shares a hidden SPOF. Two nodes in the same rack, with the same power supply or the same cloud provider, share a point of failure. True redundancy distributes the risks.
- Trivial health check. A
/healththat always returns 200 is useless: it must actually check the critical dependencies. - Never testing failover. A failover mechanism that is not rehearsed periodically will probably not work on the real day. Practice with drills (Chaos Engineering).
- Tip: assume everything will fail and design for it. Reliability is built, not hoped for.
Exercises
-
Availability calculation. Your system was down for 4 hours in a year. What percentage availability did you provide, and roughly how many "nines" does that equal?
-
Health check design. A service restarts continuously in a loop. On investigation, you see that its
livenessProbepoints to an endpoint that checks the connection to a slow database that takes a while to start up. What is wrong and how would you fix it? -
Continuity strategy. Define a reasonable RTO and RPO for (a) a corporate blog and (b) a banking payment system, justifying the difference.
Solutions
-
There are 8,760 hours in a year. Availability = (8,760 - 4) / 8,760 = 8,756 / 8,760 = 99.954%. That exceeds three nines (99.9% allowed 8.76 h) but does not reach four nines (99.99% only allowed 52.6 min). It is therefore between three and four nines.
-
The error is that the liveness probe checks a slow external dependency. While the database is starting up, the probe fails, Kubernetes thinks the process is dead and restarts it, entering a loop. Solution: the liveness should only check that the process responds (something trivial and fast); checking the database belongs to the readiness probe, which removes the Pod from traffic without restarting it. It is also advisable to adjust
initialDelaySeconds. -
(a) Corporate blog: an RTO of hours and an RPO of hours (even a day) are acceptable; losing the last article or being down for a while is not critical, low cost prevails. (b) Banking payments: an RTO of seconds/minutes and an RPO of zero; you cannot lose any transaction or stop operating, which justifies synchronous replication and automatic failover even if expensive. The difference reflects the business impact of each outage.
Conclusion
You have learned to measure availability (SLI/SLO/SLA and the nines), to understand its rising cost, and to apply the techniques that make a system resilient: redundancy to eliminate SPOFs, failover to switch over automatically, graceful degradation to avoid going down entirely, and health checks to detect problems in time. High availability is, in essence, assuming that failure is inevitable and designing so that the user barely notices it.
An available but insecure system is a vulnerable system. In the next lesson, Security by Design and Authentication/Authorization, we will see how to protect the application against attacks from the very first moment of the design.
Application Architecture Course
Module 1: Fundamentals of Application Architecture
- What Is Application Architecture?
- The Role of the Software Architect
- Quality Attributes and Non-Functional Requirements
- Architectural Decisions and Trade-offs
- Architecture Documentation: Views and the C4 Model
Module 2: Design Principles and Tactics
- Coupling, Cohesion and Separation of Concerns
- SOLID Principles Applied to Architecture
- DRY, KISS, YAGNI and Other Design Principles
- Architectural Tactics for Quality Attributes
- Managing Technical Debt
Module 3: Architectural Styles and Patterns
- Monolithic Architecture
- Layered Architecture (N-Tier)
- Client-Server Architecture
- Hexagonal Architecture (Ports and Adapters)
- Clean and Onion Architecture
Module 4: Distributed Architectures and Microservices
- Introduction to Distributed Systems
- Microservices Architecture
- Service Decomposition and Bounded Contexts
- API Gateway, Service Discovery and Inter-Service Communication
- Resilience Patterns: Circuit Breaker, Retry and Bulkhead
- The CAP Theorem and Data Consistency
Module 5: Event-Driven Architectures and Messaging
- Fundamentals of Event-Driven Architecture
- Asynchronous Messaging: Queues and Brokers
- Event Patterns: Event Sourcing and CQRS
- Managing Distributed Transactions: The Saga Pattern
- Real-Time Data Streaming
Module 6: Domain-Driven Design (DDD)
- Core DDD Concepts
- Strategic Design: Bounded Contexts and Ubiquitous Language
- Tactical Design: Entities, Aggregates and Repositories
- Context Mapping
Module 7: Data and Persistence
- Persistence Strategies: SQL vs NoSQL
- Data Access Patterns: Repository, Unit of Work and DAO
- Database per Service and Distributed Data Management
- Caching and Invalidation Strategies
Module 8: Cloud Architecture and Deployment
- Cloud Computing Fundamentals (IaaS, PaaS, SaaS)
- Containers and Orchestration with Docker and Kubernetes
- Serverless Architecture
- Cloud-Native Design Patterns
- Infrastructure as Code (IaC)
Module 9: Quality, Security and Observability
- Scalability: Horizontal vs Vertical and Load Balancing
- High Availability and Fault Tolerance
- Security by Design and Authentication/Authorization
- Observability: Logging, Metrics and Tracing
- Performance and Load Testing
