Scalability is a system's ability to handle an increase in load (more users, more requests, more data) without degrading its performance to an unacceptable degree. It is one of the most important quality attributes in application architecture: a system that works perfectly with 100 users can collapse with 100,000 if it was not designed to grow. In this lesson you will learn the two fundamental scaling strategies, how to manage state so you can scale, what load balancers and their algorithms are, how to split data through sharding, and finally the mathematical laws (Amdahl's Law and the Universal Scalability Law) that place real limits on how much we can scale.
Contents
- Vertical scaling (scale up) vs. horizontal scaling (scale out)
- The problem of state and sessions
- Load balancers and distribution algorithms
- Sharding: data partitioning
- The theoretical limits: Amdahl's Law and the Universal Scalability Law
- Common mistakes and tips
- Exercises
- Vertical scaling (scale up) vs. horizontal scaling (scale out)
There are two basic ways to give a system more capacity:
- Vertical scaling (scale up): adding more resources to a single machine (more CPU, more RAM, faster disks). It is like trading your car for a more powerful one.
- Horizontal scaling (scale out): adding more machines that work in parallel. It is like having a fleet of cars instead of just one.
The following table compares both strategies on the aspects that matter most to the architect:
| Aspect | Vertical (scale up) | Horizontal (scale out) |
|---|---|---|
| How you grow | A bigger machine | More machines |
| Maximum limit | Physical (most powerful hardware available) | Practically unlimited |
| Cost | Grows exponentially at the end | Grows linearly |
| Single point of failure | Yes (a single machine) | No (natural redundancy) |
| Complexity | Low (no code changes) | High (distributed state, network) |
| Downtime when scaling | Usually requires a restart | Zero (add nodes hot) |
| Ideal for | Relational databases, small/medium loads | Stateless web services, microservices |
In practice, modern systems combine both: you scale vertically up to a reasonable cost/benefit point, and from there you scale horizontally.
# Example of horizontal scaling in Kubernetes: # we go from 3 to 10 replicas of a service with a single command kubectl scale deployment my-service --replicas=10
This command asks Kubernetes to keep 10 identical copies (replicas) of the my-service container running. The orchestrator takes care of starting 7 new instances and distributing them across the available nodes. There is no need to touch the code: this is pure horizontal scaling.
- The problem of state and sessions
To be able to scale horizontally, nodes must be interchangeable: any request must be serviceable on any instance. The great enemy of this is the state stored in a node's local memory.
- Stateful service: stores user data in its own memory (for example, the HTTP session). If the next request goes to another node, that node knows nothing about the user.
- Stateless service: stores nothing of its own between requests; all the necessary information travels in the request or lives in a shared store (database, distributed cache).
The golden rule is: design stateless services and externalize the state.
// BAD: session in the node's local memory (stateful)
@RestController
public class CartController {
// This Map lives in the memory of THIS specific node.
// If the balancer sends the next request to another node, the cart "disappears".
private final Map<String, Cart> carts = new ConcurrentHashMap<>();
}The problem with the previous example is that the shopping cart only exists on the node that created it. Let's look at the correct version, which externalizes the state to a shared store (Redis):
// GOOD: state externalized to Redis (stateless from the node's point of view)
@RestController
public class CartController {
private final RedisTemplate<String, Cart> redis;
public CartController(RedisTemplate<String, Cart> redis) {
this.redis = redis;
}
@GetMapping("/cart/{id}")
public Cart view(@PathVariable String id) {
// The cart is read from Redis, accessible from ANY node.
return redis.opsForValue().get("cart:" + id);
}
}Here any node can service the request because the cart lives in Redis, outside the process's memory. This makes it possible to add or remove nodes without losing user data.
When state on the node cannot be avoided, an intermediate solution is session affinity (sticky sessions): the balancer always sends the same user to the same node. It works, but it breaks interchangeability and complicates scaling and fault tolerance (if the node goes down, the session is lost).
- Load balancers and distribution algorithms
A load balancer is the component that receives all incoming requests and distributes them among the available nodes. It is the piece that makes horizontal scaling possible from the client's point of view: the client sees a single address, but behind it there are many machines.
The most common distribution algorithms are:
| Algorithm | How it distributes | When to use it |
|---|---|---|
| Round Robin | In turns, one after another | Homogeneous nodes and similar requests |
| Weighted Round Robin | In turns, but with weights | Nodes of differing power |
| Least Connections | To the node with the fewest active connections | Requests of variable duration |
| IP Hash | Based on a hash of the source IP | To maintain session affinity |
| Least Response Time | To the node that responds fastest | Optimize latency |
Let's look at a real balancing configuration with Nginx:
# We define a group of backend servers
upstream backend {
least_conn; # Algorithm: fewest active connections
server 10.0.0.1:8080 weight=2; # More powerful node: weight 2
server 10.0.0.2:8080; # Weight 1 by default
server 10.0.0.3:8080;
}
server {
listen 80;
location / {
proxy_pass http://backend; # Forwards requests to the "backend" group
}
}Line by line: upstream backend declares the set of nodes. least_conn indicates that Nginx will send each request to the node with the fewest open connections at that moment. The weight=2 on the first server makes it receive roughly twice the traffic because it is more powerful. Finally, proxy_pass http://backend connects the incoming traffic to that group.
It is worth distinguishing the network layer at which the balancer operates:
- Layer 4 (L4, transport): distributes by IP and port, without inspecting the content. Very fast.
- Layer 7 (L7, application): understands HTTP and can route by URL, headers, or cookies. Smarter, somewhat slower.
- Sharding: data partitioning
Scaling the application servers is relatively easy because they are stateless. The database is much harder because it has state by definition. When a single database cannot handle the entire volume, you turn to sharding: splitting the data into fragments (shards), each on a different server.
The key is choosing a partition key (shard key) that distributes the data in a balanced way:
Users partitioned by hash(user_id) % 3: hash(user_id) % 3 == 0 -> Shard 0 (server A) hash(user_id) % 3 == 1 -> Shard 1 (server B) hash(user_id) % 3 == 2 -> Shard 2 (server C)
Each user lives on a single shard, determined by their user_id. This way, a user's queries only hit one server.
Common sharding strategies:
| Strategy | Description | Risk |
|---|---|---|
| By range | By key intervals (A-M, N-Z) | Hot spots if the key is not uniform |
| By hash | Hash of the key modulo N | Re-partitioning when N changes is costly |
| By directory | A lookup table decides the shard | Single point of failure in the directory |
| Geographic | By the user's region | Complies with data regulations (e.g. GDPR) |
The major drawback of sharding is queries that span several shards (for example, "give me all the orders from the last month for all users"): they require querying every server and combining results, which is slow and complex. That is why sharding is only applied when strictly necessary.
graph TD
C[Client] --> R[Sharding Router]
R -->|even user_id| S0[(Shard 0)]
R -->|odd user_id| S1[(Shard 1)]This diagram shows how an intermediate router decides, based on the user's key, which shard to direct each operation to. The client is unaware that the data is distributed.
- The theoretical limits: Amdahl's Law and the Universal Scalability Law
You cannot scale infinitely. There are mathematical laws that prove it.
Amdahl's Law
Amdahl's Law states that the speedup of a system when adding processors is limited by the fraction of the work that cannot be parallelized (the sequential part).
Speedup(N) = 1 / ( (1 - P) + P/N ) P = fraction of the work that CAN be parallelized N = number of processors/nodes
If 10% of the work is sequential (P = 0.9), even with infinite nodes the maximum speedup is 1 / (1 - 0.9) = 10x. In other words: the sequential part sets an absolute ceiling. That is why reducing that sequential fraction (locks, critical sections, global transactions) is so valuable.
Universal Scalability Law (USL)
Amdahl's Law is optimistic because it ignores the cost of coordinating the nodes with each other. Neil Gunther's Universal Scalability Law (USL) adds that cost and predicts something more realistic and sometimes surprising: beyond a certain point, adding more nodes worsens performance.
C(N) = N / ( 1 + α(N-1) + βN(N-1) ) α (alpha) = contention cost (waiting for shared resources) β (beta) = coherence cost (synchronizing state between nodes)
The βN(N-1) term grows quadratically: each new node has to coordinate with all the others. When that coherence cost (β) dominates, the performance curve rises, reaches a maximum, and then falls. The practical lesson is clear: minimize communication and synchronization between nodes if you want to scale for real.
Common Mistakes and Tips
- Storing state in the node's local memory. This is mistake number one. Always externalize sessions to Redis or similar before scaling horizontally.
- Overusing sticky sessions. It patches the symptom but not the cause; it hinders balanced distribution and fault tolerance.
- Scaling the application but forgetting the database. 50 web nodes are useless if they all hit a single saturated database. Identify the true bottleneck.
- Applying sharding too early. Sharding adds enormous complexity. First try read replicas, caches, and indexes.
- Ignoring the USL. More nodes is not always more performance. Always measure; do not assume linear scaling.
- Tip: measure before scaling. The bottleneck is usually where you least expect it.
Exercises
-
Appropriate scaling. You have a stateless REST API that receives unpredictable traffic spikes and a tight infrastructure budget. Would you scale vertically or horizontally, and why?
-
Session diagnosis. A user reports that, at random, their shopping cart "empties" after adding products. The application runs on 4 nodes behind a Round Robin balancer. What is the most likely cause and how would you fix it?
-
Calculation with Amdahl. A process has 25% sequential work (P = 0.75). What is the maximum theoretical speedup, even with infinite nodes available?
Solutions
-
Horizontal. Being stateless, it scales horizontally effortlessly. Horizontal scaling with autoscaling lets you add nodes only during spikes and remove them afterward, optimizing cost compared to a huge machine that is always on. It also avoids the single point of failure.
-
The cause is session state stored in each node's local memory. With Round Robin, consecutive requests from the same user go to different nodes that do not know about their cart. Correct solution: externalize the session to a shared store (Redis). Temporary patch solution: enable sticky sessions (IP Hash) to force the user to always return to the same node.
-
Applying Amdahl with N tending to infinity:
Speedup = 1 / (1 - P) = 1 / (1 - 0.75) = 1 / 0.25 = 4x. Even with infinite nodes, you will never go more than 4 times faster.
Conclusion
You have learned that scaling means choosing between growing up (vertical) or out (horizontal), that horizontal scaling requires removing state from the nodes, that the balancer and its algorithms distribute the load, that sharding splits the data when the database can no longer cope, and that Amdahl's Law and the USL place real limits on how much we can grow. A scalable system is not improvised: it is designed stateless, measured constantly, and coordination is minimized.
But scaling is not enough: the system must also keep working when something fails. In the next lesson, High Availability and Fault Tolerance, we will see how to measure and guarantee that the service stays available even when individual components go down.
Application Architecture Course
Module 1: Fundamentals of Application Architecture
- What Is Application Architecture?
- The Role of the Software Architect
- Quality Attributes and Non-Functional Requirements
- Architectural Decisions and Trade-offs
- Architecture Documentation: Views and the C4 Model
Module 2: Design Principles and Tactics
- Coupling, Cohesion and Separation of Concerns
- SOLID Principles Applied to Architecture
- DRY, KISS, YAGNI and Other Design Principles
- Architectural Tactics for Quality Attributes
- Managing Technical Debt
Module 3: Architectural Styles and Patterns
- Monolithic Architecture
- Layered Architecture (N-Tier)
- Client-Server Architecture
- Hexagonal Architecture (Ports and Adapters)
- Clean and Onion Architecture
Module 4: Distributed Architectures and Microservices
- Introduction to Distributed Systems
- Microservices Architecture
- Service Decomposition and Bounded Contexts
- API Gateway, Service Discovery and Inter-Service Communication
- Resilience Patterns: Circuit Breaker, Retry and Bulkhead
- The CAP Theorem and Data Consistency
Module 5: Event-Driven Architectures and Messaging
- Fundamentals of Event-Driven Architecture
- Asynchronous Messaging: Queues and Brokers
- Event Patterns: Event Sourcing and CQRS
- Managing Distributed Transactions: The Saga Pattern
- Real-Time Data Streaming
Module 6: Domain-Driven Design (DDD)
- Core DDD Concepts
- Strategic Design: Bounded Contexts and Ubiquitous Language
- Tactical Design: Entities, Aggregates and Repositories
- Context Mapping
Module 7: Data and Persistence
- Persistence Strategies: SQL vs NoSQL
- Data Access Patterns: Repository, Unit of Work and DAO
- Database per Service and Distributed Data Management
- Caching and Invalidation Strategies
Module 8: Cloud Architecture and Deployment
- Cloud Computing Fundamentals (IaaS, PaaS, SaaS)
- Containers and Orchestration with Docker and Kubernetes
- Serverless Architecture
- Cloud-Native Design Patterns
- Infrastructure as Code (IaC)
Module 9: Quality, Security and Observability
- Scalability: Horizontal vs Vertical and Load Balancing
- High Availability and Fault Tolerance
- Security by Design and Authentication/Authorization
- Observability: Logging, Metrics and Tracing
- Performance and Load Testing
