Scalability is a system's ability to handle an increase in load (more users, more requests, more data) without degrading its performance to an unacceptable degree. It is one of the most important quality attributes in application architecture: a system that works perfectly with 100 users can collapse with 100,000 if it was not designed to grow. In this lesson you will learn the two fundamental scaling strategies, how to manage state so you can scale, what load balancers and their algorithms are, how to split data through sharding, and finally the mathematical laws (Amdahl's Law and the Universal Scalability Law) that place real limits on how much we can scale.

Contents

  1. Vertical scaling (scale up) vs. horizontal scaling (scale out)
  2. The problem of state and sessions
  3. Load balancers and distribution algorithms
  4. Sharding: data partitioning
  5. The theoretical limits: Amdahl's Law and the Universal Scalability Law
  6. Common mistakes and tips
  7. Exercises

  1. Vertical scaling (scale up) vs. horizontal scaling (scale out)

There are two basic ways to give a system more capacity:

  • Vertical scaling (scale up): adding more resources to a single machine (more CPU, more RAM, faster disks). It is like trading your car for a more powerful one.
  • Horizontal scaling (scale out): adding more machines that work in parallel. It is like having a fleet of cars instead of just one.

The following table compares both strategies on the aspects that matter most to the architect:

Aspect Vertical (scale up) Horizontal (scale out)
How you grow A bigger machine More machines
Maximum limit Physical (most powerful hardware available) Practically unlimited
Cost Grows exponentially at the end Grows linearly
Single point of failure Yes (a single machine) No (natural redundancy)
Complexity Low (no code changes) High (distributed state, network)
Downtime when scaling Usually requires a restart Zero (add nodes hot)
Ideal for Relational databases, small/medium loads Stateless web services, microservices

In practice, modern systems combine both: you scale vertically up to a reasonable cost/benefit point, and from there you scale horizontally.

# Example of horizontal scaling in Kubernetes:
# we go from 3 to 10 replicas of a service with a single command
kubectl scale deployment my-service --replicas=10

This command asks Kubernetes to keep 10 identical copies (replicas) of the my-service container running. The orchestrator takes care of starting 7 new instances and distributing them across the available nodes. There is no need to touch the code: this is pure horizontal scaling.

  1. The problem of state and sessions

To be able to scale horizontally, nodes must be interchangeable: any request must be serviceable on any instance. The great enemy of this is the state stored in a node's local memory.

  • Stateful service: stores user data in its own memory (for example, the HTTP session). If the next request goes to another node, that node knows nothing about the user.
  • Stateless service: stores nothing of its own between requests; all the necessary information travels in the request or lives in a shared store (database, distributed cache).

The golden rule is: design stateless services and externalize the state.

// BAD: session in the node's local memory (stateful)
@RestController
public class CartController {
    // This Map lives in the memory of THIS specific node.
    // If the balancer sends the next request to another node, the cart "disappears".
    private final Map<String, Cart> carts = new ConcurrentHashMap<>();
}

The problem with the previous example is that the shopping cart only exists on the node that created it. Let's look at the correct version, which externalizes the state to a shared store (Redis):

// GOOD: state externalized to Redis (stateless from the node's point of view)
@RestController
public class CartController {

    private final RedisTemplate<String, Cart> redis;

    public CartController(RedisTemplate<String, Cart> redis) {
        this.redis = redis;
    }

    @GetMapping("/cart/{id}")
    public Cart view(@PathVariable String id) {
        // The cart is read from Redis, accessible from ANY node.
        return redis.opsForValue().get("cart:" + id);
    }
}

Here any node can service the request because the cart lives in Redis, outside the process's memory. This makes it possible to add or remove nodes without losing user data.

When state on the node cannot be avoided, an intermediate solution is session affinity (sticky sessions): the balancer always sends the same user to the same node. It works, but it breaks interchangeability and complicates scaling and fault tolerance (if the node goes down, the session is lost).

  1. Load balancers and distribution algorithms

A load balancer is the component that receives all incoming requests and distributes them among the available nodes. It is the piece that makes horizontal scaling possible from the client's point of view: the client sees a single address, but behind it there are many machines.

The most common distribution algorithms are:

Algorithm How it distributes When to use it
Round Robin In turns, one after another Homogeneous nodes and similar requests
Weighted Round Robin In turns, but with weights Nodes of differing power
Least Connections To the node with the fewest active connections Requests of variable duration
IP Hash Based on a hash of the source IP To maintain session affinity
Least Response Time To the node that responds fastest Optimize latency

Let's look at a real balancing configuration with Nginx:

# We define a group of backend servers
upstream backend {
    least_conn;                      # Algorithm: fewest active connections
    server 10.0.0.1:8080 weight=2;   # More powerful node: weight 2
    server 10.0.0.2:8080;            # Weight 1 by default
    server 10.0.0.3:8080;
}

server {
    listen 80;
    location / {
        proxy_pass http://backend;   # Forwards requests to the "backend" group
    }
}

Line by line: upstream backend declares the set of nodes. least_conn indicates that Nginx will send each request to the node with the fewest open connections at that moment. The weight=2 on the first server makes it receive roughly twice the traffic because it is more powerful. Finally, proxy_pass http://backend connects the incoming traffic to that group.

It is worth distinguishing the network layer at which the balancer operates:

  • Layer 4 (L4, transport): distributes by IP and port, without inspecting the content. Very fast.
  • Layer 7 (L7, application): understands HTTP and can route by URL, headers, or cookies. Smarter, somewhat slower.

  1. Sharding: data partitioning

Scaling the application servers is relatively easy because they are stateless. The database is much harder because it has state by definition. When a single database cannot handle the entire volume, you turn to sharding: splitting the data into fragments (shards), each on a different server.

The key is choosing a partition key (shard key) that distributes the data in a balanced way:

Users partitioned by hash(user_id) % 3:

  hash(user_id) % 3 == 0  ->  Shard 0  (server A)
  hash(user_id) % 3 == 1  ->  Shard 1  (server B)
  hash(user_id) % 3 == 2  ->  Shard 2  (server C)

Each user lives on a single shard, determined by their user_id. This way, a user's queries only hit one server.

Common sharding strategies:

Strategy Description Risk
By range By key intervals (A-M, N-Z) Hot spots if the key is not uniform
By hash Hash of the key modulo N Re-partitioning when N changes is costly
By directory A lookup table decides the shard Single point of failure in the directory
Geographic By the user's region Complies with data regulations (e.g. GDPR)

The major drawback of sharding is queries that span several shards (for example, "give me all the orders from the last month for all users"): they require querying every server and combining results, which is slow and complex. That is why sharding is only applied when strictly necessary.

graph TD
    C[Client] --> R[Sharding Router]
    R -->|even user_id| S0[(Shard 0)]
    R -->|odd user_id| S1[(Shard 1)]

This diagram shows how an intermediate router decides, based on the user's key, which shard to direct each operation to. The client is unaware that the data is distributed.

  1. The theoretical limits: Amdahl's Law and the Universal Scalability Law

You cannot scale infinitely. There are mathematical laws that prove it.

Amdahl's Law

Amdahl's Law states that the speedup of a system when adding processors is limited by the fraction of the work that cannot be parallelized (the sequential part).

Speedup(N) = 1 / ( (1 - P) + P/N )

  P = fraction of the work that CAN be parallelized
  N = number of processors/nodes

If 10% of the work is sequential (P = 0.9), even with infinite nodes the maximum speedup is 1 / (1 - 0.9) = 10x. In other words: the sequential part sets an absolute ceiling. That is why reducing that sequential fraction (locks, critical sections, global transactions) is so valuable.

Universal Scalability Law (USL)

Amdahl's Law is optimistic because it ignores the cost of coordinating the nodes with each other. Neil Gunther's Universal Scalability Law (USL) adds that cost and predicts something more realistic and sometimes surprising: beyond a certain point, adding more nodes worsens performance.

C(N) = N / ( 1 + α(N-1) + βN(N-1) )

  α (alpha) = contention cost (waiting for shared resources)
  β (beta) = coherence cost (synchronizing state between nodes)

The βN(N-1) term grows quadratically: each new node has to coordinate with all the others. When that coherence cost (β) dominates, the performance curve rises, reaches a maximum, and then falls. The practical lesson is clear: minimize communication and synchronization between nodes if you want to scale for real.

Common Mistakes and Tips

  • Storing state in the node's local memory. This is mistake number one. Always externalize sessions to Redis or similar before scaling horizontally.
  • Overusing sticky sessions. It patches the symptom but not the cause; it hinders balanced distribution and fault tolerance.
  • Scaling the application but forgetting the database. 50 web nodes are useless if they all hit a single saturated database. Identify the true bottleneck.
  • Applying sharding too early. Sharding adds enormous complexity. First try read replicas, caches, and indexes.
  • Ignoring the USL. More nodes is not always more performance. Always measure; do not assume linear scaling.
  • Tip: measure before scaling. The bottleneck is usually where you least expect it.

Exercises

  1. Appropriate scaling. You have a stateless REST API that receives unpredictable traffic spikes and a tight infrastructure budget. Would you scale vertically or horizontally, and why?

  2. Session diagnosis. A user reports that, at random, their shopping cart "empties" after adding products. The application runs on 4 nodes behind a Round Robin balancer. What is the most likely cause and how would you fix it?

  3. Calculation with Amdahl. A process has 25% sequential work (P = 0.75). What is the maximum theoretical speedup, even with infinite nodes available?

Solutions

  1. Horizontal. Being stateless, it scales horizontally effortlessly. Horizontal scaling with autoscaling lets you add nodes only during spikes and remove them afterward, optimizing cost compared to a huge machine that is always on. It also avoids the single point of failure.

  2. The cause is session state stored in each node's local memory. With Round Robin, consecutive requests from the same user go to different nodes that do not know about their cart. Correct solution: externalize the session to a shared store (Redis). Temporary patch solution: enable sticky sessions (IP Hash) to force the user to always return to the same node.

  3. Applying Amdahl with N tending to infinity: Speedup = 1 / (1 - P) = 1 / (1 - 0.75) = 1 / 0.25 = 4x. Even with infinite nodes, you will never go more than 4 times faster.

Conclusion

You have learned that scaling means choosing between growing up (vertical) or out (horizontal), that horizontal scaling requires removing state from the nodes, that the balancer and its algorithms distribute the load, that sharding splits the data when the database can no longer cope, and that Amdahl's Law and the USL place real limits on how much we can grow. A scalable system is not improvised: it is designed stateless, measured constantly, and coordination is minimized.

But scaling is not enough: the system must also keep working when something fails. In the next lesson, High Availability and Fault Tolerance, we will see how to measure and guarantee that the service stays available even when individual components go down.

Application Architecture Course

Module 1: Fundamentals of Application Architecture

Module 2: Design Principles and Tactics

Module 3: Architectural Styles and Patterns

Module 4: Distributed Architectures and Microservices

Module 5: Event-Driven Architectures and Messaging

Module 6: Domain-Driven Design (DDD)

Module 7: Data and Persistence

Module 8: Cloud Architecture and Deployment

Module 9: Quality, Security and Observability

Module 10: Evolution, Governance and Case Studies

© Copyright 2026. All rights reserved