A distributed system is a collection of independent computers that, from the user's point of view, behave as a single coherent system. In today's world, virtually any application of any significant scale (online banking, an e-commerce platform, or an insurance policy management system) is a distributed system. Understanding its fundamentals is essential before tackling microservices, because almost all the hard problems in modern architectures arise from the distributed nature of the system, not from microservices themselves.

In this lesson we will establish what "distributed" means, dismantle the famous fallacies of distributed computing, and analyze the three major challenges that every architect must internalize: latency, partial failures, and consistency.

Contents

  1. What a distributed system is
  2. Why we distribute: benefits and motivations
  3. The eight fallacies of distributed computing
  4. Challenge 1: network latency
  5. Challenge 2: partial failures
  6. Challenge 3: data consistency
  7. Communication models
  8. Common mistakes and tips
  9. Exercises
  10. Conclusion

  1. What a distributed system is

A classic definition, attributed to Andrew Tanenbaum, states:

A distributed system is a collection of independent computers that appear to their users as a single coherent system.

The three key ideas are:

  • Independent: each node has its own memory, its own clock, and can fail separately.
  • Coordinated: the nodes cooperate by exchanging messages over a network.
  • Transparency: ideally, the user is not aware that there are multiple machines.

A fundamental characteristic, formulated by Leslie Lamport, sums it up with irony:

A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.

graph LR
    C[Client] --> GW[API Gateway]
    GW --> S1[Policies Service]
    GW --> S2[Customers Service]
    S1 --> DB1[(Policies DB)]
    S2 --> DB2[(Customers DB)]
    S1 -.messages.-> S2

In this diagram, each node (gateway, services, databases) is an independent process that communicates over the network. The solid arrows represent direct calls and the dashed arrow represents asynchronous communication between services.

  1. Why we distribute: benefits and motivations

Distributing a system adds complexity, so it is only worthwhile when we obtain real benefits:

Motivation Description
Scalability Add more machines to handle more load (horizontal scaling).
Availability If one replica goes down, another keeps serving.
Fault tolerance The system stays operational despite partial failures.
Geographic latency Bring data closer to the user (CDN, regions).
Isolation Teams and modules evolve independently.

The golden rule: don't distribute because it's trendy. Distribute when a single server is no longer enough or when the organization needs autonomy between teams.

  1. The eight fallacies of distributed computing

In 1994, Peter Deutsch and other engineers at Sun Microsystems listed the mistaken assumptions that developers repeatedly make when building distributed systems. Each of them is false, and forgetting this leads to subtle bugs and production outages.

# Fallacy Reality
1 The network is reliable Packets get lost, connections drop.
2 Latency is zero Every network hop costs milliseconds or more.
3 Bandwidth is infinite There are limits; transferring large data saturates the network.
4 The network is secure There are attacks, interceptions, and impersonations.
5 Topology doesn't change Servers appear and disappear continuously.
6 There is a single administrator Multiple teams, providers, and configurations.
7 Transport cost is zero Serializing and moving data consumes CPU and money.
8 The network is homogeneous Different protocols, versions, and hardware coexist.

The following Java example shows code that falls into fallacies 1 and 2 (the network is reliable and latency is zero):

// BAD: assumes the call always works and is instantaneous
public Customer getCustomer(String id) {
    // No timeout: if the remote service does not respond,
    // this thread stays blocked indefinitely.
    return customerRestClient.get("/customers/" + id);
}

Here there is no timeout, no retries, and no error handling. If the remote service is slow or fails, the thread hangs. The correct version requires explicit timeouts and failure handling, something we will cover in the resilience lesson.

  1. Challenge 1: network latency

Latency is the time it takes for a message to travel from one node to another. Unlike a local method call (nanoseconds), a network call costs orders of magnitude more. A useful mental reference (approximate numbers from Jeff Dean):

Operation Approximate time
L1 cache access 0.5 ns
RAM access 100 ns
SSD read 150,000 ns (0.15 ms)
Round trip within the same data center 500,000 ns (0.5 ms)
Round trip between continents 150,000,000 ns (150 ms)

The practical consequence is that making many small network calls is very expensive. This antipattern is known as "network N+1":

// BAD: one network call per order (N+1 problem)
List<Order> orders = orderService.list();            // 1 call
for (Order o : orders) {
    Customer c = customerService.get(o.getCustomerId()); // N calls
    o.setCustomer(c);
}

If there are 100 orders, this amounts to 101 network calls. The solution is batching, requesting all the customers at once:

// GOOD: two calls in total, regardless of the number of orders
List<Order> orders = orderService.list();
Set<String> ids = orders.stream()
        .map(Order::getCustomerId)
        .collect(Collectors.toSet());
Map<String, Customer> customers = customerService.getMany(ids); // 1 call
orders.forEach(o -> o.setCustomer(customers.get(o.getCustomerId())));

We reduce from 101 to 2 calls: the performance difference is enormous.

  1. Challenge 2: partial failures

In a monolithic system, either everything works or everything fails. In a distributed one, a new and dangerous state appears: the partial failure, in which some components work and others do not.

The hardest case is uncertainty: when we send a request and receive no response, we don't know whether:

  • the request never reached the destination,
  • it arrived and was processed but the response was lost,
  • the destination is still processing it slowly.
sequenceDiagram
    participant A as Service A
    participant B as Service B
    A->>B: Charge 100 EUR
    Note over B: Charges correctly
    B--xA: Lost response (timeout)
    Note over A: Retry? Risk of charging twice

This forces us to design idempotent operations: an operation is idempotent if executing it several times produces the same result as executing it just once. That way a retry is safe.

// Idempotent operation via idempotency key
public void charge(String idempotencyKey, BigDecimal amount) {
    // If we already processed this key, we don't charge again
    if (chargeRepository.exists(idempotencyKey)) {
        return; // already done, safe retry
    }
    processPayment(amount);
    chargeRepository.save(idempotencyKey);
}

The idempotencyKey (a unique identifier that the client generates and resends on each retry) allows detecting duplicates. Without it, a retry could charge twice.

  1. Challenge 3: data consistency

When data is replicated or spread across nodes, keeping it coherent is difficult. We distinguish two basic models:

Model Description Example use
Strong consistency Every read returns the most recently written value. Bank balance, critical stock control.
Eventual consistency Replicas converge "over time"; stale reads are possible. "Like" counter, product catalog.

Strong consistency is convenient for the programmer but expensive in performance and availability. Eventual consistency scales better but forces us to tolerate temporarily outdated data. In the lesson on the CAP Theorem we will dig deeper into why you cannot have it all at once.

  1. Communication models

There are two major ways to communicate nodes, which we will study in detail later:

  • Synchronous (request/response): the caller waits for the response (REST, gRPC). Simple but couples availability: if B is down, A is affected.
  • Asynchronous (messaging/events): the caller publishes a message and continues (queues, events). Decouples, but introduces complexity and eventual consistency.
# Conceptual example of an event published on a message bus
event: PolicyCreated
version: 1
data:
  policyId: "POL-00123"
  line: "home"
  effectiveDate: "2026-06-30"

This event describes something that happened ("a policy was created"). Other services can subscribe and react without the publisher knowing who they are, which reduces coupling.

Common Mistakes and Tips

  • Treating network calls as local calls: there will always be failures and latency. Put timeouts on all remote calls.
  • Forgetting idempotency: if you are going to retry, make sure the operation is safe to repeat.
  • Distributing prematurely: a well-built monolith is usually a better starting point than ten poorly defined microservices.
  • Ignoring observability: without correlated logs, metrics, and traces, debugging a distributed system is nearly impossible. Use a correlation identifier that travels in every request.
  • Trusting clocks: the clocks of different machines are not perfectly synchronized; don't use timestamps to order critical events without an appropriate mechanism.

Exercises

  1. List the eight fallacies of distributed computing and, for each one, indicate a practical consequence of ignoring it in your design.
  2. You have a payment service that receives requests over the network. Explain why a simple retry can cause a double charge and design, in Java pseudocode, an idempotent solution.
  3. Estimate how many times slower a call between continents (150 ms) is compared to a RAM access (100 ns). Reason about what implication this has for API design.

Solutions

  1. For example: The network is reliable → if you don't handle errors, a dropped connection leaves the operation in an inconsistent state. Latency is zero → if you design "chatty" APIs with many round trips, performance degrades. (The rest are completed analogously using the table in section 3.)

  2. A retry causes a double charge because, if the first request was indeed processed but its response was lost, the client believes it failed and resends. Solution:

public void charge(String idempotencyKey, BigDecimal amount) {
    if (chargeRepository.exists(idempotencyKey)) {
        return; // already charged, we ignore the duplicate
    }
    processPayment(amount);
    chargeRepository.save(idempotencyKey);
}

The key is to record the idempotency key atomically together with the charge.

  1. 150 ms = 150,000,000 ns; 150,000,000 / 100 = 1,500,000 times slower. Implication: the number of remote calls must be minimized (batching, caches, data aggregation) instead of designing APIs that require many round trips.

Conclusion

We have seen that a distributed system is a set of independent nodes that cooperate over a network and appear as a single one. The eight fallacies remind us that the network is not reliable, nor instantaneous, nor secure, nor free. The three major challenges (latency, partial failures, and consistency) shape all the architectural decisions we will make from now on.

With these fundamentals clear, in the next lesson, Microservices Architecture, we will see how a specific architectural style leverages (and at the same time suffers from) the distributed nature to build scalable and maintainable systems.

Application Architecture Course

Module 1: Fundamentals of Application Architecture

Module 2: Design Principles and Tactics

Module 3: Architectural Styles and Patterns

Module 4: Distributed Architectures and Microservices

Module 5: Event-Driven Architectures and Messaging

Module 6: Domain-Driven Design (DDD)

Module 7: Data and Persistence

Module 8: Cloud Architecture and Deployment

Module 9: Quality, Security and Observability

Module 10: Evolution, Governance and Case Studies

© Copyright 2026. All rights reserved