A distributed system is a collection of independent computers that, from the user's point of view, behave as a single coherent system. In today's world, virtually any application of any significant scale (online banking, an e-commerce platform, or an insurance policy management system) is a distributed system. Understanding its fundamentals is essential before tackling microservices, because almost all the hard problems in modern architectures arise from the distributed nature of the system, not from microservices themselves.
In this lesson we will establish what "distributed" means, dismantle the famous fallacies of distributed computing, and analyze the three major challenges that every architect must internalize: latency, partial failures, and consistency.
Contents
- What a distributed system is
- Why we distribute: benefits and motivations
- The eight fallacies of distributed computing
- Challenge 1: network latency
- Challenge 2: partial failures
- Challenge 3: data consistency
- Communication models
- Common mistakes and tips
- Exercises
- Conclusion
- What a distributed system is
A classic definition, attributed to Andrew Tanenbaum, states:
A distributed system is a collection of independent computers that appear to their users as a single coherent system.
The three key ideas are:
- Independent: each node has its own memory, its own clock, and can fail separately.
- Coordinated: the nodes cooperate by exchanging messages over a network.
- Transparency: ideally, the user is not aware that there are multiple machines.
A fundamental characteristic, formulated by Leslie Lamport, sums it up with irony:
A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.
graph LR
C[Client] --> GW[API Gateway]
GW --> S1[Policies Service]
GW --> S2[Customers Service]
S1 --> DB1[(Policies DB)]
S2 --> DB2[(Customers DB)]
S1 -.messages.-> S2In this diagram, each node (gateway, services, databases) is an independent process that communicates over the network. The solid arrows represent direct calls and the dashed arrow represents asynchronous communication between services.
- Why we distribute: benefits and motivations
Distributing a system adds complexity, so it is only worthwhile when we obtain real benefits:
| Motivation | Description |
|---|---|
| Scalability | Add more machines to handle more load (horizontal scaling). |
| Availability | If one replica goes down, another keeps serving. |
| Fault tolerance | The system stays operational despite partial failures. |
| Geographic latency | Bring data closer to the user (CDN, regions). |
| Isolation | Teams and modules evolve independently. |
The golden rule: don't distribute because it's trendy. Distribute when a single server is no longer enough or when the organization needs autonomy between teams.
- The eight fallacies of distributed computing
In 1994, Peter Deutsch and other engineers at Sun Microsystems listed the mistaken assumptions that developers repeatedly make when building distributed systems. Each of them is false, and forgetting this leads to subtle bugs and production outages.
| # | Fallacy | Reality |
|---|---|---|
| 1 | The network is reliable | Packets get lost, connections drop. |
| 2 | Latency is zero | Every network hop costs milliseconds or more. |
| 3 | Bandwidth is infinite | There are limits; transferring large data saturates the network. |
| 4 | The network is secure | There are attacks, interceptions, and impersonations. |
| 5 | Topology doesn't change | Servers appear and disappear continuously. |
| 6 | There is a single administrator | Multiple teams, providers, and configurations. |
| 7 | Transport cost is zero | Serializing and moving data consumes CPU and money. |
| 8 | The network is homogeneous | Different protocols, versions, and hardware coexist. |
The following Java example shows code that falls into fallacies 1 and 2 (the network is reliable and latency is zero):
// BAD: assumes the call always works and is instantaneous
public Customer getCustomer(String id) {
// No timeout: if the remote service does not respond,
// this thread stays blocked indefinitely.
return customerRestClient.get("/customers/" + id);
}Here there is no timeout, no retries, and no error handling. If the remote service is slow or fails, the thread hangs. The correct version requires explicit timeouts and failure handling, something we will cover in the resilience lesson.
- Challenge 1: network latency
Latency is the time it takes for a message to travel from one node to another. Unlike a local method call (nanoseconds), a network call costs orders of magnitude more. A useful mental reference (approximate numbers from Jeff Dean):
| Operation | Approximate time |
|---|---|
| L1 cache access | 0.5 ns |
| RAM access | 100 ns |
| SSD read | 150,000 ns (0.15 ms) |
| Round trip within the same data center | 500,000 ns (0.5 ms) |
| Round trip between continents | 150,000,000 ns (150 ms) |
The practical consequence is that making many small network calls is very expensive. This antipattern is known as "network N+1":
// BAD: one network call per order (N+1 problem)
List<Order> orders = orderService.list(); // 1 call
for (Order o : orders) {
Customer c = customerService.get(o.getCustomerId()); // N calls
o.setCustomer(c);
}If there are 100 orders, this amounts to 101 network calls. The solution is batching, requesting all the customers at once:
// GOOD: two calls in total, regardless of the number of orders
List<Order> orders = orderService.list();
Set<String> ids = orders.stream()
.map(Order::getCustomerId)
.collect(Collectors.toSet());
Map<String, Customer> customers = customerService.getMany(ids); // 1 call
orders.forEach(o -> o.setCustomer(customers.get(o.getCustomerId())));We reduce from 101 to 2 calls: the performance difference is enormous.
- Challenge 2: partial failures
In a monolithic system, either everything works or everything fails. In a distributed one, a new and dangerous state appears: the partial failure, in which some components work and others do not.
The hardest case is uncertainty: when we send a request and receive no response, we don't know whether:
- the request never reached the destination,
- it arrived and was processed but the response was lost,
- the destination is still processing it slowly.
sequenceDiagram
participant A as Service A
participant B as Service B
A->>B: Charge 100 EUR
Note over B: Charges correctly
B--xA: Lost response (timeout)
Note over A: Retry? Risk of charging twiceThis forces us to design idempotent operations: an operation is idempotent if executing it several times produces the same result as executing it just once. That way a retry is safe.
// Idempotent operation via idempotency key
public void charge(String idempotencyKey, BigDecimal amount) {
// If we already processed this key, we don't charge again
if (chargeRepository.exists(idempotencyKey)) {
return; // already done, safe retry
}
processPayment(amount);
chargeRepository.save(idempotencyKey);
}The idempotencyKey (a unique identifier that the client generates and resends on each retry) allows detecting duplicates. Without it, a retry could charge twice.
- Challenge 3: data consistency
When data is replicated or spread across nodes, keeping it coherent is difficult. We distinguish two basic models:
| Model | Description | Example use |
|---|---|---|
| Strong consistency | Every read returns the most recently written value. | Bank balance, critical stock control. |
| Eventual consistency | Replicas converge "over time"; stale reads are possible. | "Like" counter, product catalog. |
Strong consistency is convenient for the programmer but expensive in performance and availability. Eventual consistency scales better but forces us to tolerate temporarily outdated data. In the lesson on the CAP Theorem we will dig deeper into why you cannot have it all at once.
- Communication models
There are two major ways to communicate nodes, which we will study in detail later:
- Synchronous (request/response): the caller waits for the response (REST, gRPC). Simple but couples availability: if B is down, A is affected.
- Asynchronous (messaging/events): the caller publishes a message and continues (queues, events). Decouples, but introduces complexity and eventual consistency.
# Conceptual example of an event published on a message bus event: PolicyCreated version: 1 data: policyId: "POL-00123" line: "home" effectiveDate: "2026-06-30"
This event describes something that happened ("a policy was created"). Other services can subscribe and react without the publisher knowing who they are, which reduces coupling.
Common Mistakes and Tips
- Treating network calls as local calls: there will always be failures and latency. Put timeouts on all remote calls.
- Forgetting idempotency: if you are going to retry, make sure the operation is safe to repeat.
- Distributing prematurely: a well-built monolith is usually a better starting point than ten poorly defined microservices.
- Ignoring observability: without correlated logs, metrics, and traces, debugging a distributed system is nearly impossible. Use a correlation identifier that travels in every request.
- Trusting clocks: the clocks of different machines are not perfectly synchronized; don't use timestamps to order critical events without an appropriate mechanism.
Exercises
- List the eight fallacies of distributed computing and, for each one, indicate a practical consequence of ignoring it in your design.
- You have a payment service that receives requests over the network. Explain why a simple retry can cause a double charge and design, in Java pseudocode, an idempotent solution.
- Estimate how many times slower a call between continents (150 ms) is compared to a RAM access (100 ns). Reason about what implication this has for API design.
Solutions
-
For example: The network is reliable → if you don't handle errors, a dropped connection leaves the operation in an inconsistent state. Latency is zero → if you design "chatty" APIs with many round trips, performance degrades. (The rest are completed analogously using the table in section 3.)
-
A retry causes a double charge because, if the first request was indeed processed but its response was lost, the client believes it failed and resends. Solution:
public void charge(String idempotencyKey, BigDecimal amount) {
if (chargeRepository.exists(idempotencyKey)) {
return; // already charged, we ignore the duplicate
}
processPayment(amount);
chargeRepository.save(idempotencyKey);
}The key is to record the idempotency key atomically together with the charge.
- 150 ms = 150,000,000 ns; 150,000,000 / 100 = 1,500,000 times slower. Implication: the number of remote calls must be minimized (batching, caches, data aggregation) instead of designing APIs that require many round trips.
Conclusion
We have seen that a distributed system is a set of independent nodes that cooperate over a network and appear as a single one. The eight fallacies remind us that the network is not reliable, nor instantaneous, nor secure, nor free. The three major challenges (latency, partial failures, and consistency) shape all the architectural decisions we will make from now on.
With these fundamentals clear, in the next lesson, Microservices Architecture, we will see how a specific architectural style leverages (and at the same time suffers from) the distributed nature to build scalable and maintainable systems.
Application Architecture Course
Module 1: Fundamentals of Application Architecture
- What Is Application Architecture?
- The Role of the Software Architect
- Quality Attributes and Non-Functional Requirements
- Architectural Decisions and Trade-offs
- Architecture Documentation: Views and the C4 Model
Module 2: Design Principles and Tactics
- Coupling, Cohesion and Separation of Concerns
- SOLID Principles Applied to Architecture
- DRY, KISS, YAGNI and Other Design Principles
- Architectural Tactics for Quality Attributes
- Managing Technical Debt
Module 3: Architectural Styles and Patterns
- Monolithic Architecture
- Layered Architecture (N-Tier)
- Client-Server Architecture
- Hexagonal Architecture (Ports and Adapters)
- Clean and Onion Architecture
Module 4: Distributed Architectures and Microservices
- Introduction to Distributed Systems
- Microservices Architecture
- Service Decomposition and Bounded Contexts
- API Gateway, Service Discovery and Inter-Service Communication
- Resilience Patterns: Circuit Breaker, Retry and Bulkhead
- The CAP Theorem and Data Consistency
Module 5: Event-Driven Architectures and Messaging
- Fundamentals of Event-Driven Architecture
- Asynchronous Messaging: Queues and Brokers
- Event Patterns: Event Sourcing and CQRS
- Managing Distributed Transactions: The Saga Pattern
- Real-Time Data Streaming
Module 6: Domain-Driven Design (DDD)
- Core DDD Concepts
- Strategic Design: Bounded Contexts and Ubiquitous Language
- Tactical Design: Entities, Aggregates and Repositories
- Context Mapping
Module 7: Data and Persistence
- Persistence Strategies: SQL vs NoSQL
- Data Access Patterns: Repository, Unit of Work and DAO
- Database per Service and Distributed Data Management
- Caching and Invalidation Strategies
Module 8: Cloud Architecture and Deployment
- Cloud Computing Fundamentals (IaaS, PaaS, SaaS)
- Containers and Orchestration with Docker and Kubernetes
- Serverless Architecture
- Cloud-Native Design Patterns
- Infrastructure as Code (IaC)
Module 9: Quality, Security and Observability
- Scalability: Horizontal vs Vertical and Load Balancing
- High Availability and Fault Tolerance
- Security by Design and Authentication/Authorization
- Observability: Logging, Metrics and Tracing
- Performance and Load Testing
