In a simple monolithic system it was enough to check a log file when something failed. But in modern architectures (microservices, ephemeral containers, hundreds of instances) a single request can pass through ten different services. How do you know what actually happened? The answer is observability: the ability to understand the internal state of a system from the signals it emits to the outside. It is not just about monitoring (knowing whether something is wrong), but about being able to ask the system why it is wrong, even for problems you had not anticipated. In this lesson you will master the three pillars of observability (logs, metrics, and traces), the formats and methodologies that make them useful (structured logs, RED, USE), distributed tracing with correlation, and the standard that unifies it all: OpenTelemetry.
Contents
- The three pillars of observability
- Structured logging
- Metrics: the RED and USE methodologies
- Distributed traces and correlation
- OpenTelemetry: the unifying standard
- Common mistakes and tips
- Exercises
- The three pillars of observability
Observability rests on three complementary types of signals (telemetry signals):
| Pillar | What it answers | Nature | Example |
|---|---|---|---|
| Logs | What exactly happened? | Discrete events with detail | "Error charging order 123: card declined" |
| Metrics | How much / how well is it going? | Numbers aggregated over time | "p95 latency = 240 ms; 1,200 req/s" |
| Traces | Where did the request go? | The path of a request between services | "API -> Orders -> Payment -> Bank (380 ms total)" |
They are complementary: metrics warn you that something is wrong (latency has risen), traces tell you where (which service is slow), and logs tell you why (what specific error occurred). Used together, they let you go from "the system is slow" to "the exact cause" in minutes.
- Structured logging
Traditional logging writes free text, easy for a human to read but very hard for a machine to process:
How do you search for "all errors for order 123" among millions of lines like this? Impossible to do reliably. The solution is structured logging: emitting logs as data (usually JSON) with well-defined fields.
{
"timestamp": "2026-06-30T10:15:32Z",
"level": "ERROR",
"service": "payments",
"message": "Payment declined",
"trace_id": "abc123",
"order_id": 123,
"user_id": "ana",
"reason": "card_declined"
}Each field is queryable: now you can filter by order_id = 123, group by reason, or follow the trace_id to link with the complete trace (we will see this later). This turns logs into data that can be analyzed at scale.
// Structured logging with SLF4J + MDC (Mapped Diagnostic Context)
MDC.put("trace_id", traceId); // Adds context to ALL logs of this thread
MDC.put("order_id", "123");
log.error("Payment declined", kv("reason", "card_declined"));
MDC.clear(); // Clear when finished so as not to contaminate other threadsThe MDC (Mapped Diagnostic Context) lets you "attach" context fields (such as the trace_id) that will be included automatically in every log line emitted in that thread, without having to pass them manually in each call. It is essential to clear it (MDC.clear()) at the end so as not to carry data over to other requests.
Logging best practices:
- Use levels correctly (DEBUG, INFO, WARN, ERROR) and do not flood with irrelevant logs.
- Never log sensitive data (passwords, cards, personal data). This is a legal obligation as well as good practice.
- Always include correlation identifiers (
trace_id).
- Metrics: the RED and USE methodologies
Metrics are numeric values measured over time: counters (total requests), gauges (memory used), and histograms (latency distribution). Since you can measure thousands of things, two methodologies help you choose what to measure.
RED focuses on the service from the user's point of view:
| RED metric | Meaning | Question |
|---|---|---|
| Rate | Requests per second | How much traffic am I receiving? |
| Errors | Error rate | How many requests fail? |
| Duration | Latency (distribution) | How long do they take? |
USE focuses on the resources (CPU, memory, disk, network):
| USE metric | Meaning | Question |
|---|---|---|
| Utilization | Percentage in use | How busy is it? |
| Saturation | Work queued waiting | Is there more demand than it can handle? |
| Errors | Errors of the resource | Is the resource failing? |
Practical rule: use RED for services (what the user sees) and USE for resources (the underlying infrastructure). Together they give you a complete picture.
# Prometheus metrics exposition format
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{service="payments",status="200"} 48210
http_requests_total{service="payments",status="500"} 37
# HELP http_request_duration_seconds Request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.3"} 47100
http_request_duration_seconds_bucket{le="1.0"} 48190Explanation: the first block is a counter of requests, labeled by service and status code; it lets you calculate the Rate and Errors of RED. The second is a histogram that counts how many requests fell under each latency threshold (le = less or equal); with it you calculate percentiles such as the p95 (Duration of RED). The labels ({service=...}) allow you to filter and group the metrics.
- Distributed traces and correlation
When a request crosses several services, how do you follow its complete trail? With distributed traces. A trace represents the complete journey of a request; it is composed of spans, where each span is an operation (a call to a service, a query to the database).
The key mechanism is context propagation: the first service generates a unique trace_id and passes it to each service it calls, usually in HTTP headers. This way, all the logs and spans of that request share the same trace_id and can be correlated.
graph TD
A["Span: API Gateway (trace_id=abc, 400ms)"] --> B["Span: Orders Service (120ms)"]
A --> C["Span: Payment Service (250ms)"]
C --> D["Span: Call to the Bank (230ms)"]This tree of spans shows at a glance where the time goes: of the 400 ms total, the Payment service consumes 250 ms, and almost all of that (230 ms) is spent waiting for the Bank. Without traces, you would only know that "the request took 400 ms"; with them, you know exactly where the bottleneck is.
# Trace context propagation via the standard W3C Trace Context header curl https://api.mycompany.com/orders \ -H "traceparent: 00-abc123def456...-0011223344556677-01" # version-trace_id-----------------span_id---------flags
The traceparent header (W3C standard) carries the trace_id (identifies the whole trace) and the span_id (identifies the parent span). Each service reads this header, creates its own child span, and forwards it to the next service, keeping the chain linked.
- OpenTelemetry: the unifying standard
Historically, each observability tool had its own format and instrumentation, locking you into a vendor. OpenTelemetry (OTel) is the open standard (from the CNCF) that unifies the generation of the three signals (logs, metrics, and traces) under a single API and a single protocol (OTLP). You instrument your code once and can send the data to any backend (Prometheus, Jaeger, Grafana, etc.).
graph LR
APP[Your application + OTel SDK] -->|OTLP| COL[OpenTelemetry Collector]
COL --> M[Metrics: Prometheus]
COL --> T[Traces: Jaeger]
COL --> L[Logs: Loki]The central piece is the Collector: the application, instrumented with the OTel SDK, sends all its telemetry to the Collector via the OTLP protocol, and the Collector processes it and forwards it to each specialized backend. The great advantage: your code neither knows nor cares which backend is behind it; you can change them without touching the application.
// Create a span manually with the OpenTelemetry API
Span span = tracer.spanBuilder("process-payment").startSpan();
try (Scope scope = span.makeCurrent()) {
span.setAttribute("order_id", 123); // Queryable attributes on the span
processPayment(); // The actual work
} catch (Exception e) {
span.recordException(e); // Records the error in the trace
throw e;
} finally {
span.end(); // ALWAYS close the span
}This code creates a span called process-payment that measures the time of the operation. makeCurrent() makes it "active" so that nested calls hang off it automatically. If there is an error, recordException attaches it to the trace (you will see the failure in the span). The span.end() in the finally guarantees that the span is closed and its duration recorded no matter what.
Common Mistakes and Tips
- Confusing monitoring with observability. Monitoring is watching known metrics; observability is being able to investigate problems you did not anticipate. You need both.
- Free-text logs. Impossible to analyze at scale. Use structured logging (JSON) from the start.
- Not propagating the
trace_id. Without correlation, each service's logs are disconnected islands. Always propagate the trace context. - Logging sensitive data. It is a security and compliance failure. Filter passwords, cards, and personal data before logging.
- Too many high-cardinality metrics. Labels with unlimited values (like
user_idon every metric) cause storage cost to explode. Use bounded cardinality. - Tracing 100% of traffic without sampling. In high-volume systems this generates an enormous cost. Apply intelligent sampling.
- Tip: instrument thinking about the questions you will want to ask when something fails at 3 in the morning.
Exercises
-
Choosing the right pillar. For each situation, indicate which pillar (logs, metrics, or traces) you would use first: (a) a dashboard shows that the error rate has risen to 5%; (b) you want to know why order 987 exactly failed; (c) a request takes 3 seconds and you want to know which service is slowing it down.
-
Improve a log. Rewrite this free-text log as a structured JSON log with fields useful for diagnosis:
"ERROR: failed to connect to the database after 3 attempts" -
RED vs. USE. Classify each metric according to the most appropriate methodology: (a) requests per second of the API; (b) CPU utilization percentage of the server; (c) disk queue length; (d) rate of HTTP 500 responses.
Solutions
-
(a) Metrics (they alert you that something is wrong: the error rate). (b) Logs (they give you the exact detail of that specific order's failure). (c) Traces (they show you the request's path and which span/service consumes the time). The natural investigation flow is usually metric -> trace -> log.
-
Example of a structured log:
{ "timestamp": "2026-06-30T10:20:00Z", "level": "ERROR", "service": "orders", "message": "Database connection failure", "component": "db_pool", "retries": 3, "db_host": "db-primary", "trace_id": "abc123" }Now you can filter by
component, count retries, or link with the trace viatrace_id. -
(a) RED (Rate, a service metric). (b) USE (Utilization of a resource). (c) USE (Saturation: work queued on the disk). (d) RED (Errors, a service metric). In short: a and d measure the service (RED); b and c measure resources (USE).
Conclusion
You have learned that observability rests on three complementary pillars: logs (what happened), metrics (how much / how well), and traces (where it went). You have seen how structured logs make them analyzable, how RED and USE guide you on what to measure, how correlation via trace_id links all the signals of a request, and how OpenTelemetry standardizes all the instrumentation. An observable system is one you can interrogate to understand any problem, anticipated or not.
The metrics we have learned to collect (latency, throughput, percentiles) are precisely the ones we will measure when subjecting the system to load. In the last lesson of the module, Performance and Load Testing, we will see how to measure and validate that the system withstands what the business needs.
Application Architecture Course
Module 1: Fundamentals of Application Architecture
- What Is Application Architecture?
- The Role of the Software Architect
- Quality Attributes and Non-Functional Requirements
- Architectural Decisions and Trade-offs
- Architecture Documentation: Views and the C4 Model
Module 2: Design Principles and Tactics
- Coupling, Cohesion and Separation of Concerns
- SOLID Principles Applied to Architecture
- DRY, KISS, YAGNI and Other Design Principles
- Architectural Tactics for Quality Attributes
- Managing Technical Debt
Module 3: Architectural Styles and Patterns
- Monolithic Architecture
- Layered Architecture (N-Tier)
- Client-Server Architecture
- Hexagonal Architecture (Ports and Adapters)
- Clean and Onion Architecture
Module 4: Distributed Architectures and Microservices
- Introduction to Distributed Systems
- Microservices Architecture
- Service Decomposition and Bounded Contexts
- API Gateway, Service Discovery and Inter-Service Communication
- Resilience Patterns: Circuit Breaker, Retry and Bulkhead
- The CAP Theorem and Data Consistency
Module 5: Event-Driven Architectures and Messaging
- Fundamentals of Event-Driven Architecture
- Asynchronous Messaging: Queues and Brokers
- Event Patterns: Event Sourcing and CQRS
- Managing Distributed Transactions: The Saga Pattern
- Real-Time Data Streaming
Module 6: Domain-Driven Design (DDD)
- Core DDD Concepts
- Strategic Design: Bounded Contexts and Ubiquitous Language
- Tactical Design: Entities, Aggregates and Repositories
- Context Mapping
Module 7: Data and Persistence
- Persistence Strategies: SQL vs NoSQL
- Data Access Patterns: Repository, Unit of Work and DAO
- Database per Service and Distributed Data Management
- Caching and Invalidation Strategies
Module 8: Cloud Architecture and Deployment
- Cloud Computing Fundamentals (IaaS, PaaS, SaaS)
- Containers and Orchestration with Docker and Kubernetes
- Serverless Architecture
- Cloud-Native Design Patterns
- Infrastructure as Code (IaC)
Module 9: Quality, Security and Observability
- Scalability: Horizontal vs Vertical and Load Balancing
- High Availability and Fault Tolerance
- Security by Design and Authentication/Authorization
- Observability: Logging, Metrics and Tracing
- Performance and Load Testing
