In a simple monolithic system it was enough to check a log file when something failed. But in modern architectures (microservices, ephemeral containers, hundreds of instances) a single request can pass through ten different services. How do you know what actually happened? The answer is observability: the ability to understand the internal state of a system from the signals it emits to the outside. It is not just about monitoring (knowing whether something is wrong), but about being able to ask the system why it is wrong, even for problems you had not anticipated. In this lesson you will master the three pillars of observability (logs, metrics, and traces), the formats and methodologies that make them useful (structured logs, RED, USE), distributed tracing with correlation, and the standard that unifies it all: OpenTelemetry.

Contents

  1. The three pillars of observability
  2. Structured logging
  3. Metrics: the RED and USE methodologies
  4. Distributed traces and correlation
  5. OpenTelemetry: the unifying standard
  6. Common mistakes and tips
  7. Exercises

  1. The three pillars of observability

Observability rests on three complementary types of signals (telemetry signals):

Pillar What it answers Nature Example
Logs What exactly happened? Discrete events with detail "Error charging order 123: card declined"
Metrics How much / how well is it going? Numbers aggregated over time "p95 latency = 240 ms; 1,200 req/s"
Traces Where did the request go? The path of a request between services "API -> Orders -> Payment -> Bank (380 ms total)"

They are complementary: metrics warn you that something is wrong (latency has risen), traces tell you where (which service is slow), and logs tell you why (what specific error occurred). Used together, they let you go from "the system is slow" to "the exact cause" in minutes.

  1. Structured logging

Traditional logging writes free text, easy for a human to read but very hard for a machine to process:

2026-06-30 10:15:32 ERROR User ana could not pay for order 123 (card declined)

How do you search for "all errors for order 123" among millions of lines like this? Impossible to do reliably. The solution is structured logging: emitting logs as data (usually JSON) with well-defined fields.

{
  "timestamp": "2026-06-30T10:15:32Z",
  "level": "ERROR",
  "service": "payments",
  "message": "Payment declined",
  "trace_id": "abc123",
  "order_id": 123,
  "user_id": "ana",
  "reason": "card_declined"
}

Each field is queryable: now you can filter by order_id = 123, group by reason, or follow the trace_id to link with the complete trace (we will see this later). This turns logs into data that can be analyzed at scale.

// Structured logging with SLF4J + MDC (Mapped Diagnostic Context)
MDC.put("trace_id", traceId);     // Adds context to ALL logs of this thread
MDC.put("order_id", "123");
log.error("Payment declined", kv("reason", "card_declined"));
MDC.clear();                       // Clear when finished so as not to contaminate other threads

The MDC (Mapped Diagnostic Context) lets you "attach" context fields (such as the trace_id) that will be included automatically in every log line emitted in that thread, without having to pass them manually in each call. It is essential to clear it (MDC.clear()) at the end so as not to carry data over to other requests.

Logging best practices:

  • Use levels correctly (DEBUG, INFO, WARN, ERROR) and do not flood with irrelevant logs.
  • Never log sensitive data (passwords, cards, personal data). This is a legal obligation as well as good practice.
  • Always include correlation identifiers (trace_id).

  1. Metrics: the RED and USE methodologies

Metrics are numeric values measured over time: counters (total requests), gauges (memory used), and histograms (latency distribution). Since you can measure thousands of things, two methodologies help you choose what to measure.

RED focuses on the service from the user's point of view:

RED metric Meaning Question
Rate Requests per second How much traffic am I receiving?
Errors Error rate How many requests fail?
Duration Latency (distribution) How long do they take?

USE focuses on the resources (CPU, memory, disk, network):

USE metric Meaning Question
Utilization Percentage in use How busy is it?
Saturation Work queued waiting Is there more demand than it can handle?
Errors Errors of the resource Is the resource failing?

Practical rule: use RED for services (what the user sees) and USE for resources (the underlying infrastructure). Together they give you a complete picture.

# Prometheus metrics exposition format
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{service="payments",status="200"} 48210
http_requests_total{service="payments",status="500"} 37

# HELP http_request_duration_seconds Request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.3"} 47100
http_request_duration_seconds_bucket{le="1.0"} 48190

Explanation: the first block is a counter of requests, labeled by service and status code; it lets you calculate the Rate and Errors of RED. The second is a histogram that counts how many requests fell under each latency threshold (le = less or equal); with it you calculate percentiles such as the p95 (Duration of RED). The labels ({service=...}) allow you to filter and group the metrics.

  1. Distributed traces and correlation

When a request crosses several services, how do you follow its complete trail? With distributed traces. A trace represents the complete journey of a request; it is composed of spans, where each span is an operation (a call to a service, a query to the database).

The key mechanism is context propagation: the first service generates a unique trace_id and passes it to each service it calls, usually in HTTP headers. This way, all the logs and spans of that request share the same trace_id and can be correlated.

graph TD
    A["Span: API Gateway (trace_id=abc, 400ms)"] --> B["Span: Orders Service (120ms)"]
    A --> C["Span: Payment Service (250ms)"]
    C --> D["Span: Call to the Bank (230ms)"]

This tree of spans shows at a glance where the time goes: of the 400 ms total, the Payment service consumes 250 ms, and almost all of that (230 ms) is spent waiting for the Bank. Without traces, you would only know that "the request took 400 ms"; with them, you know exactly where the bottleneck is.

# Trace context propagation via the standard W3C Trace Context header
curl https://api.mycompany.com/orders \
  -H "traceparent: 00-abc123def456...-0011223344556677-01"
#       version-trace_id-----------------span_id---------flags

The traceparent header (W3C standard) carries the trace_id (identifies the whole trace) and the span_id (identifies the parent span). Each service reads this header, creates its own child span, and forwards it to the next service, keeping the chain linked.

  1. OpenTelemetry: the unifying standard

Historically, each observability tool had its own format and instrumentation, locking you into a vendor. OpenTelemetry (OTel) is the open standard (from the CNCF) that unifies the generation of the three signals (logs, metrics, and traces) under a single API and a single protocol (OTLP). You instrument your code once and can send the data to any backend (Prometheus, Jaeger, Grafana, etc.).

graph LR
    APP[Your application + OTel SDK] -->|OTLP| COL[OpenTelemetry Collector]
    COL --> M[Metrics: Prometheus]
    COL --> T[Traces: Jaeger]
    COL --> L[Logs: Loki]

The central piece is the Collector: the application, instrumented with the OTel SDK, sends all its telemetry to the Collector via the OTLP protocol, and the Collector processes it and forwards it to each specialized backend. The great advantage: your code neither knows nor cares which backend is behind it; you can change them without touching the application.

// Create a span manually with the OpenTelemetry API
Span span = tracer.spanBuilder("process-payment").startSpan();
try (Scope scope = span.makeCurrent()) {
    span.setAttribute("order_id", 123);   // Queryable attributes on the span
    processPayment();                      // The actual work
} catch (Exception e) {
    span.recordException(e);               // Records the error in the trace
    throw e;
} finally {
    span.end();                            // ALWAYS close the span
}

This code creates a span called process-payment that measures the time of the operation. makeCurrent() makes it "active" so that nested calls hang off it automatically. If there is an error, recordException attaches it to the trace (you will see the failure in the span). The span.end() in the finally guarantees that the span is closed and its duration recorded no matter what.

Common Mistakes and Tips

  • Confusing monitoring with observability. Monitoring is watching known metrics; observability is being able to investigate problems you did not anticipate. You need both.
  • Free-text logs. Impossible to analyze at scale. Use structured logging (JSON) from the start.
  • Not propagating the trace_id. Without correlation, each service's logs are disconnected islands. Always propagate the trace context.
  • Logging sensitive data. It is a security and compliance failure. Filter passwords, cards, and personal data before logging.
  • Too many high-cardinality metrics. Labels with unlimited values (like user_id on every metric) cause storage cost to explode. Use bounded cardinality.
  • Tracing 100% of traffic without sampling. In high-volume systems this generates an enormous cost. Apply intelligent sampling.
  • Tip: instrument thinking about the questions you will want to ask when something fails at 3 in the morning.

Exercises

  1. Choosing the right pillar. For each situation, indicate which pillar (logs, metrics, or traces) you would use first: (a) a dashboard shows that the error rate has risen to 5%; (b) you want to know why order 987 exactly failed; (c) a request takes 3 seconds and you want to know which service is slowing it down.

  2. Improve a log. Rewrite this free-text log as a structured JSON log with fields useful for diagnosis: "ERROR: failed to connect to the database after 3 attempts"

  3. RED vs. USE. Classify each metric according to the most appropriate methodology: (a) requests per second of the API; (b) CPU utilization percentage of the server; (c) disk queue length; (d) rate of HTTP 500 responses.

Solutions

  1. (a) Metrics (they alert you that something is wrong: the error rate). (b) Logs (they give you the exact detail of that specific order's failure). (c) Traces (they show you the request's path and which span/service consumes the time). The natural investigation flow is usually metric -> trace -> log.

  2. Example of a structured log:

    {
      "timestamp": "2026-06-30T10:20:00Z",
      "level": "ERROR",
      "service": "orders",
      "message": "Database connection failure",
      "component": "db_pool",
      "retries": 3,
      "db_host": "db-primary",
      "trace_id": "abc123"
    }
    

    Now you can filter by component, count retries, or link with the trace via trace_id.

  3. (a) RED (Rate, a service metric). (b) USE (Utilization of a resource). (c) USE (Saturation: work queued on the disk). (d) RED (Errors, a service metric). In short: a and d measure the service (RED); b and c measure resources (USE).

Conclusion

You have learned that observability rests on three complementary pillars: logs (what happened), metrics (how much / how well), and traces (where it went). You have seen how structured logs make them analyzable, how RED and USE guide you on what to measure, how correlation via trace_id links all the signals of a request, and how OpenTelemetry standardizes all the instrumentation. An observable system is one you can interrogate to understand any problem, anticipated or not.

The metrics we have learned to collect (latency, throughput, percentiles) are precisely the ones we will measure when subjecting the system to load. In the last lesson of the module, Performance and Load Testing, we will see how to measure and validate that the system withstands what the business needs.

Application Architecture Course

Module 1: Fundamentals of Application Architecture

Module 2: Design Principles and Tactics

Module 3: Architectural Styles and Patterns

Module 4: Distributed Architectures and Microservices

Module 5: Event-Driven Architectures and Messaging

Module 6: Domain-Driven Design (DDD)

Module 7: Data and Persistence

Module 8: Cloud Architecture and Deployment

Module 9: Quality, Security and Observability

Module 10: Evolution, Governance and Case Studies

© Copyright 2026. All rights reserved