A system can work perfectly on your laptop with a single user and yet collapse on launch day when thousands of people arrive at once. Performance is not guessed: it is measured. And the only way to know how a system will behave under real load before it happens is to subject it to controlled load testing. In this lesson you will learn to distinguish and measure correctly the two fundamental performance quantities (latency and throughput), why averages lie and you must use percentiles (p95, p99), the different types of test depending on what you want to discover (load, stress, soak, spike), how to write a real test with k6 and JMeter, and how to locate the bottlenecks that limit your system.

Contents

  1. Latency vs. Throughput
  2. Why averages lie: p95/p99 percentiles
  3. Types of performance test
  4. Practical example with k6 and JMeter
  5. Identifying bottlenecks
  6. Common mistakes and tips
  7. Exercises

  1. Latency vs. Throughput

These are the two basic performance metrics, and it is important not to confuse them:

  • Latency: the time it takes for one request to complete. It is measured in milliseconds. It answers "how fast?".
  • Throughput: the number of requests the system processes per unit of time. It is measured in requests per second (req/s). It answers "how many at once?".
Latency Throughput
Measures Time of one request Requests per second
Unit ms (milliseconds) req/s (requests/sec)
Question How fast does it respond? How much volume can it handle?
Who notices it The individual user The system as a whole

The highway analogy makes it clear: latency is how long it takes one car to travel it; throughput is how many cars pass per minute. And they are independent: a highway can have high throughput (many lanes) but high latency (it is very long). In fact, when a system saturates, throughput stagnates while latency skyrockets, because requests start to queue.

  1. Why averages lie: p95/p99 percentiles

Imagine you measure latency and get an average of 100 ms. Sounds good... but the average hides the bad cases. If out of every 100 requests, 95 take 50 ms and 5 take 1,500 ms, the average is still low, but 1 in 20 users suffers a terrible experience.

The solution is percentiles:

Percentile Meaning
p50 (median) Half of the requests take less than this
p95 95% take less; the worst 5% is left out
p99 99% take less; reflects the worst 1%
p99.9 For very demanding services; the "long tail"
Latencies of 20 requests (ms), sorted:
  40 42 45 48 50 50 52 55 58 60 62 65 70 75 80 90 110 250 800 1500

  Mean (average):  ~135 ms   <- misleading, a single value of 1500 "inflates" it
  p50 (median):      ~61 ms   <- the typical experience
  p95:              ~800 ms   <- 1 in 20 suffers this
  p99:             ~1500 ms   <- the real worst case

The professional rule: define your objectives (SLO) in percentiles, not in averages. A good objective is of the type "the p95 must be below 300 ms". High percentiles (p99, p99.9) matter especially because in systems with many dependencies, one user request triggers many internal ones, and it only takes one falling into the slow tail to ruin the experience.

  1. Types of performance test

Not all tests seek the same thing. According to how the load is applied, we distinguish:

Type What it does What it reveals
Load Expected sustained load Do I meet my SLOs under normal conditions?
Stress Raise the load until it breaks What is my limit? How does it fail?
Soak (endurance) Moderate load over hours/days Are there memory leaks or slow degradation?
Spike Sudden, abrupt rise Does it withstand a sudden peak (flash sale, viral)?
graph LR
    L["Load: steady load"]
    S["Stress: rises until it breaks"]
    K["Soak: steady, many hours"]
    P["Spike: sudden peak"]

Each type responds to a different business risk:

  • The load test validates that the system meets the SLOs with the expected traffic.
  • The stress test reveals the breaking point and, crucially, how it breaks (does it degrade gracefully or collapse?).
  • The soak test detects problems that only appear over time: memory leaks, connections that are not closed, disks that fill up with logs.
  • The spike test simulates events like a viral campaign or a flash sale, checking whether autoscaling reacts in time.

  1. Practical example with k6 and JMeter

k6 is a modern load-testing tool in which scenarios are written as JavaScript code, ideal for integrating into CI/CD.

import http from 'k6/http';
import { check, sleep } from 'k6';

// We define the load profile: ramp up, hold, and ramp down virtual users
export const options = {
  stages: [
    { duration: '30s', target: 50 },   // Ramp from 0 to 50 users in 30s
    { duration: '1m',  target: 50 },   // Hold 50 users for 1 minute
    { duration: '20s', target: 0 },    // Ramp down to 0 (ramp-down)
  ],
  thresholds: {
    http_req_duration: ['p(95)<300'],  // CRITERION: the p95 must be < 300 ms
    http_req_failed:   ['rate<0.01'],  // and fewer than 1% errors
  },
};

export default function () {
  const res = http.get('https://api.mycompany.com/products');
  check(res, { 'status 200': (r) => r.status === 200 });  // Verify OK response
  sleep(1);   // 1s pause simulating the user's "reading time"
}

Detailed explanation: the stages block defines a load profile in three phases (ramp-up, plateau, and ramp-down), where target is the simultaneous virtual users (VUs). The thresholds are the heart of the test: they are the pass/fail criteria; if the p95 exceeds 300 ms or errors go above 1%, k6 marks the test as failed (very useful for a CI/CD pipeline to block a slow deployment). The default function is what each virtual user runs in a loop: it requests the product list, verifies with check that it returns 200, and pauses for 1 second simulating human behavior.

# Run the load test with k6
k6 run load-test.js

JMeter, from Apache, is the classic tool, based on a graphical interface and test plans in XML. It is very powerful and mature, with support for many protocols. Quick comparison:

k6 JMeter
Test definition Code (JavaScript) GUI / XML
Learning curve Gentle for developers Steeper
CI/CD integration Excellent (native) Possible, more laborious
Resource consumption Low and efficient Higher (JVM/thread-based)
Protocols Mostly HTTP/web Many (JDBC, FTP, JMS...)

The choice depends on the context: k6 shines in development and automation teams; JMeter in complex, multi-protocol scenarios or when a visual tool is preferred.

  1. Identifying bottlenecks

A bottleneck is the resource that limits the performance of the entire system: no matter how much you improve the rest, the system will not go faster than its slowest component. The methodology consists of applying load and observing which resource saturates first.

Common suspects, in order of frequency:

Bottleneck Typical symptom Possible solution
Database DB CPU at 100%, slow queries Indexes, cache, read replicas
Connection pool Requests waiting for a free connection Increase the pool, release connections sooner
Application CPU CPU at 100% with little traffic Optimize code, scale horizontally
Memory Frequent garbage collection, swapping More RAM, fix memory leaks
Network / I/O High latency without saturating CPU Compress, reduce calls, CDN
Locks / contention Throughput does not rise when adding users Reduce critical sections (remember the USL)

The correct process is iterative: you find the bottleneck, resolve it, measure again... and another different bottleneck appears (because now the limit is elsewhere). Optimizing is uncovering successive limits. And an essential warning: do not optimize without measuring first. Intuition about where the problem is usually gets it wrong; the data from a load test does not.

Common Mistakes and Tips

  • Looking only at the average. It hides the bad cases. Always measure p95 and p99.
  • Confusing latency with throughput. They are different things; a system that is fast per unit can have low throughput and vice versa.
  • Testing against an environment different from production. If the test environment has half the resources or toy data, the results are not extrapolable. Use a realistic environment.
  • Not including realistic wait times. Users who hammer without pauses (sleep) generate an unrealistic load. Model human behavior.
  • Optimizing blindly. Guessing the bottleneck wastes effort. Measure, identify, and only then act.
  • Forgetting soak tests. Many failures (memory leaks) only appear after hours. Short tests are not enough.
  • Tip: integrate a load test into your CI/CD with thresholds in percentiles, so that performance is an automatic quality criterion and not a surprise in production.

Exercises

  1. Interpreting percentiles. A service has an average latency of 80 ms, p50 of 70 ms, p95 of 600 ms, and p99 of 2,000 ms. Is it healthy? What worries you and what would you investigate?

  2. Choosing the type of test. For each objective, indicate the type of test: (a) knowing from how many users on the system stops meeting the SLO; (b) checking whether there is a memory leak that brings down the service after 12 hours; (c) validating that the system survives the traffic peak of a product launch.

  3. Defining thresholds. The business requires that "99% of requests respond in less than half a second and that fewer than 0.5% fail". Write the corresponding k6 thresholds block.

Solutions

  1. It is not healthy, despite the good average and median. The p50 of 70 ms indicates that the typical experience is good, but the p95 of 600 ms and, above all, the p99 of 2,000 ms reveal a "long tail": 5% of users suffer slowness and 1% wait 2 seconds. I would investigate what those slow requests have in common: unindexed database queries? garbage collection pauses? calls to a slow external service? Distributed traces (previous lesson) are the ideal tool to locate it.

  2. (a) Stress test (raise the load until you find the point where the SLO is breached / it breaks). (b) Soak/endurance test (sustained load over many hours to detect slow degradation). (c) Spike test (sudden, abrupt rise in traffic).

  3. Thresholds block:

    thresholds: {
      http_req_duration: ['p(99)<500'],   // p99 below 500 ms
      http_req_failed:   ['rate<0.005'],  // fewer than 0.5% errors
    }
    

Conclusion

You have learned that performance is measured, not assumed: that latency (time per request) and throughput (volume per second) are different quantities, that the p95/p99 percentiles tell the truth the average hides, that each type of test (load, stress, soak, spike) responds to a different risk, how to write real tests with k6 and JMeter, and how to locate bottlenecks iteratively and based on data. Validating performance before production is the difference between a successful launch and a public collapse.

With this lesson you close Module 9, Quality, Security, and Observability. You have covered scalability, high availability, security by design, observability, and performance: the five attributes that separate an application that "works on my machine" from a system that is robust, secure, and ready for the real world. These concepts are not final phases, but criteria that must guide every architecture decision from day one.

Application Architecture Course

Module 1: Fundamentals of Application Architecture

Module 2: Design Principles and Tactics

Module 3: Architectural Styles and Patterns

Module 4: Distributed Architectures and Microservices

Module 5: Event-Driven Architectures and Messaging

Module 6: Domain-Driven Design (DDD)

Module 7: Data and Persistence

Module 8: Cloud Architecture and Deployment

Module 9: Quality, Security and Observability

Module 10: Evolution, Governance and Case Studies

© Copyright 2026. All rights reserved