A system can work perfectly on your laptop with a single user and yet collapse on launch day when thousands of people arrive at once. Performance is not guessed: it is measured. And the only way to know how a system will behave under real load before it happens is to subject it to controlled load testing. In this lesson you will learn to distinguish and measure correctly the two fundamental performance quantities (latency and throughput), why averages lie and you must use percentiles (p95, p99), the different types of test depending on what you want to discover (load, stress, soak, spike), how to write a real test with k6 and JMeter, and how to locate the bottlenecks that limit your system.
Contents
- Latency vs. Throughput
- Why averages lie: p95/p99 percentiles
- Types of performance test
- Practical example with k6 and JMeter
- Identifying bottlenecks
- Common mistakes and tips
- Exercises
- Latency vs. Throughput
These are the two basic performance metrics, and it is important not to confuse them:
- Latency: the time it takes for one request to complete. It is measured in milliseconds. It answers "how fast?".
- Throughput: the number of requests the system processes per unit of time. It is measured in requests per second (req/s). It answers "how many at once?".
| Latency | Throughput | |
|---|---|---|
| Measures | Time of one request | Requests per second |
| Unit | ms (milliseconds) | req/s (requests/sec) |
| Question | How fast does it respond? | How much volume can it handle? |
| Who notices it | The individual user | The system as a whole |
The highway analogy makes it clear: latency is how long it takes one car to travel it; throughput is how many cars pass per minute. And they are independent: a highway can have high throughput (many lanes) but high latency (it is very long). In fact, when a system saturates, throughput stagnates while latency skyrockets, because requests start to queue.
- Why averages lie: p95/p99 percentiles
Imagine you measure latency and get an average of 100 ms. Sounds good... but the average hides the bad cases. If out of every 100 requests, 95 take 50 ms and 5 take 1,500 ms, the average is still low, but 1 in 20 users suffers a terrible experience.
The solution is percentiles:
| Percentile | Meaning |
|---|---|
| p50 (median) | Half of the requests take less than this |
| p95 | 95% take less; the worst 5% is left out |
| p99 | 99% take less; reflects the worst 1% |
| p99.9 | For very demanding services; the "long tail" |
Latencies of 20 requests (ms), sorted: 40 42 45 48 50 50 52 55 58 60 62 65 70 75 80 90 110 250 800 1500 Mean (average): ~135 ms <- misleading, a single value of 1500 "inflates" it p50 (median): ~61 ms <- the typical experience p95: ~800 ms <- 1 in 20 suffers this p99: ~1500 ms <- the real worst case
The professional rule: define your objectives (SLO) in percentiles, not in averages. A good objective is of the type "the p95 must be below 300 ms". High percentiles (p99, p99.9) matter especially because in systems with many dependencies, one user request triggers many internal ones, and it only takes one falling into the slow tail to ruin the experience.
- Types of performance test
Not all tests seek the same thing. According to how the load is applied, we distinguish:
| Type | What it does | What it reveals |
|---|---|---|
| Load | Expected sustained load | Do I meet my SLOs under normal conditions? |
| Stress | Raise the load until it breaks | What is my limit? How does it fail? |
| Soak (endurance) | Moderate load over hours/days | Are there memory leaks or slow degradation? |
| Spike | Sudden, abrupt rise | Does it withstand a sudden peak (flash sale, viral)? |
graph LR
L["Load: steady load"]
S["Stress: rises until it breaks"]
K["Soak: steady, many hours"]
P["Spike: sudden peak"]Each type responds to a different business risk:
- The load test validates that the system meets the SLOs with the expected traffic.
- The stress test reveals the breaking point and, crucially, how it breaks (does it degrade gracefully or collapse?).
- The soak test detects problems that only appear over time: memory leaks, connections that are not closed, disks that fill up with logs.
- The spike test simulates events like a viral campaign or a flash sale, checking whether autoscaling reacts in time.
- Practical example with k6 and JMeter
k6 is a modern load-testing tool in which scenarios are written as JavaScript code, ideal for integrating into CI/CD.
import http from 'k6/http';
import { check, sleep } from 'k6';
// We define the load profile: ramp up, hold, and ramp down virtual users
export const options = {
stages: [
{ duration: '30s', target: 50 }, // Ramp from 0 to 50 users in 30s
{ duration: '1m', target: 50 }, // Hold 50 users for 1 minute
{ duration: '20s', target: 0 }, // Ramp down to 0 (ramp-down)
],
thresholds: {
http_req_duration: ['p(95)<300'], // CRITERION: the p95 must be < 300 ms
http_req_failed: ['rate<0.01'], // and fewer than 1% errors
},
};
export default function () {
const res = http.get('https://api.mycompany.com/products');
check(res, { 'status 200': (r) => r.status === 200 }); // Verify OK response
sleep(1); // 1s pause simulating the user's "reading time"
}Detailed explanation: the stages block defines a load profile in three phases (ramp-up, plateau, and ramp-down), where target is the simultaneous virtual users (VUs). The thresholds are the heart of the test: they are the pass/fail criteria; if the p95 exceeds 300 ms or errors go above 1%, k6 marks the test as failed (very useful for a CI/CD pipeline to block a slow deployment). The default function is what each virtual user runs in a loop: it requests the product list, verifies with check that it returns 200, and pauses for 1 second simulating human behavior.
JMeter, from Apache, is the classic tool, based on a graphical interface and test plans in XML. It is very powerful and mature, with support for many protocols. Quick comparison:
| k6 | JMeter | |
|---|---|---|
| Test definition | Code (JavaScript) | GUI / XML |
| Learning curve | Gentle for developers | Steeper |
| CI/CD integration | Excellent (native) | Possible, more laborious |
| Resource consumption | Low and efficient | Higher (JVM/thread-based) |
| Protocols | Mostly HTTP/web | Many (JDBC, FTP, JMS...) |
The choice depends on the context: k6 shines in development and automation teams; JMeter in complex, multi-protocol scenarios or when a visual tool is preferred.
- Identifying bottlenecks
A bottleneck is the resource that limits the performance of the entire system: no matter how much you improve the rest, the system will not go faster than its slowest component. The methodology consists of applying load and observing which resource saturates first.
Common suspects, in order of frequency:
| Bottleneck | Typical symptom | Possible solution |
|---|---|---|
| Database | DB CPU at 100%, slow queries | Indexes, cache, read replicas |
| Connection pool | Requests waiting for a free connection | Increase the pool, release connections sooner |
| Application CPU | CPU at 100% with little traffic | Optimize code, scale horizontally |
| Memory | Frequent garbage collection, swapping | More RAM, fix memory leaks |
| Network / I/O | High latency without saturating CPU | Compress, reduce calls, CDN |
| Locks / contention | Throughput does not rise when adding users | Reduce critical sections (remember the USL) |
The correct process is iterative: you find the bottleneck, resolve it, measure again... and another different bottleneck appears (because now the limit is elsewhere). Optimizing is uncovering successive limits. And an essential warning: do not optimize without measuring first. Intuition about where the problem is usually gets it wrong; the data from a load test does not.
Common Mistakes and Tips
- Looking only at the average. It hides the bad cases. Always measure p95 and p99.
- Confusing latency with throughput. They are different things; a system that is fast per unit can have low throughput and vice versa.
- Testing against an environment different from production. If the test environment has half the resources or toy data, the results are not extrapolable. Use a realistic environment.
- Not including realistic wait times. Users who hammer without pauses (
sleep) generate an unrealistic load. Model human behavior. - Optimizing blindly. Guessing the bottleneck wastes effort. Measure, identify, and only then act.
- Forgetting soak tests. Many failures (memory leaks) only appear after hours. Short tests are not enough.
- Tip: integrate a load test into your CI/CD with thresholds in percentiles, so that performance is an automatic quality criterion and not a surprise in production.
Exercises
-
Interpreting percentiles. A service has an average latency of 80 ms, p50 of 70 ms, p95 of 600 ms, and p99 of 2,000 ms. Is it healthy? What worries you and what would you investigate?
-
Choosing the type of test. For each objective, indicate the type of test: (a) knowing from how many users on the system stops meeting the SLO; (b) checking whether there is a memory leak that brings down the service after 12 hours; (c) validating that the system survives the traffic peak of a product launch.
-
Defining thresholds. The business requires that "99% of requests respond in less than half a second and that fewer than 0.5% fail". Write the corresponding k6
thresholdsblock.
Solutions
-
It is not healthy, despite the good average and median. The p50 of 70 ms indicates that the typical experience is good, but the p95 of 600 ms and, above all, the p99 of 2,000 ms reveal a "long tail": 5% of users suffer slowness and 1% wait 2 seconds. I would investigate what those slow requests have in common: unindexed database queries? garbage collection pauses? calls to a slow external service? Distributed traces (previous lesson) are the ideal tool to locate it.
-
(a) Stress test (raise the load until you find the point where the SLO is breached / it breaks). (b) Soak/endurance test (sustained load over many hours to detect slow degradation). (c) Spike test (sudden, abrupt rise in traffic).
-
Thresholds block:
thresholds: { http_req_duration: ['p(99)<500'], // p99 below 500 ms http_req_failed: ['rate<0.005'], // fewer than 0.5% errors }
Conclusion
You have learned that performance is measured, not assumed: that latency (time per request) and throughput (volume per second) are different quantities, that the p95/p99 percentiles tell the truth the average hides, that each type of test (load, stress, soak, spike) responds to a different risk, how to write real tests with k6 and JMeter, and how to locate bottlenecks iteratively and based on data. Validating performance before production is the difference between a successful launch and a public collapse.
With this lesson you close Module 9, Quality, Security, and Observability. You have covered scalability, high availability, security by design, observability, and performance: the five attributes that separate an application that "works on my machine" from a system that is robust, secure, and ready for the real world. These concepts are not final phases, but criteria that must guide every architecture decision from day one.
Application Architecture Course
Module 1: Fundamentals of Application Architecture
- What Is Application Architecture?
- The Role of the Software Architect
- Quality Attributes and Non-Functional Requirements
- Architectural Decisions and Trade-offs
- Architecture Documentation: Views and the C4 Model
Module 2: Design Principles and Tactics
- Coupling, Cohesion and Separation of Concerns
- SOLID Principles Applied to Architecture
- DRY, KISS, YAGNI and Other Design Principles
- Architectural Tactics for Quality Attributes
- Managing Technical Debt
Module 3: Architectural Styles and Patterns
- Monolithic Architecture
- Layered Architecture (N-Tier)
- Client-Server Architecture
- Hexagonal Architecture (Ports and Adapters)
- Clean and Onion Architecture
Module 4: Distributed Architectures and Microservices
- Introduction to Distributed Systems
- Microservices Architecture
- Service Decomposition and Bounded Contexts
- API Gateway, Service Discovery and Inter-Service Communication
- Resilience Patterns: Circuit Breaker, Retry and Bulkhead
- The CAP Theorem and Data Consistency
Module 5: Event-Driven Architectures and Messaging
- Fundamentals of Event-Driven Architecture
- Asynchronous Messaging: Queues and Brokers
- Event Patterns: Event Sourcing and CQRS
- Managing Distributed Transactions: The Saga Pattern
- Real-Time Data Streaming
Module 6: Domain-Driven Design (DDD)
- Core DDD Concepts
- Strategic Design: Bounded Contexts and Ubiquitous Language
- Tactical Design: Entities, Aggregates and Repositories
- Context Mapping
Module 7: Data and Persistence
- Persistence Strategies: SQL vs NoSQL
- Data Access Patterns: Repository, Unit of Work and DAO
- Database per Service and Distributed Data Management
- Caching and Invalidation Strategies
Module 8: Cloud Architecture and Deployment
- Cloud Computing Fundamentals (IaaS, PaaS, SaaS)
- Containers and Orchestration with Docker and Kubernetes
- Serverless Architecture
- Cloud-Native Design Patterns
- Infrastructure as Code (IaC)
Module 9: Quality, Security and Observability
- Scalability: Horizontal vs Vertical and Load Balancing
- High Availability and Fault Tolerance
- Security by Design and Authentication/Authorization
- Observability: Logging, Metrics and Tracing
- Performance and Load Testing
