The Project | About Us | Contribute | Donations | License

HOME

Throughout the module we have treated events as discrete units: an order is created, a payment is confirmed. But there is a class of problems in which events arrive continuously and at high speed —user clicks, IoT sensor readings, card transactions, log lines— and the value lies in processing them as they flow, without waiting to accumulate them. This is real-time data streaming: treating data as an infinite (unbounded) flow over which we apply transformations, aggregations, and pattern detection within milliseconds.

Streaming is the foundation of use cases such as instant fraud detection, live dashboards, real-time recommendations, or systems monitoring. In this lesson you will understand the fundamental difference between batch processing and stream processing, the role of temporal windows, and how Kafka Streams lets us implement all of this with relatively simple code.

Batch vs Streaming: the fundamental difference
Stream processing concepts
Windows: grouping time
Kafka Streams: processing on top of Kafka
Example: window-based fraud detection
Real-world use cases
Common mistakes and tips
Exercises and solutions
Conclusion

Batch vs Streaming: the fundamental difference

Batch processing operates on a finite and complete dataset (for example, "all of yesterday's sales") at a scheduled moment. Stream processing operates on an infinite flow of events as they arrive.

Aspect	Batch	Streaming
Data	Finite, bounded	Infinite, continuous
Latency	High (minutes/hours)	Low (ms/seconds)
When it processes	At a scheduled moment	As the events arrive
View of the data	Complete	Partial (what has been seen so far)
Example	Monthly payroll, daily report	Fraud detection, live alerts
Typical tools	Spark batch, nightly ETL	Kafka Streams, Flink, Spark Streaming

The key phrase: in batch you have all the data before starting; in streaming you never have all of it, because one more can always arrive. This forces you to reason differently, especially about time.

Stream processing concepts

Before programming, let's set the vocabulary:

Stream: an infinite sequence of events ordered (approximately) in time.
Stateless transformations: they process each event independently. Examples: filter (discard), map (transform).
Stateful transformations: they need to remember information from previous events. Examples: count, sum, aggregate by key. They require a state store.
Event time vs processing time:
- Event time: when the event occurred (timestamp in the message itself).
- Processing time: when the system processes it.
- They can differ greatly (an event from a phone without coverage can arrive minutes late). Serious processing is based on event time.
Late data: events that arrive later than expected. The system must decide whether to incorporate or discard them.

Windows: grouping time

Since the flow is infinite, we cannot "sum everything". Instead, we group the events into temporal windows and compute over each window. There are three main types:

flowchart TB
    subgraph Tumbling["Tumbling (fixed, non-overlapping)"]
        direction LR
        T1["[0-5min]"] --- T2["[5-10min]"] --- T3["[10-15min]"]
    end
    subgraph Hopping["Hopping/Sliding (overlapping)"]
        direction LR
        H1["[0-5min]"] --- H2["[2-7min]"] --- H3["[4-9min]"]
    end
    subgraph Session["Session (by activity)"]
        direction LR
        S1["activity...gap...new session"]
    end

Window type	Description	Use case
Tumbling (fixed)	Fixed intervals without overlap. Each event falls into a single window	"Sales every 5 minutes"
Hopping / Sliding	Fixed intervals that overlap (advance by hops)	"Moving average of the last 5 min, updated every minute"
Session	Closes after a period of inactivity (gap)	"A user's activity until they stop interacting"

Example: to detect fraud we look for "more than 3 payments in 1 minute from the same card". That is a stateful aggregation over a 1-minute window grouped by card.

Kafka Streams: processing on top of Kafka

Kafka Streams is a Java library that consumes from Kafka topics, processes the data (filtering, aggregation, windows, joins), and writes the results to other topics. It does not need a separate cluster: it is code you deploy as a normal application and that scales by adding instances.

Concepts of its API:

KStream: represents an event flow (each record is an independent fact).
KTable: represents a state table (each key has its latest value; new records update it).
State store: a local store (backed in Kafka) where stateful aggregations are kept, fault-tolerant.

# Minimal configuration of a Kafka Streams app
application.id: fraud-detector          # identifies the app and its state stores
bootstrap.servers: localhost:9092       # Kafka cluster
default.key.serde: ...StringSerde       # how to serialize keys
processing.guarantee: exactly_once_v2   # processing guarantee

application.id is critical: it identifies the application and names its internal topics and state stores. Two instances with the same application.id form a group and share the work.
processing.guarantee: exactly_once_v2 enables exactly-once processing within Kafka (between topics), relying on transactions.

Example: window-based fraud detection

Let's detect cards with more than 3 payments within a 1-minute window:

StreamsBuilder builder = new StreamsBuilder();

// 1. We read the payments flow from the input topic
KStream<String, Payment> payments = builder.stream("payments");

payments
    // 2. We re-group by card (the key becomes the card)
    .groupBy((key, payment) -> payment.cardId(),
             Grouped.with(Serdes.String(), paymentSerde))
    // 3. We define a 1-minute TUMBLING window
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(1)))
    // 4. We count the payments per card within each window
    .count(Materialized.as("payment-count-per-card"))
    // 5. We convert the result table back into a flow
    .toStream()
    // 6. We keep only those that exceed the threshold
    .filter((cardWindow, count) -> count > 3)
    // 7. We emit an alert to the output topic
    .map((cardWindow, count) ->
        KeyValue.pair(cardWindow.key(),
                      new FraudAlert(cardWindow.key(), count)))
    .to("fraud-alerts", Produced.with(Serdes.String(), alertSerde));

KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();

Step-by-step explanation:

builder.stream("payments") creates a KStream that reads each payment from the input topic.
groupBy re-groups by cardId: now the events of the same card share a key and are processed together. It is the basis of any stateful aggregation.
windowedBy(TimeWindows.ofSize... 1 minute) defines tumbling windows of one minute. withNoGrace means we do not wait for late data.
count(...) keeps, in a fault-tolerant state store, how many payments each card has in the current window.
toStream() converts the KTable of counts into a change stream so we can keep processing.
filter(... count > 3) lets through only the card-window combinations that exceed the fraud threshold.
map builds a FraudAlert and to("fraud-alerts") publishes it to the output topic, where another service can block the card or notify the customer.

The result is a fraud detector that reacts in real time, without nightly batches.

Real-world use cases

Fraud detection: anomalous transaction patterns instantly (like the example).
Monitoring and alerting: detect service outages or error spikes in live logs.
Real-time dashboards: sales per minute, users active right now.
Recommendations: adjust suggestions based on the user's current browsing.
IoT and telemetry: aggregate readings from thousands of sensors and detect thresholds (temperature, vibration).
Event enrichment: join a stream of orders with a customer table to add data on the fly.

Common Mistakes and Tips

Confusing event time with processing time. If you aggregate by processing time, a delayed event falls into the wrong window and the results are incorrect. Base the windows on event time whenever the order matters.
Ignoring late data. Explicitly decide your policy: grace period, discard, or reprocess. Not deciding is deciding badly.
Forgetting that state grows. Stateful aggregations consume memory/disk. Use windows with bounded retention and clean up stale state.
Expecting "exactly-once" end to end. Kafka Streams guarantees it between Kafka topics, but when writing to external systems you again need idempotency (lesson 05-02).
Using streaming when a batch would have sufficed. If a latency of hours is acceptable, a batch is simpler and cheaper. Streaming adds operational complexity.
Tip: start with stateless transformations (filter, map); introduce state and windows only when the use case requires it.

Exercises

For each case, indicate whether you would use batch or streaming and why: (a) monthly closing accounting report; (b) alert when a sensor exceeds 80 °C; (c) nightly calculation of commissions; (d) counter of currently connected users.
You want to compute a "moving average of temperature over the last 10 minutes, updated every 2 minutes". What type of window would you use and why?
Explain in your own words the difference between KStream and KTable and give an example of data suitable for each one.

Solutions

(a) Batch: finite data for the month, no urgency. (b) Streaming: requires an immediate reaction. (c) Batch: scheduled nightly process over complete data. (d) Streaming: live metric that changes continuously.
A hopping/sliding window of size 10 minutes with a hop of 2 minutes. It is sliding because the windows overlap: every 2 minutes a new result is emitted that covers the last 10 minutes, which is exactly what a moving average requires.
A KStream models a flow of independent facts (each record is an event that adds to the previous ones), for example each payment or each click. A KTable models a state per key where each new record updates the previous value of that key, for example the current balance of each account or the latest price of each product.

Conclusion

Data streaming treats events as an infinite flow and processes them as they arrive, achieving latencies of milliseconds versus the hours of batch. We learned to reason about time (event time vs processing time), to group events into windows (tumbling, sliding, and session), and to implement stateful aggregations in Kafka Streams through a real fraud detector. We also saw when streaming is worthwhile and when a simpler batch is enough.

With this lesson we close Module 5: Event-Driven Architectures and Messaging. You have traveled the complete path: from the fundamentals of events and their asynchronous messaging, through storage with Event Sourcing and CQRS, the coordination of distributed transactions with the Saga pattern, to real-time stream processing. These patterns constitute the backbone of modern distributed systems and prepare you to design scalable, resilient, and reactive architectures.

Real-Time Data Streaming

Contents

Batch vs Streaming: the fundamental difference

Stream processing concepts

Windows: grouping time

Kafka Streams: processing on top of Kafka

Example: window-based fraud detection

Real-world use cases

Common Mistakes and Tips

Exercises

Solutions

Conclusion

Application Architecture Course

Module 1: Fundamentals of Application Architecture

Module 2: Design Principles and Tactics

Module 3: Architectural Styles and Patterns

Module 4: Distributed Architectures and Microservices

Module 5: Event-Driven Architectures and Messaging

Module 6: Domain-Driven Design (DDD)

Module 7: Data and Persistence

Module 8: Cloud Architecture and Deployment

Module 9: Quality, Security and Observability

Module 10: Evolution, Governance and Case Studies