The Project | About Us | Contribute | Donations | License

HOME

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. It allows you to express your streaming computations the same way you would express a batch computation on static data. This module will cover the key concepts, practical examples, and exercises to help you understand and implement Structured Streaming in Apache Spark.

Key Concepts

Streaming Data Sources:
- Kafka
- File sources (e.g., CSV, JSON, Parquet)
- Socket sources
Streaming DataFrames and Datasets:
- Unified API for batch and streaming data
- Schema enforcement
Event Time and Watermarking:
- Handling late data
- Watermarking to manage state
Output Modes:
- Append
- Complete
- Update
Stateful Operations:
- Aggregations
- Window operations
Fault Tolerance:
- Checkpointing
- Exactly-once semantics

Practical Example

Setting Up a Streaming Data Source

Let's start with a simple example where we read streaming data from a socket source.

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split

# Create a Spark session
spark = SparkSession.builder \
    .appName("StructuredStreamingExample") \
    .getOrCreate()

# Define the schema of the data
schema = "value STRING"

# Read streaming data from a socket source
lines = spark.readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

# Split the lines into words
words = lines.select(
    explode(
        split(lines.value, " ")
    ).alias("word")
)

# Start running the query that prints the running counts to the console
query = words.writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

query.awaitTermination()

Explanation

Spark Session: We start by creating a Spark session.
Schema Definition: Define the schema of the incoming data.
Read Stream: Use readStream to read data from a socket source.
Transformations: Split the lines into words.
Write Stream: Write the streaming data to the console in append mode.

Handling Late Data with Watermarking

Watermarking allows you to handle late data by specifying a threshold for how late data can be.

from pyspark.sql.functions import window

# Read streaming data from a socket source
lines = spark.readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

# Split the lines into words
words = lines.select(
    explode(
        split(lines.value, " ")
    ).alias("word")
)

# Generate running word count with windowing and watermarking
windowedCounts = words.groupBy(
    window(words.timestamp, "10 minutes", "5 minutes"),
    words.word
).count().withWatermark("timestamp", "10 minutes")

# Start running the query that prints the running counts to the console
query = windowedCounts.writeStream \
    .outputMode("update") \
    .format("console") \
    .start()

query.awaitTermination()

Explanation

Windowing: Group the data into 10-minute windows with a slide duration of 5 minutes.
Watermarking: Use withWatermark to specify that data older than 10 minutes should be dropped.

Exercises

Exercise 1: Reading from a File Source

Read streaming data from a directory containing JSON files and print the results to the console.

Solution:

# Read streaming data from a directory
jsonDF = spark.readStream \
    .format("json") \
    .schema(schema) \
    .load("/path/to/json/files")

# Start running the query that prints the results to the console
query = jsonDF.writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

query.awaitTermination()

Exercise 2: Aggregating Data

Read streaming data from a socket source, split the lines into words, and count the occurrences of each word.

Solution:

# Read streaming data from a socket source
lines = spark.readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

# Split the lines into words
words = lines.select(
    explode(
        split(lines.value, " ")
    ).alias("word")
)

# Generate running word count
wordCounts = words.groupBy("word").count()

# Start running the query that prints the running counts to the console
query = wordCounts.writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination()

Common Mistakes and Tips

Schema Mismatch: Ensure that the schema of the streaming data matches the expected schema.
Output Mode: Choose the correct output mode (append, complete, update) based on your use case.
Checkpointing: Always use checkpointing to ensure fault tolerance and exactly-once semantics.

Conclusion

In this section, we covered the basics of Structured Streaming in Apache Spark, including setting up streaming data sources, handling late data with watermarking, and performing stateful operations. We also provided practical examples and exercises to reinforce the concepts. In the next module, we will delve into Spark MLlib for machine learning with Spark.

Structured Streaming

Key Concepts

Practical Example

Setting Up a Streaming Data Source

Explanation

Handling Late Data with Watermarking

Explanation

Exercises

Exercise 1: Reading from a File Source

Exercise 2: Aggregating Data

Common Mistakes and Tips

Conclusion

Apache Spark Course

Module 1: Introduction to Apache Spark

Module 2: Spark Core Concepts

Module 3: Data Processing with Spark

Module 4: Advanced Spark Programming

Module 5: Performance Tuning and Optimization

Module 6: Spark in the Cloud

Module 7: Real-World Applications and Case Studies

Module 8: Capstone Project