The Project | About Us | Contribute | Donations | License

HOME

Introduction

In this project, you will learn how to process real-time data using the Hadoop ecosystem. Real-time data processing is crucial for applications that require immediate insights and actions based on incoming data streams. We will use Apache Kafka for data ingestion, Apache Storm for real-time processing, and HDFS for storage.

Objectives

Set up a real-time data processing pipeline.
Ingest data using Apache Kafka.
Process data in real-time using Apache Storm.
Store processed data in HDFS.
Analyze the stored data.

Prerequisites

Basic understanding of Hadoop and its ecosystem.
Familiarity with Apache Kafka and Apache Storm.
Hadoop environment set up (HDFS, YARN).
Java programming knowledge.

Step-by-Step Guide

Step 1: Setting Up Apache Kafka

Download and Install Kafka:
- Download Kafka from the official website.
- Extract the downloaded file and navigate to the Kafka directory.

Start Zookeeper:

bin/zookeeper-server-start.sh config/zookeeper.properties

Start Kafka Server:

bin/kafka-server-start.sh config/server.properties

Create a Kafka Topic:

bin/kafka-topics.sh --create --topic real-time-data --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

Step 2: Setting Up Apache Storm

Download and Install Storm:
- Download Storm from the official website.
- Extract the downloaded file and navigate to the Storm directory.
Start Storm Nimbus:
```
bin/storm nimbus
```
Start Storm Supervisor:
```
bin/storm supervisor
```
Start Storm UI:
```
bin/storm ui
```

Step 3: Writing a Storm Topology

Create a Maven Project:

mvn archetype:generate -DgroupId=com.example -DartifactId=storm-topology -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
cd storm-topology

Add Dependencies to pom.xml:

<dependencies>
    <dependency>
        <groupId>org.apache.storm</groupId>
        <artifactId>storm-core</artifactId>
        <version>2.2.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka-clients</artifactId>
        <version>2.6.0</version>
    </dependency>
</dependencies>

Create a Kafka Spout:

public class KafkaSpout extends BaseRichSpout {
    private SpoutOutputCollector collector;
    private KafkaConsumer<String, String> consumer;

    @Override
    public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
        this.collector = collector;
        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("group.id", "storm-group");
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        consumer = new KafkaConsumer<>(props);
        consumer.subscribe(Collections.singletonList("real-time-data"));
    }

    @Override
    public void nextTuple() {
        ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
        for (ConsumerRecord<String, String> record : records) {
            collector.emit(new Values(record.value()));
        }
    }

    @Override
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        declarer.declare(new Fields("message"));
    }
}

Create a Processing Bolt:

public class ProcessingBolt extends BaseRichBolt {
    private OutputCollector collector;

    @Override
    public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
        this.collector = collector;
    }

    @Override
    public void execute(Tuple input) {
        String message = input.getStringByField("message");
        // Process the message
        String processedMessage = message.toUpperCase(); // Example processing
        collector.emit(new Values(processedMessage));
    }

    @Override
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        declarer.declare(new Fields("processedMessage"));
    }
}

Create a HDFS Bolt:

public class HdfsBolt extends BaseRichBolt {
    private FileSystem fs;
    private Path filePath;

    @Override
    public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://localhost:9000");
        try {
            fs = FileSystem.get(conf);
            filePath = new Path("/user/hadoop/real-time-data.txt");
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    @Override
    public void execute(Tuple input) {
        String processedMessage = input.getStringByField("processedMessage");
        try (FSDataOutputStream out = fs.append(filePath)) {
            out.writeBytes(processedMessage + "\n");
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    @Override
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        // No output fields
    }
}

Create the Topology:

public class RealTimeProcessingTopology {
    public static void main(String[] args) throws Exception {
        TopologyBuilder builder = new TopologyBuilder();
        builder.setSpout("kafka-spout", new KafkaSpout());
        builder.setBolt("processing-bolt", new ProcessingBolt()).shuffleGrouping("kafka-spout");
        builder.setBolt("hdfs-bolt", new HdfsBolt()).shuffleGrouping("processing-bolt");

        Config conf = new Config();
        conf.setDebug(true);

        StormSubmitter.submitTopology("real-time-processing-topology", conf, builder.createTopology());
    }
}

Step 4: Running the Topology

Package the Project:
```
mvn package
```

Submit the Topology:

bin/storm jar target/storm-topology-1.0-SNAPSHOT.jar com.example.RealTimeProcessingTopology

Step 5: Analyzing the Data

Access HDFS:

hdfs dfs -cat /user/hadoop/real-time-data.txt

Analyze the Data:
- Use tools like Apache Hive or Apache Pig to query and analyze the stored data.

Conclusion

In this project, you have successfully set up a real-time data processing pipeline using Apache Kafka, Apache Storm, and HDFS. You learned how to ingest data in real-time, process it, and store the results in HDFS for further analysis. This project provides a foundation for building more complex real-time data processing applications using the Hadoop ecosystem.

Project 3: Real-Time Data Processing with Hadoop

Introduction

Objectives

Prerequisites

Step-by-Step Guide

Step 1: Setting Up Apache Kafka

Step 2: Setting Up Apache Storm

Step 3: Writing a Storm Topology

Step 4: Running the Topology

Step 5: Analyzing the Data

Conclusion

Hadoop Course

Module 1: Introduction to Hadoop

Module 2: Hadoop Architecture

Module 3: HDFS (Hadoop Distributed File System)

Module 4: MapReduce Programming

Module 5: Hadoop Ecosystem Tools

Module 6: Advanced Hadoop Concepts

Module 7: Real-World Applications and Case Studies

Module 8: Hands-On Projects