Introduction

In this project, you will learn how to analyze a large dataset using Hadoop. This hands-on project will guide you through the process of setting up a Hadoop environment, loading data into HDFS, and performing data analysis using MapReduce. By the end of this project, you will have a solid understanding of how to leverage Hadoop for big data analysis.

Objectives

Set up a Hadoop environment.
Load a large dataset into HDFS.
Write and execute a MapReduce job to analyze the data.
Interpret the results of the analysis.

Prerequisites

Basic understanding of Hadoop and its components.
Familiarity with HDFS and MapReduce.
Java programming knowledge (for writing MapReduce jobs).

Step-by-Step Guide

Step 1: Setting Up the Hadoop Environment

Install Hadoop: Follow the instructions in Module 1, Section 4 to set up your Hadoop environment.
Verify Installation: Ensure that Hadoop is correctly installed by running the following command:
```
hadoop version
```
You should see the Hadoop version information displayed.

Step 2: Loading Data into HDFS

Download the Dataset: For this project, we will use a sample dataset. Download the dataset from this link.
Start HDFS: Start the HDFS service using the following command:
```
start-dfs.sh
```
Create a Directory in HDFS: Create a directory in HDFS to store the dataset:
```
hdfs dfs -mkdir /user/hadoop/project1
```
Upload the Dataset to HDFS: Upload the downloaded dataset to the HDFS directory:
```
hdfs dfs -put sample-dataset.csv /user/hadoop/project1/
```

Step 3: Writing a MapReduce Job

Create a Java Project: Create a new Java project in your preferred IDE.
Add Hadoop Libraries: Add the Hadoop libraries to your project. You can find these libraries in the lib directory of your Hadoop installation.

Write the Mapper Class: Create a Mapper class to process the input data. Here is an example of a Mapper class that counts the occurrences of each word in the dataset:

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] words = value.toString().split("\\s+");
        for (String str : words) {
            word.set(str);
            context.write(word, one);
        }
    }
}

Write the Reducer Class: Create a Reducer class to aggregate the results from the Mapper:

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        context.write(key, new IntWritable(sum));
    }
}

Write the Driver Class: Create a Driver class to configure and run the MapReduce job:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountDriver {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCountDriver.class);
        job.setMapperClass(WordCountMapper.class);
        job.setCombinerClass(WordCountReducer.class);
        job.setReducerClass(WordCountReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Step 4: Running the MapReduce Job

Compile the Code: Compile the Java code and create a JAR file.

javac -classpath `hadoop classpath` -d wordcount_classes WordCountMapper.java WordCountReducer.java WordCountDriver.java
jar -cvf wordcount.jar -C wordcount_classes/ .

Run the Job: Execute the MapReduce job using the following command:

hadoop jar wordcount.jar WordCountDriver /user/hadoop/project1/sample-dataset.csv /user/hadoop/project1/output

Step 5: Analyzing the Results

View the Output: Once the job is complete, view the output stored in HDFS:
```
hdfs dfs -cat /user/hadoop/project1/output/part-r-00000
```
Interpret the Results: The output will display the word counts from the dataset. Analyze the results to gain insights into the data.

Conclusion

In this project, you have learned how to set up a Hadoop environment, load data into HDFS, write and execute a MapReduce job, and analyze the results. This hands-on experience will help you understand the practical aspects of using Hadoop for big data analysis. In the next project, you will build a data pipeline using various tools from the Hadoop ecosystem.

Project 1: Analyzing Big Data with Hadoop

Introduction

Objectives

Prerequisites

Step-by-Step Guide

Step 1: Setting Up the Hadoop Environment

Step 2: Loading Data into HDFS

Step 3: Writing a MapReduce Job

Step 4: Running the MapReduce Job

Step 5: Analyzing the Results

Conclusion

Hadoop Course

Module 1: Introduction to Hadoop

Module 2: Hadoop Architecture

Module 3: HDFS (Hadoop Distributed File System)

Module 4: MapReduce Programming

Module 5: Hadoop Ecosystem Tools

Module 6: Advanced Hadoop Concepts

Module 7: Real-World Applications and Case Studies

Module 8: Hands-On Projects