The Project | About Us | Contribute | Donations | License

HOME

Introduction

In this section, we will explore the MapReduce programming model and the Hadoop ecosystem, which is designed to process large datasets in a distributed computing environment.

Objectives

Understand the basic concepts of MapReduce.
Learn how Hadoop implements the MapReduce model.
Explore practical examples of MapReduce jobs.
Gain hands-on experience with Hadoop.

Basic Concepts of MapReduce

MapReduce is a programming model for processing large datasets with a distributed algorithm on a cluster. It consists of two main functions:

Map Function: Processes input data and produces a set of intermediate key-value pairs.
Reduce Function: Merges all intermediate values associated with the same intermediate key.

Example: Word Count

Let's consider a simple example of counting the number of occurrences of each word in a large text file.

Map Function

The map function takes a line of text as input and outputs key-value pairs, where the key is a word and the value is 1.

def map_function(line):
    words = line.split()
    for word in words:
        yield (word, 1)

Reduce Function

The reduce function takes a key and a list of values and outputs the sum of the values.

def reduce_function(word, counts):
    yield (word, sum(counts))

Execution Flow

Splitting: The input data is split into chunks.
Mapping: The map function processes each chunk and produces intermediate key-value pairs.
Shuffling: The intermediate key-value pairs are grouped by key.
Reducing: The reduce function processes each group and produces the final output.

Hadoop and MapReduce

Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers using the MapReduce programming model.

Key Components of Hadoop

Hadoop Distributed File System (HDFS): A distributed file system that stores data across multiple machines.
YARN (Yet Another Resource Negotiator): Manages resources and schedules tasks.
MapReduce Engine: Executes the MapReduce jobs.

Setting Up Hadoop

To set up Hadoop, follow these steps:

Download Hadoop: Obtain the latest version from the Apache Hadoop website.
Install Java: Hadoop requires Java to run.
Configure Hadoop: Edit configuration files such as core-site.xml, hdfs-site.xml, and mapred-site.xml.
Start Hadoop: Use the start-dfs.sh and start-yarn.sh scripts to start HDFS and YARN.

Running a MapReduce Job on Hadoop

Write the MapReduce Program: Implement the map and reduce functions in Java, Python, or another supported language.
Compile the Program: Compile the program into a JAR file if using Java.
Submit the Job: Use the hadoop jar command to submit the job to the Hadoop cluster.

Example: Word Count in Hadoop

Mapper Class (Java)

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        String[] words = value.toString().split("\\s+");
        for (String str : words) {
            word.set(str);
            context.write(word, one);
        }
    }
}

Reducer Class (Java)

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        context.write(key, new IntWritable(sum));
    }
}

Driver Class (Java)

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Practical Exercise

Exercise: Implement a MapReduce job to count the number of occurrences of each word in a text file using Hadoop.

Write the Mapper and Reducer classes in Java.
Compile the classes and create a JAR file.
Submit the job to a Hadoop cluster.
Verify the output.

Solution: Follow the example provided above for the Word Count program.

Common Mistakes and Tips

Configuration Issues: Ensure all configuration files are correctly set up.
Data Splitting: Understand how data is split and processed in parallel.
Resource Management: Monitor resource usage to avoid bottlenecks.

Conclusion

In this section, we covered the basics of the MapReduce programming model and how Hadoop implements it. We also walked through a practical example of a Word Count program and provided a hands-on exercise to reinforce the concepts. Understanding MapReduce and Hadoop is crucial for processing large datasets efficiently in a distributed environment.

MapReduce and Hadoop

Introduction

Objectives

Basic Concepts of MapReduce

Example: Word Count

Map Function

Reduce Function

Execution Flow

Hadoop and MapReduce

Key Components of Hadoop

Setting Up Hadoop

Running a MapReduce Job on Hadoop

Example: Word Count in Hadoop

Practical Exercise

Common Mistakes and Tips

Conclusion

Distributed Architectures Course

Module 1: Introduction to Distributed Systems

Module 2: Communication in Distributed Systems

Module 3: Consistency and Replication

Module 4: Distributed Storage

Module 5: Distributed Computing

Module 6: Security in Distributed Systems

Module 7: Monitoring and Maintenance

Module 8: Case Studies and Applications