The Project | About Us | Contribute | Donations | License

HOME

Introduction

HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop applications. It is designed to store large datasets reliably and to stream those datasets at high bandwidth to user applications. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.

Key Concepts

Architecture

NameNode: The master server that manages the file system namespace and regulates access to files by clients.
DataNode: The worker nodes that store and retrieve blocks when they are told to (by the NameNode).
Secondary NameNode: A helper to the primary NameNode, it performs housekeeping functions for the NameNode.

File System Namespace

HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories.
The file system namespace hierarchy is similar to most other existing file systems: it supports operations such as creating and deleting files, renaming files, and directories.

Data Replication

HDFS stores each file as a sequence of blocks; all blocks in a file except the last block are the same size.
Blocks are replicated for fault tolerance. The block size and replication factor are configurable per file.

Fault Tolerance

HDFS is designed to detect faults and recover from them automatically.
Data is replicated across multiple DataNodes to ensure data availability even if some nodes fail.

HDFS Architecture

NameNode and DataNode Interaction

The NameNode maintains the file system namespace and the metadata for all the files and directories.
The DataNodes are responsible for serving read and write requests from the file system’s clients.
The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

Block Management

Files are split into one or more blocks, and these blocks are stored in a set of DataNodes.
The NameNode maintains the mapping of blocks to DataNodes.

Heartbeats and Block Reports

DataNodes send periodic heartbeats to the NameNode to confirm their availability.
DataNodes also send block reports to the NameNode to provide information about the blocks they are storing.

Practical Example

Writing a File to HDFS

Client Request: A client requests to write a file to HDFS.
NameNode Interaction: The client contacts the NameNode to get the list of DataNodes to store the file blocks.
DataNode Interaction: The client writes the file blocks to the DataNodes.
Replication: The DataNodes replicate the blocks to other DataNodes as per the replication factor.

// Example: Writing a file to HDFS using Java API
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import java.io.OutputStream;

public class HDFSWriteExample {
    public static void main(String[] args) {
        try {
            Configuration configuration = new Configuration();
            FileSystem hdfs = FileSystem.get(configuration);
            Path file = new Path("/user/hadoop/example.txt");
            OutputStream os = hdfs.create(file);
            os.write("Hello HDFS!".getBytes());
            os.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Reading a File from HDFS

Client Request: A client requests to read a file from HDFS.
NameNode Interaction: The client contacts the NameNode to get the list of DataNodes that store the file blocks.
DataNode Interaction: The client reads the file blocks from the DataNodes.

// Example: Reading a file from HDFS using Java API
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import java.io.InputStream;

public class HDFSReadExample {
    public static void main(String[] args) {
        try {
            Configuration configuration = new Configuration();
            FileSystem hdfs = FileSystem.get(configuration);
            Path file = new Path("/user/hadoop/example.txt");
            InputStream is = hdfs.open(file);
            byte[] buffer = new byte[256];
            int bytesRead = is.read(buffer);
            while (bytesRead > 0) {
                System.out.write(buffer, 0, bytesRead);
                bytesRead = is.read(buffer);
            }
            is.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Exercises

Exercise 1: Write a File to HDFS

Task: Write a Java program to create a new file in HDFS and write the text "Hello, Hadoop!" into it.

Solution:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import java.io.OutputStream;

public class HDFSWriteExercise {
    public static void main(String[] args) {
        try {
            Configuration configuration = new Configuration();
            FileSystem hdfs = FileSystem.get(configuration);
            Path file = new Path("/user/hadoop/exercise.txt");
            OutputStream os = hdfs.create(file);
            os.write("Hello, Hadoop!".getBytes());
            os.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Exercise 2: Read a File from HDFS

Task: Write a Java program to read the content of the file created in Exercise 1 from HDFS and print it to the console.

Solution:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import java.io.InputStream;

public class HDFSReadExercise {
    public static void main(String[] args) {
        try {
            Configuration configuration = new Configuration();
            FileSystem hdfs = FileSystem.get(configuration);
            Path file = new Path("/user/hadoop/exercise.txt");
            InputStream is = hdfs.open(file);
            byte[] buffer = new byte[256];
            int bytesRead = is.read(buffer);
            while (bytesRead > 0) {
                System.out.write(buffer, 0, bytesRead);
                bytesRead = is.read(buffer);
            }
            is.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Summary

In this section, we covered the basics of HDFS, including its architecture, key components, and how it handles data storage and replication. We also provided practical examples of writing and reading files to and from HDFS using Java. Understanding HDFS is crucial for working with Hadoop, as it forms the backbone of data storage in the Hadoop ecosystem. In the next module, we will delve deeper into the MapReduce framework, which is used for processing the data stored in HDFS.

HDFS (Hadoop Distributed File System)

Introduction

Key Concepts

Architecture

File System Namespace

Data Replication

Fault Tolerance

HDFS Architecture

NameNode and DataNode Interaction

Block Management

Heartbeats and Block Reports

Practical Example

Writing a File to HDFS

Reading a File from HDFS

Exercises

Exercise 1: Write a File to HDFS

Exercise 2: Read a File from HDFS

Summary

Hadoop Course

Module 1: Introduction to Hadoop

Module 2: Hadoop Architecture

Module 3: HDFS (Hadoop Distributed File System)

Module 4: MapReduce Programming

Module 5: Hadoop Ecosystem Tools

Module 6: Advanced Hadoop Concepts

Module 7: Real-World Applications and Case Studies

Module 8: Hands-On Projects