Introduction

HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop applications. It is designed to store large datasets reliably and to stream those data sets at high bandwidth to user applications. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.

Key Concepts of HDFS Architecture

  1. Blocks

  • Definition: HDFS splits large files into smaller fixed-size blocks (default 128 MB).
  • Purpose: This allows HDFS to store large files across multiple nodes in a cluster.
  • Example: A 512 MB file will be split into four 128 MB blocks.

  1. NameNode

  • Role: The master server that manages the file system namespace and regulates access to files by clients.
  • Responsibilities:
    • Maintains the directory tree of all files in the file system.
    • Tracks the location of blocks across DataNodes.
    • Handles metadata operations like opening, closing, and renaming files and directories.

  1. DataNode

  • Role: The worker nodes that store the actual data.
  • Responsibilities:
    • Serve read and write requests from the file system’s clients.
    • Perform block creation, deletion, and replication upon instruction from the NameNode.

  1. Secondary NameNode

  • Role: Assists the primary NameNode.
  • Responsibilities:
    • Periodically merges the namespace image with the edit log to prevent the edit log from becoming too large.
    • Acts as a checkpointing mechanism but not a failover NameNode.

  1. Replication

  • Definition: HDFS stores multiple copies of data blocks to ensure reliability and fault tolerance.
  • Default Replication Factor: 3 (can be configured).
  • Example: Each block of a file is replicated on three different DataNodes.

  1. Rack Awareness

  • Definition: HDFS is aware of the rack topology of the cluster.
  • Purpose: To improve data reliability, availability, and network bandwidth utilization.
  • Example: HDFS places one replica on a different rack to ensure data availability even if an entire rack fails.

HDFS Architecture Diagram

+-------------------+
|     NameNode      |
| (Master Node)     |
+-------------------+
         |
         | Metadata Operations
         |
+-------------------+   +-------------------+   +-------------------+
|    DataNode 1     |   |    DataNode 2     |   |    DataNode 3     |
| (Worker Node)     |   | (Worker Node)     |   | (Worker Node)     |
+-------------------+   +-------------------+   +-------------------+
| Block 1, Block 2  |   | Block 2, Block 3  |   | Block 1, Block 3  |
+-------------------+   +-------------------+   +-------------------+

Practical Example

Example: Writing a File to HDFS

  1. Client Request: A client wants to write a file example.txt to HDFS.
  2. File Splitting: The file is split into blocks (e.g., example.txt is 256 MB, split into two 128 MB blocks).
  3. Metadata Update: The NameNode updates its metadata to include the new file and its blocks.
  4. Block Placement: The NameNode instructs DataNodes to store the blocks with replication.
  5. Data Storage: DataNodes store the blocks and replicate them as per the replication factor.

Code Example: Using HDFS Commands

# Create a directory in HDFS
hdfs dfs -mkdir /user/hadoop/example

# Copy a local file to HDFS
hdfs dfs -put example.txt /user/hadoop/example/

# List files in the HDFS directory
hdfs dfs -ls /user/hadoop/example/

Exercises

Exercise 1: Understanding HDFS Block Storage

Task: Given a file of size 512 MB and a default block size of 128 MB, determine how many blocks will be created and how they will be replicated.

Solution:

  • The file will be split into 4 blocks (512 MB / 128 MB).
  • Each block will be replicated 3 times (default replication factor).
  • Total blocks stored = 4 blocks * 3 replicas = 12 blocks.

Exercise 2: Basic HDFS Commands

Task: Perform the following operations using HDFS commands:

  1. Create a directory /user/student/data.
  2. Upload a file data.txt from the local filesystem to the HDFS directory.
  3. List the contents of the HDFS directory.

Solution:

# Create a directory in HDFS
hdfs dfs -mkdir /user/student/data

# Upload a file to HDFS
hdfs dfs -put data.txt /user/student/data/

# List files in the HDFS directory
hdfs dfs -ls /user/student/data/

Conclusion

In this section, we explored the architecture of HDFS, including its key components such as NameNode, DataNode, and Secondary NameNode. We also discussed the concepts of blocks, replication, and rack awareness. Understanding these fundamentals is crucial for working effectively with HDFS and leveraging its capabilities for big data storage and processing. In the next section, we will delve deeper into HDFS commands and their practical applications.

© Copyright 2024. All rights reserved