Introduction
HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop applications. It is designed to store large datasets reliably and to stream those datasets at high bandwidth to user applications. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
Key Concepts
- Architecture
- NameNode: The master server that manages the file system namespace and regulates access to files by clients.
- DataNode: The worker nodes that store and retrieve blocks when they are told to (by the NameNode).
- Secondary NameNode: A helper to the primary NameNode, it performs housekeeping functions for the NameNode.
- File System Namespace
- HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories.
- The file system namespace hierarchy is similar to most other existing file systems: it supports operations such as creating and deleting files, renaming files, and directories.
- Data Replication
- HDFS stores each file as a sequence of blocks; all blocks in a file except the last block are the same size.
- Blocks are replicated for fault tolerance. The block size and replication factor are configurable per file.
- Fault Tolerance
- HDFS is designed to detect faults and recover from them automatically.
- Data is replicated across multiple DataNodes to ensure data availability even if some nodes fail.
HDFS Architecture
NameNode and DataNode Interaction
- The NameNode maintains the file system namespace and the metadata for all the files and directories.
- The DataNodes are responsible for serving read and write requests from the file system’s clients.
- The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Block Management
- Files are split into one or more blocks, and these blocks are stored in a set of DataNodes.
- The NameNode maintains the mapping of blocks to DataNodes.
Heartbeats and Block Reports
- DataNodes send periodic heartbeats to the NameNode to confirm their availability.
- DataNodes also send block reports to the NameNode to provide information about the blocks they are storing.
Practical Example
Writing a File to HDFS
- Client Request: A client requests to write a file to HDFS.
- NameNode Interaction: The client contacts the NameNode to get the list of DataNodes to store the file blocks.
- DataNode Interaction: The client writes the file blocks to the DataNodes.
- Replication: The DataNodes replicate the blocks to other DataNodes as per the replication factor.
// Example: Writing a file to HDFS using Java API import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import java.io.OutputStream; public class HDFSWriteExample { public static void main(String[] args) { try { Configuration configuration = new Configuration(); FileSystem hdfs = FileSystem.get(configuration); Path file = new Path("/user/hadoop/example.txt"); OutputStream os = hdfs.create(file); os.write("Hello HDFS!".getBytes()); os.close(); } catch (Exception e) { e.printStackTrace(); } } }
Reading a File from HDFS
- Client Request: A client requests to read a file from HDFS.
- NameNode Interaction: The client contacts the NameNode to get the list of DataNodes that store the file blocks.
- DataNode Interaction: The client reads the file blocks from the DataNodes.
// Example: Reading a file from HDFS using Java API import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import java.io.InputStream; public class HDFSReadExample { public static void main(String[] args) { try { Configuration configuration = new Configuration(); FileSystem hdfs = FileSystem.get(configuration); Path file = new Path("/user/hadoop/example.txt"); InputStream is = hdfs.open(file); byte[] buffer = new byte[256]; int bytesRead = is.read(buffer); while (bytesRead > 0) { System.out.write(buffer, 0, bytesRead); bytesRead = is.read(buffer); } is.close(); } catch (Exception e) { e.printStackTrace(); } } }
Exercises
Exercise 1: Write a File to HDFS
Task: Write a Java program to create a new file in HDFS and write the text "Hello, Hadoop!" into it.
Solution:
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import java.io.OutputStream; public class HDFSWriteExercise { public static void main(String[] args) { try { Configuration configuration = new Configuration(); FileSystem hdfs = FileSystem.get(configuration); Path file = new Path("/user/hadoop/exercise.txt"); OutputStream os = hdfs.create(file); os.write("Hello, Hadoop!".getBytes()); os.close(); } catch (Exception e) { e.printStackTrace(); } } }
Exercise 2: Read a File from HDFS
Task: Write a Java program to read the content of the file created in Exercise 1 from HDFS and print it to the console.
Solution:
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import java.io.InputStream; public class HDFSReadExercise { public static void main(String[] args) { try { Configuration configuration = new Configuration(); FileSystem hdfs = FileSystem.get(configuration); Path file = new Path("/user/hadoop/exercise.txt"); InputStream is = hdfs.open(file); byte[] buffer = new byte[256]; int bytesRead = is.read(buffer); while (bytesRead > 0) { System.out.write(buffer, 0, bytesRead); bytesRead = is.read(buffer); } is.close(); } catch (Exception e) { e.printStackTrace(); } } }
Summary
In this section, we covered the basics of HDFS, including its architecture, key components, and how it handles data storage and replication. We also provided practical examples of writing and reading files to and from HDFS using Java. Understanding HDFS is crucial for working with Hadoop, as it forms the backbone of data storage in the Hadoop ecosystem. In the next module, we will delve deeper into the MapReduce framework, which is used for processing the data stored in HDFS.
Hadoop Course
Module 1: Introduction to Hadoop
- What is Hadoop?
- Hadoop Ecosystem Overview
- Hadoop vs Traditional Databases
- Setting Up Hadoop Environment
Module 2: Hadoop Architecture
- Hadoop Core Components
- HDFS (Hadoop Distributed File System)
- MapReduce Framework
- YARN (Yet Another Resource Negotiator)
Module 3: HDFS (Hadoop Distributed File System)
Module 4: MapReduce Programming
- Introduction to MapReduce
- MapReduce Job Workflow
- Writing a MapReduce Program
- MapReduce Optimization Techniques
Module 5: Hadoop Ecosystem Tools
Module 6: Advanced Hadoop Concepts
Module 7: Real-World Applications and Case Studies
- Hadoop in Data Warehousing
- Hadoop in Machine Learning
- Hadoop in Real-Time Data Processing
- Case Studies of Hadoop Implementations