Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize some notion of cumulative reward. In this section, we will explore the basics of RL and how to implement RL algorithms using PyTorch.

Key Concepts in Reinforcement Learning

  1. Agent: The learner or decision-maker.
  2. Environment: The external system with which the agent interacts.
  3. State (s): A representation of the current situation of the agent.
  4. Action (a): The set of all possible moves the agent can make.
  5. Reward (r): The feedback from the environment based on the action taken.
  6. Policy (π): The strategy that the agent employs to determine the next action based on the current state.
  7. Value Function (V): The expected cumulative reward from a given state.
  8. Q-Function (Q): The expected cumulative reward from a given state-action pair.

Setting Up the Environment

Before diving into the implementation, ensure you have the necessary libraries installed:

pip install torch gym

Implementing a Simple RL Algorithm: Q-Learning

Step 1: Import Libraries

import gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

Step 2: Define the Q-Network

A Q-Network approximates the Q-Function using a neural network.

class QNetwork(nn.Module):
    def __init__(self, state_size, action_size):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(state_size, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, action_size)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Step 3: Initialize the Environment and Network

env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

q_network = QNetwork(state_size, action_size)
optimizer = optim.Adam(q_network.parameters(), lr=0.001)
criterion = nn.MSELoss()

Step 4: Define the Training Loop

num_episodes = 1000
gamma = 0.99  # Discount factor
epsilon = 1.0  # Exploration rate
epsilon_decay = 0.995
epsilon_min = 0.01

for episode in range(num_episodes):
    state = env.reset()
    state = torch.FloatTensor(state).unsqueeze(0)
    total_reward = 0
    
    for t in range(200):
        # Epsilon-greedy action selection
        if np.random.rand() <= epsilon:
            action = np.random.choice(action_size)
        else:
            with torch.no_grad():
                q_values = q_network(state)
                action = torch.argmax(q_values).item()
        
        next_state, reward, done, _ = env.step(action)
        next_state = torch.FloatTensor(next_state).unsqueeze(0)
        total_reward += reward
        
        # Compute the target Q-value
        with torch.no_grad():
            target_q_value = reward + gamma * torch.max(q_network(next_state))
        
        # Compute the current Q-value
        current_q_value = q_network(state)[0, action]
        
        # Compute the loss
        loss = criterion(current_q_value, target_q_value)
        
        # Optimize the Q-network
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        state = next_state
        
        if done:
            break
    
    # Decay epsilon
    if epsilon > epsilon_min:
        epsilon *= epsilon_decay
    
    print(f"Episode {episode+1}/{num_episodes}, Total Reward: {total_reward}")

Step 5: Evaluate the Trained Agent

state = env.reset()
state = torch.FloatTensor(state).unsqueeze(0)
total_reward = 0

for t in range(200):
    with torch.no_grad():
        q_values = q_network(state)
        action = torch.argmax(q_values).item()
    
    next_state, reward, done, _ = env.step(action)
    next_state = torch.FloatTensor(next_state).unsqueeze(0)
    total_reward += reward
    state = next_state
    
    if done:
        break

print(f"Total Reward: {total_reward}")

Practical Exercises

Exercise 1: Modify the Q-Network Architecture

Task: Modify the Q-Network to include an additional hidden layer with 128 neurons. Train the network and observe the performance.

Solution:

class QNetwork(nn.Module):
    def __init__(self, state_size, action_size):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(state_size, 64)
        self.fc2 = nn.Linear(64, 128)
        self.fc3 = nn.Linear(128, 64)
        self.fc4 = nn.Linear(64, action_size)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        x = self.fc4(x)
        return x

Exercise 2: Implement Experience Replay

Task: Implement experience replay to store and sample past experiences to break the correlation between consecutive experiences.

Solution:

from collections import deque
import random

class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        state, action, reward, next_state, done = zip(*random.sample(self.buffer, batch_size))
        return state, action, reward, next_state, done
    
    def __len__(self):
        return len(self.buffer)

# Initialize replay buffer
replay_buffer = ReplayBuffer(10000)

# Modify the training loop to use experience replay
batch_size = 64

for episode in range(num_episodes):
    state = env.reset()
    state = torch.FloatTensor(state).unsqueeze(0)
    total_reward = 0
    
    for t in range(200):
        if np.random.rand() <= epsilon:
            action = np.random.choice(action_size)
        else:
            with torch.no_grad():
                q_values = q_network(state)
                action = torch.argmax(q_values).item()
        
        next_state, reward, done, _ = env.step(action)
        next_state = torch.FloatTensor(next_state).unsqueeze(0)
        total_reward += reward
        
        replay_buffer.push(state, action, reward, next_state, done)
        
        state = next_state
        
        if len(replay_buffer) > batch_size:
            states, actions, rewards, next_states, dones = replay_buffer.sample(batch_size)
            states = torch.cat(states)
            next_states = torch.cat(next_states)
            actions = torch.tensor(actions).unsqueeze(1)
            rewards = torch.tensor(rewards).unsqueeze(1)
            dones = torch.tensor(dones).unsqueeze(1)
            
            with torch.no_grad():
                target_q_values = rewards + gamma * torch.max(q_network(next_states), dim=1, keepdim=True)[0] * (1 - dones)
            
            current_q_values = q_network(states).gather(1, actions)
            
            loss = criterion(current_q_values, target_q_values)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        if done:
            break
    
    if epsilon > epsilon_min:
        epsilon *= epsilon_decay
    
    print(f"Episode {episode+1}/{num_episodes}, Total Reward: {total_reward}")

Summary

In this section, we covered the basics of reinforcement learning and implemented a simple Q-Learning algorithm using PyTorch. We also explored practical exercises to modify the Q-Network architecture and implement experience replay. These exercises help reinforce the concepts and provide hands-on experience with RL in PyTorch.

In the next module, we will delve into more advanced topics and explore other RL algorithms and techniques.

© Copyright 2024. All rights reserved