Introduction

In this project, we will apply the concepts learned in previous modules to build a Natural Language Processing (NLP) model using PyTorch. The project will involve the following steps:

  1. Data Collection and Preprocessing
  2. Building the NLP Model
  3. Training the Model
  4. Evaluating the Model
  5. Making Predictions

Step 1: Data Collection and Preprocessing

1.1 Data Collection

For this project, we will use the IMDb movie reviews dataset, which is a common dataset for sentiment analysis tasks. The dataset contains 50,000 movie reviews, split evenly into 25,000 training and 25,000 testing samples.

1.2 Data Preprocessing

Preprocessing steps include tokenization, padding, and converting text to numerical format.

import torch
from torchtext.legacy import data, datasets

# Define the fields associated with the sequences.
TEXT = data.Field(tokenize='spacy', tokenizer_language='en_core_web_sm', include_lengths=True)
LABEL = data.LabelField(dtype=torch.float)

# Load the IMDb dataset.
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

# Build the vocabulary using the training data.
TEXT.build_vocab(train_data, max_size=25000, vectors="glove.6B.100d", unk_init=torch.Tensor.normal_)
LABEL.build_vocab(train_data)

# Create iterators for the data.
BATCH_SIZE = 64
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, test_data),
    batch_size=BATCH_SIZE,
    sort_within_batch=True,
    device=device
)

Explanation

  • Field: Defines how the data should be processed.
  • datasets.IMDB: Loads the IMDb dataset.
  • build_vocab: Builds the vocabulary and loads pre-trained word embeddings.
  • BucketIterator: Creates iterators that return batches of data.

Step 2: Building the NLP Model

We will build a simple LSTM-based model for sentiment analysis.

import torch.nn as nn

class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_lengths):
        embedded = self.dropout(self.embedding(text))
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
        packed_output, (hidden, cell) = self.lstm(packed_embedded)
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
        return self.fc(hidden)

Explanation

  • Embedding: Converts words to dense vectors.
  • LSTM: Processes the sequence of word vectors.
  • Linear Layer: Maps the LSTM output to the desired output dimension.
  • Dropout: Regularization to prevent overfitting.

Step 3: Training the Model

3.1 Define Training Parameters

INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5

model = LSTMModel(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT)

3.2 Define Optimizer and Loss Function

import torch.optim as optim

optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

3.3 Training Loop

def binary_accuracy(preds, y):
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float()
    return correct.sum() / len(correct)

def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    
    for batch in iterator:
        optimizer.zero_grad()
        text, text_lengths = batch.text
        predictions = model(text, text_lengths).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = binary_accuracy(predictions, batch.label)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Explanation

  • binary_accuracy: Computes the accuracy of the model.
  • train: Trains the model for one epoch.

Step 4: Evaluating the Model

def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    
    with torch.no_grad():
        for batch in iterator:
            text, text_lengths = batch.text
            predictions = model(text, text_lengths).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
            
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Explanation

  • evaluate: Evaluates the model on the validation/test set.

Step 5: Making Predictions

def predict_sentiment(model, sentence, min_len=5):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    if len(tokenized) < min_len:
        tokenized += ['<pad>'] * (min_len - len(tokenized))
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    length_tensor = torch.LongTensor([len(indexed)])
    prediction = torch.sigmoid(model(tensor, length_tensor))
    return prediction.item()

Explanation

  • predict_sentiment: Predicts the sentiment of a given sentence.

Conclusion

In this project, we built an LSTM-based sentiment analysis model using PyTorch. We covered data collection and preprocessing, model building, training, evaluation, and making predictions. This project serves as a practical application of the concepts learned in the course and provides a foundation for more advanced NLP tasks.

© Copyright 2024. All rights reserved