The Project | About Us | Contribute | Donations | License

HOME

Introduction

In this project, we will apply the concepts learned in previous modules to build a Natural Language Processing (NLP) model using PyTorch. The project will involve the following steps:

Data Collection and Preprocessing
Building the NLP Model
Training the Model
Evaluating the Model
Making Predictions

Step 1: Data Collection and Preprocessing

1.1 Data Collection

For this project, we will use the IMDb movie reviews dataset, which is a common dataset for sentiment analysis tasks. The dataset contains 50,000 movie reviews, split evenly into 25,000 training and 25,000 testing samples.

1.2 Data Preprocessing

Preprocessing steps include tokenization, padding, and converting text to numerical format.

import torch
from torchtext.legacy import data, datasets

# Define the fields associated with the sequences.
TEXT = data.Field(tokenize='spacy', tokenizer_language='en_core_web_sm', include_lengths=True)
LABEL = data.LabelField(dtype=torch.float)

# Load the IMDb dataset.
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

# Build the vocabulary using the training data.
TEXT.build_vocab(train_data, max_size=25000, vectors="glove.6B.100d", unk_init=torch.Tensor.normal_)
LABEL.build_vocab(train_data)

# Create iterators for the data.
BATCH_SIZE = 64
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, test_data),
    batch_size=BATCH_SIZE,
    sort_within_batch=True,
    device=device
)

Explanation

Field: Defines how the data should be processed.
datasets.IMDB: Loads the IMDb dataset.
build_vocab: Builds the vocabulary and loads pre-trained word embeddings.
BucketIterator: Creates iterators that return batches of data.

Step 2: Building the NLP Model

We will build a simple LSTM-based model for sentiment analysis.

import torch.nn as nn

class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_lengths):
        embedded = self.dropout(self.embedding(text))
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
        packed_output, (hidden, cell) = self.lstm(packed_embedded)
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
        return self.fc(hidden)

Explanation

Embedding: Converts words to dense vectors.
LSTM: Processes the sequence of word vectors.
Linear Layer: Maps the LSTM output to the desired output dimension.
Dropout: Regularization to prevent overfitting.

Step 3: Training the Model

3.1 Define Training Parameters

INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5

model = LSTMModel(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT)

3.2 Define Optimizer and Loss Function

import torch.optim as optim

optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

3.3 Training Loop

def binary_accuracy(preds, y):
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float()
    return correct.sum() / len(correct)

def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    
    for batch in iterator:
        optimizer.zero_grad()
        text, text_lengths = batch.text
        predictions = model(text, text_lengths).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = binary_accuracy(predictions, batch.label)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Explanation

binary_accuracy: Computes the accuracy of the model.
train: Trains the model for one epoch.

Step 4: Evaluating the Model

def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    
    with torch.no_grad():
        for batch in iterator:
            text, text_lengths = batch.text
            predictions = model(text, text_lengths).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
            
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Explanation

evaluate: Evaluates the model on the validation/test set.

Step 5: Making Predictions

def predict_sentiment(model, sentence, min_len=5):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    if len(tokenized) < min_len:
        tokenized += ['<pad>'] * (min_len - len(tokenized))
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    length_tensor = torch.LongTensor([len(indexed)])
    prediction = torch.sigmoid(model(tensor, length_tensor))
    return prediction.item()

Explanation

predict_sentiment: Predicts the sentiment of a given sentence.

Conclusion

In this project, we built an LSTM-based sentiment analysis model using PyTorch. We covered data collection and preprocessing, model building, training, evaluation, and making predictions. This project serves as a practical application of the concepts learned in the course and provides a foundation for more advanced NLP tasks.

Natural Language Processing Project

Introduction

Step 1: Data Collection and Preprocessing

1.1 Data Collection

1.2 Data Preprocessing

Explanation

Step 2: Building the NLP Model

Explanation

Step 3: Training the Model

3.1 Define Training Parameters

3.2 Define Optimizer and Loss Function

3.3 Training Loop

Explanation

Step 4: Evaluating the Model

Explanation

Step 5: Making Predictions

Explanation

Conclusion

PyTorch: From Beginner to Advanced

Module 1: Introduction to PyTorch

Module 2: Building Neural Networks

Module 3: Training Neural Networks

Module 4: Convolutional Neural Networks (CNNs)

Module 5: Recurrent Neural Networks (RNNs)

Module 6: Advanced Topics

Module 7: Case Studies and Projects