Introduction
In this project, we will apply the concepts learned in previous modules to build a Natural Language Processing (NLP) model using PyTorch. The project will involve the following steps:
- Data Collection and Preprocessing
- Building the NLP Model
- Training the Model
- Evaluating the Model
- Making Predictions
Step 1: Data Collection and Preprocessing
1.1 Data Collection
For this project, we will use the IMDb movie reviews dataset, which is a common dataset for sentiment analysis tasks. The dataset contains 50,000 movie reviews, split evenly into 25,000 training and 25,000 testing samples.
1.2 Data Preprocessing
Preprocessing steps include tokenization, padding, and converting text to numerical format.
import torch from torchtext.legacy import data, datasets # Define the fields associated with the sequences. TEXT = data.Field(tokenize='spacy', tokenizer_language='en_core_web_sm', include_lengths=True) LABEL = data.LabelField(dtype=torch.float) # Load the IMDb dataset. train_data, test_data = datasets.IMDB.splits(TEXT, LABEL) # Build the vocabulary using the training data. TEXT.build_vocab(train_data, max_size=25000, vectors="glove.6B.100d", unk_init=torch.Tensor.normal_) LABEL.build_vocab(train_data) # Create iterators for the data. BATCH_SIZE = 64 device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') train_iterator, test_iterator = data.BucketIterator.splits( (train_data, test_data), batch_size=BATCH_SIZE, sort_within_batch=True, device=device )
Explanation
- Field: Defines how the data should be processed.
- datasets.IMDB: Loads the IMDb dataset.
- build_vocab: Builds the vocabulary and loads pre-trained word embeddings.
- BucketIterator: Creates iterators that return batches of data.
Step 2: Building the NLP Model
We will build a simple LSTM-based model for sentiment analysis.
import torch.nn as nn class LSTMModel(nn.Module): def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout): super().__init__() self.embedding = nn.Embedding(vocab_size, embedding_dim) self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout) self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim) self.dropout = nn.Dropout(dropout) def forward(self, text, text_lengths): embedded = self.dropout(self.embedding(text)) packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths) packed_output, (hidden, cell) = self.lstm(packed_embedded) output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output) hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)) return self.fc(hidden)
Explanation
- Embedding: Converts words to dense vectors.
- LSTM: Processes the sequence of word vectors.
- Linear Layer: Maps the LSTM output to the desired output dimension.
- Dropout: Regularization to prevent overfitting.
Step 3: Training the Model
3.1 Define Training Parameters
INPUT_DIM = len(TEXT.vocab) EMBEDDING_DIM = 100 HIDDEN_DIM = 256 OUTPUT_DIM = 1 N_LAYERS = 2 BIDIRECTIONAL = True DROPOUT = 0.5 model = LSTMModel(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT)
3.2 Define Optimizer and Loss Function
import torch.optim as optim optimizer = optim.Adam(model.parameters()) criterion = nn.BCEWithLogitsLoss() model = model.to(device) criterion = criterion.to(device)
3.3 Training Loop
def binary_accuracy(preds, y): rounded_preds = torch.round(torch.sigmoid(preds)) correct = (rounded_preds == y).float() return correct.sum() / len(correct) def train(model, iterator, optimizer, criterion): epoch_loss = 0 epoch_acc = 0 model.train() for batch in iterator: optimizer.zero_grad() text, text_lengths = batch.text predictions = model(text, text_lengths).squeeze(1) loss = criterion(predictions, batch.label) acc = binary_accuracy(predictions, batch.label) loss.backward() optimizer.step() epoch_loss += loss.item() epoch_acc += acc.item() return epoch_loss / len(iterator), epoch_acc / len(iterator)
Explanation
- binary_accuracy: Computes the accuracy of the model.
- train: Trains the model for one epoch.
Step 4: Evaluating the Model
def evaluate(model, iterator, criterion): epoch_loss = 0 epoch_acc = 0 model.eval() with torch.no_grad(): for batch in iterator: text, text_lengths = batch.text predictions = model(text, text_lengths).squeeze(1) loss = criterion(predictions, batch.label) acc = binary_accuracy(predictions, batch.label) epoch_loss += loss.item() epoch_acc += acc.item() return epoch_loss / len(iterator), epoch_acc / len(iterator)
Explanation
- evaluate: Evaluates the model on the validation/test set.
Step 5: Making Predictions
def predict_sentiment(model, sentence, min_len=5): model.eval() tokenized = [tok.text for tok in nlp.tokenizer(sentence)] if len(tokenized) < min_len: tokenized += ['<pad>'] * (min_len - len(tokenized)) indexed = [TEXT.vocab.stoi[t] for t in tokenized] tensor = torch.LongTensor(indexed).to(device) tensor = tensor.unsqueeze(1) length_tensor = torch.LongTensor([len(indexed)]) prediction = torch.sigmoid(model(tensor, length_tensor)) return prediction.item()
Explanation
- predict_sentiment: Predicts the sentiment of a given sentence.
Conclusion
In this project, we built an LSTM-based sentiment analysis model using PyTorch. We covered data collection and preprocessing, model building, training, evaluation, and making predictions. This project serves as a practical application of the concepts learned in the course and provides a foundation for more advanced NLP tasks.
PyTorch: From Beginner to Advanced
Module 1: Introduction to PyTorch
- What is PyTorch?
- Setting Up the Environment
- Basic Tensor Operations
- Autograd: Automatic Differentiation
Module 2: Building Neural Networks
- Introduction to Neural Networks
- Creating a Simple Neural Network
- Activation Functions
- Loss Functions and Optimization
Module 3: Training Neural Networks
Module 4: Convolutional Neural Networks (CNNs)
- Introduction to CNNs
- Building a CNN from Scratch
- Transfer Learning with Pre-trained Models
- Fine-Tuning CNNs
Module 5: Recurrent Neural Networks (RNNs)
- Introduction to RNNs
- Building an RNN from Scratch
- Long Short-Term Memory (LSTM) Networks
- Gated Recurrent Units (GRUs)
Module 6: Advanced Topics
- Generative Adversarial Networks (GANs)
- Reinforcement Learning with PyTorch
- Deploying PyTorch Models
- Optimizing Performance