The Project | About Us | Contribute | Donations | License

HOME

Introduction

Sentiment analysis, also known as opinion mining, is a technique used to determine the sentiment expressed in a piece of text. This project will guide you through the process of building a sentiment analysis model to classify social media posts as positive, negative, or neutral.

Objectives

Understand the basics of sentiment analysis.
Learn how to preprocess text data.
Build and evaluate a sentiment analysis model using machine learning algorithms.
Implement the model and test it on real-world social media data.

Step 1: Understanding Sentiment Analysis

What is Sentiment Analysis?

Sentiment analysis is the process of identifying and categorizing opinions expressed in a text to determine whether the writer's attitude towards a particular topic, product, etc., is positive, negative, or neutral.

Applications

Customer Feedback: Analyzing customer reviews to understand their satisfaction levels.
Market Research: Gauging public opinion about products or services.
Social Media Monitoring: Tracking sentiment trends on social media platforms.

Step 2: Data Collection

Sources of Data

Twitter API: Collect tweets using Twitter's API.
Kaggle Datasets: Use pre-existing datasets available on Kaggle.
Web Scraping: Scrape data from social media websites.

Example: Collecting Data from Twitter

import tweepy

# Twitter API credentials
consumer_key = 'your_consumer_key'
consumer_secret = 'your_consumer_secret'
access_token = 'your_access_token'
access_token_secret = 'your_access_token_secret'

# Authenticate with the Twitter API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

# Collect tweets
tweets = api.search(q='Machine Learning', lang='en', count=100)
for tweet in tweets:
    print(tweet.text)

Step 3: Data Preprocessing

Text Cleaning

Remove Punctuation: Strip punctuation marks.
Lowercase Conversion: Convert all text to lowercase.
Stop Words Removal: Remove common words that do not contribute to sentiment (e.g., "and", "the").
Tokenization: Split text into individual words or tokens.
Lemmatization/Stemming: Reduce words to their base or root form.

Example: Text Cleaning

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

def clean_text(text):
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word not in stop_words]
    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)

# Example usage
sample_text = "I love Machine Learning! It's amazing."
cleaned_text = clean_text(sample_text)
print(cleaned_text)

Step 4: Feature Extraction

Bag of Words (BoW)

Convert text data into numerical features using the Bag of Words model.

Example: Bag of Words

from sklearn.feature_extraction.text import CountVectorizer

# Sample data
texts = ["I love Machine Learning", "Machine Learning is great", "I hate spam emails"]

# Create the Bag of Words model
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

# Display the feature names and the transformed data
print(vectorizer.get_feature_names_out())
print(X.toarray())

TF-IDF (Term Frequency-Inverse Document Frequency)

Another method to convert text data into numerical features by considering the importance of words in the document.

Example: TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

# Create the TF-IDF model
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(texts)

# Display the feature names and the transformed data
print(tfidf_vectorizer.get_feature_names_out())
print(X_tfidf.toarray())

Step 5: Building the Model

Choosing an Algorithm

Logistic Regression
Naive Bayes
Support Vector Machines (SVM)

Example: Logistic Regression

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Sample data and labels
texts = ["I love Machine Learning", "Machine Learning is great", "I hate spam emails"]
labels = [1, 1, 0]  # 1: Positive, 0: Negative

# Convert text data to numerical features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

# Train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Step 6: Model Evaluation

Evaluation Metrics

Accuracy: The ratio of correctly predicted instances to the total instances.
Precision: The ratio of correctly predicted positive observations to the total predicted positives.
Recall: The ratio of correctly predicted positive observations to the all observations in actual class.
F1 Score: The weighted average of Precision and Recall.

Example: Evaluation Metrics

from sklearn.metrics import confusion_matrix

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

Step 7: Testing on Real-World Data

Example: Testing on New Data

# New data
new_texts = ["I am very happy with the service", "This is the worst experience ever"]

# Preprocess and transform new data
new_texts_cleaned = [clean_text(text) for text in new_texts]
new_X = vectorizer.transform(new_texts_cleaned)

# Predict sentiment
new_predictions = model.predict(new_X)
print("Predictions:", new_predictions)

Conclusion

In this project, you have learned how to:

Collect and preprocess text data for sentiment analysis.
Extract features using Bag of Words and TF-IDF models.
Build and evaluate a sentiment analysis model using logistic regression.
Test the model on real-world social media data.

By completing this project, you have gained practical experience in sentiment analysis, which is a valuable skill in various fields such as marketing, customer service, and social media monitoring.

Project 3: Sentiment Analysis on Social Media

Introduction

Objectives

Step 1: Understanding Sentiment Analysis

What is Sentiment Analysis?

Applications

Step 2: Data Collection

Sources of Data

Example: Collecting Data from Twitter

Step 3: Data Preprocessing

Text Cleaning

Example: Text Cleaning

Step 4: Feature Extraction

Bag of Words (BoW)

Example: Bag of Words

TF-IDF (Term Frequency-Inverse Document Frequency)

Example: TF-IDF

Step 5: Building the Model

Choosing an Algorithm

Example: Logistic Regression

Step 6: Model Evaluation

Evaluation Metrics

Example: Evaluation Metrics

Step 7: Testing on Real-World Data

Example: Testing on New Data

Conclusion

Machine Learning Course

Module 1: Introduction to Machine Learning

Module 2: Fundamentals of Statistics and Probability

Module 3: Data Preprocessing

Module 4: Supervised Machine Learning Algorithms

Module 5: Unsupervised Machine Learning Algorithms

Module 6: Model Evaluation and Validation

Module 7: Advanced Techniques and Optimization

Module 8: Model Implementation and Deployment

Module 9: Practical Projects

Module 10: Additional Resources