Introduction
In this case study, we will apply the concepts learned in the Data Visualization module to a real-world dataset. The goal is to create meaningful visualizations that can help us understand the data better and communicate our findings effectively. We will use both base R graphics and the ggplot2
package to create various types of plots.
Dataset
For this case study, we will use the mtcars
dataset, which is a built-in dataset in R. It contains data about various car models, including their miles per gallon (mpg), number of cylinders, horsepower, and more.
Dataset Overview
mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Step-by-Step Visualization
- Scatter Plot: MPG vs Horsepower
Base R Graphics
# Scatter plot using base R plot(mtcars$hp, mtcars$mpg, main = "MPG vs Horsepower", xlab = "Horsepower", ylab = "Miles Per Gallon", pch = 19, col = "blue")
ggplot2
# Load ggplot2 package library(ggplot2) # Scatter plot using ggplot2 ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point(color = "blue") + labs(title = "MPG vs Horsepower", x = "Horsepower", y = "Miles Per Gallon")
- Box Plot: MPG by Number of Cylinders
Base R Graphics
# Box plot using base R boxplot(mpg ~ cyl, data = mtcars, main = "MPG by Number of Cylinders", xlab = "Number of Cylinders", ylab = "Miles Per Gallon", col = c("red", "green", "blue"))
ggplot2
# Box plot using ggplot2 ggplot(mtcars, aes(x = factor(cyl), y = mpg)) + geom_boxplot(aes(fill = factor(cyl))) + labs(title = "MPG by Number of Cylinders", x = "Number of Cylinders", y = "Miles Per Gallon") + scale_fill_manual(values = c("red", "green", "blue"))
- Histogram: Distribution of MPG
Base R Graphics
# Histogram using base R hist(mtcars$mpg, main = "Distribution of MPG", xlab = "Miles Per Gallon", col = "purple", breaks = 10)
ggplot2
# Histogram using ggplot2 ggplot(mtcars, aes(x = mpg)) + geom_histogram(binwidth = 2, fill = "purple", color = "black") + labs(title = "Distribution of MPG", x = "Miles Per Gallon", y = "Frequency")
- Bar Plot: Count of Cars by Gear
Base R Graphics
# Bar plot using base R barplot(table(mtcars$gear), main = "Count of Cars by Gear", xlab = "Number of Gears", ylab = "Count", col = "orange")
ggplot2
# Bar plot using ggplot2 ggplot(mtcars, aes(x = factor(gear))) + geom_bar(fill = "orange") + labs(title = "Count of Cars by Gear", x = "Number of Gears", y = "Count")
Practical Exercises
Exercise 1: Scatter Plot with Regression Line
Create a scatter plot of wt
(weight) vs mpg
and add a regression line to it using ggplot2
.
Solution
# Scatter plot with regression line using ggplot2 ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point(color = "blue") + geom_smooth(method = "lm", color = "red") + labs(title = "MPG vs Weight with Regression Line", x = "Weight (1000 lbs)", y = "Miles Per Gallon")
Exercise 2: Faceted Plot
Create a faceted plot of mpg
vs hp
for each number of cylinders using ggplot2
.
Solution
# Faceted plot using ggplot2 ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() + facet_wrap(~ cyl) + labs(title = "MPG vs Horsepower by Number of Cylinders", x = "Horsepower", y = "Miles Per Gallon")
Conclusion
In this case study, we explored various types of visualizations using both base R graphics and the ggplot2
package. We created scatter plots, box plots, histograms, and bar plots to analyze the mtcars
dataset. Additionally, we practiced creating more complex visualizations such as scatter plots with regression lines and faceted plots. These skills are essential for effectively communicating data insights and making data-driven decisions.
R Programming: From Beginner to Advanced
Module 1: Introduction to R
- Introduction to R and RStudio
- Basic R Syntax
- Data Types and Structures
- Basic Operations and Functions
- Importing and Exporting Data
Module 2: Data Manipulation
- Vectors and Lists
- Matrices and Arrays
- Data Frames
- Factors
- Data Manipulation with dplyr
- String Manipulation
Module 3: Data Visualization
- Introduction to Data Visualization
- Base R Graphics
- ggplot2 Basics
- Advanced ggplot2
- Interactive Visualizations with plotly
Module 4: Statistical Analysis
- Descriptive Statistics
- Probability Distributions
- Hypothesis Testing
- Correlation and Regression
- ANOVA and Chi-Square Tests
Module 5: Advanced Data Handling
Module 6: Advanced Programming Concepts
- Writing Functions
- Debugging and Error Handling
- Object-Oriented Programming in R
- Functional Programming
- Parallel Computing
Module 7: Machine Learning with R
- Introduction to Machine Learning
- Data Preprocessing
- Supervised Learning
- Unsupervised Learning
- Model Evaluation and Tuning
Module 8: Specialized Topics
- Time Series Analysis
- Spatial Data Analysis
- Text Mining and Natural Language Processing
- Bioinformatics with R
- Financial Data Analysis