The Project | About Us | Contribute | Donations | License

HOME

Descriptive statistics are used to summarize and describe the main features of a dataset. This module will cover the fundamental concepts and techniques for performing descriptive statistics in R.

Key Concepts

Measures of Central Tendency:
- Mean
- Median
- Mode
Measures of Dispersion:
- Range
- Variance
- Standard Deviation
- Interquartile Range (IQR)
Measures of Shape:
- Skewness
- Kurtosis
Summary Statistics:
- Summary function
- Quantiles

Practical Examples

Measures of Central Tendency

Mean

The mean is the average of a set of numbers.

# Example dataset
data <- c(10, 20, 30, 40, 50)

# Calculate mean
mean_value <- mean(data)
print(mean_value)

Explanation:

data is a vector containing the dataset.
mean(data) calculates the mean of the dataset.

Median

The median is the middle value of a dataset when it is ordered.

# Calculate median
median_value <- median(data)
print(median_value)

Explanation:

median(data) calculates the median of the dataset.

Mode

The mode is the value that appears most frequently in a dataset. R does not have a built-in function for mode, so we can create a custom function.

# Custom function to calculate mode
get_mode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Calculate mode
mode_value <- get_mode(data)
print(mode_value)

Explanation:

get_mode is a custom function that calculates the mode of a dataset.

Measures of Dispersion

Range

The range is the difference between the maximum and minimum values in a dataset.

# Calculate range
range_value <- range(data)
print(range_value)

Explanation:

range(data) returns the minimum and maximum values of the dataset.

Variance

Variance measures the spread of the data points from the mean.

# Calculate variance
variance_value <- var(data)
print(variance_value)

Explanation:

var(data) calculates the variance of the dataset.

Standard Deviation

Standard deviation is the square root of the variance and provides a measure of the average distance from the mean.

# Calculate standard deviation
sd_value <- sd(data)
print(sd_value)

Explanation:

sd(data) calculates the standard deviation of the dataset.

Interquartile Range (IQR)

The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile).

# Calculate IQR
iqr_value <- IQR(data)
print(iqr_value)

Explanation:

IQR(data) calculates the interquartile range of the dataset.

Measures of Shape

Skewness

Skewness measures the asymmetry of the data distribution.

# Install and load e1071 package for skewness
install.packages("e1071")
library(e1071)

# Calculate skewness
skewness_value <- skewness(data)
print(skewness_value)

Explanation:

skewness(data) calculates the skewness of the dataset. The e1071 package is required for this function.

Kurtosis

Kurtosis measures the "tailedness" of the data distribution.

# Calculate kurtosis
kurtosis_value <- kurtosis(data)
print(kurtosis_value)

Explanation:

kurtosis(data) calculates the kurtosis of the dataset. The e1071 package is required for this function.

Summary Statistics

Summary Function

The summary function provides a quick overview of the dataset, including the minimum, first quartile, median, mean, third quartile, and maximum.

# Summary statistics
summary_stats <- summary(data)
print(summary_stats)

Explanation:

summary(data) provides a summary of the dataset.

Quantiles

Quantiles divide the data into equal-sized intervals.

# Calculate quantiles
quantiles <- quantile(data)
print(quantiles)

Explanation:

quantile(data) calculates the quantiles of the dataset.

Practical Exercises

Exercise 1: Calculate Descriptive Statistics

Given the dataset data <- c(5, 10, 15, 20, 25, 30, 35, 40, 45, 50), calculate the following:

Mean
Median
Mode
Range
Variance
Standard Deviation
IQR
Skewness
Kurtosis
Summary statistics

Solution

data <- c(5, 10, 15, 20, 25, 30, 35, 40, 45, 50)

# Mean
mean_value <- mean(data)
print(mean_value)

# Median
median_value <- median(data)
print(median_value)

# Mode
get_mode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}
mode_value <- get_mode(data)
print(mode_value)

# Range
range_value <- range(data)
print(range_value)

# Variance
variance_value <- var(data)
print(variance_value)

# Standard Deviation
sd_value <- sd(data)
print(sd_value)

# IQR
iqr_value <- IQR(data)
print(iqr_value)

# Skewness
skewness_value <- skewness(data)
print(skewness_value)

# Kurtosis
kurtosis_value <- kurtosis(data)
print(kurtosis_value)

# Summary statistics
summary_stats <- summary(data)
print(summary_stats)

Common Mistakes and Tips

Mistake: Forgetting to install and load necessary packages (e.g., e1071 for skewness and kurtosis).
- Tip: Always check if a package is required for a specific function and ensure it is installed and loaded.
Mistake: Misinterpreting the results of descriptive statistics.
- Tip: Understand the meaning of each measure and how it describes the dataset.

Conclusion

In this section, we covered the fundamental concepts and techniques for performing descriptive statistics in R. You learned how to calculate measures of central tendency, dispersion, and shape, as well as how to obtain summary statistics. These skills are essential for understanding and summarizing your data before performing further analysis. In the next section, we will delve into probability distributions, which are crucial for statistical analysis and hypothesis testing.

Descriptive Statistics

Key Concepts

Practical Examples

Measures of Central Tendency

Mean

Median

Mode

Measures of Dispersion

Range

Variance

Standard Deviation

Interquartile Range (IQR)

Measures of Shape

Skewness

Kurtosis

Summary Statistics

Summary Function

Quantiles

Practical Exercises

Exercise 1: Calculate Descriptive Statistics

Solution

Common Mistakes and Tips

Conclusion

R Programming: From Beginner to Advanced

Module 1: Introduction to R

Module 2: Data Manipulation

Module 3: Data Visualization

Module 4: Statistical Analysis

Module 5: Advanced Data Handling

Module 6: Advanced Programming Concepts

Module 7: Machine Learning with R

Module 8: Specialized Topics

Module 9: Project and Case Studies