Introduction

Pandas is a powerful and flexible open-source data manipulation and analysis library for Python. It provides data structures like Series and DataFrame, which are essential for handling structured data. In this module, we will explore the basics of Pandas and how to use it for data manipulation tasks.

Key Concepts

  1. Series: A one-dimensional labeled array capable of holding any data type.
  2. DataFrame: A two-dimensional labeled data structure with columns of potentially different types.
  3. Indexing and Selecting Data: Techniques to access and modify data in Series and DataFrames.
  4. Data Cleaning: Handling missing values, duplicates, and data type conversions.
  5. Data Transformation: Applying functions, aggregations, and merging datasets.

Setting Up Pandas

Before we start, ensure you have Pandas installed. You can install it using pip:

pip install pandas

Importing Pandas

import pandas as pd

Series

A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels, called its index.

Creating a Series

import pandas as pd

# Creating a Series from a list
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print(series)

Output

0    1
1    2
2    3
3    4
4    5
dtype: int64

Accessing Elements in a Series

print(series[0])  # Output: 1
print(series[1:3])  # Output: 1    2
                    #          2    3

DataFrame

A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

Creating a DataFrame

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)

Output

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Accessing Data in a DataFrame

# Accessing columns
print(df['Name'])

# Accessing rows by index
print(df.iloc[0])

# Accessing rows by label
print(df.loc[0])

Data Cleaning

Handling Missing Values

# Creating a DataFrame with missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, None, 40],
    'City': ['New York', None, 'Chicago', 'San Francisco']
}
df = pd.DataFrame(data)

# Filling missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['City'].fillna('Unknown', inplace=True)
print(df)

Output

      Name   Age           City
0    Alice  25.0       New York
1      Bob  30.0        Unknown
2  Charlie  31.666667     Chicago
3    David  40.0  San Francisco

Removing Duplicates

# Creating a DataFrame with duplicate rows
data = {
    'Name': ['Alice', 'Bob', 'Alice', 'David'],
    'Age': [25, 30, 25, 40],
    'City': ['New York', 'Los Angeles', 'New York', 'San Francisco']
}
df = pd.DataFrame(data)

# Removing duplicates
df.drop_duplicates(inplace=True)
print(df)

Output

    Name  Age           City
0  Alice   25       New York
1    Bob   30    Los Angeles
3  David   40  San Francisco

Data Transformation

Applying Functions

# Applying a function to a column
df['Age'] = df['Age'].apply(lambda x: x + 1)
print(df)

Output

    Name  Age           City
0  Alice   26       New York
1    Bob   31    Los Angeles
3  David   41  San Francisco

Aggregations

# Aggregating data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'San Francisco']
}
df = pd.DataFrame(data)

# Grouping by City and calculating mean age
grouped = df.groupby('City')['Age'].mean()
print(grouped)

Output

City
Chicago          35
Los Angeles      30
New York         25
San Francisco    40
Name: Age, dtype: int64

Merging DataFrames

# Creating two DataFrames
data1 = {
    'Name': ['Alice', 'Bob'],
    'Age': [25, 30]
}
data2 = {
    'Name': ['Charlie', 'David'],
    'Age': [35, 40]
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Merging DataFrames
merged_df = pd.concat([df1, df2])
print(merged_df)

Output

      Name  Age
0    Alice   25
1      Bob   30
0  Charlie   35
1    David   40

Practical Exercises

Exercise 1: Creating and Manipulating a DataFrame

Task: Create a DataFrame with the following data and perform the specified operations.

Name Age City
Alice 25 New York
Bob 30 Los Angeles
Charlie 35 Chicago
David 40 San Francisco
  1. Add a new column Salary with the values [70000, 80000, 90000, 100000].
  2. Replace the City value 'Los Angeles' with 'LA'.
  3. Calculate the mean salary.

Solution:

import pandas as pd

# Creating the DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'San Francisco']
}
df = pd.DataFrame(data)

# Adding a new column
df['Salary'] = [70000, 80000, 90000, 100000]

# Replacing a value
df['City'] = df['City'].replace('Los Angeles', 'LA')

# Calculating the mean salary
mean_salary = df['Salary'].mean()
print(f"Mean Salary: {mean_salary}")

Output

Mean Salary: 87500.0

Conclusion

In this module, we covered the basics of Pandas for data manipulation. We learned about Series and DataFrames, how to clean and transform data, and how to perform aggregations and merge datasets. These skills are fundamental for data analysis and will be built upon in more advanced topics.

Python Programming Course

Module 1: Introduction to Python

Module 2: Control Structures

Module 3: Functions and Modules

Module 4: Data Structures

Module 5: Object-Oriented Programming

Module 6: File Handling

Module 7: Error Handling and Exceptions

Module 8: Advanced Topics

Module 9: Testing and Debugging

Module 10: Web Development with Python

Module 11: Data Science with Python

Module 12: Final Project

© Copyright 2024. All rights reserved