Skip to content

3. Pandas

Pandas is a fast, powerful, and flexible Python library used for data manipulation and analysis. It provides data structures like Series and DataFrame for handling structured data and offers tools for reading, writing, and transforming data in various formats.

1. Introduction to Pandas

Pandas is designed to handle data from various sources such as CSV files, Excel, SQL databases, and JSON. It provides easy-to-use functions for data cleaning, manipulation, and analysis, making it a go-to library for data scientists.

To start, install Pandas via:

Terminal window
pip install pandas

2. Data Structures: Series and DataFrame

Pandas provides two core data structures:

  1. Series: A one-dimensional array (like a column in a spreadsheet).
  2. DataFrame: A two-dimensional table with rows and columns.

2.1 Series

A Series is a one-dimensional labeled array capable of holding any data type.

import pandas as pd
# Create a Series
s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
print(s)

Key features:

  • You can access elements using labels (s['a']).
  • Series can hold different types of data: integers, floats, strings, etc.

2.2 DataFrame

A DataFrame is a two-dimensional, mutable, and heterogeneous data structure with labeled axes (rows and columns).

# Create a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
print(df)

Important DataFrame features:

  • Rows and columns have labels.
  • DataFrames can be created from dictionaries, lists, NumPy arrays, or files (CSV, Excel, etc.).

3. Reading and Writing Data

Pandas allows easy data import and export from multiple formats.

# Read from a CSV file
df = pd.read_csv('data.csv')
# Write to a CSV file
df.to_csv('output.csv', index=False)
# Read from an Excel file
df = pd.read_excel('data.xlsx')
# Write to an Excel file
df.to_excel('output.xlsx', index=False)

4. DataFrame Operations

4.1 Indexing and Selecting Data

You can select rows and columns using labels (loc) or positions (iloc).

# Select a single column
print(df['Name'])
# Select multiple columns
print(df[['Name', 'Age']])
# Select rows by label using loc
print(df.loc[0])
# Select rows by position using iloc
print(df.iloc[0:2])

The difference between iloc and loc in pandas comes down to how you index and select data in a DataFrame:

1. loc: Label-based indexing

  • Used to select rows and columns based on labels (index labels or column names).
  • It allows selecting both by row/column labels and a boolean condition.
  • Syntax: df.loc[rows, columns]

Example:

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
# Using loc to select rows by index label
print(df.loc[0]) # Select the row where the index is 0
# Using loc to select columns by name
print(df.loc[:, 'Name']) # Select all rows, but only 'Name' column
# Using loc with a condition
print(df.loc[df['Age'] > 25]) # Select rows where 'Age' > 25

Select Multiple Columns

import pandas as pd
# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
# Using loc to select specific columns by name
selected_columns = df.loc[:, ['Name', 'City']]
print(selected_columns)
Explanation:
  • df.loc[:, ['Name', 'City']]:
    • The first argument (:) selects all rows.
    • The second argument (['Name', 'City']) specifies the two column names to select.
Output:
Name City
0 Alice New York
1 Bob Los Angeles
2 Charlie Chicago

2. iloc: Integer-based indexing

  • Used to select rows and columns based on their integer position (index position).
  • It works purely on positional indexing, which means you reference rows and columns by their position in the DataFrame (similar to how lists work in Python).
  • Syntax: df.iloc[rows, columns]

Example:

# Using iloc to select rows by index position
print(df.iloc[0]) # Select the first row (0th position)
# Using iloc to select columns by position
print(df.iloc[:, 0]) # Select all rows, but only the first column (position 0)
# Using iloc to slice rows and columns by position
print(df.iloc[0:2, 0:2]) # Select the first 2 rows and first 2 columns

3. Key Differences

Featureloc (Label-based)iloc (Integer-based)
IndexingUses labels (index labels or column names)Uses integer positions (row/column positions)
SelectionCan handle labels, conditions, and rangesWorks with integer slicing, similar to Python lists
FlexibilityMore flexible, allowing label-based selection or condition-based filteringStrictly based on position, not flexible for conditions
Error HandlingThrows an error if labels don’t existThrows an error if positions are out of bounds

4.2 Filtering, Sorting, and Grouping

Filtering: Apply conditions to filter rows.

# Filter rows where Age > 30
filtered_df = df[df['Age'] > 30]

Sorting: Sort by column values.

# Sort by Age
sorted_df = df.sort_values(by='Age')

Grouping: Use groupby() for aggregation.

# Group by a column and apply aggregation
grouped = df.groupby('Age')['Salary'].mean()

5. Handling Missing Data

Pandas provides ways to handle missing or NaN values.

# Check for missing values
print(df.isnull())
# Drop rows with missing values
df_cleaned = df.dropna()
# Fill missing values
df_filled = df.fillna(0)

6. Data Transformation

6.1 Adding/Removing Columns

# Add a new column
df['Bonus'] = [1000, 1500, 2000]
# Remove a column
df = df.drop('Bonus', axis=1)

6.2 Merging and Concatenating DataFrames

You can merge DataFrames similar to SQL joins.

# Merge two DataFrames
df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'ID': [1, 2], 'Salary': [50000, 60000]})
merged_df = pd.merge(df1, df2, on='ID')

6.3 Pivoting and Melting

Pivoting reshapes data based on specific values, while melting transforms it back.

# Pivot a DataFrame
pivot_df = df.pivot(index='ID', columns='Name', values='Salary')
# Melt a DataFrame
melted_df = pd.melt(pivot_df, id_vars='ID')

7. Data Aggregation

Aggregation allows you to compute summaries of your data.

# Aggregate functions
df_agg = df.groupby('Name').agg({
'Age': 'mean',
'Salary': 'sum'
})
# Custom aggregation
custom_agg = df.groupby('Age').apply(lambda x: x['Salary'].sum())

8. Working with Time Series Data

Pandas has robust support for time series data.

# Convert a column to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Set a datetime column as the index
df.set_index('Date', inplace=True)
# Resample time series data (e.g., monthly)
monthly_data = df.resample('M').mean()

Key time series features:

  • Date parsing and formatting.
  • Resampling for aggregating or downsampling data.
  • Rolling windows for moving averages.

9. Visualization with Pandas

Pandas integrates well with Matplotlib for quick plotting.

import matplotlib.pyplot as plt
# Simple plot
df['Salary'].plot(kind='line')
# Plotting grouped data
df.groupby('Age')['Salary'].mean().plot(kind='bar')
plt.show()

For more advanced visualizations, you can use libraries like Seaborn or Plotly in combination with Pandas.


10. Performance Optimization

Pandas offers several ways to improve performance:

10.1 Vectorized Operations

Pandas supports vectorized operations that are much faster than row-by-row operations.

# Vectorized operation
df['New_Column'] = df['Salary'] * 1.1

10.2 Avoiding Loops

Always prefer Pandas’ built-in functions over loops for better performance.

# Bad practice: looping through rows
for i, row in df.iterrows():
df.at[i, 'New_Column'] = row['Salary'] * 1.1
# Better practice: use vectorization
df['New_Column'] = df['Salary'] * 1.1

10.3 Optimizing Memory Usage

To optimize memory usage, you can downcast numeric types.

# Downcast integer types
df['Age'] = pd.to_numeric(df['Age'], downcast='integer')
# Downcast float types
df['Salary'] = pd.to_numeric(df['Salary'], downcast='float')

10.4 Working with Large Datasets

For large datasets, consider loading only a portion of the data at a time.

# Load only the first 1000 rows
df = pd.read_csv('large_file.csv', nrows=1000)
# Load specific columns
df = pd.read_csv('large_file.csv', usecols=['Name', 'Age'])

Key Features in Pandas

  1. Flexible Data Structures: DataFrame and Series make handling structured data intuitive.
  2. Powerful File I/O: Supports reading from and writing to CSV, Excel, SQL, etc.
  3. Data Manipulation: Pandas provides extensive tools for reshaping, cleaning, and filtering data.
  4. Time Series Support: Built-in methods for handling datetime data and resampling.
  5. High-Level Plotting: Quick and simple visualizations for exploratory data analysis.
  6. Performance Optimizations: Use vectorized operations and memory-efficient methods for speed.

This advanced tutorial gives you a deeper understanding of the core concepts and more complex operations in Pandas. By focusing on performance, memory optimization, and flexible data handling, you’ll be equipped to handle even the largest datasets efficiently.