3. Pandas
Pandas is a fast, powerful, and flexible Python library used for data manipulation and analysis. It provides data structures like Series and DataFrame for handling structured data and offers tools for reading, writing, and transforming data in various formats.
1. Introduction to Pandas
Pandas is designed to handle data from various sources such as CSV files, Excel, SQL databases, and JSON. It provides easy-to-use functions for data cleaning, manipulation, and analysis, making it a go-to library for data scientists.
To start, install Pandas via:
pip install pandas
2. Data Structures: Series and DataFrame
Pandas provides two core data structures:
- Series: A one-dimensional array (like a column in a spreadsheet).
- DataFrame: A two-dimensional table with rows and columns.
2.1 Series
A Series is a one-dimensional labeled array capable of holding any data type.
import pandas as pd
# Create a Seriess = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])print(s)
Key features:
- You can access elements using labels (
s['a']
). - Series can hold different types of data: integers, floats, strings, etc.
2.2 DataFrame
A DataFrame is a two-dimensional, mutable, and heterogeneous data structure with labeled axes (rows and columns).
# Create a DataFramedata = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'Salary': [50000, 60000, 70000]}
df = pd.DataFrame(data)print(df)
Important DataFrame features:
- Rows and columns have labels.
- DataFrames can be created from dictionaries, lists, NumPy arrays, or files (CSV, Excel, etc.).
3. Reading and Writing Data
Pandas allows easy data import and export from multiple formats.
# Read from a CSV filedf = pd.read_csv('data.csv')
# Write to a CSV filedf.to_csv('output.csv', index=False)
# Read from an Excel filedf = pd.read_excel('data.xlsx')
# Write to an Excel filedf.to_excel('output.xlsx', index=False)
4. DataFrame Operations
4.1 Indexing and Selecting Data
You can select rows and columns using labels (loc
) or positions (iloc
).
# Select a single columnprint(df['Name'])
# Select multiple columnsprint(df[['Name', 'Age']])
# Select rows by label using locprint(df.loc[0])
# Select rows by position using ilocprint(df.iloc[0:2])
The difference between iloc
and loc
in pandas
comes down to how you index and select data in a DataFrame:
1. loc
: Label-based indexing
- Used to select rows and columns based on labels (index labels or column names).
- It allows selecting both by row/column labels and a boolean condition.
- Syntax:
df.loc[rows, columns]
Example:
import pandas as pddata = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}df = pd.DataFrame(data)
# Using loc to select rows by index labelprint(df.loc[0]) # Select the row where the index is 0
# Using loc to select columns by nameprint(df.loc[:, 'Name']) # Select all rows, but only 'Name' column
# Using loc with a conditionprint(df.loc[df['Age'] > 25]) # Select rows where 'Age' > 25
Select Multiple Columns
import pandas as pd
# Sample DataFramedata = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
# Using loc to select specific columns by nameselected_columns = df.loc[:, ['Name', 'City']]
print(selected_columns)
Explanation:
df.loc[:, ['Name', 'City']]
:- The first argument (
:
) selects all rows. - The second argument (
['Name', 'City']
) specifies the two column names to select.
- The first argument (
Output:
Name City0 Alice New York1 Bob Los Angeles2 Charlie Chicago
2. iloc
: Integer-based indexing
- Used to select rows and columns based on their integer position (index position).
- It works purely on positional indexing, which means you reference rows and columns by their position in the DataFrame (similar to how lists work in Python).
- Syntax:
df.iloc[rows, columns]
Example:
# Using iloc to select rows by index positionprint(df.iloc[0]) # Select the first row (0th position)
# Using iloc to select columns by positionprint(df.iloc[:, 0]) # Select all rows, but only the first column (position 0)
# Using iloc to slice rows and columns by positionprint(df.iloc[0:2, 0:2]) # Select the first 2 rows and first 2 columns
3. Key Differences
Feature | loc (Label-based) | iloc (Integer-based) |
---|---|---|
Indexing | Uses labels (index labels or column names) | Uses integer positions (row/column positions) |
Selection | Can handle labels, conditions, and ranges | Works with integer slicing, similar to Python lists |
Flexibility | More flexible, allowing label-based selection or condition-based filtering | Strictly based on position, not flexible for conditions |
Error Handling | Throws an error if labels don’t exist | Throws an error if positions are out of bounds |
4.2 Filtering, Sorting, and Grouping
Filtering: Apply conditions to filter rows.
# Filter rows where Age > 30filtered_df = df[df['Age'] > 30]
Sorting: Sort by column values.
# Sort by Agesorted_df = df.sort_values(by='Age')
Grouping: Use groupby()
for aggregation.
# Group by a column and apply aggregationgrouped = df.groupby('Age')['Salary'].mean()
5. Handling Missing Data
Pandas provides ways to handle missing or NaN
values.
# Check for missing valuesprint(df.isnull())
# Drop rows with missing valuesdf_cleaned = df.dropna()
# Fill missing valuesdf_filled = df.fillna(0)
6. Data Transformation
6.1 Adding/Removing Columns
# Add a new columndf['Bonus'] = [1000, 1500, 2000]
# Remove a columndf = df.drop('Bonus', axis=1)
6.2 Merging and Concatenating DataFrames
You can merge DataFrames similar to SQL joins.
# Merge two DataFramesdf1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})df2 = pd.DataFrame({'ID': [1, 2], 'Salary': [50000, 60000]})
merged_df = pd.merge(df1, df2, on='ID')
6.3 Pivoting and Melting
Pivoting reshapes data based on specific values, while melting transforms it back.
# Pivot a DataFramepivot_df = df.pivot(index='ID', columns='Name', values='Salary')
# Melt a DataFramemelted_df = pd.melt(pivot_df, id_vars='ID')
7. Data Aggregation
Aggregation allows you to compute summaries of your data.
# Aggregate functionsdf_agg = df.groupby('Name').agg({ 'Age': 'mean', 'Salary': 'sum'})
# Custom aggregationcustom_agg = df.groupby('Age').apply(lambda x: x['Salary'].sum())
8. Working with Time Series Data
Pandas has robust support for time series data.
# Convert a column to datetimedf['Date'] = pd.to_datetime(df['Date'])
# Set a datetime column as the indexdf.set_index('Date', inplace=True)
# Resample time series data (e.g., monthly)monthly_data = df.resample('M').mean()
Key time series features:
- Date parsing and formatting.
- Resampling for aggregating or downsampling data.
- Rolling windows for moving averages.
9. Visualization with Pandas
Pandas integrates well with Matplotlib for quick plotting.
import matplotlib.pyplot as plt
# Simple plotdf['Salary'].plot(kind='line')
# Plotting grouped datadf.groupby('Age')['Salary'].mean().plot(kind='bar')
plt.show()
For more advanced visualizations, you can use libraries like Seaborn or Plotly in combination with Pandas.
10. Performance Optimization
Pandas offers several ways to improve performance:
10.1 Vectorized Operations
Pandas supports vectorized operations that are much faster than row-by-row operations.
# Vectorized operationdf['New_Column'] = df['Salary'] * 1.1
10.2 Avoiding Loops
Always prefer Pandas’ built-in functions over loops for better performance.
# Bad practice: looping through rowsfor i, row in df.iterrows(): df.at[i, 'New_Column'] = row['Salary'] * 1.1
# Better practice: use vectorizationdf['New_Column'] = df['Salary'] * 1.1
10.3 Optimizing Memory Usage
To optimize memory usage, you can downcast numeric types.
# Downcast integer typesdf['Age'] = pd.to_numeric(df['Age'], downcast='integer')
# Downcast float typesdf['Salary'] = pd.to_numeric(df['Salary'], downcast='float')
10.4 Working with Large Datasets
For large datasets, consider loading only a portion of the data at a time.
# Load only the first 1000 rowsdf = pd.read_csv('large_file.csv', nrows=1000)
# Load specific columnsdf = pd.read_csv('large_file.csv', usecols=['Name', 'Age'])
Key Features in Pandas
- Flexible Data Structures: DataFrame and Series make handling structured data intuitive.
- Powerful File I/O: Supports reading from and writing to CSV, Excel, SQL, etc.
- Data Manipulation: Pandas provides extensive tools for reshaping, cleaning, and filtering data.
- Time Series Support: Built-in methods for handling datetime data and resampling.
- High-Level Plotting: Quick and simple visualizations for exploratory data analysis.
- Performance Optimizations: Use vectorized operations and memory-efficient methods for speed.
This advanced tutorial gives you a deeper understanding of the core concepts and more complex operations in Pandas. By focusing on performance, memory optimization, and flexible data handling, you’ll be equipped to handle even the largest datasets efficiently.