Pandas Power-Up: Data Transformation & Combination – Level Up Your Analysis

Issue 22: Melt, Pivot, Merge, Concatenate... Unleash the Magic of Pandas! 🧙‍♂️

Jul 04, 2024

Welcome back, data ninjas!

You've mastered the art of slicing and dicing your data. Now, get ready to transform and combine your datasets like a pro! This newsletter will teach you powerful techniques to reshape, merge, and manipulate your information, opening up a world of new possibilities for your analysis.

Why Transform and Combine Data?

Think of your data like ingredients in a kitchen. You can transform them – chop, mix, blend – to create delicious dishes.Similarly, transforming data helps you reshape it into a format that's easier to analyze. Combining data, like merging multiple datasets, allows you to gain a more comprehensive view of your information.

Pandas Power Moves:

Melt: Turn wide data into long data (think spreadsheets vs. databases). This is handy when you want to analyze multiple columns as categories.
Example:

# Sample DataFrame
data = {'Product': ['A', 'B', 'C'],
        'Jan': [10, 20, 30],
        'Feb': [15, 25, 35]}
df = pd.DataFrame(data)

melted_df = df.melt(id_vars='Product', var_name='Month', value_name='Sales')
print(melted_df)

Output:

  Product Month  Sales
0       A   Jan     10
1       B   Jan     20
2       C   Jan     30
3       A   Feb     15
4       B   Feb     25
5       C   Feb     35

Pivot: The opposite of melt! Transform long data into wide data for summarizing.
Example: Using the 'melted_df' from above

pivoted_df = melted_df.pivot(index='Product', columns='Month', values='Sales')
print(pivoted_df)

Output:

Month   Jan  Feb
Product        
A       10   15
B       20   25
C       30   35

Merge: Combine two DataFrames based on shared columns, like merging customer information with purchase history.
Example:

# Customer DataFrame
customer_data = {'ID': [1, 2, 3],
                 'Name': ['Alice', 'Bob', 'Charlie']}
customers = pd.DataFrame(customer_data)

# Purchase DataFrame
purchase_data = {'ID': [1, 1, 2],
                 'Product': ['X', 'Y', 'Z']}
purchases = pd.DataFrame(purchase_data)

merged_df = pd.merge(customers, purchases, on='ID')
print(merged_df)

Output:

   ID    Name Product
0  1   Alice       X
1  1   Alice       Y
2  2     Bob       Z

Concatenate: Stack DataFrames on top of each other or side-by-side to increase your sample size.
Example:

# DataFrame 1
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})

# DataFrame 2
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

concatenated_df = pd.concat([df1, df2])
print(concatenated_df)

Output:

Challenge Time!

Data Transformation Challenge: Download a dataset of your choice and practice using melt or pivot to reshape it.
Data Combining Challenge: Find two related datasets and try merging or concatenating them to create a unified view.

Poll Time! 🗳️

Stay Tuned!

Solutions to the previous Challenges (Issue 21)

Let's assume your DataFrame from the previous newsletter looked like this:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
        'Age': [12, 15, 10, 12, 13],
        'Favorite_Sport': ['Soccer', 'Basketball', 'Swimming', 'Soccer', 'Tennis']}

df = pd.DataFrame(data)

Here are the answers to the challenges:

Challenge 1: Find the Youngest

youngest = df[df['Age'] == df['Age'].min()]
print(youngest)

Output:

     Name  Age Favorite_Sport
2  Charlie   10       Swimming

Explanation:

df['Age'] == df['Age'].min(): This part finds all rows where the Age column is equal to the minimum value in the Age column.
df[...]: This filters the DataFrame df to keep only the rows that satisfy the condition.

Challenge 2: Filter by Multiple Conditions

twelve_and_soccer = df[(df['Age'] == 12) & (df['Favorite_Sport'] == 'Soccer')]
print(twelve_and_soccer)

Output:

   Name  Age Favorite_Sport
0  Alice   12        Soccer
3  David   12        Soccer

Explanation:

df['Age'] == 12: This part finds all rows where the Age column is equal to 12.
df['Favorite_Sport'] == 'Soccer': This part finds all rows where the Favorite_Sport column is equal to 'Soccer'.
(...) & (...): The & operator combines the two conditions, so we only keep rows where both conditions are true.

Challenge 3: Create a New DataFrame

new_df = df[['Name', 'Favorite_Sport']]
print(new_df)

Output:

     Name Favorite_Sport
0   Alice        Soccer
1     Bob    Basketball
2  Charlie       Swimming
3   David        Soccer
4   Emily        Tennis

Explanation:

df[['Name', 'Favorite_Sport']]: By passing a list of column names to the DataFrame, we create a new DataFrame that includes only those selected columns.

Let me know if you'd like any more explanations or have other questions!

Young Pythoneers

Discussion about this post