Part 3: Advanced Data Analysis with Pandas: Going Beyond the Basics

Feb 15, 2024

Pandas is a powerful Python library for data manipulation and analysis. In this article, we will build upon the basics and explore advanced data analysis techniques using Pandas. By the end, you will have a solid understanding of combining datasets, groupby operations, pivot dataframe,advanced indexing, vectorized string operations,and handling large datasets, allowing you to tackle complex data analysis tasks with ease.

Combining datasets:

Merge:

The merge() function in Pandas allows you to combine two DataFrames based on one or more keys. This operation is similar to SQL joins.You can specify on what key you want to merge.

import pandas as pd # Sample DataFrames 
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']}) 
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Age': [25, 30, 35]}) 

# Merge based on 'ID' 
merged_df = pd.merge(df1, df2, on='ID', how='inner') 
print("Merged DataFrame:\n", merged_df)

#output
Merged DataFrame:
    ID    Name  Age
0   1   Alice   25
1   2     Bob   30

Join:

The join() function in Pandas allows you to combine two DataFrames based on their indices.By default join is outer.

# Sample DataFrames
df1 = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie']}, index=[1, 2, 3])
df2 = pd.DataFrame({'Age': [25, 30, 35]}, index=[1, 2, 4])

# Join based on index
joined_df = df1.join(df2, how='inner')
print("Joined DataFrame:\n", joined_df)

#Output

Joined DataFrame:
     Name  Age
1  Alice   25
2    Bob   30

Append:

The append() function in Pandas allows you to append rows of one DataFrame to another.Append method create a new object and dont change original dataframe.

# Appending rows
appended_df = df1.append({'Name': 'David'}, ignore_index=True)
print("Appended DataFrame:\n", appended_df)

Appended DataFrame:
       Name
0    Alice
1      Bob
2  Charlie
3    David

Concatenate:

The concat() function in Pandas allows you to concatenate multiple DataFrames along rows or columns.

# Concatenating along rows
concatenated_df = pd.concat([df1, df2], axis=0)
print("Concatenated DataFrame (along rows):\n", concatenated_df)

#output
Concatenated DataFrame (along rows):
      Name   Age
1   Alice   NaN
2     Bob   NaN
3  Charlie   NaN
1     NaN  25.0
2     NaN  30.0
4     NaN  35.0

pd.concat preserve indices even if result has duplicate indices.We can indices 1 and 2 are repeated.If you want to avoid this you can use flag ignore_index=True

When concatenating along axis=1, DataFrames are concatenated horizontally, meaning they are combined column-wise. Here's how you can do it using the concat() function:

import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [7, 8, 9], 'D': [10, 11, 12]})

# Concatenating along columns (horizontally)
concatenated_df = pd.concat([df1, df2], axis=1)
print("Concatenated DataFrame (along columns):\n", concatenated_df)

#Output

Concatenated DataFrame (along columns):
    A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12

These examples demonstrate how to use the merge(), join(), append(), and concat() functions in Pandas to combine datasets in various ways, providing flexibility in data manipulation and analysis.

Groupby Operations:

Similar to the SQL language, pandas allows grouping your data into groups to run calculations over each group.Groupby operations in Pandas allow you to split your dataset into groups based on one or more keys and perform operations on these groups.

Here's a breakdown of what the groupby() function returns and how it works:

GroupBy Object:
- The initial return value of the groupby() function is a GroupBy object.
- It doesn't contain actual data but represents a collection of DataFrame groups based on the grouping criteria.
Aggregation:
- When you apply an aggregation function (e.g., sum(), mean(), count()) to the GroupBy object, Pandas performs the aggregation operation on each group individually.
- The result of the aggregation operation is returned as a new DataFrame or Series, depending on the function used.
Iteration:
- You can iterate over the GroupBy object to access each group individually using a loop or apply() function.
- This allows you to perform custom operations on each group or extract specific information.
Transformation:
- Additionally, you can perform transformations on the groups using the transform() function.
- This allows you to apply functions to each group and return a DataFrame with the same shape as the original, where each value corresponds to the result of the transformation applied to the corresponding group.

We'll explore various aggregation functions like sum, mean, median, etc., and how to apply custom aggregation functions to your groups.

import pandas as pd

# Sample DataFrame
data = {
    'Product': ['A', 'B', 'A', 'B', 'A'],
    'Sales': [100, 200, 150, 300, 120],
    'Region': ['North', 'South', 'North', 'South', 'North']
}
df = pd.DataFrame(data)

# Grouping by 'Region'
grouped = df.groupby('Region')

# Displaying the GroupBy object
print(grouped)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f81a0651580>

The output shows that grouped is a DataFrameGroupBy object.ou can also iterate over the GroupBy object to access each group:

# Iterating over groups and displaying first few rows of each group
for name, group in grouped:
    print("Region:", name)
    print(group.head())
    print()

#output
Region: North
  Product  Sales Region
0       A    100  North
2       A    150  North
4       A    120  North

Region: South
  Product  Sales Region
1       B    200  South
3       B    300  South

The .agg() method allows you to compute one or more aggregation functions on each group of a DataFrame.

 Aggregating with .agg()
aggregated_data = grouped.agg({'Sales': ['sum', 'mean']})
print(aggregated_data)

#output
       Sales           
         sum        mean
Region                   
North    370  123.333333
South    500  250.000000

The .filter() method allows you to subset groups based on a condition.

# Filtering groups with .filter()
filtered_data = grouped.filter(lambda x: x['Sales'].sum() > 400)
print(filtered_data)

#output
  Product  Sales Region
1       B    200  South
3       B    300  South

The .transform() method applies a function to each group independently and returns a DataFrame with the same shape as the original.

# Transforming with .transform()
transformed_data = grouped['Sales'].transform(lambda x: x - x.mean())
print(transformed_data)

#output
0   -23.333333
1   -50.000000
2    26.666667
3    50.000000
4    -3.333333
Name: Sales, dtype: float64

The .apply() method allows you to apply a function to each group, and it can return a scalar, DataFrame, or Series.

# Applying custom function with .apply()
def custom_function(group):
    return group.nlargest(2, columns='Sales')

applied_data = grouped.apply(custom_function)
print(applied_data)

#output
         Product  Sales Region
Region                         
North  2       A    150  North
       4       A    120  North
South  3       B    300  South
       1       B    200  South

Convert Dataframe into pivot table:

A pivot table is useful to summarize and analyze the patterns in your data.
Pandas supports spreadsheet-like pivot tables that allow quick data summarization. If you want to turn your DataFrame into a pivot table, use pd.pivot_table().

import pandas as pd

# Sample DataFrame
data = {
    'Product': ['A', 'B', 'A', 'B', 'A'],
    'Sales': [100, 200, 150, 300, 120],
    'Region': ['North', 'South', 'North', 'South', 'North']
}
df = pd.DataFrame(data)

# Creating a pivot table
pivot_table = df.pivot_table(index='Region', columns='Product', values='Sales', aggfunc='sum', fill_value=0)
print(pivot_table)

#output
Product    A    B
Region           
North    220    0
South      0  500

In this example:

index specifies the column to use for the index of the pivot table (in this case, 'Region').
columns specifies the column to use for the columns of the pivot table (in this case, 'Product').
values specifies the column to use for populating the values in the pivot table (in this case, 'Sales').
aggfunc specifies the aggregation function to use when multiple values are found that correspond to the same index/column pair. Here, we use 'sum' to sum the sales values.
fill_value specifies the value to replace missing values in the pivot table (optional).

Unpivot Dataframe:

The pd.melt() function in Pandas is used to reshape or transform a DataFrame from wide format to long format. It unpivots a DataFrame from a wide format to a long format, making it easier to analyze and visualize the data.


# Reshaping back to long format using melt
melted_df = pd.melt(pivot_table.reset_index(), id_vars='Region', var_name='Product', value_name='Sales')
print("\nReshaped DataFrame (long format):\n", melted_df)

#output
Reshaped DataFrame (long format):
   Region Product  Sales
0  North       A    220
1  South       A      0
2  North       B      0
3  South       B    500

Advanced Indexing:

Pandas offers powerful indexing capabilities beyond basic row/column selection. We'll dive into:

Hierarchical Indexing: Creating and manipulating DataFrames with multiple index levels.
Boolean Indexing: Filtering data based on boolean conditions.

Understanding these indexing techniques will enable you to efficiently extract and manipulate subsets of your data.

# Creating a DataFrame with hierarchical index
index = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1), ('B', 2)], names=['Letter', 'Number'])
df_hierarchical = pd.DataFrame({'Values': [10, 20, 30, 40]}, index=index)

# Selecting data using loc
print(df_hierarchical.loc['A'])

# Selecting data using iloc
print(df_hierarchical.iloc[1:3])

# Boolean indexing
print(df_hierarchical[df_hierarchical['Values'] > 20])

#output
        Values
Number        
1           10
2           20
            Values
Letter Number       
A      2          20
B      1          30
B      2          40

Vectorized string operations:

Vectorized string operations in Pandas allow you to efficiently perform element-wise string operations on entire arrays of data without having to loop over each element individually. This capability is particularly useful when dealing with string data in DataFrame columns.

Here are some common vectorized string operations available in Pandas:

Accessing String Methods:

You can access string methods directly on Pandas Series containing string data using the .str accessor.

import pandas as pd

# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David']}
df = pd.DataFrame(data)

# Convert all names to uppercase
df['Name'] = df['Name'].str.upper()
print(df)

#Output
    Name
0  ALICE
1    BOB
2 CHARLIE
3   DAVID

String Contains:

Check if a string contains a specific substring

# Check if names contain 'li'
df['Contains_li'] = df['Name'].str.contains('li')
print(df)

    Name  Contains_li
0  ALICE         True
1    BOB        False
2 CHARLIE         True
3   DAVID        False

Nearly all python built functions are mirrored by pandas vectorized string method.

str.upper(): Convert strings to uppercase.
str.lower(): Convert strings to lowercase.
str.capitalize(): Capitalize the first character of each string.
str.title(): Capitalize the first character of each word.
str.strip(): Remove leading and trailing whitespace.
str.lstrip(): Remove leading whitespace.
str.rstrip(): Remove trailing whitespace.
str.startswith(): Check if strings start with a specified substring.
str.endswith(): Check if strings end with a specified substring.
str.contains(): Check if strings contain a specified substring.
str.replace(): Replace occurrences of a substring with another string.
str.split(): Split strings into substrings based on a delimiter.
str.join(): Join a list of strings into a single string using a specified delimiter.
str.extract(): Extract substrings using a regular expression.
str.extractall(): Extract all occurrences of substrings using a regular expression.
str.len(): Compute the length of each string.
str.isalnum(): Check if strings are alphanumeric.
str.isalpha(): Check if strings are alphabetic.
str.isnumeric(): Check if strings are numeric.
str.isdecimal(): Check if strings are decimal.
str.isdigit(): Check if strings consist of digits.
str.islower(): Check if strings are lowercase.
str.isupper(): Check if strings are uppercase.

Handling Large Datasets:

As datasets grow in size, memory management becomes crucial. We'll discuss strategies for handling large datasets efficiently:

Chunking: Processing data in smaller, manageable chunks.
Memory Optimization: Techniques for reducing memory usage, such as using categorical data types.
Out-of-Memory Computation: Leveraging tools like Dask for parallel and out-of-memory computation.

# Creating a large DataFrame
large_df = pd.DataFrame({'A': range(1000000), 'B': range(1000000)})

# Chunking
for chunk in pd.read_csv('large_data.csv', chunksize=10000):
    process(chunk)

# Memory Optimization
large_df['Category'] = large_df['Category'].astype('category')

# Out-of-Memory Computation using Dask
import dask.dataframe as dd
ddf = dd.from_pandas(large_df, npartitions=4)  # Assuming 4 cores
result = ddf.groupby('Category').sum().compute()

These techniques will help you work with datasets that exceed the available memory of your system.

Conclusion:

By mastering advanced data analysis techniques with Pandas, you'll be better equipped to derive meaningful insights from your data. Whether you're performing complex aggregations, manipulating multi-level indices, or handling large datasets, Pandas provides the tools you need to tackle diverse data analysis tasks effectively. Keep exploring and experimenting with Pandas to become a proficient in data science feild.

DataJourney

Discussion about this post