Part 2: Exploring Data with Pandas: Essential Operations and Functions
In this article, we'll focus on the essential operations of selecting, filtering, and transforming data using Pandas, along with key functions for descriptive statistics, handling missing data, and dealing with duplicates.
Let's dive into each of these core functionalities:
Selecting Data:
Pandas provides powerful mechanisms for selecting specific data from a DataFrame or Series.
Basic indexing and slicing operations allow you to select rows, columns, or individual elements based on their labels or positions.
Additionally, Pandas offers advanced selection techniques using methods like
.loc[]
and.iloc[]
, which enable selecting data based on labels or integer positions, respectively.
Basic slicing and indexing:
import pandas as pd
# Create a sample DataFrame
data = {'A': [1, 2, 3, 4, 5],
'B': ['a', 'b', 'c', 'd', 'e']}
df = pd.DataFrame(data)
# Selecting a single column
column_A = df['A']
print(column_A)
# Selecting multiple columns
columns_AB = df[['A', 'B']]
print(columns_AB)
# Slicing rows
sliced_rows = df[1:4] # Select rows from index 1 to 3
print(sliced_rows)
# Slicing rows and columns simultaneously
sliced_rows_cols = df.loc[1:3, ['A', 'B']]
print(sliced_rows_cols)
##output
column_A:
0 1
1 2
2 3
3 4
4 5
Name: A, dtype: int64
columns_AB:
A B
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
sliced_rows:
A B
1 2 b
2 3 c
3 4 d
sliced_rows_cols:
A B
1 2 b
2 3 c
3 4 d
Key Differences between .loc and .iloc and when to use each:
.loc[]
: Selects data using labels (index names or column names). It's generally preferred for readability and maintainability, especially when working with named indices or columns..iloc[]
: Selects data using integer positions (0-based indexing). It's primarily used when you need exact positional access, such as when indices or column names are unreliable or dynamic.
import pandas as pd
# Create a DataFrame with named index and columns
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data, index=['A', 'B', 'C'])
# Select row with index 'B' (by label)
row_by_label = df.loc['B']
# Select column 'Age' (by label)
column_by_label = df.loc[:, 'Age']
# Select specific rows and columns by labels
subset_by_label = df.loc[['A', 'C'], ['Name', 'Age']]
print("Row with index 'B' (by label):")
print(row_by_label)
print("Column 'Age' (by label):")
print(column_by_label)
print(Specific rows and columns by labels:)
print(subset_by_label)
# Select the first row (0-based index)
first_row_by_position = df.iloc[0]
# Select the second column (1-based index)
second_column_by_position = df.iloc[:, 1]
# Select rows from index 1 to 2 (excluding the last one)
subset_by_position = df.iloc[1:3]
# Select the first two rows and the last column
mixed_selection = df.iloc[:2, -1]
print("First row (by position):")
print(first_row_by_position)
print(first_row_by_position)
print("Second column (by position):")
print(Rows from index 1 to 2 (excluding the last one):)
print(subset_by_position)
print("First two rows and the last column:")
print(mixed_selection)
#Output
Row with index 'B' (by label):
Age Name
B 30 Bob
Column 'Age' (by label):
Age
25
30
35
Specific rows and columns by labels:
Name Age
A Alice 25
C Charlie 35
First row (by position):
Age Name
A 25 Alice
Second column (by position):
Age
25
30
35
Rows from index 1 to 2 (excluding the last one):
Name Age
B Bob 30
C Charlie 35
First two rows and the last column:
Age
25
30
2.Filtering Data:
Filtering data involves selecting rows or columns based on certain conditions. Pandas allows you to filter data using Boolean indexing, where you specify conditions that determine which rows or columns to include. You can also use the .query() method to filter data using SQL-like queries, providing a more expressive and intuitive way to filter datasets.
import pandas as pd
# Create a DataFrame
data = {'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]}
df = pd.DataFrame(data)
# Boolean indexing to filter rows where column A values are greater than 2
filtered_df = df[df['A'] > 2]
print(filtered_df)
# Using .query() method to filter rows where column A values are greater than 2
filtered_df = df.query('A > 2')
print(filtered_df)
#output
A B
2 3 7
3 4 8
3. Transforming Data:
Data transformation involves modifying the structure or content of a DataFrame or Series.
Pandas provides various functions and methods for transforming data, including:
map()
,apply()
which allow you to apply functions element-wise or along specific axes.import pandas as pd # Create a sample DataFrame data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]} df = pd.DataFrame(data) df['A_squared'] = df['A'].map(lambda x: x**2) # Define a function to double the values def double(x): return x * 2 # Apply the function element-wise to column 'A' df['A_doubled'] = df['A'].apply(double) print(df) #output A B A_squared A_double 0 1 10 1 1 1 2 20 4 2 2 3 30 9 6 3 4 40 16 8 4 5 50 25 10
assign()
, which enables creating new columns or overwriting existing ones based on computed values. Note that this returns a new DataFrame object, the original is not modified
df = df.assign(A_plus_B=df['A'] + df['B']) print("\nDataFrame after applying assign():\n", df) #output DataFrame after applying assign(): A B A_squared A_double A_plus_B 0 1 10 1 1 11 1 2 20 4 2 22 2 3 30 9 6 33 3 4 40 16 8 44 4 5 50 25 10 55
4.Descriptive statistics
Descriptive statistics provide summaries and insights into the distribution and characteristics of a dataset.
Pandas offers a wide range of descriptive statistical functions, such as
mean()
,median()
,sum()
,std()
,min()
, andmax()
, among others.These functions can be applied to entire datasets, specific columns, or rows to compute summary statistics.
import pandas as pd
# Create a DataFrame
data = {'A': [1, 2, 3, 4, 5],
'B': [5, 6, 7, 8, 9]}
df = pd.DataFrame(data)
# Summary statistics using describe()
print(df.describe())
# Mean of each column
print("Mean:")
print(df.mean())
# Median of each column
print("Median:")
print(df.median())
# Mode of each column
print("Mode:")
print(df.mode())
# Standard deviation of each column
print("Standard Deviation:")
print(df.std())
# Variance of each column
print("Variance:")
print(df.var())
# Minimum of each column
print("Minimum:")
print(df.min())
# Maximum of each column
print("Maximum:")
print(df.max())
# 25th, 50th, and 75th percentiles of each column
print("Quantiles:")
print(df.quantile([0.25, 0.5, 0.75]))
A B
count 5.000000 5.000000
mean 3.000000 7.000000
std 1.581139 1.581139
min 1.000000 5.000000
25% 2.000000 6.000000
50% 3.000000 7.000000
75% 4.000000 8.000000
max 5.000000 9.000000
Mean:
A 3.0
B 7.0
dtype: float64
Median:
A 3.0
B 7.0
dtype: float64
Mode:
A B
0 1 5
1 2 6
2 3 7
3 4 8
4 5 9
Standard Deviation:
A 1.581139
B 1.581139
dtype: float64
Variance:
A 2.5
B 2.5
dtype: float64
Minimum:
A 1
B 5
dtype: int64
Maximum:
A 5
B 9
dtype: int64
Quantiles:
A B
0.25 2.0 6.0
0.50 3.0 7.0
0.75 4.0 8.0
5. Handling Missing Data:
Missing data is a common issue in real-world datasets and needs to be addressed before analysis.
Pandas provides functions like
isnull()
,notnull()
,dropna()
, andfillna()
, which allow you to detect missing values, drop rows or columns containing missing data, or fill missing values with specified values or methods.
import pandas as pd
import numpy as np
# Create a DataFrame with missing data
data = {
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [np.nan, 10, 11, 12]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Detect missing data
print("\nMissing Data:")
print(df.isnull())
# Drop rows with any missing data
cleaned_df = df.dropna()
print("\nDataFrame after dropping rows with missing data:")
print(cleaned_df)
# Fill missing data with a scalar value
filled_df = df.fillna(0)
print("\nDataFrame after filling missing data with 0:")
print(filled_df)
# Fill missing data with the mean of the column
mean_filled_df = df.fillna(df.mean())
print("\nDataFrame after filling missing data with mean:")
print(mean_filled_df)
#output
Original DataFrame:
A B C
0 1.0 5.0 NaN
1 2.0 NaN 10.0
2 NaN NaN 11.0
3 4.0 8.0 12.0
Missing Data:
A B C
0 False False True
1 False True False
2 True True False
3 False False False
DataFrame after dropping rows with missing data:
A B C
3 4.0 8.0 12.0
DataFrame after filling missing data with 0:
A B C
0 1.0 5.0 0.0
1 2.0 0.0 10.0
2 0.0 0.0 11.0
3 4.0 8.0 12.0
DataFrame after filling missing data with mean:
A B C
0 1.0 5.0 11.0
1 2.0 6.5 10.0
2 2.3 6.5 11.0
3 4.0 8.0 12.0
Forward fill (ffill) and backward fill (bfill) are methods used to propagate non-null values forward or backward along a specified axis to fill in missing data.
Here's a brief explanation of each:
Forward Fill (ffill): This method fills missing values with the most recent non-null value in the same column before the missing value.
Backward Fill (bfill): This method fills missing values with the next non-null value in the same column after the missing value.
import pandas as pd
import numpy as np
# Create a DataFrame with missing data
data = {
'A': [1, np.nan, 3, np.nan, 5],
'B': [np.nan, 2, np.nan, 4, np.nan]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Forward fill missing values
filled_forward_df = df.ffill()
print("\nDataFrame after forward fill:")
print(filled_forward_df)
# Backward fill missing values
filled_backward_df = df.bfill()
print("\nDataFrame after backward fill:")
print(filled_backward_df)
#output
Original DataFrame:
A B
0 1.0 NaN
1 NaN 2.0
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
DataFrame after forward fill:
A B
0 1.0 NaN
1 1.0 2.0
2 3.0 2.0
3 3.0 4.0
4 5.0 4.0
DataFrame after backward fill:
A B
0 1.0 2.0
1 3.0 2.0
2 3.0 4.0
3 5.0 4.0
4 5.0 NaN
6. Dealing with Duplicates:
Duplicate rows or columns can skew analysis results and need to be identified and handled appropriately.
Pandas offers methods like
duplicated()
anddrop_duplicates()
to identify and remove duplicate rows based on specific columns.These functions help ensure data integrity and accuracy by eliminating redundant information from the dataset.
import pandas as pd
# Create a sample DataFrame with duplicate rows
data = {
'Name': ['John', 'Alice', 'Bob', 'John', 'Alice'],
'Age': [25, 30, 35, 25, 30],
'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Check for duplicate rows based on all columns
print("\nDuplicate Rows:")
print(df[df.duplicated()])
# Drop duplicate rows
cleaned_df = df.drop_duplicates()
print("\nDataFrame after dropping duplicates:")
print(cleaned_df)
#output
Original DataFrame:
Name Age City
0 John 25 New York
1 Alice 30 Los Angeles
2 Bob 35 Chicago
3 John 25 New York
4 Alice 30 Los Angeles
Duplicate Rows:
Name Age City
3 John 25 New York
4 Alice 30 Los Angeles
DataFrame after dropping duplicates:
Name Age City
0 John 25 New York
1 Alice 30 Los Angeles
2 Bob 35 Chicago
6. Sorting Dataframe:
You can sort a DataFrame by its index using the sort_index()
method in Pandas. Here's how you can do it:
import pandas as pd
# Sample DataFrame
data = {
'Product': ['A', 'B', 'C', 'D', 'E'],
'Sales': [100, 200, 150, 300, 120],
'Region': ['North', 'South', 'North', 'South', 'North']
}
df = pd.DataFrame(data)
# Setting 'Product' column as the index
df.set_index('Product', inplace=True)
# Sorting by index
sorted_df = df.sort_index()
print("Sorted by index:\n", sorted_df)
Sorted by index:
Sales Region
Product
A 100 North
B 200 South
C 150 North
D 300 South
E 120 North
You can sort a DataFrame using the sort_values()
method in Pandas. Here's how you can do it:
import pandas as pd
# Sample DataFrame
data = {
'Product': ['A', 'B', 'C', 'D', 'E'],
'Sales': [100, 200, 150, 300, 120],
'Region': ['North', 'South', 'North', 'South', 'North']
}
df = pd.DataFrame(data)
# Sorting by a single column
sorted_df = df.sort_values(by='Sales')
print("Sorted by Sales:\n", sorted_df)
# Sorting by multiple columns
sorted_df = df.sort_values(by=['Region', 'Sales'], ascending=[True, False])
print("\nSorted by Region and then by Sales:\n", sorted_df)
Sorted by Sales:
Product Sales Region
0 A 100 North
4 E 120 North
2 C 150 North
1 B 200 South
3 D 300 South
Sorted by Region and then by Sales:
Product Sales Region
2 C 150 North
0 A 100 North
4 E 120 North
3 D 300 South
1 B 200 South
In the first example, the DataFrame is sorted by the 'Sales' column in ascending order. In the second example, the DataFrame is sorted first by the 'Region' column in ascending order, and then by the 'Sales' column in descending order. You can specify multiple columns to sort by and also control the sorting order for each column.
By mastering these essential operations and functions in Pandas, you'll be well-equipped to manipulate and explore datasets efficiently, paving the way for deeper analysis and insights in your data projects.