Part 2: Exploring Data with Pandas: Essential Operations and Functions

Feb 15, 2024

In this article, we'll focus on the essential operations of selecting, filtering, and transforming data using Pandas, along with key functions for descriptive statistics, handling missing data, and dealing with duplicates.

Let's dive into each of these core functionalities:

Selecting Data:

Pandas provides powerful mechanisms for selecting specific data from a DataFrame or Series.
Basic indexing and slicing operations allow you to select rows, columns, or individual elements based on their labels or positions.
Additionally, Pandas offers advanced selection techniques using methods like .loc[] and .iloc[], which enable selecting data based on labels or integer positions, respectively.

Basic slicing and indexing:

import pandas as pd

# Create a sample DataFrame
data = {'A': [1, 2, 3, 4, 5],
        'B': ['a', 'b', 'c', 'd', 'e']}
df = pd.DataFrame(data)

# Selecting a single column
column_A = df['A']
print(column_A)

# Selecting multiple columns
columns_AB = df[['A', 'B']]
print(columns_AB)

# Slicing rows
sliced_rows = df[1:4]  # Select rows from index 1 to 3
print(sliced_rows)

# Slicing rows and columns simultaneously
sliced_rows_cols = df.loc[1:3, ['A', 'B']]
print(sliced_rows_cols)

##output

column_A:
0    1
1    2
2    3
3    4
4    5
Name: A, dtype: int64


columns_AB:
   A  B
0  1  a
1  2  b
2  3  c
3  4  d
4  5  e

sliced_rows:
   A  B
1  2  b
2  3  c
3  4  d

sliced_rows_cols:
   A  B
1  2  b
2  3  c
3  4  d

Key Differences between .loc and .iloc and when to use each:

.loc[]: Selects data using labels (index names or column names). It's generally preferred for readability and maintainability, especially when working with named indices or columns.
.iloc[]: Selects data using integer positions (0-based indexing). It's primarily used when you need exact positional access, such as when indices or column names are unreliable or dynamic.

import pandas as pd

# Create a DataFrame with named index and columns
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data, index=['A', 'B', 'C'])

# Select row with index 'B' (by label)
row_by_label = df.loc['B']

# Select column 'Age' (by label)
column_by_label = df.loc[:, 'Age']

# Select specific rows and columns by labels
subset_by_label = df.loc[['A', 'C'], ['Name', 'Age']]

print("Row with index 'B' (by label):")
print(row_by_label)
print("Column 'Age' (by label):")
print(column_by_label)
print(Specific rows and columns by labels:)
print(subset_by_label)


# Select the first row (0-based index)
first_row_by_position = df.iloc[0]

# Select the second column (1-based index)
second_column_by_position = df.iloc[:, 1]

# Select rows from index 1 to 2 (excluding the last one)
subset_by_position = df.iloc[1:3]

# Select the first two rows and the last column
mixed_selection = df.iloc[:2, -1]

print("First row (by position):")
print(first_row_by_position)
print(first_row_by_position)
print("Second column (by position):")
print(Rows from index 1 to 2 (excluding the last one):)
print(subset_by_position)
print("First two rows and the last column:")
print(mixed_selection)


#Output

Row with index 'B' (by label):

        Age	Name
B	30	Bob

Column 'Age' (by label):

Age
25
30
35

Specific rows and columns by labels:

        Name	Age
A	Alice	25
C	Charlie	35

First row (by position):
	Age	Name
A	25	Alice

Second column (by position):

Age
25
30
35

Rows from index 1 to 2 (excluding the last one):

        Name	Age
B	Bob	30
C	Charlie	35

First two rows and the last column:

Age
25
30

2.Filtering Data:

Filtering data involves selecting rows or columns based on certain conditions. Pandas allows you to filter data using Boolean indexing, where you specify conditions that determine which rows or columns to include. You can also use the .query() method to filter data using SQL-like queries, providing a more expressive and intuitive way to filter datasets.

import pandas as pd

# Create a DataFrame
data = {'A': [1, 2, 3, 4],
        'B': [5, 6, 7, 8]}
df = pd.DataFrame(data)

# Boolean indexing to filter rows where column A values are greater than 2
filtered_df = df[df['A'] > 2]
print(filtered_df)

# Using .query() method to filter rows where column A values are greater than 2
filtered_df = df.query('A > 2')
print(filtered_df)

#output
   A  B
2  3  7
3  4  8

3. Transforming Data:

Data transformation involves modifying the structure or content of a DataFrame or Series.

Pandas provides various functions and methods for transforming data, including:

map(), apply() which allow you to apply functions element-wise or along specific axes.

import pandas as pd

# Create a sample DataFrame
data = {'A': [1, 2, 3, 4, 5],
        'B': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

df['A_squared'] = df['A'].map(lambda x: x**2)

# Define a function to double the values
def double(x):
    return x * 2

# Apply the function element-wise to column 'A'
df['A_doubled'] = df['A'].apply(double)
print(df)

#output
   A   B  A_squared    A_double
0  1  10          1      1
1  2  20          4      2
2  3  30          9      6
3  4  40         16      8
4  5  50         25      10

assign(), which enables creating new columns or overwriting existing ones based on computed values. Note that this returns a new DataFrame object, the original is not modified

df = df.assign(A_plus_B=df['A'] + df['B'])
print("\nDataFrame after applying assign():\n", df)

#output
DataFrame after applying assign():
   A   B  A_squared    A_double  A_plus_B
0  1  10          1      1          11
1  2  20          4      2          22
2  3  30          9      6          33
3  4  40         16      8          44
4  5  50         25      10         55

4.Descriptive statistics

Descriptive statistics provide summaries and insights into the distribution and characteristics of a dataset.
Pandas offers a wide range of descriptive statistical functions, such as mean(), median(), sum(), std(), min(), and max(), among others.
These functions can be applied to entire datasets, specific columns, or rows to compute summary statistics.

import pandas as pd

# Create a DataFrame
data = {'A': [1, 2, 3, 4, 5],
        'B': [5, 6, 7, 8, 9]}
df = pd.DataFrame(data)

# Summary statistics using describe()
print(df.describe())

# Mean of each column
print("Mean:")
print(df.mean())

# Median of each column
print("Median:")
print(df.median())

# Mode of each column
print("Mode:")
print(df.mode())

# Standard deviation of each column
print("Standard Deviation:")
print(df.std())

# Variance of each column
print("Variance:")
print(df.var())

# Minimum of each column
print("Minimum:")
print(df.min())

# Maximum of each column
print("Maximum:")
print(df.max())

# 25th, 50th, and 75th percentiles of each column
print("Quantiles:")
print(df.quantile([0.25, 0.5, 0.75]))

              A         B
count  5.000000  5.000000
mean   3.000000  7.000000
std    1.581139  1.581139
min    1.000000  5.000000
25%    2.000000  6.000000
50%    3.000000  7.000000
75%    4.000000  8.000000
max    5.000000  9.000000

Mean:
A    3.0
B    7.0
dtype: float64

Median:
A    3.0
B    7.0
dtype: float64

Mode:
   A  B
0  1  5
1  2  6
2  3  7
3  4  8
4  5  9

Standard Deviation:
A    1.581139
B    1.581139
dtype: float64

Variance:
A    2.5
B    2.5
dtype: float64

Minimum:
A    1
B    5
dtype: int64

Maximum:
A    5
B    9
dtype: int64

Quantiles:
         A    B
0.25  2.0  6.0
0.50  3.0  7.0
0.75  4.0  8.0

5. Handling Missing Data:

Missing data is a common issue in real-world datasets and needs to be addressed before analysis.
Pandas provides functions like isnull(), notnull(), dropna(), and fillna(), which allow you to detect missing values, drop rows or columns containing missing data, or fill missing values with specified values or methods.

import pandas as pd
import numpy as np

# Create a DataFrame with missing data
data = {
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [np.nan, 10, 11, 12]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Detect missing data
print("\nMissing Data:")
print(df.isnull())

# Drop rows with any missing data
cleaned_df = df.dropna()
print("\nDataFrame after dropping rows with missing data:")
print(cleaned_df)

# Fill missing data with a scalar value
filled_df = df.fillna(0)
print("\nDataFrame after filling missing data with 0:")
print(filled_df)

# Fill missing data with the mean of the column
mean_filled_df = df.fillna(df.mean())
print("\nDataFrame after filling missing data with mean:")
print(mean_filled_df)

#output
Original DataFrame:
     A    B     C
0  1.0  5.0   NaN
1  2.0  NaN  10.0
2  NaN  NaN  11.0
3  4.0  8.0  12.0

Missing Data:
       A      B      C
0  False  False   True
1  False   True  False
2   True   True  False
3  False  False  False

DataFrame after dropping rows with missing data:
     A    B     C
3  4.0  8.0  12.0

DataFrame after filling missing data with 0:
     A    B     C
0  1.0  5.0   0.0
1  2.0  0.0  10.0
2  0.0  0.0  11.0
3  4.0  8.0  12.0

DataFrame after filling missing data with mean:
     A    B     C
0  1.0  5.0  11.0
1  2.0  6.5  10.0
2  2.3  6.5  11.0
3  4.0  8.0  12.0

Forward fill (ffill) and backward fill (bfill) are methods used to propagate non-null values forward or backward along a specified axis to fill in missing data.

Here's a brief explanation of each:

Forward Fill (ffill): This method fills missing values with the most recent non-null value in the same column before the missing value.
Backward Fill (bfill): This method fills missing values with the next non-null value in the same column after the missing value.

import pandas as pd
import numpy as np

# Create a DataFrame with missing data
data = {
    'A': [1, np.nan, 3, np.nan, 5],
    'B': [np.nan, 2, np.nan, 4, np.nan]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Forward fill missing values
filled_forward_df = df.ffill()
print("\nDataFrame after forward fill:")
print(filled_forward_df)

# Backward fill missing values
filled_backward_df = df.bfill()
print("\nDataFrame after backward fill:")
print(filled_backward_df)

#output
Original DataFrame:
     A    B
0  1.0  NaN
1  NaN  2.0
2  3.0  NaN
3  NaN  4.0
4  5.0  NaN

DataFrame after forward fill:
     A    B
0  1.0  NaN
1  1.0  2.0
2  3.0  2.0
3  3.0  4.0
4  5.0  4.0

DataFrame after backward fill:
     A    B
0  1.0  2.0
1  3.0  2.0
2  3.0  4.0
3  5.0  4.0
4  5.0  NaN

6. Dealing with Duplicates:

Duplicate rows or columns can skew analysis results and need to be identified and handled appropriately.
Pandas offers methods like duplicated() and drop_duplicates() to identify and remove duplicate rows based on specific columns.
These functions help ensure data integrity and accuracy by eliminating redundant information from the dataset.

import pandas as pd

# Create a sample DataFrame with duplicate rows
data = {
    'Name': ['John', 'Alice', 'Bob', 'John', 'Alice'],
    'Age': [25, 30, 35, 25, 30],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Check for duplicate rows based on all columns
print("\nDuplicate Rows:")
print(df[df.duplicated()])

# Drop duplicate rows
cleaned_df = df.drop_duplicates()
print("\nDataFrame after dropping duplicates:")
print(cleaned_df)

#output
Original DataFrame:
    Name  Age         City
0   John   25     New York
1  Alice   30  Los Angeles
2    Bob   35      Chicago
3   John   25     New York
4  Alice   30  Los Angeles

Duplicate Rows:
    Name  Age      City
3   John   25  New York
4  Alice   30     Los Angeles

DataFrame after dropping duplicates:
    Name  Age         City
0   John   25     New York
1  Alice   30  Los Angeles
2    Bob   35      Chicago

6. Sorting Dataframe:

You can sort a DataFrame by its index using the sort_index() method in Pandas. Here's how you can do it:

import pandas as pd

# Sample DataFrame
data = {
    'Product': ['A', 'B', 'C', 'D', 'E'],
    'Sales': [100, 200, 150, 300, 120],
    'Region': ['North', 'South', 'North', 'South', 'North']
}
df = pd.DataFrame(data)

# Setting 'Product' column as the index
df.set_index('Product', inplace=True)

# Sorting by index
sorted_df = df.sort_index()
print("Sorted by index:\n", sorted_df)

Sorted by index:
          Sales Region
Product              
A          100  North
B          200  South
C          150  North
D          300  South
E          120  North

You can sort a DataFrame using the sort_values() method in Pandas. Here's how you can do it:

import pandas as pd

# Sample DataFrame
data = {
    'Product': ['A', 'B', 'C', 'D', 'E'],
    'Sales': [100, 200, 150, 300, 120],
    'Region': ['North', 'South', 'North', 'South', 'North']
}
df = pd.DataFrame(data)

# Sorting by a single column
sorted_df = df.sort_values(by='Sales')
print("Sorted by Sales:\n", sorted_df)

# Sorting by multiple columns
sorted_df = df.sort_values(by=['Region', 'Sales'], ascending=[True, False])
print("\nSorted by Region and then by Sales:\n", sorted_df)

Sorted by Sales:
   Product  Sales Region
0       A    100  North
4       E    120  North
2       C    150  North
1       B    200  South
3       D    300  South

Sorted by Region and then by Sales:
   Product  Sales Region
2       C    150  North
0       A    100  North
4       E    120  North
3       D    300  South
1       B    200  South

In the first example, the DataFrame is sorted by the 'Sales' column in ascending order. In the second example, the DataFrame is sorted first by the 'Region' column in ascending order, and then by the 'Sales' column in descending order. You can specify multiple columns to sort by and also control the sorting order for each column.

By mastering these essential operations and functions in Pandas, you'll be well-equipped to manipulate and explore datasets efficiently, paving the way for deeper analysis and insights in your data projects.

DataJourney

Discussion about this post