Part 1: Introduction to Pandas: Your Gateway to Data Analysis in Python

Feb 15, 2024

Welcome to the world of Pandas! In this article, we'll take our first steps into the powerful world of Pandas, a Python library that makes data analysis a breeze. Whether you're a beginner or an experienced data enthusiast, Pandas is an essential tool in your toolkit for working with structured data.

What is Pandas?

Pandas is an open-source Python library that provides high-performance, easy-to-use data structures and data analysis tools. It's built on top of NumPy, another popular Python library for numerical computing, and offers additional features specifically tailored for data manipulation and analysis.

Getting Started with Pandas

Before we dive into Pandas, make sure you have it installed. You can install Pandas using pip, the Python package manager, with the following command:

pip install pandas

Once Pandas is installed, you can import it into your Python scripts or Jupyter notebooks using the following convention:

import pandas as pd

The pd alias is a common shorthand used by the Pandas community when importing the library. It allows us to refer to Pandas functions and objects more conveniently.

Understanding Series and DataFrame

At the core of Pandas are two main data structures: Series and DataFrame.

Series: A one-dimensional labeled array capable of holding any data type. It's similar to a Python list or NumPy array but with additional functionality.
DataFrame: A two-dimensional labeled data structure with columns of potentially different types. It's akin to a spreadsheet or SQL table, where data is organized in rows and columns.

These data structures provide a powerful way to work with structured data, making it easy to perform various operations such as filtering, grouping, and aggregation.

Constructiong Series Object:

You can create a Pandas Series in several ways, including from Python lists, NumPy arrays, dictionaries, and scalar values.

Here are examples of different methods to create a Series:

1.From a Python List:

s = pd.Series([10, 20, 30])
print(s)

0    10
1    20
2    30
dtype: int64

Series combine sequence of values with explicit sequence of indices which we can access with value and index attribute.Values are similar to numpy array.Index is array like object of type pd.index which we will discuss more in detail

print(s.values)
print(s.index)

array([10,20,30])
RangeIndex(start=0,stop=3,step=1)

Index need not be integer,but can consist of nay desired data type.

data = [10, 20, 30]
index = ['A', 'B', 'C']
s = pd.Series(data, index=index)
print(s)

A    10
B    20
C    30
dtype: int64

2.From numpy array:

import pandas as pd
import numpy as np

# Create a NumPy array
data = np.array([10, 20, 30])

# Create a Series from a NumPy array
s = pd.Series(data)
print(s)

A    10
B    20
C    30
dtype: int64

3.From a Dictionary:

You can create a Pandas Series directly from a Python dictionary, with keys becoming index labels and values becoming data elements

# Create a Python dictionary
data = {'a': 10, 'b': 20, 'c': 30, 'd': 40}

# Convert the dictionary to a Pandas Series
s = pd.Series(data)
print(s)

a    10
b    20
c    30
d    40
dtype: int64

Index can be explicitly set to control order or subset of keys to be used.

# Create a Python dictionary
data = {'a': 10, 'b': 20, 'c': 30, 'd': 40}

# Convert the dictionary to a Pandas Series
s = pd.Series(data,index=[c,b])
print(s)

c    30
b    20
dtype: int64

4:From scalar values

import pandas as pd

# Create a Series from a scalar value
s = pd.Series(5, index=['a', 'b', 'c', 'd'])
print(s)

a    5
b    5
c    5
b    5
dtype: int64

Creating Dataframe objects:

Creating a DataFrame in Pandas is straightforward. You can create a DataFrame from various data sources, including Python dictionaries, lists, NumPy arrays, and even other DataFrames.

Here are examples of different methods to create a DataFrame:

1.From Python Dictionary:

import pandas as pd

# Create a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'Score': [85, 90, 88]}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
print(df)

2: From lists of lists

import pandas as pd

# Create a list of lists
data = [['Alice', 25, 85],
        ['Bob', 30, 90],
        ['Charlie', 35, 88]]

# Create a DataFrame from the list of lists
df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])
print(df)

3.From numpy array

import pandas as pd
import numpy as np

# Create a NumPy array of integers
data = np.array([[10, 20, 30],
                 [40, 50, 60],
                 [70, 80, 90]])

# Create a DataFrame from the NumPy array
df = pd.DataFrame(data)

# Print the DataFrame
print("DataFrame from NumPy array:")
print(df)

DataFrame from NumPy array:
    0   1   2
0  10  20  30
1  40  50  60
2  70  80  90

4.From a Dictionary of Series or Lists:

import pandas as pd

# Define a dictionary of Pandas Series objects
data = {
    "City": pd.Series(["New York", "Los Angeles", "Chicago", "Houston"]),
    "Population": pd.Series([8537673, 3976322, 2695598], index=["New York", "Los Angeles", "Chicago"]),
    "Area (sq mi)": pd.Series([468.9, 468.7, 227.3, 637.5], index=["New York", "Los Angeles", "Chicago", "Houston"]),
    "Founded": pd.Series([1624, 1781, 1833], index=["New York", "Los Angeles", "Chicago"]),
}

# Create a DataFrame from the dictionary
cities_df = pd.DataFrame(data)

# Print the DataFrame
print(cities_df)


                  City  Population  Area (sq mi)  Founded
Chicago        Chicago   2695598.0         227.3   1833.0
Houston        Houston         NaN         637.5      NaN
Los Angeles  Los Angeles   3976322.0         468.7   1781.0
New York      New York   8537673.0         468.9   1624.0

In the "Population" Series, the indices are aligned with the "City" Series (New York, Los Angeles, Chicago). The value for Houston is NaN because there is no corresponding index in the "Population" Series.
Similarly, in the "Founded" Series, the indices are aligned with the "City" Series, and NaN values are inserted where indices do not align.
The "Area (sq mi)" Series contains values for all cities, including Houston, but NaN is inserted for the missing "Population" and "Founded" values for Houston.

This example demonstrates how Pandas automatically aligns indices when creating a DataFrame from a dictionary of Pandas Series objects, and inserts NaN values where indices do not align.

Index Object:

Index object in a DataFrame represents the labels of the rows. It provides a way to uniquely identify each row in the DataFrame. The index can be of different types, such as integer, string, datetime, or a combination of these.

The index object is immutable, meaning once it is created, its contents cannot be changed. This immutability ensures data integrity and consistency, as modifying the index could potentially lead to unintended consequences, such as data misalignment or inconsistency.

When you create a DataFrame without specifying an index, Pandas assigns a default integer index starting from 0.

You can specify a custom index when creating a DataFrame.

import pandas as pd

# Create a DataFrame with custom index
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['X', 'Y', 'Z'])
print(df.index)

You can set a column as the index of the DataFrame.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': ['X', 'Y', 'Z']})

# Set column 'C' as index
df.set_index('C', inplace=True)
print(df.index)

Conclusion

In this article, we've scratched the surface of what Pandas has to offer. We've learned about its core data structures, Series and DataFrame and Index.

In the next article, we'll dive deeper into now delve deeper into the core functionalities of Pandas. We'll explore essential operations such as selecting, filtering, and transforming data using Pandas.

Stay tuned for more Pandas goodness!

DataJourney

Discussion about this post