Intro to Pandas — Class Notes

Try me

Open In ColabBinder

Introduction

Motivation

  • Pandas is the absolute foundation for data science in Python.

  • Essential tool if you are dealing with any kind of data.

Objectives

  • Cut through all the complexity and understand the core concepts.

  • High impact knowledge of how Pandas actually works.

  • Get you started with real-world data manipulation tasks.

Intro for non-programmers (I)

  • “Spreadsheets on steroids”: powerful data manipulation tool (like Excel, but way more powerful).

  • Overwhelmed by endless lists or arrays? Pandas provides:

    • Convenient and robust Data structures.

    • Functions to filter, aggregate, and transform data easily.

  • “Nervous system” for data analysis in Python.

Intro for non-programmers (II)

  • Use case examples:

    • Complex financial data analysis.

    • Cleaning messy datasets like survey results.

    • Combining multiple data sources.

Intro for non-programmers (III)

  • Pandas revolves around two core data structures:

    • Series: 1D labeled array (like a column in a spreadsheet).

    • DataFrame: 2D labeled table (like an entire spreadsheet with rows and columns).

  • Any single column in a DataFrame is under the hood a Series.

  • All Series in a dataframe share the same index (row labels).

  • Columns in a DataFrame can have different data types (ints, floats, strings, etc).

  • This flexibility makes Pandas ideal for real-world datasets: Different data types in a single container.

Intro for non-programmers (IV)

  • Pandas facilitates tne entire data lifecycle:

    • Loading data from various formats (CSV, JSON, Excel, SQL databases, etc).

    • Cleaning and transforming data (handling missing values, filtering rows, aggregating data, etc).

    • Analyzing and visualizing data (summary statistics, plots, etc).

  • Pandas is optimized for performance with large datasets (millions of rows).

  • Integrates seamlessly with other Python libraries (NumPy, Matplotlib, Scikit-learn) stay tuned for more!

Agenda

  • Intro and agenda (20 min)

  • Loading data (15 min)

  • Querying (15 min)

Setup pandas

Pandas is available off-the-shelf in Colab . If you are running locally, you may need to install it first:

pip install pandas

If not, just import:

[ ]:
import pandas as pd

This pdalias (de-facto standard) will become your best friend when working with data!

Create DataFrames and load data

We are going to present two different ways to create DataFrames that cover most common use cases:

  1. From scratch (e.g., for small datasets or testing): Creating datasets from Python dictionaries/lists.

  2. Loading from JSON/CSV files (most common in real-world scenarios). Other methods are supported (e.g., Excel, SQL databases, Parquet, etc) but out of scope for this intro.

Pandas can create DataFrames directly from Python dictionaries/lists.

  • Each key in the dictionary becomes a column

  • The values become the rows.

[ ]:
data = {
    "id":   [1,2,3,4,5],
    "name": ["Ada","Alan","Grace","Linus","Edsger"],
    "score":[9.7, 8.2, 9.6, 7.5, 8.9],
    "program": ["CS","CS","DS","SE","CS"]
}
df1 = pd.DataFrame(data)
df1

Creating a dataframe from records

You can also create DataFrames from a list of records (dictionaries), where each dictionary represents a row.

[ ]:
students = [
    {"id": 101, "name": "Ada Lovelace",  "program":"CS", "score": 9.5},
    {"id": 102, "name": "Alan Turing",   "program":"CS", "score": 8.8},
    {"id": 103, "name": "Grace Hopper",  "program":"DS", "score": 9.6},
]
df2 = pd.DataFrame.from_records(students)
df2

Importing large datasets from JSON/CSV

In real-world scenarios, datasets are often stored in files (locally or remotely). Pandas provides convenient functions to load data from common formats like JSON and CSV. We are going to load data from a CSV with information about diabetes patients.

The dataset contains the following columns:

  • Pregnancies: Number of times pregnant

  • Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

  • BloodPressure: Diastolic blood pressure (mm Hg)

  • SkinThickness: Triceps skin fold thickness (mm)

  • Insulin: 2-Hour serum insulin (mu U/ml)

  • BMI: Body mass index (weight in kg/(height in m)^2)

  • DiabetesPedigreeFunction: Diabetes pedigree function

  • Age: Age (years)

  • Outcome: Class variable (0 or 1) determining if the patient has diabetes

  • 268 of 768 are 1, the others are 0

  • Class Distribution: (class value 1 is interpreted as “tested positive for diabetes”)

The following code loads the dataset into a Pandas dataframe:

[ ]:
df = pd.read_csv('https://raw.githubusercontent.com/ffraile/computer_science_tutorials/main/source/Data%20Manipulation/exercises/datasets/diabetes.csv')
df

Inspecting and accessing data

Inspecting data

  • df.head(n): first n rows (default 5)

  • df.tail(n): last n rows (default 5)

  • df.dtypes: data types of each column

  • df.info(): summary of dataframe (rows, columns, dtypes, memory usage)

  • df.describe(): statistical summary of numeric columns

[ ]:
df.info()
[ ]:
df.describe()
[ ]:
df.head()

See the numbers to the left of each row? That’s the index (row labels). By default, it’s just sequential numbers 0,1,2,… telling you the order of the row, but you can customize it, setting one of the columns as index or providing your own labels.

[ ]:
df1 = df1.set_index("id")
df1

Selecting rows/columns (very brief)

  • Column(s): Select 1 colum df['name'], df[['name','score']]

  • Row slice: df[1:4] (by position)

  • loc/iloc: df.loc[row_labels, col_labels], df.iloc[row_idx, col_idx]

[ ]:
print(df1['name']) # one column
print(df1[['name','score']]) # double brackets for multiple columns
print(df1[1:4]) # row slice by position
print(df1.iloc[:3, :2]) # first 3 rows, first 2 columns by position

You can also access individual cells:

  • df.at[row_label, col_label]: access by labels

However, this is not very common in practice: You leverage the power of Pandas by working with entire columns/rows at once.

Querying: masks, .query, string ops, isin, between

Pandas provides multiple ways to filter/query data based on conditions. Here are some common methods:

  • Boolean masks: create a boolean Series and use it to filter rows

  • .query(): use a query string to filter rows

  • String operations: use .str accessor for string columns

  • isin(): filter rows where column values are in a list

  • between(): filter rows where column values are within a range

[ ]:
high = df1[df1["score"] >= 9] # Boolean mask: Filter rows with score >= 9
high
[ ]:
diabetes = df[df["Outcome"] == 1] # Boolean mask: patients with diabetes
diabetes
[ ]:
cs = df1.query("program == 'CS' and score >= 8.5") # Querying combining different mask
cs
[ ]:
names_with_a = df1[df1["name"].str.contains("a", case=False)]
names_with_a
[ ]:
subset = df1[df1["program"].isin(["CS","DS"]) & df1["score"].between(8.5, 9.8)]
subset
[ ]:
## Playground! Try your own queries here
# e.g., filter students with score < 8 or blood pressure > 80 of patients with diabetes

B-PI Cards

  • P1: Predict rows for df.query("program == 'CS' and score >= 9").

  • P2: Minimal fix: df[df['program' == 'CS']].

  • P3: Predict: df[df['name'].str.startswith(('A','G'))].

Mini-challenges

  1. CSV parsing: modify read_csv to read a ;-separated version of csv_text and parse id as integer and missing scores as NaN.

  2. JSON lines: build a JSON-lines string from students and load it with pd.read_json(..., lines=True).

  3. Top-K filter: keep only the top-3 score students in df via thresholding (no nlargest shortcut).

[ ]:
# Hints (not executed)
# 1) pd.read_csv(StringIO(csv_text_semicolon), sep=';', dtype={'id':'Int64'}, na_values=['', 'NA'])
# 2) js_lines = '\n'.join(json.dumps(r, ensure_ascii=False) for r in students); pd.read_json(StringIO(js_lines), lines=True)
# 3) thr = df['score'].nlargest(3).min(); top3 = df[df['score'] >= thr]

Takeaways

  • JSON: great for nested data; ensure valid JSON (no comments/trailing commas); control with indent, ensure_ascii.

  • CSV: plain tabular text; be explicit with delimiter/quoting; watch commas in text; use newline='' on write.

  • pandas: read JSON/CSV, inspect dtypes, select, and query with masks or .query(); string helpers like .str.contains are key.