Intro to Pandas — Class Notes¶
Try me¶
Introduction¶
Motivation¶
Pandas is the absolute foundation for data science in Python.
Essential tool if you are dealing with any kind of data.
Objectives¶
Cut through all the complexity and understand the core concepts.
High impact knowledge of how Pandas actually works.
Get you started with real-world data manipulation tasks.
Intro for non-programmers (I)¶
“Spreadsheets on steroids”: powerful data manipulation tool (like Excel, but way more powerful).
Overwhelmed by endless lists or arrays? Pandas provides:
Convenient and robust Data structures.
Functions to filter, aggregate, and transform data easily.
“Nervous system” for data analysis in Python.
Intro for non-programmers (II)¶
Use case examples:
Complex financial data analysis.
Cleaning messy datasets like survey results.
Combining multiple data sources.
Intro for non-programmers (III)¶
Pandas revolves around two core data structures:
Series: 1D labeled array (like a column in a spreadsheet).
DataFrame: 2D labeled table (like an entire spreadsheet with rows and columns).
Any single column in a DataFrame is under the hood a Series.
All Series in a dataframe share the same index (row labels).
Columns in a DataFrame can have different data types (ints, floats, strings, etc).
This flexibility makes Pandas ideal for real-world datasets: Different data types in a single container.
Intro for non-programmers (IV)¶
Pandas facilitates tne entire data lifecycle:
Loading data from various formats (CSV, JSON, Excel, SQL databases, etc).
Cleaning and transforming data (handling missing values, filtering rows, aggregating data, etc).
Analyzing and visualizing data (summary statistics, plots, etc).
Pandas is optimized for performance with large datasets (millions of rows).
Integrates seamlessly with other Python libraries (NumPy, Matplotlib, Scikit-learn) stay tuned for more!
Agenda¶
Intro and agenda (20 min)
Loading data (15 min)
Querying (15 min)
Setup pandas¶
Pandas is available off-the-shelf in Colab . If you are running locally, you may need to install it first:
pip install pandas
If not, just import:
[ ]:
import pandas as pd
This pdalias (de-facto standard) will become your best friend when working with data!
Create DataFrames and load data¶
We are going to present two different ways to create DataFrames that cover most common use cases:
From scratch (e.g., for small datasets or testing): Creating datasets from Python dictionaries/lists.
Loading from JSON/CSV files (most common in real-world scenarios). Other methods are supported (e.g., Excel, SQL databases, Parquet, etc) but out of scope for this intro.
Pandas can create DataFrames directly from Python dictionaries/lists.
Each key in the dictionary becomes a column
The values become the rows.
[ ]:
data = {
"id": [1,2,3,4,5],
"name": ["Ada","Alan","Grace","Linus","Edsger"],
"score":[9.7, 8.2, 9.6, 7.5, 8.9],
"program": ["CS","CS","DS","SE","CS"]
}
df1 = pd.DataFrame(data)
df1
Creating a dataframe from records¶
You can also create DataFrames from a list of records (dictionaries), where each dictionary represents a row.
[ ]:
students = [
{"id": 101, "name": "Ada Lovelace", "program":"CS", "score": 9.5},
{"id": 102, "name": "Alan Turing", "program":"CS", "score": 8.8},
{"id": 103, "name": "Grace Hopper", "program":"DS", "score": 9.6},
]
df2 = pd.DataFrame.from_records(students)
df2
Importing large datasets from JSON/CSV¶
In real-world scenarios, datasets are often stored in files (locally or remotely). Pandas provides convenient functions to load data from common formats like JSON and CSV. We are going to load data from a CSV with information about diabetes patients.
The dataset contains the following columns:
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1) determining if the patient has diabetes
268 of 768 are 1, the others are 0
Class Distribution: (class value 1 is interpreted as “tested positive for diabetes”)
The following code loads the dataset into a Pandas dataframe:
[ ]:
df = pd.read_csv('https://raw.githubusercontent.com/ffraile/computer_science_tutorials/main/source/Data%20Manipulation/exercises/datasets/diabetes.csv')
df
Inspecting and accessing data¶
Inspecting data¶
df.head(n): first n rows (default 5)df.tail(n): last n rows (default 5)df.dtypes: data types of each columndf.info(): summary of dataframe (rows, columns, dtypes, memory usage)df.describe(): statistical summary of numeric columns
[ ]:
df.info()
[ ]:
df.describe()
[ ]:
df.head()
See the numbers to the left of each row? That’s the index (row labels). By default, it’s just sequential numbers 0,1,2,… telling you the order of the row, but you can customize it, setting one of the columns as index or providing your own labels.
[ ]:
df1 = df1.set_index("id")
df1
Selecting rows/columns (very brief)¶
Column(s): Select 1 colum
df['name'],df[['name','score']]Row slice:
df[1:4](by position)loc/iloc:
df.loc[row_labels, col_labels],df.iloc[row_idx, col_idx]
[ ]:
print(df1['name']) # one column
print(df1[['name','score']]) # double brackets for multiple columns
print(df1[1:4]) # row slice by position
print(df1.iloc[:3, :2]) # first 3 rows, first 2 columns by position
You can also access individual cells:
df.at[row_label, col_label]: access by labels
However, this is not very common in practice: You leverage the power of Pandas by working with entire columns/rows at once.
Querying: masks, .query, string ops, isin, between¶
Pandas provides multiple ways to filter/query data based on conditions. Here are some common methods:
Boolean masks: create a boolean Series and use it to filter rows
.query(): use a query string to filter rowsString operations: use
.straccessor for string columnsisin(): filter rows where column values are in a listbetween(): filter rows where column values are within a range
[ ]:
high = df1[df1["score"] >= 9] # Boolean mask: Filter rows with score >= 9
high
[ ]:
diabetes = df[df["Outcome"] == 1] # Boolean mask: patients with diabetes
diabetes
[ ]:
cs = df1.query("program == 'CS' and score >= 8.5") # Querying combining different mask
cs
[ ]:
names_with_a = df1[df1["name"].str.contains("a", case=False)]
names_with_a
[ ]:
subset = df1[df1["program"].isin(["CS","DS"]) & df1["score"].between(8.5, 9.8)]
subset
[ ]:
## Playground! Try your own queries here
# e.g., filter students with score < 8 or blood pressure > 80 of patients with diabetes
B-PI Cards¶
P1: Predict rows for
df.query("program == 'CS' and score >= 9").P2: Minimal fix:
df[df['program' == 'CS']].P3: Predict:
df[df['name'].str.startswith(('A','G'))].
Mini-challenges¶
CSV parsing: modify
read_csvto read a;-separated version ofcsv_textand parseidas integer and missing scores asNaN.JSON lines: build a JSON-lines string from
studentsand load it withpd.read_json(..., lines=True).Top-K filter: keep only the top-3
scorestudents indfvia thresholding (nonlargestshortcut).
[ ]:
# Hints (not executed)
# 1) pd.read_csv(StringIO(csv_text_semicolon), sep=';', dtype={'id':'Int64'}, na_values=['', 'NA'])
# 2) js_lines = '\n'.join(json.dumps(r, ensure_ascii=False) for r in students); pd.read_json(StringIO(js_lines), lines=True)
# 3) thr = df['score'].nlargest(3).min(); top3 = df[df['score'] >= thr]
Takeaways¶
JSON: great for nested data; ensure valid JSON (no comments/trailing commas); control with
indent,ensure_ascii.CSV: plain tabular text; be explicit with delimiter/quoting; watch commas in text; use
newline=''on write.pandas: read JSON/CSV, inspect dtypes, select, and query with masks or
.query(); string helpers like.str.containsare key.