Introduction to data manipulation and analysis with Pandas¶
Try me¶
Pandas is a powerful library that provides convenient data structures and functions to work with data. Throughout the next chapters, we will use Pandas for data manipulation and analysis. To use Pandas in your project, you first need to install it in your environment. Additionally, in this tutorial we will import the display and Markdown libraries to display the dataframes as tables.
[ ]:
!pip install pandas
!pip install ipython
[1]:
# Import pandas
import pandas as pd
# Import display and Markdown
from IPython.display import display, Markdown
Introduction to Pandas data structures¶
This introduction covers the fundamentals of Pandas data structures used in this course. It is however recommended that you also take a look to the official documentation.
There are primarily two types of Pandas data structures:
Dataframes: Dataframes are two dimensional data structures, similar to a two dimensional numpy array, a spreadsheet or an SQL table.
Series: Series are one-dimensional data structures, similar to one-dimensional numpy arrays
Dataframes¶
Data frames facilitates the manipulation and analysis of two dimensional data structures, combining convenient functions to:
Extract data from different sources, like CSV files, Excel sheets, or SQL databases
Transform data, applying functions to change the values, or combining different data sources to prepare the data for our application
Load data into output CSV files, Excel sheets or SQL databases
Analyse Analyse the data
Pandas columns in a dataframe are not necessarily of the same type. In the course, we will primarily work with numeric types, string types and the datetime type, so we will focus on those two types in this tutorial. We will also store complex objects, and find convenient Pandas functions to manipulate the dataframe using them, so you can see that Pandas is going to be very useful throughout the course!
You can create a dataframe from different structures: DataFrame accepts many different kinds. Among others, in this course, we will primarily use:
A dictionary of one dimensional numpy arrays, lists, dicts, or Series
CSV, or Excel files
Another Pandas dataframe
A Pandas series
Let us start with a simple example, using a dictionary.
[2]:
# This is a dictionary
dict = {"one": [1, 2, 3, 4, 5], "two": [5, 4, 3, 2, 1], "three":["Awesome", "Brutal", "Cool", "Divine", "Exquisite"], "date": ['2021-01-01', '2021-01-02', '2021-01-02', '2021-02-03', '2021-02-05'] }
# And this is a dataframe
dataframe = pd.DataFrame.from_dict(dict)
display(dataframe)
| one | two | three | date | |
|---|---|---|---|---|
| 0 | 1 | 5 | Awesome | 2021-01-01 |
| 1 | 2 | 4 | Brutal | 2021-01-02 |
| 2 | 3 | 3 | Cool | 2021-01-02 |
| 3 | 4 | 2 | Divine | 2021-02-03 |
| 4 | 5 | 1 | Exquisite | 2021-02-05 |
Our Pandas dataframe, Pandas, has three columns, with names, ‘one’, ‘two’, and ‘three’. Notice the integer number in the column to the left. This is the index, and by default, the index has no name.
Another convenient way to create a dataframe is to use a set of records in an array:
[3]:
# This is an array of dict
records = [{"one": 1, "two": 5, "three":"Awesome", "date":"2021-01-01"},
{"one": 2, "two": 4, "three":"Brutal", "date":"2021-01-02"},
{"one": 3, "two": 3, "three":"Cool", "date":"2021-01-03"},
{"one": 4, "two": 2, "three":"Divine", "date":"2021-02-04"},
{"one": 5, "two": 1, "three":"Exquisite", "date":"2021-02-05"}
]
# This is a dataframe
dataframe_2 = pd.DataFrame.from_records(records)
display(dataframe)
| one | two | three | date | |
|---|---|---|---|---|
| 0 | 1 | 5 | Awesome | 2021-01-01 |
| 1 | 2 | 4 | Brutal | 2021-01-02 |
| 2 | 3 | 3 | Cool | 2021-01-02 |
| 3 | 4 | 2 | Divine | 2021-02-03 |
| 4 | 5 | 1 | Exquisite | 2021-02-05 |
Another convenient function to create a dataset is loading a csv file:
online_retail = pd.read_csv('datasets/Online Retail_2.csv', usecols=['InvoiceNo', 'StockCode', 'Quantity','InvoiceDate')
This line of code will create a dataset from the file ‘datasets/Online Retail_2.csv’, loading only data from the columns labeled as ‘InvoiceNo’, ‘StockCode’, ‘QUantity’, and ‘InvoiceDate’.
Accessing data¶
There are multiple ways to access the data in a dataframe. Dataframe columns are series, the other important data structure in Pandas. You can access columns in a dataframe using column names as keys:
[4]:
names = dataframe['three']
display(names)
0 Awesome
1 Brutal
2 Cool
3 Divine
4 Exquisite
Name: three, dtype: object
Notice that series are also indexed, you can access elements or slices just as in numpy:
[5]:
print(names[0])
print(names[1])
print(names[0:2])
Awesome
Brutal
0 Awesome
1 Brutal
Name: three, dtype: object
Obviously, you can access the elements directly from the data frame:
[6]:
print(dataframe['three'][0])
Awesome
Queries¶
You can use queries to access parts of your dataframe, that is, to filter the data, using python expressions. The function query() is used to issue the query. For instance, the following query finds all rows where the column ‘one’ is less or equal than two:
[7]:
filtered_df = dataframe.query('one <= 2')
display(filtered_df)
| one | two | three | date | |
|---|---|---|---|---|
| 0 | 1 | 5 | Awesome | 2021-01-01 |
| 1 | 2 | 4 | Brutal | 2021-01-02 |
You can combine different columns in the expressions, for instance, the following example filters data where the column ‘two’ is greater than 3 and the column ‘three’ is equal to “Awesome”. The second example shows values where the column ‘one’ is equal to the column ‘two’
[8]:
filtered_df2 = dataframe.query('two > 3 and three == "Awesome"')
display(filtered_df2)
filtered_df3 = dataframe.query('one == two')
display(filtered_df3)
| one | two | three | date | |
|---|---|---|---|---|
| 0 | 1 | 5 | Awesome | 2021-01-01 |
| one | two | three | date | |
|---|---|---|---|---|
| 2 | 3 | 3 | Cool | 2021-01-02 |
Vectorized functions¶
Pandas series have a huge set of vectorized functions you can apply to the entire series. For instance, the different functions in the numpy tutorial:
[9]:
display(dataframe['two'].sum())
display(dataframe['one'].cumsum())
display(dataframe['two'].mean())
15
0 1
1 3
2 6
3 10
4 15
Name: one, dtype: int64
3.0
One interesting function is describe(), which returns another dataframe with statistical information about the series or dataframe:
[10]:
dataframe.describe()
[10]:
| one | two | |
|---|---|---|
| count | 5.000000 | 5.000000 |
| mean | 3.000000 | 3.000000 |
| std | 1.581139 | 1.581139 |
| min | 1.000000 | 1.000000 |
| 25% | 2.000000 | 2.000000 |
| 50% | 3.000000 | 3.000000 |
| 75% | 4.000000 | 4.000000 |
| max | 5.000000 | 5.000000 |
Another important vectorized function is unique(). We will use unique()
All string type functions are vectorised and can be accessed through str. For instance, the following example gets the first character of the string:
[11]:
upper_three = dataframe['three'].str.upper()
display(upper_three)
0 AWESOME
1 BRUTAL
2 COOL
3 DIVINE
4 EXQUISITE
Name: three, dtype: object
str provides vectorized access to the string value of the series, and thus you can access parts of every string in a series using indexes:
[12]:
first_character = dataframe['three'].str[0]
display(first_character)
0 A
1 B
2 C
3 D
4 E
Name: three, dtype: object
Similar to str, dt implements vectorized functions to work with dates.The following examples first transforms column “date” into a datetime, and then calculates the month.
[13]:
month = pd.to_datetime(dataframe['date']).dt.to_period("M")
display(month)
0 2021-01
1 2021-01
2 2021-01
3 2021-02
4 2021-02
Name: date, dtype: period[M]
The function pd.to_datetime converts the series in column ‘date’ into a datetime. Then, dt.to_period() transforms this column into a period, in this case, the letter “M” represents months, and therefore, the period obtained is months.
Finally, apply allows us to apply and lambda function taking all the values of the dataframe or just a series:
[14]:
plus_one = dataframe['one'].apply(lambda x: x +1)
display(plus_one)
0 2
1 3
2 4
3 5
4 6
Name: one, dtype: int64
In the code above, we apply a newly created lambda function that takes one parameter x and just adds one. apply() will apply this function to the series in column ‘one’. There is a huge set of functions available, and many things you can do which are not convered in this tutorial, so you better check the documentation at Pandas Series API Reference
Indexing¶
By default, the dataframes we have created have an integer index to access data in a raw, but we can define any type of index. Let us create another dataframe with another type of index:
[15]:
data_2 = {"2021-01-01": ['003', '46021'], "2021-01-02": ['005', '46022'], "2021-01-03": ['007', '46022'],
"2021-02-04": ['003', '46021'], "2021-02-05": ['003', '46027']}
dataframe_3 = pd.DataFrame.from_dict(data_2, orient='index', columns=['client_id', 'postal_code'])
display(dataframe_3)
| client_id | postal_code | |
|---|---|---|
| 2021-01-01 | 003 | 46021 |
| 2021-01-02 | 005 | 46022 |
| 2021-01-03 | 007 | 46022 |
| 2021-02-04 | 003 | 46021 |
| 2021-02-05 | 003 | 46027 |
We have used two new parameters to create our dataframe from a dictionary (from_dict), orient, and columns. When orient is index, the dictionary keys are used as the index of the created dataframe. In this case, we are using a string. The columns parameters allow us to give a name to the columns of our dataframe. Now, we can access the data using the new index:
[16]:
print(dataframe_3['client_id']['2021-01-01'])
003
We can also set the index of an existing dataframe using its columns. Let us change the index of our first dataframe:
[17]:
new_dataframe = dataframe.set_index(['one'])
display(new_dataframe)
| two | three | date | |
|---|---|---|---|
| one | |||
| 1 | 5 | Awesome | 2021-01-01 |
| 2 | 4 | Brutal | 2021-01-02 |
| 3 | 3 | Cool | 2021-01-02 |
| 4 | 2 | Divine | 2021-02-03 |
| 5 | 1 | Exquisite | 2021-02-05 |
Notice, that since our column had a name, now our index also has a name. We can set the name of an index like:
[18]:
dataframe_3.index.set_names(['date'], inplace=True)
display(dataframe_3)
| client_id | postal_code | |
|---|---|---|
| date | ||
| 2021-01-01 | 003 | 46021 |
| 2021-01-02 | 005 | 46022 |
| 2021-01-03 | 007 | 46022 |
| 2021-02-04 | 003 | 46021 |
| 2021-02-05 | 003 | 46027 |
We can use different columns as index to create a dataframe similar to a pivot table:
[19]:
dataframe_4 = dataframe.set_index(['one', 'two', 'three'])
display(dataframe_4)
print(dataframe_4['date'][1][5]['Awesome'])
| date | |||
|---|---|---|---|
| one | two | three | |
| 1 | 5 | Awesome | 2021-01-01 |
| 2 | 4 | Brutal | 2021-01-02 |
| 3 | 3 | Cool | 2021-01-02 |
| 4 | 2 | Divine | 2021-02-03 |
| 5 | 1 | Exquisite | 2021-02-05 |
2021-01-01
New columns¶
To add a new column, you just need to define the column name. For instance, let us to the original dataframe a new column containing the month:
[20]:
dataframe['month'] = pd.to_datetime(dataframe['date']).dt.month
display(dataframe)
| one | two | three | date | month | |
|---|---|---|---|---|---|
| 0 | 1 | 5 | Awesome | 2021-01-01 | 1 |
| 1 | 2 | 4 | Brutal | 2021-01-02 | 1 |
| 2 | 3 | 3 | Cool | 2021-01-02 | 1 |
| 3 | 4 | 2 | Divine | 2021-02-03 | 2 |
| 4 | 5 | 1 | Exquisite | 2021-02-05 | 2 |
Grouping¶
Grouping allows us to group by values that have the same index value:
[21]:
ones_in_month = dataframe.set_index(['month']).groupby(['month']).sum()
display(ones_in_month)
| one | two | |
|---|---|---|
| month | ||
| 1 | 6 | 12 |
| 2 | 9 | 3 |
Note that, first, we define the index month, and then we group by this index. Finally, we apply the vectorized function sum to sum all values of the columns with the same value of the index month. Column three silently disappears since its type does not support this function.
Joining¶
Indexes allow us to join different dataframes that have an index with the same name:
[22]:
dataframe_5 = dataframe.set_index(['date'])
dataframe_6 = dataframe_5.join(dataframe_3, how="inner")
display(dataframe_6)
| one | two | three | month | client_id | postal_code | |
|---|---|---|---|---|---|---|
| date | ||||||
| 2021-01-01 | 1 | 5 | Awesome | 1 | 003 | 46021 |
| 2021-01-02 | 2 | 4 | Brutal | 1 | 005 | 46022 |
| 2021-01-02 | 3 | 3 | Cool | 1 | 005 | 46022 |
| 2021-02-05 | 5 | 1 | Exquisite | 2 | 003 | 46027 |
The how parameter allows us to control which method is used to join the columns. Use the figure below as a reference.

Basically, we need to bear in mind that not all rows in both tables are going to meet the condition. The SQL engine is going to evaluate the condition and then return the data in either table that meets the condition depending on the type of join:
Inner join: To get only the data that meets the condition.
Left join: To get all data from the left table and the data of the right table that meets the condition.
Right join: Opposite of the left join. Normally, the use of left join is preferred.
Outer join: Opposite of inner join, gets all data from both tables. It is seldom used in practice.