Introduction to data manipulation and analysis with Pandas¶

Try me¶

Pandas is a powerful library that provides convenient data structures and functions to work with data. Throughout the next chapters, we will use Pandas for data manipulation and analysis. To use Pandas in your project, you first need to install it in your environment. Additionally, in this tutorial we will import the display and Markdown libraries to display the dataframes as tables.

[ ]:

!pip install pandas
!pip install ipython

[1]:

# Import pandas
import pandas as pd

# Import display and Markdown
from IPython.display import display, Markdown

Introduction to Pandas data structures¶

This introduction covers the fundamentals of Pandas data structures used in this course. It is however recommended that you also take a look to the official documentation.

There are primarily two types of Pandas data structures:

Dataframes: Dataframes are two dimensional data structures, similar to a two dimensional numpy array, a spreadsheet or an SQL table.
Series: Series are one-dimensional data structures, similar to one-dimensional numpy arrays

Dataframes¶

Data frames facilitates the manipulation and analysis of two dimensional data structures, combining convenient functions to:

Extract data from different sources, like CSV files, Excel sheets, or SQL databases
Transform data, applying functions to change the values, or combining different data sources to prepare the data for our application
Load data into output CSV files, Excel sheets or SQL databases
Analyse Analyse the data

Pandas columns in a dataframe are not necessarily of the same type. In the course, we will primarily work with numeric types, string types and the datetime type, so we will focus on those two types in this tutorial. We will also store complex objects, and find convenient Pandas functions to manipulate the dataframe using them, so you can see that Pandas is going to be very useful throughout the course!

You can create a dataframe from different structures: DataFrame accepts many different kinds. Among others, in this course, we will primarily use:

A dictionary of one dimensional numpy arrays, lists, dicts, or Series
CSV, or Excel files
Another Pandas dataframe
A Pandas series

Let us start with a simple example, using a dictionary.

[2]:

# This is a dictionary
dict = {"one": [1, 2, 3, 4, 5], "two": [5, 4, 3, 2, 1], "three":["Awesome", "Brutal", "Cool", "Divine", "Exquisite"], "date": ['2021-01-01', '2021-01-02', '2021-01-02', '2021-02-03', '2021-02-05'] }

# And this is a dataframe
dataframe = pd.DataFrame.from_dict(dict)

display(dataframe)

	one	two	three	date
0	1	5	Awesome	2021-01-01
1	2	4	Brutal	2021-01-02
2	3	3	Cool	2021-01-02
3	4	2	Divine	2021-02-03
4	5	1	Exquisite	2021-02-05

Our Pandas dataframe, Pandas, has three columns, with names, ‘one’, ‘two’, and ‘three’. Notice the integer number in the column to the left. This is the index, and by default, the index has no name.

Another convenient way to create a dataframe is to use a set of records in an array:

[3]:

# This is an array of dict
records = [{"one": 1, "two": 5, "three":"Awesome", "date":"2021-01-01"},
           {"one": 2, "two": 4, "three":"Brutal", "date":"2021-01-02"},
           {"one": 3, "two": 3, "three":"Cool", "date":"2021-01-03"},
           {"one": 4, "two": 2, "three":"Divine", "date":"2021-02-04"},
           {"one": 5, "two": 1, "three":"Exquisite", "date":"2021-02-05"}
           ]

# This is a dataframe
dataframe_2 = pd.DataFrame.from_records(records)
display(dataframe)

	one	two	three	date
0	1	5	Awesome	2021-01-01
1	2	4	Brutal	2021-01-02
2	3	3	Cool	2021-01-02
3	4	2	Divine	2021-02-03
4	5	1	Exquisite	2021-02-05

Another convenient function to create a dataset is loading a csv file:

online_retail = pd.read_csv('datasets/Online Retail_2.csv', usecols=['InvoiceNo', 'StockCode', 'Quantity','InvoiceDate')

This line of code will create a dataset from the file ‘datasets/Online Retail_2.csv’, loading only data from the columns labeled as ‘InvoiceNo’, ‘StockCode’, ‘QUantity’, and ‘InvoiceDate’.

Accessing data¶

There are multiple ways to access the data in a dataframe. Dataframe columns are series, the other important data structure in Pandas. You can access columns in a dataframe using column names as keys:

[4]:

names = dataframe['three']
display(names)

0      Awesome
1       Brutal
2         Cool
3       Divine
4    Exquisite
Name: three, dtype: object

Notice that series are also indexed, you can access elements or slices just as in numpy:

[5]:

print(names[0])
print(names[1])
print(names[0:2])

Awesome
Brutal
0    Awesome
1     Brutal
Name: three, dtype: object

Obviously, you can access the elements directly from the data frame:

[6]:

print(dataframe['three'][0])

Awesome

Queries¶

You can use queries to access parts of your dataframe, that is, to filter the data, using python expressions. The function query() is used to issue the query. For instance, the following query finds all rows where the column ‘one’ is less or equal than two:

[7]:

filtered_df = dataframe.query('one <= 2')
display(filtered_df)

	one	two	three	date
0	1	5	Awesome	2021-01-01
1	2	4	Brutal	2021-01-02

You can combine different columns in the expressions, for instance, the following example filters data where the column ‘two’ is greater than 3 and the column ‘three’ is equal to “Awesome”. The second example shows values where the column ‘one’ is equal to the column ‘two’

[8]:

filtered_df2 = dataframe.query('two > 3 and three == "Awesome"')
display(filtered_df2)
filtered_df3 = dataframe.query('one == two')
display(filtered_df3)

	one	two	three	date
0	1	5	Awesome	2021-01-01

	one	two	three	date
2	3	3	Cool	2021-01-02

Vectorized functions¶

Pandas series have a huge set of vectorized functions you can apply to the entire series. For instance, the different functions in the numpy tutorial:

[9]:

display(dataframe['two'].sum())
display(dataframe['one'].cumsum())
display(dataframe['two'].mean())

0     1
1     3
2     6
3    10
4    15
Name: one, dtype: int64

3.0

One interesting function is describe(), which returns another dataframe with statistical information about the series or dataframe:

[10]:

dataframe.describe()

[10]:

	one	two
count	5.000000	5.000000
mean	3.000000	3.000000
std	1.581139	1.581139
min	1.000000	1.000000
25%	2.000000	2.000000
50%	3.000000	3.000000
75%	4.000000	4.000000
max	5.000000	5.000000

Another important vectorized function is unique(). We will use unique()

All string type functions are vectorised and can be accessed through str. For instance, the following example gets the first character of the string:

[11]:

upper_three = dataframe['three'].str.upper()
display(upper_three)

0      AWESOME
1       BRUTAL
2         COOL
3       DIVINE
4    EXQUISITE
Name: three, dtype: object

str provides vectorized access to the string value of the series, and thus you can access parts of every string in a series using indexes:

[12]:

first_character = dataframe['three'].str[0]
display(first_character)

0    A
1    B
2    C
3    D
4    E
Name: three, dtype: object

Similar to str, dt implements vectorized functions to work with dates.The following examples first transforms column “date” into a datetime, and then calculates the month.

[13]:

month = pd.to_datetime(dataframe['date']).dt.to_period("M")
display(month)

0    2021-01
1    2021-01
2    2021-01
3    2021-02
4    2021-02
Name: date, dtype: period[M]

The function pd.to_datetime converts the series in column ‘date’ into a datetime. Then, dt.to_period() transforms this column into a period, in this case, the letter “M” represents months, and therefore, the period obtained is months.

Finally, apply allows us to apply and lambda function taking all the values of the dataframe or just a series:

[14]:

plus_one = dataframe['one'].apply(lambda x: x +1)
display(plus_one)

0    2
1    3
2    4
3    5
4    6
Name: one, dtype: int64

In the code above, we apply a newly created lambda function that takes one parameter x and just adds one. apply() will apply this function to the series in column ‘one’. There is a huge set of functions available, and many things you can do which are not convered in this tutorial, so you better check the documentation at Pandas Series API Reference

Indexing¶

By default, the dataframes we have created have an integer index to access data in a raw, but we can define any type of index. Let us create another dataframe with another type of index:

[15]:

data_2 = {"2021-01-01": ['003', '46021'], "2021-01-02": ['005', '46022'], "2021-01-03": ['007', '46022'],
          "2021-02-04": ['003', '46021'], "2021-02-05": ['003', '46027']}

dataframe_3 = pd.DataFrame.from_dict(data_2, orient='index', columns=['client_id', 'postal_code'])
display(dataframe_3)

	client_id	postal_code
2021-01-01	003	46021
2021-01-02	005	46022
2021-01-03	007	46022
2021-02-04	003	46021
2021-02-05	003	46027

We have used two new parameters to create our dataframe from a dictionary (from_dict), orient, and columns. When orient is index, the dictionary keys are used as the index of the created dataframe. In this case, we are using a string. The columns parameters allow us to give a name to the columns of our dataframe. Now, we can access the data using the new index:

[16]:

print(dataframe_3['client_id']['2021-01-01'])

We can also set the index of an existing dataframe using its columns. Let us change the index of our first dataframe:

[17]:

new_dataframe = dataframe.set_index(['one'])
display(new_dataframe)

	two	three	date
one
1	5	Awesome	2021-01-01
2	4	Brutal	2021-01-02
3	3	Cool	2021-01-02
4	2	Divine	2021-02-03
5	1	Exquisite	2021-02-05

Notice, that since our column had a name, now our index also has a name. We can set the name of an index like:

[18]:

dataframe_3.index.set_names(['date'], inplace=True)
display(dataframe_3)

	client_id	postal_code
date
2021-01-01	003	46021
2021-01-02	005	46022
2021-01-03	007	46022
2021-02-04	003	46021
2021-02-05	003	46027

We can use different columns as index to create a dataframe similar to a pivot table:

[19]:

dataframe_4 = dataframe.set_index(['one', 'two', 'three'])
display(dataframe_4)
print(dataframe_4['date'][1][5]['Awesome'])

			date
one	two	three
1	5	Awesome	2021-01-01
2	4	Brutal	2021-01-02
3	3	Cool	2021-01-02
4	2	Divine	2021-02-03
5	1	Exquisite	2021-02-05

2021-01-01

New columns¶

To add a new column, you just need to define the column name. For instance, let us to the original dataframe a new column containing the month:

[20]:

dataframe['month'] = pd.to_datetime(dataframe['date']).dt.month
display(dataframe)

	one	two	three	date	month
0	1	5	Awesome	2021-01-01	1
1	2	4	Brutal	2021-01-02	1
2	3	3	Cool	2021-01-02	1
3	4	2	Divine	2021-02-03	2
4	5	1	Exquisite	2021-02-05	2

Grouping¶

Grouping allows us to group by values that have the same index value:

[21]:

ones_in_month = dataframe.set_index(['month']).groupby(['month']).sum()
display(ones_in_month)

	one	two
month
1	6	12
2	9	3

Note that, first, we define the index month, and then we group by this index. Finally, we apply the vectorized function sum to sum all values of the columns with the same value of the index month. Column three silently disappears since its type does not support this function.

Joining¶

Indexes allow us to join different dataframes that have an index with the same name:

[22]:

dataframe_5 = dataframe.set_index(['date'])
dataframe_6 = dataframe_5.join(dataframe_3, how="inner")
display(dataframe_6)

	one	two	three	month	client_id	postal_code
date
2021-01-01	1	5	Awesome	1	003	46021
2021-01-02	2	4	Brutal	1	005	46022
2021-01-02	3	3	Cool	1	005	46022
2021-02-05	5	1	Exquisite	2	003	46027

The how parameter allows us to control which method is used to join the columns. Use the figure below as a reference.

Types of join clauses

Basically, we need to bear in mind that not all rows in both tables are going to meet the condition. The SQL engine is going to evaluate the condition and then return the data in either table that meets the condition depending on the type of join:

Inner join: To get only the data that meets the condition.
Left join: To get all data from the left table and the data of the right table that meets the condition.
Right join: Opposite of the left join. Normally, the use of left join is preferred.
Outer join: Opposite of inner join, gets all data from both tables. It is seldom used in practice.