Pandas Dataframe Exercises

Try me

Open In ColabBinder

In the first exercises, we are going to use the open dataset from the National Institute of Diabetes and Digestive and Kidney Diseases which is available in Kaggle. The dataset contains information about patients with diabetes. You can find it in this URL:

https://www.kaggle.com/uciml/pima-indians-diabetes-database

We have downloaded the dataset and we have uploaded it to the repository of the course. You can find it in the following URL:

https://raw.githubusercontent.com/ffraile/computer_science_tutorials/main/source/Data%20Manipulation/exercises/datasets/diabetes.csv

The dataset contains the following columns:

  • Pregnancies: Number of times pregnant

  • Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

  • BloodPressure: Diastolic blood pressure (mm Hg)

  • SkinThickness: Triceps skin fold thickness (mm)

  • Insulin: 2-Hour serum insulin (mu U/ml)

  • BMI: Body mass index (weight in kg/(height in m)^2)

  • DiabetesPedigreeFunction: Diabetes pedigree function

  • Age: Age (years)

  • Outcome: Class variable (0 or 1)

  • 268 of 768 are 1, the others are 0

  • Class Distribution: (class value 1 is interpreted as “tested positive for diabetes”)

The following code loads the dataset into a Pandas dataframe:

[1]:
import pandas as pd
diabetes_pd = pd.read_csv('https://raw.githubusercontent.com/ffraile/computer_science_tutorials/main/source/Data%20Manipulation/exercises/datasets/diabetes.csv')
diabetes_pd
[1]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
... ... ... ... ... ... ... ... ... ...
763 10 101 76 48 180 32.9 0.171 63 0
764 2 122 70 27 0 36.8 0.340 27 0
765 5 121 72 23 112 26.2 0.245 30 0
766 1 126 60 0 0 30.1 0.349 47 1
767 1 93 70 31 0 30.4 0.315 23 0

768 rows × 9 columns

  1. Display the statistical summary of the dataset. What can you say about the dataset?

[ ]:

  1. Use the query function to find the average BMI of the patients with diabetes and the average BMI of the patients without diabetes. What can you say about the results?

[ ]:

For the next exercises, we are going to use the datasets of Datahub, which is a platform for sharing and discovering open data. First, we are going to use the COVID dataset, which contains information about the COVID-19 pandemic. You can find it in the following URL:

https://datahub.io/core/covid-1

[5]:
# Uncomment to use the version in datahub
# covid_pd = pd.read_csv('https://datahub.io/core/covid-19/r/countries-aggregated.csv')
covid_pd = pd.read_csv('https://raw.githubusercontent.com/ffraile/computer_science_tutorials/main/source/Data%20Manipulation/exercises/datasets/covid.csv')

covid_pd
[5]:
Date China US United_Kingdom Italy France Germany Spain Iran
0 2020-01-22 548 1 0 0 0 0 0 0
1 2020-01-23 643 1 0 0 0 0 0 0
2 2020-01-24 920 2 0 0 2 0 0 0
3 2020-01-25 1406 2 0 0 3 0 0 0
4 2020-01-26 2075 5 0 0 3 0 0 0
... ... ... ... ... ... ... ... ... ...
728 2022-01-19 118370 68684431 15610069 9219391 15288014 8361262 8676916 6231909
729 2022-01-20 118470 69329860 15718193 9418256 15715329 8502132 8834363 6236567
730 2022-01-21 118544 70209840 15814617 9603856 16116748 8635461 8975458 6241843
731 2022-01-22 118616 70495874 15891905 9781191 16506090 8716804 8975458 6245346
732 2022-01-23 118773 70699416 15966838 9923678 16807733 8773030 8975458 6250490

733 rows × 9 columns

  1. Use the data functions to create another column with the month of the date.

[ ]:

  1. Now, group the data by month and country and create a dataset containing the total number of confirmed cases and deaths for each month and country. What can you say about the results?

[ ]:

  1. For this exercisee, we are going to use another public repository of data, called UCI Machine Learning Repository. In this repository, you can find a lot of datasets for different purposes. We are going to use the dataset of the Wine Quality dataset. Download the dataset as a CSV file, then import it into a Pandas dataframe. What are the columns of the dataset? Display the statistical summary.

[ ]:

[ ]: