Pandas Dataframe Exercises¶
Try me¶
In the first exercises, we are going to use the open dataset from the National Institute of Diabetes and Digestive and Kidney Diseases which is available in Kaggle. The dataset contains information about patients with diabetes. You can find it in this URL:
https://www.kaggle.com/uciml/pima-indians-diabetes-database
We have downloaded the dataset and we have uploaded it to the repository of the course. You can find it in the following URL:
The dataset contains the following columns:
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)
268 of 768 are 1, the others are 0
Class Distribution: (class value 1 is interpreted as “tested positive for diabetes”)
The following code loads the dataset into a Pandas dataframe:
[1]:
import pandas as pd
diabetes_pd = pd.read_csv('https://raw.githubusercontent.com/ffraile/computer_science_tutorials/main/source/Data%20Manipulation/exercises/datasets/diabetes.csv')
diabetes_pd
[1]:
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 763 | 10 | 101 | 76 | 48 | 180 | 32.9 | 0.171 | 63 | 0 |
| 764 | 2 | 122 | 70 | 27 | 0 | 36.8 | 0.340 | 27 | 0 |
| 765 | 5 | 121 | 72 | 23 | 112 | 26.2 | 0.245 | 30 | 0 |
| 766 | 1 | 126 | 60 | 0 | 0 | 30.1 | 0.349 | 47 | 1 |
| 767 | 1 | 93 | 70 | 31 | 0 | 30.4 | 0.315 | 23 | 0 |
768 rows × 9 columns
Display the statistical summary of the dataset. What can you say about the dataset?
[ ]:
Use the query function to find the average BMI of the patients with diabetes and the average BMI of the patients without diabetes. What can you say about the results?
[ ]:
For the next exercises, we are going to use the datasets of Datahub, which is a platform for sharing and discovering open data. First, we are going to use the COVID dataset, which contains information about the COVID-19 pandemic. You can find it in the following URL:
https://datahub.io/core/covid-1
[5]:
# Uncomment to use the version in datahub
# covid_pd = pd.read_csv('https://datahub.io/core/covid-19/r/countries-aggregated.csv')
covid_pd = pd.read_csv('https://raw.githubusercontent.com/ffraile/computer_science_tutorials/main/source/Data%20Manipulation/exercises/datasets/covid.csv')
covid_pd
[5]:
| Date | China | US | United_Kingdom | Italy | France | Germany | Spain | Iran | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020-01-22 | 548 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2020-01-23 | 643 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2020-01-24 | 920 | 2 | 0 | 0 | 2 | 0 | 0 | 0 |
| 3 | 2020-01-25 | 1406 | 2 | 0 | 0 | 3 | 0 | 0 | 0 |
| 4 | 2020-01-26 | 2075 | 5 | 0 | 0 | 3 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 728 | 2022-01-19 | 118370 | 68684431 | 15610069 | 9219391 | 15288014 | 8361262 | 8676916 | 6231909 |
| 729 | 2022-01-20 | 118470 | 69329860 | 15718193 | 9418256 | 15715329 | 8502132 | 8834363 | 6236567 |
| 730 | 2022-01-21 | 118544 | 70209840 | 15814617 | 9603856 | 16116748 | 8635461 | 8975458 | 6241843 |
| 731 | 2022-01-22 | 118616 | 70495874 | 15891905 | 9781191 | 16506090 | 8716804 | 8975458 | 6245346 |
| 732 | 2022-01-23 | 118773 | 70699416 | 15966838 | 9923678 | 16807733 | 8773030 | 8975458 | 6250490 |
733 rows × 9 columns
Use the data functions to create another column with the month of the date.
[ ]:
Now, group the data by month and country and create a dataset containing the total number of confirmed cases and deaths for each month and country. What can you say about the results?
[ ]:
For this exercisee, we are going to use another public repository of data, called UCI Machine Learning Repository. In this repository, you can find a lot of datasets for different purposes. We are going to use the dataset of the Wine Quality dataset. Download the dataset as a CSV file, then import it into a Pandas dataframe. What are the columns of the dataset? Display the statistical summary.
[ ]:
[ ]: