Pandas Dataframe Exercises¶

Try me¶

In the first exercises, we are going to use the open dataset from the National Institute of Diabetes and Digestive and Kidney Diseases which is available in Kaggle. The dataset contains information about patients with diabetes. You can find it in this URL:

https://www.kaggle.com/uciml/pima-indians-diabetes-database

We have downloaded the dataset and we have uploaded it to the repository of the course. You can find it in the following URL:

https://raw.githubusercontent.com/ffraile/computer_science_tutorials/main/source/Data%20Manipulation/exercises/datasets/diabetes.csv’

The dataset contains the following columns:

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)
268 of 768 are 1, the others are 0
Class Distribution: (class value 1 is interpreted as “tested positive for diabetes”)

The following code loads the dataset into a Pandas dataframe:

[1]:

import pandas as pd
diabetes_pd = pd.read_csv('https://raw.githubusercontent.com/ffraile/computer_science_tutorials/main/source/Data%20Manipulation/exercises/datasets/diabetes.csv')
diabetes_pd

[1]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1
...	...	...	...	...	...	...	...	...	...
763	10	101	76	48	180	32.9	0.171	63	0
764	2	122	70	27	0	36.8	0.340	27	0
765	5	121	72	23	112	26.2	0.245	30	0
766	1	126	60	0	0	30.1	0.349	47	1
767	1	93	70	31	0	30.4	0.315	23	0

768 rows × 9 columns

Display the statistical summary of the dataset. What can you say about the dataset?

[ ]:

Use the query function to find the average BMI of the patients with diabetes and the average BMI of the patients without diabetes. What can you say about the results?

[ ]:

For the next exercises, we are going to use the datasets of Datahub, which is a platform for sharing and discovering open data. First, we are going to use the COVID dataset, which contains information about the COVID-19 pandemic. You can find it in the following URL:

https://datahub.io/core/covid-1

[5]:

# Uncomment to use the version in datahub
# covid_pd = pd.read_csv('https://datahub.io/core/covid-19/r/countries-aggregated.csv')
covid_pd = pd.read_csv('https://raw.githubusercontent.com/ffraile/computer_science_tutorials/main/source/Data%20Manipulation/exercises/datasets/covid.csv')

covid_pd

[5]:

	Date	China	US	United_Kingdom	Italy	France	Germany	Spain	Iran
0	2020-01-22	548	1	0	0	0	0	0	0
1	2020-01-23	643	1	0	0	0	0	0	0
2	2020-01-24	920	2	0	0	2	0	0	0
3	2020-01-25	1406	2	0	0	3	0	0	0
4	2020-01-26	2075	5	0	0	3	0	0	0
...	...	...	...	...	...	...	...	...	...
728	2022-01-19	118370	68684431	15610069	9219391	15288014	8361262	8676916	6231909
729	2022-01-20	118470	69329860	15718193	9418256	15715329	8502132	8834363	6236567
730	2022-01-21	118544	70209840	15814617	9603856	16116748	8635461	8975458	6241843
731	2022-01-22	118616	70495874	15891905	9781191	16506090	8716804	8975458	6245346
732	2022-01-23	118773	70699416	15966838	9923678	16807733	8773030	8975458	6250490

733 rows × 9 columns

Use the data functions to create another column with the month of the date.

[ ]:

Now, group the data by month and country and create a dataset containing the total number of confirmed cases and deaths for each month and country. What can you say about the results?

[ ]:

For this exercisee, we are going to use another public repository of data, called UCI Machine Learning Repository. In this repository, you can find a lot of datasets for different purposes. We are going to use the dataset of the Wine Quality dataset. Download the dataset as a CSV file, then import it into a Pandas dataframe. What are the columns of the dataset? Display the statistical summary.

[ ]:

[ ]: