A6—Matplotlib and seaborn
Due 2023-11-03, 11:59pm EST 16ptsFollow all the instructions below.
Please post any questions about this assignment on Slack.
Warning: This is an individual assignment.
Table of contents
- Change log
- Aim of the assignment
- Instructions
- Submission instructions
- Grading
Change log
- 2023-11-01 fixed link to dataset.
- 2023-10-30 added sources for datasets.
- Due date was corrected to 2023-11-03. A student got a bug bounty!
Aim of the assignment
Matplotlib and seaborn are two classic approaches for visualizing data in Jupyter Notebooks. This assignment will introduce you to using both to create a variety of visualizations.
Instructions
Setup instructions
-
Accept the GitHub Classroom assignment invitation by clicking this link to get your repository:
https://classroom.github.com/a/Feozg6P-
For reference, this is the template repository your repository is being created from: https://github.com/NEU-DS-4200-F23/A6--matplotlib_seaborn.
-
Follow our usual steps for creating a virtual environment, using
pip
to install everything inrequirements.txt
, and starting Jupyter Lab.
Part 1 - Matplotlib: California Housing Analysis
Our first objective is to analyse and understand various metrics of California housing using Matplotlib. Perform the following steps in california_housing.ipynb
. Reference the Matplotlib plot types, user guide, and example gallery.
Part 1.a: Data Exploration
- Load the dataset named
california_housing.csv
. This is from Kaggle. - Display the first few rows of the dataset to understand its structure.
- Identify and list down the unique metrics available in the dataset.
Part 1.B: Data Distribution Analysis
- Plot histograms for each feature in the dataset to understand their distributions.
- Create scatterplots to understand relationships between selected features. Specifically:
- Visualise the relationship between
median_income
andmedian_house_value
. - Visualise the geographical distribution of
median_house_value
usinglongitude
andlatitude
.
- Visualise the relationship between
Part 1.C: Advanced Visualizations
- Create a Hexbin plot to observe the density and relationship between
median_income
andmedian_house_value
. - Visualise the
housing_median_age
using an Area Plot. - Create a Pie Chart to represent the distribution of houses based on the binned
housing_median_age
. - Generate a Polar Plot to represent the median incomes of houses in different
housing_median_age
bins.
Part 1.D: Research Matplotlib Documentation and Create another Data Visualization
- Create a new visualization that’s different from the previous ones.
- Explain the data used, the type of visualization, and what story is being told.
- Justify your choice of visual encoding and visualization design choices (i.e. marks, channels, perceptual ordering, data type, etc.)
Part 1.E: Reflection
Based on your visual analysis, summarise the key insights you gathered about California housing metrics in a Markdown cell at the end. Provide recommendations or areas of focus for property investments in California.
Part 2 - Seaborn: Diamond Characteristics Analysis
The second objective is to analyze and understand diamond characteristics using seaborn. Perform the following steps in diamonds.ipynb
.
Part 2.A: Data Exploration
- Load the dataset named
diamonds.csv
. This is from ggplot2 via seaborn-data. - Display the first few rows to understand the diamond characteristics.
- Identify and describe the metrics available in the dataset.
Part 2.B: Data Distribution Analysis
- Plot histograms to understand the distributions of diamond features like
carat
,depth
,price
, andtable
. - Create scatterplots to observe the relationships between:
carat
andprice
depth
andtable
Part 2.C: Advanced Visualizations
- Create box plots to understand the price distribution based on the diamond’s
cut
. - Generate violin plots to understand the distribution of diamond price based on its
clarity
. - Use a pair plot to visualise the relationships between
carat
,depth
,table
, andprice
. - Generate a swarm plot to understand the distribution of diamond price based on its
cut
(due to computational intensity, consider sampling a subset of the data for this visualization).
Part 2.D: Research Seaborn Documentation and Create another Data Visualization
- Create a new visualization that’s different from the previous ones.
- Explain the data used, the type of visualization, and what story is being told.
- Justify your choice of visual encoding and visualization design choices (i.e. marks, channels, perceptual ordering, data type, etc.
Part 2.E: Reflection
Based on your visual analysis, summarise the key insights you gathered about diamonds in a Markdown cell at the end. Provide recommendations or areas of focus for diamond traders.
Submission instructions
-
Ensure that:
- All of your edits to the Jupyter Notebook files are committed and pushed to the remote repository on GitHub which was generated by GitHub Classroom. We will grade based on what is available in that repository.
-
Submit the URL of your repository to the assignment
A6—Matplotlib and seaborn
in GradeScope.Warning: Do not put a link to a personal repository. It must be within our class GitHub organization.
Grading
Criteria | Points |
---|---|
Notebook 1 | 8 pts |
Notebook 2 | 8 pts |
16 pts |