Link Search Menu Expand Document

A6—Matplotlib and seaborn

Due 2023-11-03, 11:59pm EST 16pts

Follow all the instructions below.

Please post any questions about this assignment on Slack.

Warning: This is an individual assignment.

Table of contents

Change log

  • 2023-11-01 fixed link to dataset.
  • 2023-10-30 added sources for datasets.
  • Due date was corrected to 2023-11-03. A student got a bug bounty!

Aim of the assignment

Matplotlib and seaborn are two classic approaches for visualizing data in Jupyter Notebooks. This assignment will introduce you to using both to create a variety of visualizations.

Instructions

Setup instructions

  1. Accept the GitHub Classroom assignment invitation by clicking this link to get your repository:

    https://classroom.github.com/a/Feozg6P-

    For reference, this is the template repository your repository is being created from: https://github.com/NEU-DS-4200-F23/A6--matplotlib_seaborn.

  2. Follow our usual steps for creating a virtual environment, using pip to install everything in requirements.txt, and starting Jupyter Lab.

Part 1 - Matplotlib: California Housing Analysis

Our first objective is to analyse and understand various metrics of California housing using Matplotlib. Perform the following steps in california_housing.ipynb. Reference the Matplotlib plot types, user guide, and example gallery.

Part 1.a: Data Exploration

  • Load the dataset named california_housing.csv. This is from Kaggle.
  • Display the first few rows of the dataset to understand its structure.
  • Identify and list down the unique metrics available in the dataset.

Part 1.B: Data Distribution Analysis

  • Plot histograms for each feature in the dataset to understand their distributions.
  • Create scatterplots to understand relationships between selected features. Specifically:
    • Visualise the relationship between median_income and median_house_value.
    • Visualise the geographical distribution of median_house_value using longitude and latitude.

Part 1.C: Advanced Visualizations

  • Create a Hexbin plot to observe the density and relationship between median_income and median_house_value.
  • Visualise the housing_median_age using an Area Plot.
  • Create a Pie Chart to represent the distribution of houses based on the binned housing_median_age.
  • Generate a Polar Plot to represent the median incomes of houses in different housing_median_age bins.

Part 1.D: Research Matplotlib Documentation and Create another Data Visualization

  • Create a new visualization that’s different from the previous ones.
  • Explain the data used, the type of visualization, and what story is being told.
  • Justify your choice of visual encoding and visualization design choices (i.e. marks, channels, perceptual ordering, data type, etc.)

Part 1.E: Reflection

Based on your visual analysis, summarise the key insights you gathered about California housing metrics in a Markdown cell at the end. Provide recommendations or areas of focus for property investments in California.

Part 2 - Seaborn: Diamond Characteristics Analysis

The second objective is to analyze and understand diamond characteristics using seaborn. Perform the following steps in diamonds.ipynb.

Part 2.A: Data Exploration

  • Load the dataset named diamonds.csv. This is from ggplot2 via seaborn-data.
  • Display the first few rows to understand the diamond characteristics.
  • Identify and describe the metrics available in the dataset.

Part 2.B: Data Distribution Analysis

  • Plot histograms to understand the distributions of diamond features like carat, depth, price, and table.
  • Create scatterplots to observe the relationships between:
    • carat and price
    • depth and table

Part 2.C: Advanced Visualizations

  • Create box plots to understand the price distribution based on the diamond’s cut.
  • Generate violin plots to understand the distribution of diamond price based on its clarity.
  • Use a pair plot to visualise the relationships between carat, depth, table, and price.
  • Generate a swarm plot to understand the distribution of diamond price based on its cut (due to computational intensity, consider sampling a subset of the data for this visualization).

Part 2.D: Research Seaborn Documentation and Create another Data Visualization

  • Create a new visualization that’s different from the previous ones.
  • Explain the data used, the type of visualization, and what story is being told.
  • Justify your choice of visual encoding and visualization design choices (i.e. marks, channels, perceptual ordering, data type, etc.

Part 2.E: Reflection

Based on your visual analysis, summarise the key insights you gathered about diamonds in a Markdown cell at the end. Provide recommendations or areas of focus for diamond traders.

Submission instructions

  1. Ensure that:

    1. All of your edits to the Jupyter Notebook files are committed and pushed to the remote repository on GitHub which was generated by GitHub Classroom. We will grade based on what is available in that repository.
  2. Submit the URL of your repository to the assignment A6—Matplotlib and seaborn in GradeScope.

    Warning: Do not put a link to a personal repository. It must be within our class GitHub organization.

Grading

Criteria Points
Notebook 1 8 pts
Notebook 2 8 pts
  16 pts

© 2023 Cody Dunne. Released under the CC BY-SA license.