Visualizing the Palmer Penguins Data Set with Plotly

Step 1. Read the data into Python using pandas

The pandas Python package allows us to read the data set, and facilitates data manipulation. After importing the pandas package, we read the url containing our dataset into pandas with the read_csv function.

import pandas as pd
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)

Here we do some preliminary data filtering, such as removing any NA values and respecifying our column names. Each column name gives a specific variable of interest for the penguins.

#data filtering
penguins = penguins.dropna(subset = ["Body Mass (g)", "Sex"])
penguins["Species"] = penguins["Species"].str.split().str.get(0)
penguins = penguins[penguins["Sex"] != "."]
#renaming columns
cols = ["Species", "Island", "Sex", "Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)", "Body Mass (g)"]
penguins = penguins[cols]

This is what our dataset looks like:

penguins

penguin_output.png

Step 2. Use an appropriate plot for the information that you want to visualize

Now that we have our dataset in proper format, it is time to choose some variables of interest that we would like to visualize. Here’s an example:

Say we wanted to compare the distribution of Culmen size (the length and width of the tip of the Penguin’s beak) accross various penguin species. One way to do this is to generate a plot of the penguins with the Culmen length as the x axis, and the culmen depth as the y axis.

Although we could generate seperate plots for each species, plotting them together and coloring them by species allows us to compare these distributions directly.

Step 3. Plot our data!

Here, we import the plotly.express module, which provides various plotting functions for visualization of our data.

We will be using the scatterplot function, which besides color, has various other arguments such as “size”, “symbol”, and “hover_data” which we can use to represent various other variables in our data set.

Here is an example where we color penguins by species. We give symbols corresponding to the island inhabited by the penguin, and set the size of the points to correspond to the penguins’ body mass.

In addition, we also use marginal plots (see “marginal_y” and “marginal_x” in the function parameters), which provide summary statistics of the data across either the x or y variable. For example, the x marginal plot gives violin plots of the culmen length distribution corresponding to each species and island, and the y marginal plot gives violin plots of the culmen depth distribution corresponding to each species and island.

from plotly import express as px

fig = px.scatter(data_frame = penguins, #dataset we want to visualize
                 title = "Culmen Size Distribution among Three Penguin Species", #figure title
                 x = "Culmen Length (mm)", #column from dataset to use for x axis
                 y = "Culmen Depth (mm)", #column from dataset to use for y axis
                 color = "Species", #column to use for color
                 size = "Body Mass (g)", #column to use for point size
                 symbol = "Island", #column to use for point symbol symbol
                 size_max = 8, # setting max point size
                 opacity = 0.7, #setting point opacity
                 hover_name = "Species", #column data to use as the title when hovering over data points
                 hover_data = ["Island", "Sex"], #column data to use when hovering over data points
                 width = 800, #width (in pixels) of the plot
                 height = 500, #height (in pixels) of the plot
                marginal_y = "violin", #specifying marginal plots gives supplementary summary statistics of x or y axes
                marginal_x = "violin")

# show the plot
fig.show()

penguinPlot.png And now we’re done! Note: plotly actually provides interactive plots, so if this code is actually executed in Jupyter notebook, hovering over the points would actually give the various statistics we provided in the hover_data and hover_title parameters.

Have fun plotting!

Written on April 7, 2022