This session takes an intermediate look at data visualisation in python with the package seaborn. If you haven't completed the introductory session, it is strongly advised you work through that material first, as the fundamentals of plotting with seaborn are covered there. This session continues from the foundation built in the introductory session and explores different, more specific forms of visualisation, using facets, regressions and other plotting types.
The easiest way to use Python 3, seaborn and Spyder is to install the Anaconda Distribution, a data science platform for Windows, Linux and macOS.
Open the Anaconda Navigator (you might have to run anaconda-navigator
from a terminal on Linux), and launch Spyder. On some operating systems, you might be able to find Spyder directly in your applications.
In order to keep everything nicely contained in one directory, and to find files more easily, we need to create a project.
- Projects -> New project...
- New directory
- Project name: "python_seaborn_intermediate"
- Choose a location that suits you on your computer
- Click "Create"
This will move our working directory to the directory we just created, and Python will look for files (and save files) in this same directory by default.
Spyder opens a temporary script automatically. You can save that as a file into our project directory:
- File -> Save as...
- Make sure you are located in the project directory
- Name the script "process.py"
Working in a script allows us to write code more comfortably, and save a process as a clearly defined list of commands that others can review and reuse.
seaborn is a Python module useful for statistical data visualisation. It is a high-level module built upon another module called matplotlib. Effectively, seaborn is a more user friendly module which uses matplotlib in the background to do all the work. It is also possible to use matplotlib in directly, although it is less intuitive. Matplotlib is modelled on MATLAB, so if you have MATLAB experience, then it should be easier to use.
To be able to use the functions included in seaborn, we'll need to import it first.
import seaborn as sns
sns
is the usual nickname for the seaborn, and writing outseaborn. ...
every time we need to use a function gets cumbersome pretty quickly.
To begin with, let's import a dataset to visualise. seaborn has some inbuilt data, we'll use the one called "tips". To import the data, we use the sns.load_dataset( ... )
as follows:
tips = sns.load_dataset("tips")
Notice that we have to start with
sns.
, since we're using a seaborn function.
Now that our data is stored in the variable tips
, let's have a look at it to see what it contains. Simply run
tips
Here we can see that our data is separated into seven variables (columns):
total_bill
tip
sex
smoker
day
time
size
with 244 rows. It looks like this data contains statistics on the tips that a restaurant or cafe received in a week.
We can visualise different aspects of this data, and we will have different options depending on the types of variables we choose. Data can be categorical or numerical, and ordered or unordered, leading to a variety of visualisation possibilities. If you would like to explore this more, check out From data to Viz, a tool which helps you to find the best option.
Let's set our theme so that our plots look nice. You can choose whichever preferences you would like (see Customisation -> Themes in the introductory session). Our preference for today is
sns.set_theme(context = "talk", style = "ticks")
In this session, we're going to write a single plot over multiple lines for readability. To run those lines together, we can section off areas of our code to run called cells.
To start a cell, use the code #%%
. You should see a horizontal bar marking the start of the cell and highlighted section of code (the cell). The cell will end before the next #%%
, so if you need to close it too, place one at the end. Press Ctrl + Return to run the code in the cell, or Shift + Return to run the cell and move to the next. In our session we'll exclusively use Ctrl + Return.
Note that running a cell is equivalent to highlight the code and pressing F9 (run selection or current line). Cells are advantageous as they both save time and allow us to rerun the same highlighted code.
In this session, we'll include the start of a cell in every snippet, but not the end. If you have code which follows the snippet, you may need to insert an extra #%%
to start a new cell below.
In the introductory session we explored the fundamentals of plotting with seaborn and some of the possible data visualisations tools available. Now, let's look at some more advanced possbilities that we can use.
You may remember that there are three functions for normal, figure-level plots:
sns.relplot
(relational plots, e.g. scatter plots, line plots)sns.catplot
(categorical plots, e.g. bar plots, box plots)sns.displot
(distributions, e.g. histograms, kernel density distributions)
We are yet to use the third type, sns.displot
. These are distributions, which provide an alternative way to analyse data.
To begin with, let's produce a histogram showing us the distribution of the variable total bill
#%%
sns.displot(data = tips,
x = "total_bill",
kind = "hist")
Remember to press Ctrl + Return to run the current cell
Immediately, we can see that the data is skewed, with a mean likely higher than the median due to a longer rightward tail.
Despite appearances, a histogram is not a bar plot, found in
sns.catplot
. A histogram is a distribution, where changing the number of bins (columns) can potentially reveal different results. Notice that total bill is numerical, something which wouldn't be possible with a bar plot (we would have a column for every number - that's a lot of columns!)
While it appears simple, there are a lot of features available in sns.displot
. In our previous plot, the statistic is count - the total number of observations. We can change that, using stat =
, to any of the following options
stat = ... |
Description |
---|---|
"count" |
Count, as seen above |
"frequency" |
The number of observations divided by the bin width |
"probability" |
Normalises the observations such that bar heights sum to 1 |
"density" |
Normalises such that the total area of the bars (all together) sums to 1 |
Depending on your version, you may also have access to
stat = ... |
Description |
---|---|
"proportion" |
Almost identical to probability |
"percent" |
Normalises such that bar heights sum to 100 |
Let's use "probability"
to normalise the histogram
#%%
sns.displot(data = tips,
x = "total_bill",
kind = "hist",
stat = "probability")
Next, we can adjust the bin (column) properties, such as width, range and aesthetics.
Using bins =
, we can specify the number of bins used. Above, there are 14. Notice that if we reduce the number, perhaps to 7, we could draw some different conclusions.
#%%
sns.displot(data = tips,
x = "total_bill",
kind = "hist",
stat = "probability",
bins = 7)
See how here the tail only decreases, while in the previous plot, the last bin is higher? Whatever choice we make, by grouping in bins, we always risk masking some data between bins. Similarly, we can increase the number to something high, say 50.
#%%
sns.displot(data = tips,
x = "total_bill",
kind = "hist",
stat = "probability",
bins = 50)
Here, the visualisation may be too sparse, making it harder to draw conclusions. Finding the best number of bins can be achieved algorithmically or manual choice.
Instead of visualising the individual observations, we could instead choose to view the cumulative observations. This may provide clarity on the general curve of the data, and could be especially useful for temporal data.
Creating a cumulative histogram is as simple as including the parameter cumulative = True
.
#%%
sns.displot(data = tips,
x = "total_bill",
kind = "hist",
stat = "probability",
cumulative = True)
Here we can see that there 60% of the data is below $20, just by looking at the cumulative distribution.
By including a third variable with hue =
, we can produce multiple histograms separated by colour. Let's introduce the variable time.
#%%
sns.displot(data = tips,
x = "total_bill",
hue = "time",
kind = "hist",
stat = "probability")
Automatically, the times overlay. It looks like diners spend more in the evening than at lunch. The attribute multiple =
determines how the two plots are displayed:
multiple = ... |
Description |
---|---|
"layer" |
Overlayed, as above. |
"dodge" |
Bars are interwoven/side-by-side, like prison bars. |
"stack" |
Bars are stacked. |
"fill" |
Stacked bars which all reach 1, displaying the probability of obtaining one over the another. |
Additionally, the attribute element =
provides alternatives to bars which still display the same visualisation. These possibilities are,
element = ... |
Description |
---|---|
"bars" |
As above. |
"step" |
A continuous outline, maintaining the vertical structure of the bins. |
"poly" |
A continuous outline formed by straight lines between data points. |
Let's combine multiple = "stack"
and element = "step"
#%%
sns.displot(data = tips,
x = "total_bill",
hue = "time",
kind = "hist",
stat = "probability",
multiple = "stack",
element = "step")
Finally, it is also possible to overlay a KDE (kernel density estimate) distribution too, using kde = True
. We'll examine the KDE on its own now.
KDE (kernel density estimate) plots are smooth distributions which fit statistical data. Their smoothness may reflect reality in a way histograms don't, although it's important to acknowledge that KDEs are estimations and will have some distortion of the data. Normally, they provide an accurate picture of the sample distribution.
KDE plots are produced by changing the attribute kind
in our sns.displot
function.
#%%
sns.displot(data = tips,
x = "total_bill",
hue = "time",
kind = "kde")
Notice that the
The ECDF (empirical cumulative distribution function) plots provide a cumulative visualisation of the data. These are unique in that they require no aggregation or estimation - no bins or fitting function, the ECDF just plots observations as a running total.
#%%
sns.displot(data = tips,
x = "total_bill",
hue = "time",
kind = "ecdf")
The final feature of distributive plots we'll examine is bivariate plotting. For plots of kind = "hist"
and kind = "ecdf"
, it is possible to also pass a y
attribute. Let's use tip with a histogram.
#%%
sns.displot(data = tips,
x = "total_bill",
y = "tip",
kind = "hist")
These plots have all the options of their univariate counterparts, such as changing the bins, introducing another variable, etc.
We can also create a bivariate KDE plot changing kind = "kde"
#%%
sns.displot(data = tips,
x = "total_bill",
y = "tip",
kind = "kde")
Visualised here are contour lines, corresponding to values of the estimated probability. Contours tend to circle around maxima.
Can you produce a histogram which examines the variable tip with 13 bins, and also introduces the variable size with hue =
. In addition, use the "stack"
option for multiple variables. As a tricky extra step, you can superimpose a kde distribution using kde = True
(which is something we haven't yet covered).
Solution
The code is
#%%
sns.displot(data = tips,
x = "tip",
hue = "size",
bins = 13,
multiple = "stack",
kde = True)
And the plot is
Notice that by choosing a numerical variable for colour, seaborn uses a sequential palette, rather than a qualitative one.
When data visualisations include multiple separate plots, where the difference in plots corresponds to a variable but the
The advantage of facetting is that each facet is a relatively 'clean' plot and comparison across the facetting variable is simple. This is quite easily achieved in seaborn, simply by using the row =
and col =
parameters in any figure level plots.
Let's start by producing a scatter plot comparing the total bill and tips variables.
#%%
sns.relplot(data = tips,
x = "total_bill",
y = "tip",
kind = "scatter")
Now, let's separate each day into columns. To do this, we introduce col = "day"
#%%
sns.relplot(data = tips,
x = "total_bill",
y = "tip",
kind = "scatter",
col = "day")
As you can see, there are now four separate plots, which correspond to the four days in our dataset (Thur, Fri, Sat, Sun).
As with before, we can group by hue =
, size =
or style =
to include another variable. Let's include the smoker
variable with colour.
#%%
sns.relplot(data = tips,
x = "total_bill",
y = "tip",
kind = "scatter",
col = "day",
hue = "smoker")
We can also include a set of facets for the row. Let's use the sex
variable
#%%
sns.relplot(data = tips,
x = "total_bill",
y = "tip",
kind = "scatter",
col = "day",
row = "sex",
hue = "smoker")
As you can see, some facets work better than others, and its worth being prudent on how many variables to include. Having too many rows and columns could potentially hide trends that together are more obvious, as it can make the data sparse.
We're free to add any of the customisation that we could in the introductory session, as we are using the same function. Facetting works in the same way for any sns.displot()
or sns.catplot()
graphs too.
Pair plots appear very similar to facet plots, and are in a way a subset of facetting. They are often a useful way of summarising all the numerical data. To understand them, let's first create one.
#%%
sns.pairplot(data = tips)
This is a large plot. What pairplot does is produce scatterplots for all numerical variables, on the non-diagonal facets, and produce histograms on the diagonal. These histograms show the count for the variable they represent (e.g., the centre histogram shows the distribution of tips).
This pairplot looks a little odd because the size
variable is discrete. Nevertheless, it is still useful.
We can customise the pairplot in a few ways. Firstly, as with all our plots, we can group by colour, using hue =
. Grouping by time,
#%%
sns.pairplot(data = tips,
hue = "time")
Notice that the distributions have changed from histograms to layered kernel density estimates (KDEs). If we want to go back to histograms, we could try with the parameter kind = "hist"
#%%
sns.pairplot(data = tips,
hue = "time",
kind = "hist")
Although this also changes our scatterplots to bivariate histograms. If we only want to adjust the diagonal, we can instead use diag_kind = "hist"
#%%
sns.pairplot(data = tips,
hue = "time",
diag_kind = "hist")
This yields stacked histograms and scatterplots.
If you want to change both the diagonal and off-diagonal plots, simply use both
kind
anddiag_kind
in your plot.
Finally, if you're only interested one set of the off-diagonals, which may often be the case due to the symmetry of the plots, then the parameter corner = True
will only display the bottom-left half.
#%%
sns.pairplot(data = tips,
hue = "time",
diag_kind = "hist",
corner = True)
A third type of multi-plot visualisation is possible with the sns.jointplot()
function. This visualisation plots distribution graphs along the axes of a bivariate plot (like a scatter plot or bivariate histogram). We can again compare our total bill and tip variables,
#%%
sns.jointplot(data = tips,
x = "total_bill",
y = "tip")
If we group by a variable, say smoker, again using hue =
, we see (like in the pair plots) a KDE distribution instead of a histogram
#%%
sns.jointplot(data = tips,
x = "total_bill",
y = "tip",
hue = "smoker")
We can change our plot types using the kind =
parameter as before. The options available are
kind = ... |
Description |
---|---|
"scatter" |
A scatter plot with histograms. |
"kde" |
A univariate and bivariate KDE plot. |
"hist" |
A univariate and bivariate histogram. |
"hex" |
A univariate and bivariate histogram with hexagonal bins. |
"reg" |
A scatter plot with a linear regression and histograms with a KDE. |
"resid" |
The residuals of a linear regression with histograms. |
Let's try with "hist"
.
#%%
sns.jointplot(data = tips,
x = "total_bill",
y = "tip",
hue = "smoker",
kind = "hist")
Joint plots provide a powerful tool for analysing data, particularly different grouped data, because the distribution plots on the margin indicate whether specific variables group in certain sections. In this case, it looks like the smoker and non-smoker data are both distributed around the same means, with similar tails, indicating that there may not be a relationship between smoker and total bill or smoker and tip. One of the most important tools for this form of analysis, however, is linear regression, which we will now explore next.
The final multi-plot visualisation technique we will explore displays multiple different plots on the same set of axes. To do that, we need to call two plotting functions for each of our plots. Let's start by viewing a scatter plot of size vs tip
#%%
sns.relplot(data = tips,
x = "size",
y = "tip",
kind = "scatter")
Let's also create the graph that we want to overlay. We could try a line graph of the averages
#%%
sns.relplot(data = tips,
x = "size",
y = "tip",
kind = "line")
Now, let's try put the line graph on top of the scatter. Here, we need to pause and consider the difference between figure-level and axes-level functions for a moment.
- Figure-level plots create a new figure (the whole image) every time they're used.
sns.relplot
,sns.displot
andsns.catplot
are the only figure-level plots. - Axes-level plots create a new plot on the current axes every time they're used. Using these, we can put multiple plots on the same image. Examples include
sns.lineplot
,sns.barplot
,sns.histplot
etc. We haven't used any of these yet, but the syntax is almost identical to those figure-level functions.
The best way to understand this difference is to try them both. If I call two figure-level plots, it's going to create two separate images. Try running the two plots above in the same cell.
#%%
sns.relplot(data = tips,
x = "size",
y = "tip",
kind = "scatter")
sns.relplot(data = tips,
x = "size",
y = "tip",
kind = "line")
What you should find is that two separate images are produced, the same ones as above! That's because the second plot, sns.replot( ... , kind = "line")
, is a figure level plot, so it makes a whole new image for the visualisation. Instead, let's try by replacing the second plot with sns.lineplot
. We also need to omit the kind = "line"
argument, since it is now redundant.
#%%
sns.relplot(data = tips,
x = "size",
y = "tip",
kind = "scatter")
sns.lineplot(data = tips,
x = "size",
y = "tip")
Et voila! The line plot has been placed on top of the scatter plot. If we wanted to include any more plots, we would need to do so with additional axes-level functions.
Note that by keeping the first plot as figure-level, we retain the visual aesthetic of figure-level plots. In general, the first plot we use will define the default appearance.
sns.lineplot
is one example of many axes level functions. In fact, for every kind = " ... "
option there is a corresponding sns. ... plot()
axes-level function. So
sns.displot( ... , kind = "hist")
corresponds tosns.histplot
sns.catplot( ... , kind = "count"
corresponds tosns.countplot
sns.relplot( ... , kind = "scatter")
corresponds tosns.scatterplot
- etc.
Can you produce a group of histograms which look at the distribution of total bill? The group should separate the responses to smoker into columns and time into rows.
Solution
The code is
#%%
sns.displot(data = tips,
x = "total_bill",
col = "smoker",
row = "time")
And the plot is
There are four categorical plots that we'll look at now, box plots, violin plots, swarm plots and point plots. To produce these, we'll move to the function sns.catplot
.
Bar plots were covered in the introductory session.
Box plots are a classic categorical plot, showing key statistics about the dataset. To produce one, we can simply use
#%%
sns.catplot(data = tips,
kind = "box")
Boxplots are composed of a box (coloured), whiskers (extending from the box in each direction) and outliers (diamond points). The box represents the middle quartiles (between 25%-75%, half of the data), displaying the median with the interior line. The whiskers display the outer quartiles (between 0-25% and 75-100%), with outliers excluded from the statistical calculations and displayed as points.
Obviously, the observations don't scale too well between variables here, so let's isolate a specific variable, by assigning one to
#%%
sns.catplot(data = tips,
x = "total_bill",
kind = "box")
Next, we can introduce another variable with y =
. Let's use day.
#%%
sns.catplot(data = tips,
x = "total_bill",
y = "day",
kind = "box")
Here four boxplots display the data for each day. We can introduce a third variable using hue
, let's include smoker.
#%%
sns.catplot(data = tips,
x = "total_bill",
y = "day",
hue = "smoker",
kind = "box")
Bear in mind, boxplots don't display the number of data points, and by separating by day and smoker status, we have probably reduced our groups to small sample sizes which may be stastically unreliable.
Violin plots are similar to boxplots, but use a smooth design rather than a box, to emphasise the distribution.
To produce a violin plot like our previous boxplot, we simply need to change the attribute kind = "violin"
.
#%%
sns.catplot(data = tips,
x = "total_bill",
y = "day",
hue = "smoker",
kind = "violin")
Violin plots replace the boxes with Kernel Density Estatimates (remember them?) for the underlying distribution. Note that these may be misleading if there are only a small number of data points. We can check that by using the attribute inner
With the inner
attribute, there are five options:
inner = ... |
Description |
---|---|
"box" |
Default, shows the box and whisker plot inside |
"quartile" |
Shows the quartiles, with markers at 25% Q1, 50% (median) and 75% Q3 |
"point" |
Shows individual data entries as points |
"stick" |
Shows individual data entries as lines |
None |
Shows nothing |
Let's use inner = "stick"
.
#%%
sns.catplot(data = tips,
x = "total_bill",
y = "day",
hue = "smoker",
kind = "violin",
inner = "stick")
As you can see, some plots have more data points than others. In particular, the Friday, Yes plot has only a handful of points - the plot may be unreasonably smooth.
Finally, let's examine swarm plot. These are a way of creating scatterplot-like visualisations for categorical data that adjusts the position of individual points to present all observations. We can create one with the variables we used previously (let's exclude hue
for now)
#%%
sns.catplot(data = tips,
x = "total_bill",
y = "day",
kind = "swarm")
Swarm plots give a clear qualitative indication of the spread of data.
For our final challenge, try to produce this plot
Hint: the colour palette used is called
"crest"
.
Solution
The code is
#%%
sns.catplot(data = tips,
x = "size",
y = "total_bill",
palette = "crest",
kind = "violin")
Here we briefly explore an extension to our previous plot, encountering some of the limitations with seaborn. What if we wanted to swap our axes above, displaying the violin plots horizontally? Well, we could try swapping the x
and y
variables:
#%%
sns.catplot(data = tips,
x = "total_bill",
y = "size",
palette = "crest",
kind = "violin")
But this produces a strange graph (and takes some time to do so) by considering every entry in total bill as a separate categorical variable, with its own plot.
Here we see that seaborn automatically assumes the x variable is categorical if both variables are numerical when using sns.catplot
. We can tell seaborn to instead consider size as categorical by including orient = "h"
That's better! But hold on, the ordering of size is misleading. We should have 1 at the bottom and 6 at the top! Seaborn doesn't know this, because we've told it that size is categorical, so it doesn't know how to interpret the order. To change this requires a manual solution, using order =
. If we know the order we want, we could simply include
order = [6,5,4,3,2,1]
which does the trick
However, what if there were many different options? It's hard work to put them all in manually. We can do it with the pandas module. Firstly, let's import it
import pandas as pd
Next, let's create the list of the unique values in size that we want (like [6,5,4, ... ]
above) and order them.
myOrder = sorted(pd.unique(tips["size"]), reverse = True)
Let's unpack this. Here, we've isolated our size data with tips["size"]
, made a list of its unique values with pd.unique( ... )
, and sorted them in reverse order with sorted( ... , reverse = True)
. All together, we get the same list [6,5,4, ... ]
.
If we then put myOrder
into the order =
attribute, it works!
#%%
myOrder = sorted(pd.unique(tips["size"]), reverse = True)
sns.catplot(data = tips,
x = "total_bill",
y = "size",
palette = "crest",
kind = "violin",
orient = "h",
order = myOrder)
That was a lot of different plots. Here is a summary of what we just covered.
Topic | Description |
---|---|
Cells | By using #%% , we can create a cell which runs all the code within it using ctrl + return |
Plotting | Remember that we use sns.relplot for relational plots (line & scatter), sns.displot for distributions and sns.catplot for categorical plots. See the summaries below for more depth. |
Distributive Plots | Histograms, KDE plots and ECGF plots all allow analysis of data distributions, which can be modified by statistic, bin dimensions, etc. |
Faceting | With col = and row = we can facet by variables within our dataset. Joint plots and Pair plots offer special types of facet plots. |
Overlaying plots | By combining a figure-level plot with multiple axes-level plots, we can overlay multiple graphs onto the same visualisation |
Categorical Plots | A number of categorical plots are available (e.g. box plots, swarm plots) which provide alternative visualisations for categorical data |
Below is a summary of all available* plots in seaborn. Most of these have been examined in either the introductory session or this one, however, there are some which we have not yet looked at. The seaborn documentation and tutorials provide desciptions and advice for all available plots.
*As of v0.12.2
All the plots below are figure-level. To produce the axes-level plot of the same type, simply use
sns.****plot()
where ****
is given in kind = "****"
for the corresponding figure-level plot. For example,
sns.relplot( ..., kind = "scatter", ... ) # Figure-level scatter plot
sns.scatterplot( ... ) # Axes-level scatter plot
Plot Name | Code | Notes |
---|---|---|
Scatter Plot | sns.relplot( ... , kind = "scatter", ... ) |
Requires numerical data |
Line Plot | sns.relplot( ... , kind = "line", ... ) |
Requires numerical data |
Plot Name | Code | Notes |
---|---|---|
Histogram | sns.displot( ... , kind = "hist", ... ) |
Can be univariate (x only) or bivariate (x and y ) |
Kernel Density Estimate | sns.displot( ... , kind = "kde" , ... ) |
Can be univariate (x only) or bivariate (x and y ) |
ECDF* | sns.displot( ... , kind = "ecdf", ... ) |
|
Rug Plot | sns.displot( ... , rug = True , ... ) |
Combine with another sns.displot , plots marginal distributions |
*Empirical Cumulative Distribution Functions
Plot Name | Code | Notes |
---|---|---|
Strip Plot | sns.catplot( ... , kind = "strip" , ... ) |
Like a scatterplot for categorical data |
Swarm Plot | sns.catplot( ... , kind = "swarm" , ... ) |
|
Box Plot | sns.catplot( ... , kind = "box" , ... ) |
One variable is always interpreted categorically |
Violin Plot | sns.catplot( ... , kind = "violin" , ... ) |
One variable is always interpreted categorically |
Enhanced Box Plot | sns.catplot( ... , kind = "boxen", ... ) |
A box plot with additional quantiles |
Point Plot | sns.catplot( ... , kind = "point" , ... ) |
Like a line plot for categorical data |
Bar Plot | sns.catplot( ... , kind = "bar" , ... ) |
Aggregates data |
Count Plot | sns.catplot( ... , kind = "count" , ... ) |
A bar plot with the total number of observations |
Plot Name | Code | Notes |
---|---|---|
Pair Plot | sns.pairplot( ... ) |
A form of facetting |
Joint Plot | sns.jointplot( ... ) |
|
Regressions | sns.lmplot( ... ) |
|
Residual Plot | sns.residplot( ... ) |
The residuals of a linear regression |
Heatmap | sns.heatmap( ... ) |
|
Clustermap | sns.clustermap( ... ) |
If you have any further questions, would like to learn about more python content or would like support with your work, we would love to hear from you at [email protected]