Pick’n’mix plots

3 variations of boxplot mixed with data plots.

Follow-up to: Showing distributions

 

When writing about the half-and-half plot, many of you replied with further discussion points, tips, and tutorials. I tried collected them here to make them available to everyone.

More mixed boxplots

Aaron Ellison @AMaxEll17 brought to our attention that he published a plot in 1993, where he overlaid the box plot with the data points (see fig 1A). Along with it he published the code, pre github et al. Aaron was inspired by the just published “Grammar of Graphics” by Wilkinson. He seems to be the first person to have published it in a paper?

Today, boxplot/data plots are common and easy to plot in R with ggplot2. Declan O’Regan @DrDeclanORegan shows us one example in figure 1B. An “exploded” version, where the boxplot and its metrics are barely visible and the focus is on the data points, is shown in figure 1C (provided by the cystic fibrosis Gene therapy group @CFGT_Edinburgh).

3 variations of boxplot mixed with data plots.
Figure 1: Boxplot mixed with data plots.

Box’n’Bee

There is also the overlay of boxplot with the bee-swarm plot. Here, individual data points are ordered and arranged in a U-shape instead of randomly placed. An example is shown by Darren Wisniewski @Dmwizzle, who made this in ggplot2 (fig 2A).

But, beware of the bee-swarm: the ordered arrangement of the data (U-/ or A-shape most common) may introduces visual artifacts. And, personally, I draw a mental line through the U-shaped branches and straighten it to understand the data. This is error-prone and of course a waste of time when the line could equally be straight. In figure 2B I have plotted the same data as bee plot and dot plot for a direct comparison. I feel it is easier to see how the data is distributed in the data/dot plot. (Data: gene expression of RNAs that are localized at the poles in the fruit fly oocyte. RNAs that localize at the posterior for days have higher expression than RNAs at the anterior pole that are localized just for a few hours).

Pick_fig2-01.png
Figure 2: A. Boxplot and bee swarm plot. B. Comparison of bee versus data/dot plot.

Histogram & boxplot

Robert Grant @robertstats pointed us to an interesting histogram overlaid with statistical summaries that was originally designed by @f2harrell (here is a link to a tutorial with R), see figure 3. The horizontal histogram shown below has particularly small bins and the median and quartiles indicated below – for my taste a bit too small.

Picture5
Figure 3: histogram with boxplot overlay.

Violin and data

Of course, there are also mixed plots with violin plots. Violin plot themselves most often already are overlaid with a boxplot. Another possibility by Wouter de Coster @wouter_decoster is to mix the violin plot with a bee swarm plot, which he implemented with python seaborn (fig 4A). As you know, I personally would have preferred the actual data instead of the bee swarm, see above.

Joey Burant @jbburant put forward the idea of mixing data points as a histogram with half of a violin plot in , see figure 4B. pick_3-01

Joey also nicely documented how in github:


## GOAL:
## re-create a figure similar to Fig. 2 in Wilson et al. (2018),
## Nature 554: 183-188. Available from:
## https://www.nature.com/articles/nature25479#s1
##
## combines a boxplot (or violin) with the raw data, by splitting each
## category location in two (box on left, raw data on right)
# initial set-up ———————————————————-
## set working directory
getwd()
## call required packages
library(tidyverse)
library(ggthemes)
## load source code
devtools::source_gist("2a1bb0133ff568cbe28d",
filename = "geom_flat_violin.R")
## sourced from github "dgrtwo/geom_flat_violin.R
## set plotting theme
theme_set(theme_few())
## import data
iris <- iris
# half violin plot with raw data ——————————————
## create a violin plot of Sepal.Length per species
## using the custom function geom_flat_violin()
ggplot(data = iris,
mapping = aes(x = Species,
y = Sepal.Length,
fill = Species)) +
geom_flat_violin(scale = "count",
trim = FALSE) +
stat_summary(fun.data = mean_sdl,
fun.args = list(mult = 1),
geom = "pointrange",
position = position_nudge(0.05)) +
geom_dotplot(binaxis = "y",
dotsize = 0.5,
stackdir = "down",
binwidth = 0.1,
position = position_nudge(0.025)) +
theme(legend.position = "none") +
labs(x = "Species",
y = "Sepal length (cm)")

When the histo-violin is flipped horizontal this looks like a raining cloud, Roger Kievit @rogierK  therefore named it the raincloud plot and just deposited a preprint article about this plot type and its implementation. For matplotlib users posted a guide in github.


import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
iris = sns.load_dataset('iris')
f, ax = plt.subplots(figsize=(10, 8))
species_list = np.unique(iris.species)
color_list = ['r', 'g', 'b']
data = [np.array(iris[iris.species==s].sepal_length) for s in species_list]
v1 = ax.violinplot(data, vert=True, showextrema=False, points=100, bw_method=0.3)
for b,c in zip(v1['bodies'], color_list):
m = np.mean(b.get_paths()[0].vertices[:, 0])
b.get_paths()[0].vertices[:, 0] = np.clip(b.get_paths()[0].vertices[:, 0], m, np.inf)
b.set_color(c)
b.set_edgecolor('k')
b.set_linewidth(2)
b.set_alpha(0.6)
for n in range(len(species_list)):
ax.plot(0.1*np.random.uniform(size=len(data[n]))+n+0.85, data[n],
color_list[n], marker='.', ms=10, linestyle='', alpha=0.5)
for n in range(len(species_list)):
ax.errorbar(n+1.03, np.mean(data[n]), yerr=np.sqrt(np.var(data[n])), color='k')
ax.plot(n+1.03, np.mean(data[n]), 'k.', ms=20)
fontsize = 24
ax.set_ylim([4, 8.5])
ax.set_ylabel('Sepal Length', fontsize=fontsize)
ax.set_yticks(range(4,9))
ax.set_yticklabels(range(4,9), fontsize=fontsize4)
ax.set_xlim([0.5, 3.5])
ax.set_xlabel('Species', fontsize=fontsize)
ax.set_xticks([1,2,3])
ax.set_xticklabels(species_list, fontsize=fontsize4)
plt.show()

In excel…

Jorge Camoes @wisevis shows us that such plot types are also possible to make in excel – he shows us a horizontal boxplot with data points above from his book (fig 5). I generally like horizontal boxplots, especially when comparing lots of categories! re-created the half-and-half plot it in excel. Both are phenomenal, I had no idea excel could do this much! 

Picture8
Figure 5: Boxplot with data in excel

… and matlab

And finally, matlab user rejoice, it is also possible to make mixed plots in your favorite environment, Matt Cooper @mattguycooper suggests to use the ‘notboxplot’ function on the file exchange that creates ‘box plots’ with dot plots overlaid, this gives you plots as shown in figure 6:

Picture9
Figure 6: boxplots with data in matlab

More: Tutorials and interactive plots

Bogdan Micu‏ @trizniak points us to a nice interactive violin plot: https://plot.ly/r/violin/.

A couple of tutorials: Frank Soboczenski‏ @h21k shows us the code for making half-and-half boxplots in R: https://github.com/h21k/R/blob/master/snippets/half_box.R, James Rooney @jpkrooney pointed us to a great tutorial for making violin plots with ggplot2 by Katherine Wood @kathmwood https://inattentionalcoffee.wordpress.com/2017/02/14/data-in-the-raw-violin-plots/ and @lisadebruine compares different plots compare with the same data: https://debruine.github.io/plot_comparison.html.

 

 

 

One thought on “Pick’n’mix plots

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.