Showing distributions

When reading about co-evolution of prey and predators, I stumbled across a cute new plot type: a half boxplot, half dot plot. Wilson et al. created this plot to simultaneously visualize both the summaries of their data (center, spread) and the actual data points. This allows us, the audience, to learn a lot about their results: that cheetahs are maybe binomially distributed and have outliers, or that zebras show a curious clustering.

Wilson_redraw
Half boxplot- half data plot, re-drawn from Wilson et al. 2018

Your quick guide to distribution plots

The half-and-half, aka dox-plot (a friend), led me to explore which visuals are commonly used for showing distributions.

Raw data

To show how the raw data is distributed, we simply use dots or bars (as in barcode plots). When there are overlapping data points, we can use transparent fill colors or “jittering”, which distributes the data points in a given area, for increased clarity.

Summarizing the data: center and spread

Often, we are interested in summarizing statistics to judge and compare data. By convention the center (median) is indicated with a horizontal bar and the spread (variance, standard deviation) with vertical whiskers. Common plot types for this are the “star-wars rebel fighter”, the dot plot and boxplot. Bar plot used to be widely used, but are now banned by most journals for concealing most relevant information, so they are here only for completeness (see previous post). Very often these days I see boxplots that are overlaid with the data points – this works really well for up to 100 data points and is easy to implement with most software.

Part_2_FocusCenter

Show shape and unusual features

For normally distributed data, the center and the spread are highly informative. However, in life science we often have bimodal distributions, clusters, or gaps. Then boxplots become very insufficient and might even conceal interesting aspects (if not outright be misleading).

For faithfully showing distributions, histograms have a long history. Here, one has to be very careful with choosing bin sizes: too large or too small bins can greatly distort the histogram shape and result in a misleading chart. Choosing bin sizes is a science in itself, for details see wikipedia – but basically, it again depends on the data shape and sampling depth.

An alternative to histograms are density plots. They show only how the data are distributed, but in life science are not frequently used yet. They become very useful for large data sets that can’t be visualized as raw data points anymore, or where the eye would not be able comprehend the spread intuitively anymore. A rather recent but so far “happy marriage” is the violin plot. Violin plots are a fusion of the boxplot and its summary statistics, with the density/shape of the data (Hintze and Nelson, 1998 doi: 10.1080/00031305.1998.10480559).

Part_3_FocusShape

 

Pro and cons of different plot types:

Table_procon.png

All figures as one:

geneology of distribution charts_5-01

 

Advertisement

9 thoughts on “Showing distributions

    1. Hi Dorian,
      bee-swarm plots are basically the same as plotting the original data. However; the original data is plotted in a misleading manner: the data-points are not aligned horizontal as they should be, so visually the dots at the distal ends appear to be a higher y-value than they actually have! Here is a paper about this problem written by Leland Wilkinson: http://moderngraphics11.pbworks.com/f/wilkinson_1999.DotPlots.pdf
      In summary: bee-warm is better to be avoided!
      Best,
      Helena

      Like

      1. Imho that depends on the rendering. In some (more classical) implementations the horizontal alignment is indeed not correct, and those should be avoided. However, in others like https://github.com/eclarke/ggbeeswarm#geom_quasirandom alignment is preserved. When used correctly I also like beeswarms to complement box or violin plots as they show the raw data but prevent overplotting (without alpha or jittering) by mimicking the shape of the distribution.

        Like

  1. Really useful, and a great summary! Out of interest, what software programs would you recommend for generating these kind of charts? (both licensed and freeware) Any advice much appreciated…

    Like

    1. Hi Brooke, I personally only use R for such plots. Excel might work for most too, I heard it also is capable of doing boxplots but I have not found out yet. For ditributions I am not aware that they work with excel at all. Greetings!

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.