When reading about co-evolution of prey and predators, I stumbled across a cute new plot type: a half boxplot, half dot plot. Wilson et al. created this plot to simultaneously visualize both the summaries of their data (center, spread) and the actual data points. This allows us, the audience, to learn a lot about their results: that cheetahs are maybe binomially distributed and have outliers, or that zebras show a curious clustering.
Your quick guide to distribution plots
The half-and-half, aka dox-plot (a friend), led me to explore which visuals are commonly used for showing distributions.
To show how the raw data is distributed, we simply use dots or bars (as in barcode plots). When there are overlapping data points, we can use transparent fill colors or “jittering”, which distributes the data points in a given area, for increased clarity.
Summarizing the data: center and spread
Often, we are interested in summarizing statistics to judge and compare data. By convention the center (median) is indicated with a horizontal bar and the spread (variance, standard deviation) with vertical whiskers. Common plot types for this are the “star-wars rebel fighter”, the dot plot and boxplot. Bar plot used to be widely used, but are now banned by most journals for concealing most relevant information, so they are here only for completeness (see previous post). Very often these days I see boxplots that are overlaid with the data points – this works really well for up to 100 data points and is easy to implement with most software.
Show shape and unusual features
For normally distributed data, the center and the spread are highly informative. However, in life science we often have bimodal distributions, clusters, or gaps. Then boxplots become very insufficient and might even conceal interesting aspects (if not outright be misleading).
For faithfully showing distributions, histograms have a long history. Here, one has to be very careful with choosing bin sizes: too large or too small bins can greatly distort the histogram shape and result in a misleading chart. Choosing bin sizes is a science in itself, for details see wikipedia – but basically, it again depends on the data shape and sampling depth.
An alternative to histograms are density plots. They show only how the data are distributed, but in life science are not frequently used yet. They become very useful for large data sets that can’t be visualized as raw data points anymore, or where the eye would not be able comprehend the spread intuitively anymore. A rather recent but so far “happy marriage” is the violin plot. Violin plots are a fusion of the boxplot and its summary statistics, with the density/shape of the data (Hintze and Nelson, 1998 doi: 10.1080/00031305.1998.10480559).
Pro and cons of different plot types:
All figures as one: