A quick guide to better figures

A German version of this appears in Laborjournal 9/2019.

It took me around one year from data analysis to the final visualization of my postdoc results. A year, in which I tried many visualization strategies and simplified them again and again. Helpful at that time were Edward Tufte’s book „The visualization of quantitative data” and my pre-science background in applied art, that taught me effects of color and shape.

While visualizations are used in science since antiquity, their number and complexity now increased significantly. Nevertheless, even today budding scientists do not learn to encode and effectively decode visualizations. This is evident in the numerous examples of difficult-to-read figures and outright misleading charts.  In the past years, I taught data visualization to hundreds of scientists. A few tips and tricks to increase readability and effectiveness of data figures are summarized here.

 

Planning

As always, the plan comes first. Who is my audience, what do I want to say, and why/ what is my message? Only when these questions have been answered succinctly can you begin to implement the visualization. What otherwise happens are charts with 50 categories, 12 colors, or the use of a diagram type altogether.

 

Testing

Testing a visualization is a must. Does the visualization work on paper? Is the electronic version readable? We should always look for external, objective reviewers. Often these are not the laboratory colleagues who already know our results and awkward visuals. Ideally, you should find several Guinea-pigs to get a feeling for the representative-ness of the answers.

 

Choose diagram type

A diagram should represent data truthfully and functionally. It is therefore important to select the diagram type suitable for the data type. A line chart shows a temporal progression, a bar chart compares sets of different categories, and if the category names are long, you can make a bar chart of them (Figure 1). Pie charts effectively show percentages for up to 5 rather differently sized categories. They quickly become a disaster if used for more categories, in 3D, or if two pie charts are to be compared precisely.

To avoid being misleading, some basic visualization rules should be followed: omitting the zero line distorts relative size differences in column and bar graphs (Figure 1C). In line charts, on the other hand, a zero line can be omitted because only trends are read, and in fact at times (link to twitter) a zero-baseline in line charts can result in very misleading information presentation. Absolute caution is advised with charts for statistical distributions: The box plot is suitable for normally distributed data; in histograms, the number of bins must match the number of data points. And important for readability: the invariable size, e.g. time, is shown on the x-axis, the dependent size on the y-axis (not as in Fig. 1D!).

Diagramm_types-01
A bar chart (A) focuses on absolute amounts and is not ideal for showing trends. Line charts (B) are better suited to show trends over time. Bar charts without a zero-baseline distort the relative category sizes: Germany’s ERC successes is over-emphasized when the baseline starts at 10 (C). Conventions about choosing data for the x- and y-axes are important. When ignoring those charts rapidly become incomprehensible (D).

Label

Without text, each diagram is just like an abstract piece of art. To be self-explanatory, images need a title, axis labels, and a legend. Nowadays, the title often simply sits above the diagram itself, otherwise it is found in the caption. Legends should be placed where they are needed – i.e. better not in the signature, but close to the data (see: Layout).

The following applies to all text elements: Abbreviations should be avoided, with the exception of very, very common ones. Again, find colleagues to try out your favorite abbreviation. You’d be surprised how an abbreviation turns out to be just lab jargon or really specific to your field of research. But: your papers should not be! (Googling it is also possible: if not found among the top entries, probably you should not use that abbreviation).

It is also important to avoid redundancies (especially in the axis labels!!), filler texts and passive language.

 

Layout

In a fraction of a second we can capture visual information, understand content, color and organization. The better this information is structured, the faster we understand it.

Most importantly, we usually read images like text, from top left to bottom right. That’s why posters have titles at the top and references at the bottom. Just like on posters, however, we can also increase the reading speed of individual charts: a meaningful title might be placed above a diagram (left align is usually good!) and color codes should be explained right here. On the other hand, scale information, data source, or sample size can happily be placed at the bottom right.

Because we read visual information similar to text, composite images should be organized in either rows or columns. Once such a reading direction is set, it should not be changed (best: keep reading direction for all figures of the manuscript). To maintain an overall organized look, you are best advised to maintain the (imaginary) borders of the columns and rows – this is known in layout-ing as the underlying grid.

A wonderful resource for making visual information effective are the “Gestalt principles” described in the 1920s. They describe how we understand visual information. Helpful are: a simple overall form, for example organization of the bars according to their size, symmetry in the layout, proximity of objects belonging together, and a similar appearance of objects belonging together: the same color-code across the manuscript, easily recognizable symbols for the same category e.g. in scatterplots etc.

 

Colorize

Henry Dreyfuss described color as the exclamation mark of a design. He meant that color invariably provokes a reaction, attracts attention, and, like the punctuation mark, should be used last and strategically. In diagrams, color is used to create groupings (see gestalt principles above), to convey quantities (e.g. red stands for high, blue for low temperatures), or to highlight part of the data (e.g. a red line between loud gray lines, Fig. 2 before/after).

Color_choice-01
More than three to five color shades are confusing and hard to distinguish (top). Color is better used strategically for the core message (below) or alternatively can be used for groups. Note: incorporating the legend into the title increases readability and saves space.

To make sure I do not overuse color I start with greyscales. Sometimes, a black bar among grey bars is already enough to focus the reader. If color is to be used, it has to match the data. Quantitative data that have a common scale (e.g. animal age in days) should be shown with a hue in different intensities. Divergent data (temperature, gene expression above/below zero) are shown with two hues merging into each other. Only categorical data can be coded with different hues.

Don’t forget: each color must be explained in the illustration or legend (see labeling) and should be used consistently within a figure, poster, or manuscript (color code).

In any case, colors should also be accessible to colorblind people. So please do not mix red and green tones in one image.

 

Improve

Once a first version of a figure is done improving it ideally is iterated until the core statement stands out at the first glance (1-second test!) and it is self-explanatory. As always, helpful colleagues are gold!

Effective and clear figures are helpful as they transport your message faster, but they also make your science accessible to a wider audience (which means more citations!) and to future readers (reproducibility!). And, if you are extremely successful, some figures even become icons to an entire field, like the phylogenetic diagram of Darwin that is re-used again and again.

 

 

Further reading & training

Train and test

Catalogue of chart types: http://www.datavizcatalogue.com/

Train to label axes: mathbench.umd.edu/modules/visualization_graph/page05.htm

Why all data points should be shown: https://www.autodeskresearch.com/publications/samestats

Guess the correlation:  http://guessthecorrelation.com/

Online tools for making charts etc

Venn: http://bioinformatics.psb.ugent.be/webtools/Venn/ (never use for >3/4 categories. Alternative: UpSetR plots). Boxplot: http://shiny.chemgrid.org/boxplotr/. Michaelis-Menten: http://www.physiologyweb.com/calculators/michaelis_menten_equation_interactive_graph.html

Graphical Abstracts: https://biorender.com/

Colors

Colorshemes: colorbrewer2.org

Test slides, figures and posters for color blind safety: ColorOracle

Study on color perception by XKCD: https://blog.xkcd.com/2010/05/03/color-survey-results/

Read on

Edward R. Tufte „The Visual Display of Quantitative Data“

William S. Cleveland, „Visualizing Data“

Alberto Cairo, „The Truthful Art“

 

 

 

Advertisements

Hello world, I am your ugliest color!

PANTONE488C and color perception

Understanding abstract visualizations is an almost inherent human trait. Without much training, kindergarten kids get that a longer item means more – bigger kids are older, longer lollipops are more etc.

Perception of color is immediate, but individually very distinct: around 10% of the male population is red-green color blind, seeing colors very much depends on individual photo-receptor composition. During a logo development, I was once very pleased with my beautiful blue-to-green gradient design. Alas, it turned out that the colleague in need of this logo, was unable to see even see the gradient! He literally thought I was making fun of him by showing him just a blue dot.

Gradient-01

Color and shape

Colors alone usually have no meaning. But they can strongly influence the meaning of a shape. And, this meaning is a learned cultural standard. a red octagon symbolizes “STOP”, a red human shape indicates “FEMALE” and a red heart means “LOVE”, and a red cable is MINUS electric currency.

SHAPES-01

Changing the color of a symbol can change the meaning of a symbol: While a red octagon means STOP, a green octagon at airports inverts this and means “GO AHEAD”.

 

XKCD color survey: or what is the guys name for lavendar?

A fun read about color-perception is the XKCD color survey with >200.000 participants: https://blog.xkcd.com/2010/05/03/color-survey-results

Ugliest color

Despite out individual color perception, an Australian research project now identified the world’s ugliest color in a quest to find a repelling cigarette packaging. The unanimous lead spot was won by PANTONE448C! A murky mix of muddy green and brown. It proved so successful in discouraging smokers that it was since rolled out in other countries too.

5___I_am_pantone448C_ugliestcolor-01

You are free to use it, but beware that it might signal something BAD! 🙂

Get an ERC starting grant / all about tables

Tables, the ancient gold of DataViz.

Tables are one of the most successful visualizations in the history of science. They existed long before charts were invented. Miescher reported the discovery of nucleic acids in a small table with just three rows. Janni Nusslein-Volhardt and Eric Wieschaus compiled their observations that gene expression controls embryogenesis in a table. And the seminal discovery that cells are enclosed by a lipid bilayer warranted a full-page table:

Tables are still common in scientific papers and presentations today. In a recent issue of nature 70% of life science manuscripts incorporated some form table. Simple tables are often relegated to the supplemental section. Tables in the main section of a manuscript may also be tables in a fancier format, such as a heatmaps, HiC plots, or databases. This common use of tables explains the high interest in the subject among participants of my DataViz classes.

Data suitable for tables

When seeing a figure, we focus on its most prominent features: thick lines, tallest bars, patterns in a scatter plot. If the figure is done well, this provides us instant insight to the main message. Tables display text and numbers in an organized form. When seeing a table, we address it like a text: we read from top left, to bottom right. And this is the intention of a table: they present complete datasets without a punchy, key message. Tables force the reader to come up with a conclusion and thus are more work. And tables quite simply require more space than any summary chart.

Despite these disadvantages, tables very useful for reporting precise numbers, and for precisely comparing numbers across rows and columns. Tables are also great for presenting large datasets, in which every member of our audience is interested in a different aspect of the data. Each reader of the table on ERC starting grants likely will their country of residence first. If however I wanted to show which country gets most ERCs, a bar chart works better (for categorial data, not statistical summaries). One can immediately spot the longest bar, and read it even faster when data is sorted.

Table_vs_Chart_2-01
Same data presented in table (above) and in bar chart (below): each medium has a slightly different message and purpose.

 

Designing a clear table

Because tables are organized text, alignment and typography are critical for their legibility, and good legibility allows faster reading.

Organize rows and columns.

Which data is used as the key for presenting the result? This goes to the first column of every row. In our case these are the country names. All observations for this country, ERC grants, population, funding rate, get a column each.

Alignment is your friend.

Text is best left aligned, where we start reading text. Numbers are right aligned to make comparisons by digits along a column easier. The column headers are aligned with the content. That means the header for a text column is left aligned, the header for a number column right aligned.

Best practice for text and number alignment in tables: Align text left, numbers right, and headers with the content.
Best practice for text and number alignment in tables
Font.

Choosing a legible font is always great, but really critical for numbers. To be legible, all numbers should have the same height (“new style”) and the same width (“tabular/monospace”). The width is important to compare digits within columns, the same height (so no ascenders and descenders in numbers) looks overall less cluttered.

It is not possible to make easy recommendation for a specific font since they are modified depending on the operating system and program. For example, Arial has proportional characters which would not work for numbers in a table. Microsoft Word/Mac has adapted the numbers to be tabularized, while Adobe illustrator/Mac has not.

Different fonts and their usuability in tables: old style and proportional fonts are not good for presenting numbers.
Font style influences how well we can compare numbers in tables.
Decimals.

To best compare numbers in a column, they should have the same decimal points. This helps the alignment and the understanding. In general, it is always worth considering if the decimals are even important: number of grants does not warrant a decimal because it can only be a whole number. Same for humans, but humans in millions might be reasonably presented with one decimal.

Showing decimal numbers in tables: make sure numbers within column have consistent decimals, and think if decimals are necessary.
Showing decimal numbers in tables

 

Going pro

With the above rules, you usually end up with an organized and legible table. But good news is, you can do more! Removing left and right boarders is fantastic to create extra space and use the entire width of a table cell for the content – sometimes rather important for grants!

You can do still more. The minimalist Edward Tufte even says ‘every pixel should have a meaning’. In this spirit, removing all gridlines works often just as fine. Often the well-aligned content is sufficient to guide the eye through columns and rows.

Gridlines-01
Gridlines in tables. Try leaving out vertical gridlines, or try without any gridlines. Organized text is often sufficient to guide the eye.

Very long tables however are hard to read without any guides. Two options exist. Either, the content may be grouped into blocks of 5 or 10 rows each. The blocks are then visually separated by white space from each other. Alternatively, the table could include a horizontal gridline every 5 or 10 rows.

LongTables-01
Horizontal gridlines every 5 rows makes long tables more readable. Alternatively white space can be used to group rows.

 

Color

Last, and only last, think also about color (black and grey are colors too!). I personally like highlighting table headers by giving them a fill color. Usually a light grey works fine, but this very much depends on the overall presentation and the purpose. If shown on a beamer, some light greys don’t work anymore. Also think of color code (and never mix red and green). I consistently used pink, so using the same color for the header would maybe give it a coordinated look. If your fill color is dark, you have to think about white labels. And they in turn requires larger fonts to achieve equal legibility.

Header-01

 

Glossary

Heatmap. A heatmap is a matrix/table in which the cells are shaded according to a color-scale representing the an observed value. They are particularity used to represent many-to-many comparison. Heatmaps can display very large matrices in at a very small scale and allow us to rapidly compare numbers and even see coherent patterns in the data. Heatmaps may also be combined with clustering algorithms (or simple sorting by value), which facilitates seeing patterns in data. Heatmaps are not useful to get precise numbers.

Microarrays. A heatmap that informs about gene expression levels across samples. Gene expression is shown as relative expression compared to a ground truth state. Up-regulation is shown in green, downregulation in red (microarrays are thus not color-blind safe).

HiC plot. HiC plots show heat maps where each pixel represents counts for DNA interactions between two genomic regions. The pixel intensity indicates the number of reads (one color scale) or the divergence of reads from a control (dual color scale). The axes each show the genomic regions that are compared, usually binned to e.g. 1Mb.

Database: online formats of tables to present a large dataset.

Table-Chart hybrid: A table with several observations in columns and one of the observations being presented as small chart (dotplot, boxplot, barplot) adjacent to the respective row. The chart-column usually highlights a particularly important observation.

 

* and Herzegovina

Happy Birthday VIZBI

Note: This was published as guest blog at EMBL events

10 years after it all started, VIZBI came back to its original stomping grounds, the ATC at EMBL in Heidelberg. As its name suggests, VIZBI “Visualizing Biological Data” is a blend of several worlds. Of biology, with its long history in visualizations that goes back to Ancient Greek text books, and of art and scientific illustration.

VizBi-01
Venn diagram of VIZBI disciplines: microscopy and EM data, transcriptomics and computer science. (Note: a 5-circle Venn cannot show all possible overlaps, which is fully intended here)

 

VIZBI is also inseparable from computer science and its tools to transform big data into human readable entities. And finally, VIZBI incorporates concepts of design and visual perception to make visualizations engaging and enlightening.

Highlighting spectacular biological images

At VIZBI 2010, microscopic images were omnipresent. Back then, I was embarking on my postdoc project, a large-scale microscopy screen of RNAs in cells. My memories tell me that this was the main focus of the conference. Indeed, a quick check of the 2010 program confirms that almost the entire community of light sheet microscopy and image processing were in attendance at the first ever event.

VIZBI 2019 continued to highlight spectacular biological images. A phenomenal augmented reality installation showed them in 3D, EM-tomography simulations by Peijun Zhang animated the 64-million atoms assembling into HIV particles, and Lucy Collinson shared the high numbers of high-resolution EM data collected at the Francis Crick Institute. This large amount of data is annotated with the help of amateurs, for example in their citizen science project at the Zooniverse “Etch a cell”.

Colourful confocal images or images of tissues also provided the inspiration to many works of illustrators on display that combined science and art, for example the double win of best poster and best art to a depiction of tubulin in a mitotic spindle by Beata Mierzwa @beatascienceart, a hugely talented artist and scientist (who also sells cool cytoskeleton-printed leggings and mini-brain organoid dresses).

Data visualization

At VIZBI 2019, visualizations of data – as opposed to images – gained a much more prominent spot. All keynote speakers were from the technology side. Hadley Wickham presented the history of ggplot2. Ggplot2 (and yes, there once was a ggplot1!) is the R universe for visualizing pretty much everything that comes in numbers and is now merged into the tidyverse. Being a visualization talk, all slides were themselves beautiful, I love the tidyverse playfully represented as stars of our universe! The second keynote was by Janet Iwasa who presented her animation work that heavily relies on 3D and computer graphics software used for animation films. Instead of earning her money in the film industry, she decided to put it to good use for biology. Janet first used her skills in her PhD project to visualize motor proteins “walking” along the cytoskeleton, and these days produces Oscar®-worthy movies showing biology, such as the origin of life or the life cycle of HIV. And everyone take note: all her films start as a storyboard on paper, which is what I teach as good practice for all visualization designs.

Making the invisible visible

The third keynote was by Moritz Stefaner, a data designer who is enticed by biological data but appalled by the time-scales in biological projects (too long!). Luckily, he hasn’t given up on us just yet, and keeps producing phenomenal visualizations. For example, showing absence and loss is notoriously hard, but Moritz found a beautiful way to make the invisible visible in his designs for “Where the wild bees are” with Ferris Jabr for Scientific American.

Moritz_vizBiPicture1
Making absence visible, a project by Keynote speaker Moritz Stefaner. Photo: H.Jambor

Moritz left us hungry for more when also showing his data-cuisine project, that visualizes data about food and turns food into data: the number of berries picked in Finland become a layered dessert, and common causes of death are encoded as praline fillings – you never know which one you’ll get! (Luckily this was with Belgium pralines, so all deaths are sweet.)

Feedback wanted!

Visualizations of data were in the spotlight of many other projects too. This is of course owed to the many possibilities of large-scale methods that swamped biology with data in recent years: RNAseq, inexpensive genome sequencing, mass-spec at fantastic scales, robotics driven biochemistry and medicine, image processing that turns images into insights by quantifying signals and so on. RNA sequencing, for example, fuelled Susan Clark’s project tracing methylations in cancer, Phillippe Collas’ ambitious endeavour to understand 3D genome architecture, and is empowered by Charlotte Soneson’s “iSEE” software to interactively analyse data from high throughput experiments and the project of Kirsten Bos tracing human pathogens back thousands of years by sequencing tiny dental samples. And of course, of the biggest data projects in biology is the ENSEMBL genome browser, which was officially released as pre-alpha version VIZBI (check it out: 2020.ensembl.org), the very approachable Andy Yates and his team are looking for feedback!

Technical Challenges

Visualizations of high-dimensional datasets are not without problems. The technical challenges were addressed by David Sehnal who showed computational infrastructure to visualize protein structures (MolStar). The mathematical problems of dimensionality reductions were a topic of Wolfgang Huber’s talk, and a tool to visualize, and thereby find(!), batch effects, “proBatch”, was presented in the flash talk by Jelena Čuklina (they welcome beta-testing by users!). Teaching science visualizations, I often see a great need to discuss ethical and practical aspects. Critically assessing limitations and challenges of scientific visualizations might be a topic to be expanded in future, when VIZBI enters its second decade. This should be coupled with visual perception research, after all, we are no longer limited by computational power, but rather by what our eyes and brains can comprehend (see Miller 1956).

Flash talks

art_and_bio_sub_for_upload
“Data dancing” © Alex Diaz

 

Speaking of flash talks: the conference organisers did such a great job in highlighting every single one (!) of the posters by one-minute talks. I tremendously enjoyed them, admittedly in part because I have a short attention span. Among the talks and art was also “Data dancing” by Alex Diaz. He showed that art and beauty can also be found in statistics and numbers blossoming like flowers across the page. On that note: see you next year in San Francisco!

 

P.S. Many more highlights I was unable to cover here. Check https://vizbi.org/2019/ for all posters and slides of the flash talks, check #VIZBI on twitter and my public collection of participants twitter handles (https://twitter.com/helenajambor/lists/vizbi2019).

 

 

Non-zero baselines: the good, the bad, and the ugly

Or: who gets how most ERC funding?

Of all the charts being ridiculed at WTFviz, many get shamed for their lack of a zero-baseline. When teaching DataViz, zero-baselines are invariably a topic of debate, even in the quietest groups. To participants, the rules to when zero is necessary to understand the data, and when it may be happily omitted, are often unclear. Therefore, let’s quickly recap.

Bar charts: always show zero

When amounts are encoded by length, as done in bar charts, the zero-baseline is critical to our intuitive understanding of the data. A bar twice as long represents that the category has twice the amount of counts. The number of the prestigious ERC starting grants to German host institutes roughly doubled from 2013 to 2014, correctly shown by a bar twice as long in (A).

If, however, the y-axis does not start at zero as in (B), the increase from 2013 to 2014 is hugely over-emphasized and looks roughly 4 to 5 times as high. In example (C) the baselines starts above the first data point and misleads the audiences thinking that only Germany received ERC funding in 2018.

Bar_varying_yaxis_2-01

Non-zero baselines skew the relative difference between categories and are misleading. (The same applies to axis-breaks in bar charts!). Non-zero baselines are often used to save space.

In most cases however, the chart could simply be shown with less overall height. This option maintains the relative bar sizes faithfully. When reading bar charts we are always interested in relative, not absolute size differences among our categories. (And I learned that Israel is part of the ERC funding consortium!)

Bar_varying_yaxisHEIGHT_2-01
Number of submitted ERC grants varies a lot across countries. Varying the physical height of the plot faithfully maintains the relative differences. 

Line charts are happy without zero

The situation is entirely different for line charts. We use them to show trends, e.g. increase or decrease in a category over time. The rate of change is encoded by the slope of the line relative to the horizon. We usually evaluate the slope independently from its distance to zero. For example, seeing the zero is not important for assessing that ERC successes in Germany fluctuate, while UK and France have stable funding rates. And, no matter where the zero-baseline is, why does the UK have such a curious funding peak in 2012, what happened there!?

Line_varying-yaxis_2-01
For understanding trends in line charts, we do not critically need to see the zero baseline.

Sometimes showing zero is misleading

Importantly, showing the zero-baseline in line charts may be misleading. For example, mapping human body temperatures at a scale from 0 to 100˚C would effectively mask us from seeing a life-threatening increase from 39 to 40˚C in a patient. Similarly, showing global temperatures at a scale from 0 to 120˚C results in an entirely flat line, and was used by opponents of climate research to hide man-made global temperature changes, alas, an outcry at twitter swiftly followed.

ClimateChange_noChange
Line chart misleading BECAUSE of a zero-baseline. Tweet: @EcoSenseNow, 23rd April 2019

Distributions: it depends on the data

When showing statistical summaries, again the zero is usually not necessary to be visible. We are interested in the shape of the data (normal or bimodal), it’s median, and outliers. How far the majority of data points are from zero is not usually of interest as long as all data is shown. Instead, the relative distance of individual data points from each other are key.

Good practice for non-zero baselines

When using non-zero baselines, the common practice is to unlink the x- and y-axes. For educational purposes I cut data from the right example. This is a dangerous territory and in some cases may be misleading the audience. In this example, I effectively hide the early lead of the UK in winning ERCs!

Line_varying-yaxis_size_2-01
One possibility to alert readers to a non-zero baseline in your charts.

Data European Research Council, https://erc.europa.eu/projects-figures/statistics, starting grants from 2007-2018.

 

AI CARE for you

– a new AI-based tool for microscopy

Sharp images need a lot of light or a long exposure time. But too much light damages cells, which are also always in motion. In microscopy, images are therefore often taken at short exposure times and low laser power and are blurred and noisy. Computer scientists at the Center for Systems Biology/Max-Planck institute for cell biology and genetics in Dresden developed software that avoids this problem. The artificial intelligence-based software CARE, Content-aware image restoration, can calculate razor-sharp microscopy data from noisy images. This enables biologists to obtain high-resolution images and films without exposing the cells to the risk of light toxicity.

split_tribolium2.png
Before and after CARE: images are effectively de-noised with CARE. (c) M.Weigert et al, 2018

Big Data Problems in microscopy

The researchers involved had been focusing on Big Data in microscopy for some. In vivo and light sheet microscopy quickly amasses giga or terabytes of data. These data need de-noising and de-convolution before analysis, a time-consuming and computationally intensive step. Recently, computer science have more and more used machine learning with neural networks to solve complex tasks. In machine learning computers are fed data and learn to solve a specific problem. This approach is being the recent successes of computers in Chess and Go, and also powers the translation tool Deepl. It was thus not far fetched, to also use machine learning in bioimage data de convolution.

Image de-convolution thanks to AI

The first author of the study, Martin Weigert, says: “It was already known that machine learning for 2D images actually delivers very good results. What we had to do was transfer this to biological data.” Martin tried it out just before heading home for Christmas in 2016. The results were so overwhelming that he initially assumed he had made a mistake. Martin Weigert: “I called Loïc [Royer, another author] over and we discussed how this is impossibly good, this can’t be right, there must be some mistake!

It wasn’t a mistake. Martin Weigert was finally convinced when he saw pictures of fruit flies. In the noisy original image, no structures were visible to the eye. But after a machine learning network was trained, it calculated immaculate images showing detailed distributions of protein in the membrane of fruit fly wings.

To really adapt machine learning for biological images, the research team had to overcome several challenges. Microscopic images are often three- or four-dimensional, biological materials have different densities, microscopes have certain point-spread-functions and cameras have certain sensitivities. But eventually, CARE was born.

IMG_2108.jpeg
Some members of the CARE team: Florian Jug, Akanksha Jain, Martin Weigert. (c) H. Jambor

Happy embryos thanks to AI

The new machine learning tool CARE not only accelerates de-convolution but also has other advantages. Besides large data sets, photo-toxicity is an ever-present problem of in vivo microscopy. When embryos are imaged for days, die while being observed.

Akanksha Jain is interested in the early development of Tribolium beetle and says, “Even if my beetles don’t die, I am concerned that the laser power I need for good images will cause artefacts.” For Akanksha, postdoc in Pavel Tomancak’s laboratory, this problem is solved with CARE: “Once the network is trained, I can image at only 0.1% laser power, you literally see nothing, and then reconstruct the images with CARE. It’s mind-blowing how good that works.” So far, CARE has been able to convince in every tissue and model organism tested. Martin Weigert and his colleagues have already used it with mouse liver, zebrafish retina, fruit fly embryos, flatworms, Drosophila wings and Akanksha’s beetles.

How does CARE work?

To train a CARE network you need only a few pictures. These images are specific for a microscope set-up, tissue and fluorescent markers and image-pairs must be acquired at low and high laser intensity. CARE then calculate the correction values from the typical deviations.

CARE in application

In practice, the training set of images could be taken at the beginning and the end of an experiment, or also much later. And, once trained, a CARE network can be re-used indefinitely for future (and in fact also for past) experiments.

The demands on the microscope are also low: CARE can be used for any microscope type. The only requirement is that users may vary the laser intensity for the training images. Akanksha Jain confirms this: “CARE does not change the type of experiments I do, but it increases the possibilities of how I can capture images: it becomes faster and more robust”.

Florian Jug also sees the use of CARE for biologists as unproblematic. In his opinion, the biggest difficulty is “that people can’t install it on a computer.” For this, the team has taken precautions: the publication is accompanied by a detailed online documentation explaining step by step how CARE works with Windows or Linux.

Careful with CARE?

An important question is always whether software can introduce artefacts in the data. After Martin Weigert got CARE up and running in just a few days, it took the team about a year to be fully convinced that it works error-free. “We had to find out: can I trust the network” says Florian Jug. The team first showed that independently trained CARE networks deliver comparable results. Similar to two people solving the same math problem but using a different path. The team also controlled that the variance isn’t visible on a single pixel basis, demonstrating that CARE does not insert or changes any data, but merely sharpens what is present already.

CARE thus allows biologist to get better images without having to invest in better microscopes. Martin Weigert is finishing his doctoral thesis, but the next innovations are already in the pipeline. A method to calculate optimized images without reference images is in the planning stage.

Published in Nature Methods, December 2018 and at BioRxiv

This is a English version of an article that also appeared in 3/2019 of Laborjournal.

Pick’n’mix plots

Follow-up to: Showing distributions

 

When writing about the half-and-half plot, many of you replied with further discussion points, tips, and tutorials. I tried collected them here to make them available to everyone.

More mixed boxplots

Aaron Ellison @AMaxEll17 brought to our attention that he published a plot in 1993, where he overlaid the box plot with the data points (see fig 1A). Along with it he published the code, pre github et al. Aaron was inspired by the just published “Grammar of Graphics” by Wilkinson. He seems to be the first person to have published it in a paper?

Today, boxplot/data plots are common and easy to plot in R with ggplot2. Declan O’Regan @DrDeclanORegan shows us one example in figure 1B. An “exploded” version, where the boxplot and its metrics are barely visible and the focus is on the data points, is shown in figure 1C (provided by the cystic fibrosis Gene therapy group @CFGT_Edinburgh).

3 variations of boxplot mixed with data plots.
Figure 1: Boxplot mixed with data plots.

Box’n’Bee

There is also the overlay of boxplot with the bee-swarm plot. Here, individual data points are ordered and arranged in a U-shape instead of randomly placed. An example is shown by Darren Wisniewski @Dmwizzle, who made this in ggplot2 (fig 2A).

But, beware of the bee-swarm: the ordered arrangement of the data (U-/ or A-shape most common) may introduces visual artifacts. And, personally, I draw a mental line through the U-shaped branches and straighten it to understand the data. This is error-prone and of course a waste of time when the line could equally be straight. In figure 2B I have plotted the same data as bee plot and dot plot for a direct comparison. I feel it is easier to see how the data is distributed in the data/dot plot. (Data: gene expression of RNAs that are localized at the poles in the fruit fly oocyte. RNAs that localize at the posterior for days have higher expression than RNAs at the anterior pole that are localized just for a few hours).

Pick_fig2-01.png
Figure 2: A. Boxplot and bee swarm plot. B. Comparison of bee versus data/dot plot.

Histogram & boxplot

Robert Grant @robertstats pointed us to an interesting histogram overlaid with statistical summaries that was originally designed by @f2harrell (here is a link to a tutorial with R), see figure 3. The horizontal histogram shown below has particularly small bins and the median and quartiles indicated below – for my taste a bit too small.

Picture5
Figure 3: histogram with boxplot overlay.

Violin and data

Of course, there are also mixed plots with violin plots. Violin plot themselves most often already are overlaid with a boxplot. Another possibility by Wouter de Coster @wouter_decoster is to mix the violin plot with a bee swarm plot, which he implemented with python seaborn (fig 4A). As you know, I personally would have preferred the actual data instead of the bee swarm, see above.

Joey Burant @jbburant put forward the idea of mixing data points as a histogram with half of a violin plot in , see figure 4B. pick_3-01

Joey also nicely documented how in github:

When the histo-violin is flipped horizontal this looks like a raining cloud, Roger Kievit @rogierK  therefore named it the raincloud plot and just deposited a preprint article about this plot type and its implementation. For matplotlib users posted a guide in github.

In excel…

Jorge Camoes @wisevis shows us that such plot types are also possible to make in excel – he shows us a horizontal boxplot with data points above from his book (fig 5). I generally like horizontal boxplots, especially when comparing lots of categories! re-created the half-and-half plot it in excel. Both are phenomenal, I had no idea excel could do this much! 

Picture8
Figure 5: Boxplot with data in excel

… and matlab

And finally, matlab user rejoice, it is also possible to make mixed plots in your favorite environment, Matt Cooper @mattguycooper suggests to use the ‘notboxplot’ function on the file exchange that creates ‘box plots’ with dot plots overlaid, this gives you plots as shown in figure 6:

Picture9
Figure 6: boxplots with data in matlab

More: Tutorials and interactive plots

Bogdan Micu‏ @trizniak points us to a nice interactive violin plot: https://plot.ly/r/violin/.

A couple of tutorials: Frank Soboczenski‏ @h21k shows us the code for making half-and-half boxplots in R: https://github.com/h21k/R/blob/master/snippets/half_box.R, James Rooney @jpkrooney pointed us to a great tutorial for making violin plots with ggplot2 by Katherine Wood @kathmwood https://inattentionalcoffee.wordpress.com/2017/02/14/data-in-the-raw-violin-plots/ and @lisadebruine compares different plots compare with the same data: https://debruine.github.io/plot_comparison.html.