helena * jambor

scientist interested in RNA, genomics and science visualizations

AI CARE for you

– a new AI-based tool for microscopy

Sharp images need a lot of light or a long exposure time. But too much light damages cells, which are also always in motion. In microscopy, images are therefore often taken at short exposure times and low laser power and are blurred and noisy. Computer scientists at the Center for Systems Biology/Max-Planck institute for cell biology and genetics in Dresden developed software that avoids this problem. The artificial intelligence-based software CARE, Content-aware image restoration, can calculate razor-sharp microscopy data from noisy images. This enables biologists to obtain high-resolution images and films without exposing the cells to the risk of light toxicity.

split_tribolium2.png

Before and after CARE: images are effectively de-noised with CARE. (c) M.Weigert et al, 2018

Big Data Problems in microscopy

The researchers involved had been focusing on Big Data in microscopy for some. In vivo and light sheet microscopy quickly amasses giga or terabytes of data. These data need de-noising and de-convolution before analysis, a time-consuming and computationally intensive step. Recently, computer science have more and more used machine learning with neural networks to solve complex tasks. In machine learning computers are fed data and learn to solve a specific problem. This approach is being the recent successes of computers in Chess and Go, and also powers the translation tool Deepl. It was thus not far fetched, to also use machine learning in bioimage data de convolution.

Image de-convolution thanks to AI

The first author of the study, Martin Weigert, says: “It was already known that machine learning for 2D images actually delivers very good results. What we had to do was transfer this to biological data.” Martin tried it out just before heading home for Christmas in 2016. The results were so overwhelming that he initially assumed he had made a mistake. Martin Weigert: “I called Loïc [Royer, another author] over and we discussed how this is impossibly good, this can’t be right, there must be some mistake!

It wasn’t a mistake. Martin Weigert was finally convinced when he saw pictures of fruit flies. In the noisy original image, no structures were visible to the eye. But after a machine learning network was trained, it calculated immaculate images showing detailed distributions of protein in the membrane of fruit fly wings.

To really adapt machine learning for biological images, the research team had to overcome several challenges. Microscopic images are often three- or four-dimensional, biological materials have different densities, microscopes have certain point-spread-functions and cameras have certain sensitivities. But eventually, CARE was born.

IMG_2108.jpeg

Some members of the CARE team: Florian Jug, Akanksha Jain, Martin Weigert. (c) H. Jambor

Happy embryos thanks to AI

The new machine learning tool CARE not only accelerates de-convolution but also has other advantages. Besides large data sets, photo-toxicity is an ever-present problem of in vivo microscopy. When embryos are imaged for days, die while being observed.

Akanksha Jain is interested in the early development of Tribolium beetle and says, “Even if my beetles don’t die, I am concerned that the laser power I need for good images will cause artefacts.” For Akanksha, postdoc in Pavel Tomancak’s laboratory, this problem is solved with CARE: “Once the network is trained, I can image at only 0.1% laser power, you literally see nothing, and then reconstruct the images with CARE. It’s mind-blowing how good that works.” So far, CARE has been able to convince in every tissue and model organism tested. Martin Weigert and his colleagues have already used it with mouse liver, zebrafish retina, fruit fly embryos, flatworms, Drosophila wings and Akanksha’s beetles.

How does CARE work?

To train a CARE network you need only a few pictures. These images are specific for a microscope set-up, tissue and fluorescent markers and image-pairs must be acquired at low and high laser intensity. CARE then calculate the correction values from the typical deviations.

CARE in application

In practice, the training set of images could be taken at the beginning and the end of an experiment, or also much later. And, once trained, a CARE network can be re-used indefinitely for future (and in fact also for past) experiments.

The demands on the microscope are also low: CARE can be used for any microscope type. The only requirement is that users may vary the laser intensity for the training images. Akanksha Jain confirms this: “CARE does not change the type of experiments I do, but it increases the possibilities of how I can capture images: it becomes faster and more robust”.

Florian Jug also sees the use of CARE for biologists as unproblematic. In his opinion, the biggest difficulty is “that people can’t install it on a computer.” For this, the team has taken precautions: the publication is accompanied by a detailed online documentation explaining step by step how CARE works with Windows or Linux.

Careful with CARE?

An important question is always whether software can introduce artefacts in the data. After Martin Weigert got CARE up and running in just a few days, it took the team about a year to be fully convinced that it works error-free. “We had to find out: can I trust the network” says Florian Jug. The team first showed that independently trained CARE networks deliver comparable results. Similar to two people solving the same math problem but using a different path. The team also controlled that the variance isn’t visible on a single pixel basis, demonstrating that CARE does not insert or changes any data, but merely sharpens what is present already.

CARE thus allows biologist to get better images without having to invest in better microscopes. Martin Weigert is finishing his doctoral thesis, but the next innovations are already in the pipeline. A method to calculate optimized images without reference images is in the planning stage.

Published in Nature Methods, December 2018 and at BioRxiv

This is a English version of an article that also appeared in 3/2019 of Laborjournal.

Advertisements

Pick’n’mix plots

Follow-up to: Showing distributions

 

When writing about the half-and-half plot, many of you replied with further discussion points, tips, and tutorials. I tried collected them here to make them available to everyone.

More mixed boxplots

Aaron Ellison @AMaxEll17 brought to our attention that he published a plot in 1993, where he overlaid the box plot with the data points (see fig 1A). Along with it he published the code, pre github et al. Aaron was inspired by the just published “Grammar of Graphics” by Wilkinson. He seems to be the first person to have published it in a paper?

Today, boxplot/data plots are common and easy to plot in R with ggplot2. Declan O’Regan @DrDeclanORegan shows us one example in figure 1B. An “exploded” version, where the boxplot and its metrics are barely visible and the focus is on the data points, is shown in figure 1C (provided by the cystic fibrosis Gene therapy group @CFGT_Edinburgh).

3 variations of boxplot mixed with data plots.

Figure 1: Boxplot mixed with data plots.

Box’n’Bee

There is also the overlay of boxplot with the bee-swarm plot. Here, individual data points are ordered and arranged in a U-shape instead of randomly placed. An example is shown by Darren Wisniewski @Dmwizzle, who made this in ggplot2 (fig 2A).

But, beware of the bee-swarm: the ordered arrangement of the data (U-/ or A-shape most common) may introduces visual artifacts. And, personally, I draw a mental line through the U-shaped branches and straighten it to understand the data. This is error-prone and of course a waste of time when the line could equally be straight. In figure 2B I have plotted the same data as bee plot and dot plot for a direct comparison. I feel it is easier to see how the data is distributed in the data/dot plot. (Data: gene expression of RNAs that are localized at the poles in the fruit fly oocyte. RNAs that localize at the posterior for days have higher expression than RNAs at the anterior pole that are localized just for a few hours).

Pick_fig2-01.png

Figure 2: A. Boxplot and bee swarm plot. B. Comparison of bee versus data/dot plot.

Histogram & boxplot

Robert Grant @robertstats pointed us to an interesting histogram overlaid with statistical summaries that was originally designed by @f2harrell (here is a link to a tutorial with R), see figure 3. The horizontal histogram shown below has particularly small bins and the median and quartiles indicated below – for my taste a bit too small.

Picture5

Figure 3: histogram with boxplot overlay.

Violin and data

Of course, there are also mixed plots with violin plots. Violin plot themselves most often already are overlaid with a boxplot. Another possibility by Wouter de Coster @wouter_decoster is to mix the violin plot with a bee swarm plot, which he implemented with python seaborn (fig 4A). As you know, I personally would have preferred the actual data instead of the bee swarm, see above.

Joey Burant @jbburant put forward the idea of mixing data points as a histogram with half of a violin plot in , see figure 4B. pick_3-01

Joey also nicely documented how in github:

When the histo-violin is flipped horizontal this looks like a raining cloud, Roger Kievit @rogierK  therefore named it the raincloud plot and just deposited a preprint article about this plot type and its implementation. For matplotlib users posted a guide in github.

In excel…

Jorge Camoes @wisevis shows us that such plot types are also possible to make in excel – he shows us a horizontal boxplot with data points above from his book (fig 5). I generally like horizontal boxplots, especially when comparing lots of categories! re-created the half-and-half plot it in excel. Both are phenomenal, I had no idea excel could do this much! 

Picture8

Figure 5: Boxplot with data in excel

… and matlab

And finally, matlab user rejoice, it is also possible to make mixed plots in your favorite environment, Matt Cooper @mattguycooper suggests to use the ‘notboxplot’ function on the file exchange that creates ‘box plots’ with dot plots overlaid, this gives you plots as shown in figure 6:

Picture9

Figure 6: boxplots with data in matlab

More: Tutorials and interactive plots

Bogdan Micu‏ @trizniak points us to a nice interactive violin plot: https://plot.ly/r/violin/.

A couple of tutorials: Frank Soboczenski‏ @h21k shows us the code for making half-and-half boxplots in R: https://github.com/h21k/R/blob/master/snippets/half_box.R, James Rooney @jpkrooney pointed us to a great tutorial for making violin plots with ggplot2 by Katherine Wood @kathmwood https://inattentionalcoffee.wordpress.com/2017/02/14/data-in-the-raw-violin-plots/ and @lisadebruine compares different plots compare with the same data: https://debruine.github.io/plot_comparison.html.

 

 

 

Axis breaks

I challenge you: it is almost never necessary to break an axis! I found this one in a recent scientific article. The axis break here is 100% unnecessary. All that was needed was a little re-scaling of the dot size and the y-axis.

(or in fact: not reporting the one number with a graph in the first place. Or: showing an almost complete pie- or bar chart).

Imported File (2)-01When you cannot avoid an axis break, for example for showing highly diverged values, do it well:  the break point must be clear in the axis AND in the data! Another solution is show the values along a logarithmic scale. However this works only for some data types and is almost always harder to read (make sure to then include grid lines to focus your audience right away on the unusual axis layout!).

Also, never use a design that accidentally gives the impression of a broken axis. In this example instead of a boxplot with a black bar for median, a white one was chosen, giving the impression of some interruption in the data.

IMG_0273

 

Showing distributions

When reading about co-evolution of prey and predators, I stumbled across a cute new plot type: a half boxplot, half dot plot. Wilson et al. created this plot to simultaneously visualize both the summaries of their data (center, spread) and the actual data points. This allows us, the audience, to learn a lot about their results: that cheetahs are maybe binomially distributed and have outliers, or that zebras show a curious clustering.

Wilson_redraw

Half boxplot- half data plot, re-drawn from Wilson et al. 2018

Your quick guide to distribution plots

The half-and-half, aka dox-plot (a friend), led me to explore which visuals are commonly used for showing distributions.

Raw data

To show how the raw data is distributed, we simply use dots or bars (as in barcode plots). When there are overlapping data points, we can use transparent fill colors or “jittering”, which distributes the data points in a given area, for increased clarity.

Summarizing the data: center and spread

Often, we are interested in summarizing statistics to judge and compare data. By convention the center (median) is indicated with a horizontal bar and the spread (variance, standard deviation) with vertical whiskers. Common plot types for this are the “star-wars rebel fighter”, the dot plot and boxplot. Bar plot used to be widely used, but are now banned by most journals for concealing most relevant information, so they are here only for completeness (see previous post). Very often these days I see boxplots that are overlaid with the data points – this works really well for up to 100 data points and is easy to implement with most software.

Part_2_FocusCenter

Show shape and unusual features

For normally distributed data, the center and the spread are highly informative. However, in life science we often have bimodal distributions, clusters, or gaps. Then boxplots become very insufficient and might even conceal interesting aspects (if not outright be misleading).

For faithfully showing distributions, histograms have a long history. Here, one has to be very careful with choosing bin sizes: too large or too small bins can greatly distort the histogram shape and result in a misleading chart. Choosing bin sizes is a science in itself, for details see wikipedia – but basically, it again depends on the data shape and sampling depth.

An alternative to histograms are density plots. They show only how the data are distributed, but in life science are not frequently used yet. They become very useful for large data sets that can’t be visualized as raw data points anymore, or where the eye would not be able comprehend the spread intuitively anymore. A rather recent but so far “happy marriage” is the violin plot. Violin plots are a fusion of the boxplot and its summary statistics, with the density/shape of the data (Hintze and Nelson, 1998 doi: 10.1080/00031305.1998.10480559).

Part_3_FocusShape

 

Pro and cons of different plot types:

Table_procon.png

All figures as one:

geneology of distribution charts_5-01

 

It’s playtime: Bad Poster Bingo!

Your next conference is coming up and you need a distraction for the poster session? Then consider playing “Bad Poster Bingo” and win a prize*! Better even, use this guide to avoid common pitfalls we experience when presenting scientific data on a poster! Tips are also in my previous guide to good poster design & references therein.

How to play Bad Poster Bingo

Complete 5 in a row, horizontally/vertically/diagonally, send me an example & win!

> Download Bad_Poster_Bingo_bunt or Bad Poster Bingo (plain) <

Bad_Poster_Bingo_bunt-01.png
* your choice for a win: a Helena-drawn portrait, a Helena-made badge of honor, or an honorary mention on twitter 🙂

Venn, Euler, upset: visualize overlaps in datasets

Visualizations for comparing datasets is a topic in all my data viz classes. Current solutions for comparing 2,3, 4 and more datasets are diverse and some are controversial. A one-fits-all solution does not exist, but there are well-working solutions, and some that should be avoided.

1-3 datasets

When comparing two or three datasets, Venn diagrams work well. Most people already learn about them in school, and if not they are intuitive*. Each dataset is shown as a circle and they are arranged such that all possible overlaps are shown. Done.

4 and more datasets

Things get problematic when comparing more than three datasets. Mathematically, it is not possible to show all overlaps of four or more datasets with circles. One possibility is to leave out some overlaps, as is often done in Euler diagrams. In the example below the overlap between “oocyte stage 2-7” and “oocyte stage 9” is for instance not visualized (RNAs localized in the oocytes across development, see publication). I find it however confusing when data is left out and sometimes “no overlap” is an important information itself.

Venn_euler_fourdatasets-01

Datasets of RNAs subcellularly localized in Drosophila oocytes.

Venn himself devised the diagram comparing four and more datasets by switching from a circles to ellipses. Branko Grünbaum developed the ellipsoid representation further for comparison of five dataset by . Their strategies are used by the online tool Draw Venn (Yves Vandepeer, Univ of Gent) where you can make Venn plots by simply uploading your data there. A variation is used by Heberle et al here (publication). There is also an R package by Victor Quesada.

I find there are two problems for Venn diagrams with more than three datasets. First, it takes long to read them and extract all information: comparing four datasets gives a diagram with 15 regions/11 overlaps, five datasets gives a diagram with 31 regions/26 overlaps! I invariably end up writing the numbers down into my own table. Secondly, the areas can’t possibly be representative for the overlap size – and this is a lost information.

New: upset plots

An alternate solution, the upset plot, was developed by Niels Gehlenborg and Jake Conway. Presence of dataset elements in a given intersect is shown with a dot in a simple table. The size of the intersect is represented with a bar chart. Both are simple visuals that are easy to consume. Their package is available in R and simple to use.

 

Customising upset plots

While the upset plots are simple, I think they can be improved. In upset plots the intersect is shown above the actual datasets, that serve as the legend. Basically, one is forced to read the upset from the bottom up. By flipping the plot horizontally this caveat is overcome: now the datasets are on the left, where we typically read first, and the bar is shown on the right nicely accompanying the respective set. Another improvement is to clearly label the intersects e.g. “present in one set”, “two sets” and to group them visually. Additionally, I have also color-coded the datasets to provide a quicker way of orienting the reader.

Depending on your message, you will have to find the optimal ordering strategy. I visualized the subcellular enrichments of RNAs and how they change localization during the development of the fruit fly oocyte. I would want to learn e.g. what happens to the hundreds of specific RNAs that enrich at early stages? Do they remain localized at all stages? It turns out the majority gives up their specific subcellular enrichment and instead become distributed inside the cell while other RNAs (not visualized here) take their place (more information on the biology).

Upset_RNAs_AI_bunt_sorted-01

Note, I did all the fine-tuning of the upset plot with illustrator but most likely it is also possible in R directly.

 

* Be aware that more people than you expect do not know Venn diagrams & require an introduction!

 

 

How big is an ribonucleic acid*?

I am often surprised about the real dimensions of biological entities versus how they are shown in textbooks and scientific illustrations and this is very striking for ribonucleic acid (RNA). Ribonucleic acids themselves are not photogenic as they move and wiggle, and in textbooks are shown as short strands bound by 1-2 proteins. Not really – ribonucleic acids are bundled up, associate with hundreds of proteins, cations, and other small molecules, and have a higher spherical dimension than proteins.

Quizz time! What is your guess for the physical length of a “typical human ribonucleic acid*” (let’s say 2-5kilobases)? Don’t look it up! Draw it on the image below in relation to a human egg, a skin cell, a yeast, bacterium, or viral capsid & send to [hjambor – at – gmail.com], I’ll include it in the collection below. Or just post your guess in micro-, nano, or picometers in the comments!

 

Answer:

……

 

……

 

……

A single nucleotide, which is the smallest building block, spans 3.4 Angstrom, or 340 picometers, or 0.3 nanometers. Three nucleotides encode one amino acid in a protein, therefore ribonucleic acids* three times longer than the respective protein. In addition, ribonucleic acids have many nucleotides that only serve regulatory purposes, they help with or block protein translation, or they influence  stability and degradation.

Screen Shot 2018-03-04 at 21.45.50

The average yeast ribonucleic acid is ~1500 nucleotides (Miura, BMC Genomics, 2008), which adds up to a whopping 510 nanometers, or 0.5 micrometers, spanning a good portion of the length of the entire budding yeast itself!

The average human ribonucleic acid molecule is 2000 to 6000 nucleotides, resulting in a physical length of 0.7 to 2 micrometers (Strachan and Read, 1999, Human molecular genetics). This is after a process called splicing, which removes about 60-80% of the nucleotides before a protein is even made from it. Before splicing, right when they are transcribed from the DNA template, human ribonucleic acids are 3-5 micrometers long – that is longer than a virus capsid, a bacterial cell, a yeast cell, and even larger than the diameter of the nucleus it is transcribed in! These are just averages, the longest human ribonucleic acids measure 100 (Titin) and even 600 micrometers (caspr2). To fit inside a cells, and the nucleus of a cell, ribonucleic acids curl up and are compacted. And even in the cytoplasm, where they are shorter, ribonucleic acids take up a lot of space – on average about half of the genes are transcribed at any given time point, and typically each ribonucleic acid is present in multiple copies.

Now compare your guess to the answers I got from molecular biologists – their replies varied from 10 nanometers to 100 micrometers! Mind you, my own guess was far off as well, and that after having worked with localized ribonucleic acids for over 10 years!

1_

What do we learn? Biological entities cover 10 magnitudes of scales, therefore faithful representations of size is neither possible nor expected in illustrations that merely symbolize information. On the other hand, our visual memory is pretty good – once we saw information as a picture, we tend to believe it. By memorizing false relative scales, we may thus loose an important information that may help us interpret research data.

* For the enthusiast: I mean messenger ribonucleic acids (mRNAs), the class that encodes proteins. These are generally longer than other categories of RNAs that do not encode proteins, such as rRNAs and tRNAs, miRNAs, and piRNAs.

 

Conformation of the insulin receptor

A few days back, my fellow CNV grantee Theresia Gutmann from the Coskun lab casually told me over dinner about her PhD work. In collaboration with the Rockefeller University NYC, Theresia had visualized the changing conformation of the human insulin receptor upon insulin binding (paper). Having just started at the Center for Regenerative Therapies Dresden with its focus on Diabetes, I could not believe that this had not been done before! To honor her achievement, I made a #sketchnote of the discovery and a GIF explaining insulin in our body (below).

theresia_new.pngInsulin:insulin_6

Paper: Gutmann, Kim et al. (2018): Visualization of ligand-induced transmembrane signaling in the full-length human insulin receptor. Journal of Cell Biology, DOI: 10.1083/jcb.201711047

 

 

Real viz coming soon, today status: tired!

I have a couple of thing I want to prepare and show, but we, as many in Germany, are submitting a DFG-excellence strategy grant next week. It’s a giant project, for seven years, with many many player’s and a lot of coordination, politics and details… To relax in the evening, i do what I always did to calm myself, drawing!

More information design soon!

Mom, can you draw a unicorn for me…

img_7807.jpg

Stop drawing me mom!

img_7804.jpg

How big is stuff in biology?

It is easy for everyone, already from kindergarten age on, to judge and compare sizes and lengths. Which lollipop is biggest, that the Eiffel tower is tall, and that matchbox cars are smaller than real ones. But it is rather difficult to understand sizes at macroscopic and microscopic scale, because we never get to see it with the unaided eye, and most of us just see images taken by others.

I probably read hundreds and hundreds of times that a cell is around 20um; I vaguely remember that many bacteria are 1/10th of that size because one magnitude difference is easy to remember. But how much bigger a cell is than a virus, and how much smaller in relative terms than my finger, I read up on again and again.

To help myself, I started drawing the relative sizes of various biological entities that I am fascinated with. Myself (here: my thumb), a fruit fly (my model organism in research for 10 years), eggs of various sizes, cells and my beloved ribosome, a wonderful machine made of many proteins and importantly, RNA that exists in every organism.

Screen Shot 2018-01-25 at 22.14.01Screen Shot 2018-01-25 at 22.13.50Screen Shot 2018-01-25 at 22.13.414_yeast-bacteria-hivScreen Shot 2018-01-25 at 22.13.06Screen Shot 2018-01-25 at 22.11.44Screen Shot 2018-01-25 at 22.10.49

While making the drawings and looking up sizes, I was once more mesmerized to re- discover that a membrane lipid is not that much bigger than a water molecule! And that a human egg, which itself is 10 times larger than an “average cell” is almost visible by eye! Also, consider this: cells come is vastly different sizes, the longest cell in the human body is around one meter long, while the smallest is around 10um. In other words, cells can vary in size over five magnitudes, from 10 to 1 000 000um! That means, if you think of the smallest cell as a tennis ball, the largest would be in comparison as tall as the Mount Everest (and, their nucleus is still the same size…)!

Have fun looking through the comparisons! A beautiful inspiration is here.

PS Also take note how one can use both relative size and scale bars for showing the size of an object! Please, never ever forget to add scale bars to your images, they are the only clue that allows your audience to relate the content to reality!

This slideshow requires JavaScript.