Our life can be described with numbers and visualized in diagrams. However, every mean and every chart are simplifications. Useful and fascinating, but also obstructive and restrictive.
Yesterday we experienced a wonderful visualization of two “lives in numbers” in the Festspielhaus Hellerau. The two choreographers and dancers Katia Manjate and Anna Till meet in the Dresden premiere of “Life in numbers”. Four years ago, the artists began to compare their lives on the basis of statistics and through these got to know each other. This gave the impulse to “Life in numbers”, and so the piece begins. With a comparison of life in Dresden, where Till lives, and Maputo, where Manjate lives. In the course of the 8 parts, we get to know key data, which are translated into dance on a Cartesian coordinate system, with movements along the invisible x- and y-axes. The dancers themselves form the data points, always in relationship with each other, but changing, converging, and diverging in the course of time.
Differences and similarities
In the middle part of the piece we turn away from the mean values and towards individual data points. No longer: “How long does a woman live in Germany and one in Mozambique” is discussed, but Till and Manjate ask themselves directly: “How old are you?“. The confrontational, curious questions are staged like a duel and bathed in bright light. The shadows of the bodies are projected into rectangles formed by the spotlights, thus visualizing our thinking in boxes. The audience waits with great anticipation for the next answer.
In the third part the tension is released, the light becomes warm and the dance playful. The artists now begin a joyful confrontation with their differences and celebrate their similarities. This is supported by music and rhythms from both countries.
Numbers are beautiful
In the end, Till asks, can we live without numbers? “Life in numbers” answers “no”, but also shows the life within numbers. And it demonstrates that numbers are more than economic values. Numbers may describe how often we laugh per day, how long we see the sun, and what rhythms feet can dance.
As a scientist, I avoid 3D diagrams because they distort data. With “Life in numbers” you can experience how dance, a three- or even four-dimensional visualization, can make numbers directly experienceable and fascinating.
Unser Leben kann mit Zahlen beschrieben und in Diagrammen visualisiert werden. Jeder Mittelwert und jede Visualisierung sind jedoch immer Vereinfachungen. Nützlich und faszinierend, aber auch behindernd und einengend.
Premiere in Hellerau
Gestern konnten wir im Festspielhaus Hellerau eine wunderbare Visualisierung von zwei „Leben in Zahlen“ erleben. In der Dresden Premiere von „Life in numbers“ treffen sich die beiden Choreografinnen und Tänzerinnen Katia Manjate und Anna Till. Vor 4 Jahren begannen die Künstlerinnen ihre Leben anhand von Zahlen zu vergleichen und sich so kennenzulernen. Dies gab den Impuls zu „Life in numbers“, und so beginnt auch das Stück. Mit einem Vergleich der Leben in Dresden, wo Till lebt, und Maputo, wo Manjate lebt. Im Laufe der 8 Teile lernen wir so zunächst Eckdaten kennen, die tänzerisch auf einem kartesischen Koordinatensystem umgesetzt werden, mit Bewegungen entlang der unsichtbaren x- und y-Achsen. Die Tänzerinnen bilden hier selber die Datenpunkte, immer in einem Verhältnis zueinander, aber im Laufe der Zeit sich verändernd, konvergierend, divergierend.
Unterschiede und Gemeinsamkeiten
Im mittleren Teil des Stücks wenden wir uns von den Mittelwerten ab und den individuellen Datenpunkten zu. Nicht mehr: „Wie lange lebt eine Frau in Deutschland und eine in Mosambik?“ wird erörtert, sondern Till und Manjate fragen sich direkt: „Wie alt bist Du?“. Die konfrontativ-neugierigen Fragen werden wie ein Zweikampf inszeniert und in grelles Licht getaucht. Die Schatten der Körper werden in Rechtecke projiziert, die von den Scheinwerfern gebildet werden und visualisieren damit unser Denken in Kästen. Das Publikum wartet mit der Fragenden höchstgespannt auf die nächste Antwort.
Im dritten Teil löst sich die Spannung, das Licht wird warm und der Tanz spielerisch. Die Künstlerlinnen beginnen nun eine freudige Auseinandersetzung mit ihren Unterschieden und feiern ihre Gemeinsamkeiten. Dies wird von Musik und Rhythmen aus beiden Ländern unterstützt.
Zahlen sind auch schön
Am Ende fragt Till, können wir ohne Zahlen leben? „Life in numbers“ antwortet „nein“, zeigt aber auch das Leben in den Zahlen. Und es zeigt, dass Zahlen mehr sind als ökonomische Werte. Zahlen können auch beschreiben, wie oft lache ich pro Tag, wie lange sehe ich die Sonne, und welche Rhythmen können meine Füße tanzen.
Als Wissenschaftler vermeide ich 3D-Diagramme, weil sie Daten verzerrt zeigen. Bei „Life in numbers“ kann man erleben wie Tanz, eine drei- oder sogar vierdimensionale Visualisierung, Zahlen ganz unmittelbar erlebbar und faszinierend machen kann.
We keep discussing axis layouts and the problematic cases of non-zero baselines (in bar charts). Here is another example from the city of Dresden. Dresden is a really pretty place and it is always worth coming for a visit. With the below chart, the city wanted to showcase that each year new record tourist numbers are recorded.
Truthful bar chart
Now, since this isn’t the first time we discuss baselines, you should immediately spot that a rise from around 200,000 to 300,000 isn’t even close to a tripling of number as the bar length visually suggests. And overall, the bar-length does not actually even represent the increase at all. It rather seems that their lengths were chosen to fit an imaginary linear increase. I re-plotted the bar chart with a zero-baseline. Lo-and-behold, the rising number of tourist is still visible, but clearly not nearly as record-worthy.
The image however circulated in 2017 – three years after the last data-point shown! Now, if you know that since the end of 2014 Dresden is plagued by very prominent weekly demonstrations of right-wing activists, having no data after 2014 is alarming. In the local science and business community the problems are very evident: we have a clear drop in international scientists applying and accepting jobs in the city! I therefore went to the Dresden city website to get the data for the subsequent years and this confirmed what I suspected: tourist numbers no longer rose, instead, they even dropped!
Line or bar chart?
Time trends are usually more visible in line charts. Indeed, the drop of tourist numbers since 2014 is very apparent in a line chart, and even more so when we leave out the zero-baseline, which somewhat flattens the data (note: in line charts leaving out zero-baseline is ok and sometimes even necessary!).
Fun with Excel
And, did you know you can use a picture as the background for your chart in Excel!?!
Bar charts encode categorical values by length. By comparing bar lengths, we can visually compare the category sizes.
When a bar is truncated due to a missing zero-baseline or an interrupted y-axis, the relative size difference between the bars changes. Now, the bars no longer visually encodes the actual category value. (Read more from a previous blog)
Misleading bar chart
The above is a DataViz classic. FOX NEWS reported an (to them: alarming) increase in Obamacare enrollments over a few days with a bar chart. They apparently feared imminent bankruptcy of the USA and therfore save the nation by overemphasizing the increase with truncated bar. Instead of the moderate ~20% rise (6 to 7 milllion), their bar showed a 300% increase in length!
I quickly re-designed the chart in Excel. The increase is still clearly visible, even with a zero-baseline.
A German version of this appears in Laborjournal 9/2019.
It took me around one year from data analysis to the final visualization of my postdoc results. A year, in which I tried many visualization strategies and simplified them again and again. Helpful at that time were Edward Tufte’s book „The visualization of quantitative data” and my pre-science background in applied art, that taught me effects of color and shape.
While visualizations are used in science since antiquity, their number and complexity now increased significantly. Nevertheless, even today budding scientists do not learn to encode and effectively decode visualizations. This is evident in the numerous examples of difficult-to-read figures and outright misleading charts. In the past years, I taught data visualization to hundreds of scientists. A few tips and tricks to increase readability and effectiveness of data figures are summarized here.
As always, the plan comes first. Who is my audience, what do I want to say, and why/ what is my message? Only when these questions have been answered succinctly can you begin to implement the visualization. What otherwise happens are charts with 50 categories, 12 colors, or the use of a diagram type altogether.
Testing a visualization is a must. Does the visualization work on paper? Is the electronic version readable? We should always look for external, objective reviewers. Often these are not the laboratory colleagues who already know our results and awkward visuals. Ideally, you should find several Guinea-pigs to get a feeling for the representative-ness of the answers.
Choose diagram type
A diagram should represent data truthfully and functionally. It is therefore important to select the diagram type suitable for the data type. A line chart shows a temporal progression, a bar chart compares sets of different categories, and if the category names are long, you can make a bar chart of them (Figure 1). Pie charts effectively show percentages for up to 5 rather differently sized categories. They quickly become a disaster if used for more categories, in 3D, or if two pie charts are to be compared precisely.
To avoid being misleading, some basic visualization rules should be followed: omitting the zero line distorts relative size differences in column and bar graphs (Figure 1C). In line charts, on the other hand, a zero line can be omitted because only trends are read, and in fact at times (link to twitter) a zero-baseline in line charts can result in very misleading information presentation. Absolute caution is advised with charts for statistical distributions: The box plot is suitable for normally distributed data; in histograms, the number of bins must match the number of data points. And important for readability: the invariable size, e.g. time, is shown on the x-axis, the dependent size on the y-axis (not as in Fig. 1D!).
Without text, each diagram is just like an abstract piece of art. To be self-explanatory, images need a title, axis labels, and a legend. Nowadays, the title often simply sits above the diagram itself, otherwise it is found in the caption. Legends should be placed where they are needed – i.e. better not in the signature, but close to the data (see: Layout).
The following applies to all text elements: Abbreviations should be avoided, with the exception of very, very common ones. Again, find colleagues to try out your favorite abbreviation. You’d be surprised how an abbreviation turns out to be just lab jargon or really specific to your field of research. But: your papers should not be! (Googling it is also possible: if not found among the top entries, probably you should not use that abbreviation).
It is also important to avoid redundancies (especially in the axis labels!!), filler texts and passive language.
In a fraction of a second we can capture visual information, understand content, color and organization. The better this information is structured, the faster we understand it.
Most importantly, we usually read images like text, from top left to bottom right. That’s why posters have titles at the top and references at the bottom. Just like on posters, however, we can also increase the reading speed of individual charts: a meaningful title might be placed above a diagram (left align is usually good!) and color codes should be explained right here. On the other hand, scale information, data source, or sample size can happily be placed at the bottom right.
Because we read visual information similar to text, composite images should be organized in either rows or columns. Once such a reading direction is set, it should not be changed (best: keep reading direction for all figures of the manuscript). To maintain an overall organized look, you are best advised to maintain the (imaginary) borders of the columns and rows – this is known in layout-ing as the underlying grid.
A wonderful resource for making visual information effective are the “Gestalt principles” described in the 1920s. They describe how we understand visual information. Helpful are: a simple overall form, for example organization of the bars according to their size, symmetry in the layout, proximity of objects belonging together, and a similar appearance of objects belonging together: the same color-code across the manuscript, easily recognizable symbols for the same category e.g. in scatterplots etc.
Henry Dreyfuss described color as the exclamation mark of a design. He meant that color invariably provokes a reaction, attracts attention, and, like the punctuation mark, should be used last and strategically. In diagrams, color is used to create groupings (see gestalt principles above), to convey quantities (e.g. red stands for high, blue for low temperatures), or to highlight part of the data (e.g. a red line between loud gray lines, Fig. 2 before/after).
To make sure I do not overuse color I start with greyscales. Sometimes, a black bar among grey bars is already enough to focus the reader. If color is to be used, it has to match the data. Quantitative data that have a common scale (e.g. animal age in days) should be shown with a hue in different intensities. Divergent data (temperature, gene expression above/below zero) are shown with two hues merging into each other. Only categorical data can be coded with different hues.
Don’t forget: each color must be explained in the illustration or legend (see labeling) and should be used consistently within a figure, poster, or manuscript (color code).
In any case, colors should also be accessible to colorblind people. So please do not mix red and green tones in one image.
Once a first version of a figure is done improving it ideally is iterated until the core statement stands out at the first glance (1-second test!) and it is self-explanatory. As always, helpful colleagues are gold!
Effective and clear figures are helpful as they transport your message faster, but they also make your science accessible to a wider audience (which means more citations!) and to future readers (reproducibility!). And, if you are extremely successful, some figures even become icons to an entire field, like the phylogenetic diagram of Darwin that is re-used again and again.
Understanding abstract visualizations is an almost inherent human trait. Without much training, kindergarten kids get that a longer item means more – bigger kids are older, longer lollipops are more etc.
Perception of color is immediate, but individually very distinct: around 10% of the male population is red-green color blind, seeing colors very much depends on individual photo-receptor composition. During a logo development, I was once very pleased with my beautiful blue-to-green gradient design. Alas, it turned out that the colleague in need of this logo, was unable to see even see the gradient! He literally thought I was making fun of him by showing him just a blue dot.
Color and shape
Colors alone usually have no meaning. But they can strongly influence the meaning of a shape. And, this meaning is a learned cultural standard. a red octagon symbolizes “STOP”, a red human shape indicates “FEMALE” and a red heart means “LOVE”, and a red cable is MINUS electric currency.
Changing the color of a symbol can change the meaning of a symbol: While a red octagon means STOP, a green octagon at airports inverts this and means “GO AHEAD”.
XKCD color survey: or what is the guys name for lavendar?
Despite out individual color perception, an Australian research project now identified the world’s ugliest color in a quest to find a repelling cigarette packaging. The unanimous lead spot was won by PANTONE448C! A murky mix of muddy green and brown. It proved so successful in discouraging smokers that it was since rolled out in other countries too.
You are free to use it, but beware that it might signal something BAD! 🙂
Tables are one of the most successful visualizations in the history of science. They existed long before charts were invented. Miescher reported the discovery of nucleic acids in a small table with just three rows. Janni Nusslein-Volhardt and Eric Wieschaus compiled their observations that gene expression controls embryogenesis in a table. And the seminal discovery that cells are enclosed by a lipid bilayer warranted a full-page table:
Gorter, 1925: Table reporting cells have lipid bilayers.
Different types of tables published in a recent issue of Nature.
Tables are still common in scientific papers and presentations today. In a recent issue of nature 70% of life science manuscripts incorporated some form table. Simple tables are often relegated to the supplemental section. Tables in the main section of a manuscript may also be tables in a fancier format, such as a heatmaps, HiC plots, or databases. This common use of tables explains the high interest in the subject among participants of my DataViz classes.
Data suitable for tables
When seeing a figure, we focus on its most prominent features: thick lines, tallest bars, patterns in a scatter plot. If the figure is done well, this provides us instant insight to the main message. Tables display text and numbers in an organized form. When seeing a table, we address it like a text: we read from top left, to bottom right. And this is the intention of a table: they present complete datasets without a punchy, key message. Tables force the reader to come up with a conclusion and thus are more work. And tables quite simply require more space than any summary chart.
Despite these disadvantages, tables very useful for reporting precise numbers, and for precisely comparing numbers across rows and columns. Tables are also great for presenting large datasets, in which every member of our audience is interested in a different aspect of the data. Each reader of the table on ERC starting grants likely will their country of residence first. If however I wanted to show which country gets most ERCs, a bar chart works better (for categorial data, not statistical summaries). One can immediately spot the longest bar, and read it even faster when data is sorted.
Designing a clear table
Because tables are organized text, alignment and typography are critical for their legibility, and good legibility allows faster reading.
Organize rows and columns.
Which data is used as the key for presenting the result? This goes to the first column of every row. In our case these are the country names. All observations for this country, ERC grants, population, funding rate, get a column each.
Alignment is your friend.
Text is best left aligned, where we start reading text. Numbers are right aligned to make comparisons by digits along a column easier. The column headers are aligned with the content. That means the header for a text column is left aligned, the header for a number column right aligned.
Choosing a legible font is always great, but really critical for numbers. To be legible, all numbers should have the same height (“new style”) and the same width (“tabular/monospace”). The width is important to compare digits within columns, the same height (so no ascenders and descenders in numbers) looks overall less cluttered.
It is not possible to make easy recommendation for a specific font since they are modified depending on the operating system and program. For example, Arial has proportional characters which would not work for numbers in a table. Microsoft Word/Mac has adapted the numbers to be tabularized, while Adobe illustrator/Mac has not.
To best compare numbers in a column, they should have the same decimal points. This helps the alignment and the understanding. In general, it is always worth considering if the decimals are even important: number of grants does not warrant a decimal because it can only be a whole number. Same for humans, but humans in millions might be reasonably presented with one decimal.
With the above rules, you usually end up with an organized and legible table. But good news is, you can do more! Removing left and right boarders is fantastic to create extra space and use the entire width of a table cell for the content – sometimes rather important for grants!
You can do still more. The minimalist Edward Tufte even says ‘every pixel should have a meaning’. In this spirit, removing all gridlines works often just as fine. Often the well-aligned content is sufficient to guide the eye through columns and rows.
Very long tables however are hard to read without any guides. Two options exist. Either, the content may be grouped into blocks of 5 or 10 rows each. The blocks are then visually separated by white space from each other. Alternatively, the table could include a horizontal gridline every 5 or 10 rows.
Last, and only last, think also about color (black and grey are colors too!). I personally like highlighting table headers by giving them a fill color. Usually a light grey works fine, but this very much depends on the overall presentation and the purpose. If shown on a beamer, some light greys don’t work anymore. Also think of color code (and never mix red and green). I consistently used pink, so using the same color for the header would maybe give it a coordinated look. If your fill color is dark, you have to think about white labels. And they in turn requires larger fonts to achieve equal legibility.
Heatmap. A heatmap is a matrix/table in which the cells are shaded according to a color-scale representing the an observed value. They are particularity used to represent many-to-many comparison. Heatmaps can display very large matrices in at a very small scale and allow us to rapidly compare numbers and even see coherent patterns in the data. Heatmaps may also be combined with clustering algorithms (or simple sorting by value), which facilitates seeing patterns in data. Heatmaps are not useful to get precise numbers.
Microarrays. A heatmap that informs about gene expression levels across samples. Gene expression is shown as relative expression compared to a ground truth state. Up-regulation is shown in green, downregulation in red (microarrays are thus not color-blind safe).
HiC plot. HiC plots show heat maps where each pixel represents counts for DNA interactions between two genomic regions. The pixel intensity indicates the number of reads (one color scale) or the divergence of reads from a control (dual color scale). The axes each show the genomic regions that are compared, usually binned to e.g. 1Mb.
Database: online formats of tables to present a large dataset.
Table-Chart hybrid: A table with several observations in columns and one of the observations being presented as small chart (dotplot, boxplot, barplot) adjacent to the respective row. The chart-column usually highlights a particularly important observation.
10 years after it all started, VIZBI came back to its original stomping grounds, the ATC at EMBL in Heidelberg. As its name suggests, VIZBI “Visualizing Biological Data” is a blend of several worlds. Of biology, with its long history in visualizations that goes back to Ancient Greek text books, and of art and scientific illustration.
VIZBI is also inseparable from computer science and its tools to transform big data into human readable entities. And finally, VIZBI incorporates concepts of design and visual perception to make visualizations engaging and enlightening.
Highlighting spectacular biological images
At VIZBI 2010, microscopic images were omnipresent. Back then, I was embarking on my postdoc project, a large-scale microscopy screen of RNAs in cells. My memories tell me that this was the main focus of the conference. Indeed, a quick check of the 2010 program confirms that almost the entire community of light sheet microscopy and image processing were in attendance at the first ever event.
VIZBI 2019 continued to highlight spectacular biological images. A phenomenal augmented reality installation showed them in 3D, EM-tomography simulations by Peijun Zhang animated the 64-million atoms assembling into HIV particles, and Lucy Collinson shared the high numbers of high-resolution EM data collected at the Francis Crick Institute. This large amount of data is annotated with the help of amateurs, for example in their citizen science project at the Zooniverse “Etch a cell”.
Colourful confocal images or images of tissues also provided the inspiration to many works of illustrators on display that combined science and art, for example the double win of best poster and best art to a depiction of tubulin in a mitotic spindle by Beata Mierzwa @beatascienceart, a hugely talented artist and scientist (who also sells cool cytoskeleton-printed leggings and mini-brain organoid dresses).
At VIZBI 2019, visualizations of data – as opposed to images – gained a much more prominent spot. All keynote speakers were from the technology side. Hadley Wickham presented the history of ggplot2. Ggplot2 (and yes, there once was a ggplot1!) is the R universe for visualizing pretty much everything that comes in numbers and is now merged into the tidyverse. Being a visualization talk, all slides were themselves beautiful, I love the tidyverse playfully represented as stars of our universe! The second keynote was by Janet Iwasa who presented her animation work that heavily relies on 3D and computer graphics software used for animation films. Instead of earning her money in the film industry, she decided to put it to good use for biology. Janet first used her skills in her PhD project to visualize motor proteins “walking” along the cytoskeleton, and these days produces Oscar®-worthy movies showing biology, such as the origin of life or the life cycle of HIV. And everyone take note: all her films start as a storyboard on paper, which is what I teach as good practice for all visualization designs.
Making the invisible visible
The third keynote was by Moritz Stefaner, a data designer who is enticed by biological data but appalled by the time-scales in biological projects (too long!). Luckily, he hasn’t given up on us just yet, and keeps producing phenomenal visualizations. For example, showing absence and loss is notoriously hard, but Moritz found a beautiful way to make the invisible visible in his designs for “Where the wild bees are” with Ferris Jabr for Scientific American.
Moritz left us hungry for more when also showing his data-cuisine project, that visualizes data about food and turns food into data: the number of berries picked in Finland become a layered dessert, and common causes of death are encoded as praline fillings – you never know which one you’ll get! (Luckily this was with Belgium pralines, so all deaths are sweet.)
Visualizations of data were in the spotlight of many other projects too. This is of course owed to the many possibilities of large-scale methods that swamped biology with data in recent years: RNAseq, inexpensive genome sequencing, mass-spec at fantastic scales, robotics driven biochemistry and medicine, image processing that turns images into insights by quantifying signals and so on. RNA sequencing, for example, fuelled Susan Clark’s project tracing methylations in cancer, Phillippe Collas’ ambitious endeavour to understand 3D genome architecture, and is empowered by Charlotte Soneson’s “iSEE” software to interactively analyse data from high throughput experiments and the project of Kirsten Bos tracing human pathogens back thousands of years by sequencing tiny dental samples. And of course, of the biggest data projects in biology is the ENSEMBL genome browser, which was officially released as pre-alpha version VIZBI (check it out: 2020.ensembl.org), the very approachable Andy Yates and his team are looking for feedback!
Visualizations of high-dimensional datasets are not without problems. The technical challenges were addressed by David Sehnal who showed computational infrastructure to visualize protein structures (MolStar). The mathematical problems of dimensionality reductions were a topic of Wolfgang Huber’s talk, and a tool to visualize, and thereby find(!), batch effects, “proBatch”, was presented in the flash talk by Jelena Čuklina (they welcome beta-testing by users!). Teaching science visualizations, I often see a great need to discuss ethical and practical aspects. Critically assessing limitations and challenges of scientific visualizations might be a topic to be expanded in future, when VIZBI enters its second decade. This should be coupled with visual perception research, after all, we are no longer limited by computational power, but rather by what our eyes and brains can comprehend (see Miller 1956).
Speaking of flash talks: the conference organisers did such a great job in highlighting every single one (!) of the posters by one-minute talks. I tremendously enjoyed them, admittedly in part because I have a short attention span. Among the talks and art was also “Data dancing” by Alex Diaz. He showed that art and beauty can also be found in statistics and numbers blossoming like flowers across the page. On that note: see you next year in San Francisco!
Of all the charts being ridiculed at WTFviz, many get shamed for their lack of a zero-baseline. When teaching DataViz, zero-baselines are invariably a topic of debate, even in the quietest groups. To participants, the rules to when zero is necessary to understand the data, and when it may be happily omitted, are often unclear. Therefore, let’s quickly recap.
Bar charts: always show zero
When amounts are encoded by length, as done in bar charts, the zero-baseline is critical to our intuitive understanding of the data. A bar twice as long represents that the category has twice the amount of counts. The number of the prestigious ERC starting grants to German host institutes roughly doubled from 2013 to 2014, correctly shown by a bar twice as long in (A).
If, however, the y-axis does not start at zero as in (B), the increase from 2013 to 2014 is hugely over-emphasized and looks roughly 4 to 5 times as high. In example (C) the baselines starts above the first data point and misleads the audiences thinking that only Germany received ERC funding in 2018.
Non-zero baselines skew the relative difference between categories and are misleading. (The same applies to axis-breaks in bar charts!). Non-zero baselines are often used to save space.
In most cases however, the chart could simply be shown with less overall height. This option maintains the relative bar sizes faithfully. When reading bar charts we are always interested in relative, not absolute size differences among our categories. (And I learned that Israel is part of the ERC funding consortium!)
Line charts are happy without zero
The situation is entirely different for line charts. We use them to show trends, e.g. increase or decrease in a category over time. The rate of change is encoded by the slope of the line relative to the horizon. We usually evaluate the slope independently from its distance to zero. For example, seeing the zero is not important for assessing that ERC successes in Germany fluctuate, while UK and France have stable funding rates. And, no matter where the zero-baseline is, why does the UK have such a curious funding peak in 2012, what happened there!?
Sometimes showing zero is misleading
Importantly, showing the zero-baseline in line charts may be misleading. For example, mapping human body temperatures at a scale from 0 to 100˚C would effectively mask us from seeing a life-threatening increase from 39 to 40˚C in a patient. Similarly, showing global temperatures at a scale from 0 to 120˚C results in an entirely flat line, and was used by opponents of climate research to hide man-made global temperature changes, alas, an outcry at twitter swiftly followed.
Distributions: it depends on the data
When showing statistical summaries, again the zero is usually not necessary to be visible. We are interested in the shape of the data (normal or bimodal), it’s median, and outliers. How far the majority of data points are from zero is not usually of interest as long as all data is shown. Instead, the relative distance of individual data points from each other are key.
Good practice for non-zero baselines
When using non-zero baselines, the common practice is to unlink the x- and y-axes. For educational purposes I cut data from the right example. This is a dangerous territory and in some cases may be misleading the audience. In this example, I effectively hide the early lead of the UK in winning ERCs!
Sharp images need a lot of light or a long exposure time. But too much light damages cells, which are also always in motion. In microscopy, images are therefore often taken at short exposure times and low laser power and are blurred and noisy. Computer scientists at the Center for Systems Biology/Max-Planck institute for cell biology and genetics in Dresden developed software that avoids this problem. The artificial intelligence-based software CARE, Content-aware image restoration, can calculate razor-sharp microscopy data from noisy images. This enables biologists to obtain high-resolution images and films without exposing the cells to the risk of light toxicity.
Big Data Problems in microscopy
The researchers involved had been focusing on Big Data in microscopy for some. In vivo and light sheet microscopy quickly amasses giga or terabytes of data. These data need de-noising and de-convolution before analysis, a time-consuming and computationally intensive step. Recently, computer science have more and more used machine learning with neural networks to solve complex tasks. In machine learning computers are fed data and learn to solve a specific problem. This approach is being the recent successes of computers in Chess and Go, and also powers the translation tool Deepl. It was thus not far fetched, to also use machine learning in bioimage data de convolution.
Image de-convolution thanks to AI
The first author of the study, Martin Weigert, says: “It was already known that machine learning for 2D images actually delivers very good results. What we had to do was transfer this to biological data.” Martin tried it out just before heading home for Christmas in 2016. The results were so overwhelming that he initially assumed he had made a mistake. Martin Weigert: “I called Loïc [Royer, another author] over and we discussed how this is impossibly good, this can’t be right, there must be some mistake!
It wasn’t a mistake. Martin Weigert was finally convinced when he saw pictures of fruit flies. In the noisy original image, no structures were visible to the eye. But after a machine learning network was trained, it calculated immaculate images showing detailed distributions of protein in the membrane of fruit fly wings.
To really adapt machine learning for biological images, the research team had to overcome several challenges. Microscopic images are often three- or four-dimensional, biological materials have different densities, microscopes have certain point-spread-functions and cameras have certain sensitivities. But eventually, CARE was born.
Happy embryos thanks to AI
The new machine learning tool CARE not only accelerates de-convolution but also has other advantages. Besides large data sets, photo-toxicity is an ever-present problem of in vivo microscopy. When embryos are imaged for days, die while being observed.
Akanksha Jain is interested in the early development of Tribolium beetle and says, “Even if my beetles don’t die, I am concerned that the laser power I need for good images will cause artefacts.” For Akanksha, postdoc in Pavel Tomancak’s laboratory, this problem is solved with CARE: “Once the network is trained, I can image at only 0.1% laser power, you literally see nothing, and then reconstruct the images with CARE. It’s mind-blowing how good that works.” So far, CARE has been able to convince in every tissue and model organism tested. Martin Weigert and his colleagues have already used it with mouse liver, zebrafish retina, fruit fly embryos, flatworms, Drosophila wings and Akanksha’s beetles.
How does CARE work?
To train a CARE network you need only a few pictures. These images are specific for a microscope set-up, tissue and fluorescent markers and image-pairs must be acquired at low and high laser intensity. CARE then calculate the correction values from the typical deviations.
CARE in application
In practice, the training set of images could be taken at the beginning and the end of an experiment, or also much later. And, once trained, a CARE network can be re-used indefinitely for future (and in fact also for past) experiments.
The demands on the microscope are also low: CARE can be used for any microscope type. The only requirement is that users may vary the laser intensity for the training images. Akanksha Jain confirms this: “CARE does not change the type of experiments I do, but it increases the possibilities of how I can capture images: it becomes faster and more robust”.
Florian Jug also sees the use of CARE for biologists as unproblematic. In his opinion, the biggest difficulty is “that people can’t install it on a computer.” For this, the team has taken precautions: the publication is accompanied by a detailed online documentation explaining step by step how CARE works with Windows or Linux.
Careful with CARE?
An important question is always whether software can introduce artefacts in the data. After Martin Weigert got CARE up and running in just a few days, it took the team about a year to be fully convinced that it works error-free. “We had to find out: can I trust the network” says Florian Jug. The team first showed that independently trained CARE networks deliver comparable results. Similar to two people solving the same math problem but using a different path. The team also controlled that the variance isn’t visible on a single pixel basis, demonstrating that CARE does not insert or changes any data, but merely sharpens what is present already.
CARE thus allows biologist to get better images without having to invest in better microscopes. Martin Weigert is finishing his doctoral thesis, but the next innovations are already in the pipeline. A method to calculate optimized images without reference images is in the planning stage.