Our life can be described with numbers and visualized in diagrams. However, every mean and every chart are simplifications. Useful and fascinating, but also obstructive and restrictive.
Yesterday we experienced a wonderful visualization of two “lives in numbers” in the Festspielhaus Hellerau. The two choreographers and dancers Katia Manjate and Anna Till meet in the Dresden premiere of “Life in numbers”. Four years ago, the artists began to compare their lives on the basis of statistics and through these got to know each other. This gave the impulse to “Life in numbers”, and so the piece begins. With a comparison of life in Dresden, where Till lives, and Maputo, where Manjate lives. In the course of the 8 parts, we get to know key data, which are translated into dance on a Cartesian coordinate system, with movements along the invisible x- and y-axes. The dancers themselves form the data points, always in relationship with each other, but changing, converging, and diverging in the course of time.
Differences and similarities
In the middle part of the piece we turn away from the mean values and towards individual data points. No longer: “How long does a woman live in Germany and one in Mozambique” is discussed, but Till and Manjate ask themselves directly: “How old are you?“. The confrontational, curious questions are staged like a duel and bathed in bright light. The shadows of the bodies are projected into rectangles formed by the spotlights, thus visualizing our thinking in boxes. The audience waits with great anticipation for the next answer.
In the third part the tension is released, the light becomes warm and the dance playful. The artists now begin a joyful confrontation with their differences and celebrate their similarities. This is supported by music and rhythms from both countries.
Numbers are beautiful
In the end, Till asks, can we live without numbers? “Life in numbers” answers “no”, but also shows the life within numbers. And it demonstrates that numbers are more than economic values. Numbers may describe how often we laugh per day, how long we see the sun, and what rhythms feet can dance.
As a scientist, I avoid 3D diagrams because they distort data. With “Life in numbers” you can experience how dance, a three- or even four-dimensional visualization, can make numbers directly experienceable and fascinating.
Unser Leben kann mit Zahlen beschrieben und in Diagrammen visualisiert werden. Jeder Mittelwert und jede Visualisierung sind jedoch immer Vereinfachungen. Nützlich und faszinierend, aber auch behindernd und einengend.
Premiere in Hellerau
Gestern konnten wir im Festspielhaus Hellerau eine wunderbare Visualisierung von zwei „Leben in Zahlen“ erleben. In der Dresden Premiere von „Life in numbers“ treffen sich die beiden Choreografinnen und Tänzerinnen Katia Manjate und Anna Till. Vor 4 Jahren begannen die Künstlerinnen ihre Leben anhand von Zahlen zu vergleichen und sich so kennenzulernen. Dies gab den Impuls zu „Life in numbers“, und so beginnt auch das Stück. Mit einem Vergleich der Leben in Dresden, wo Till lebt, und Maputo, wo Manjate lebt. Im Laufe der 8 Teile lernen wir so zunächst Eckdaten kennen, die tänzerisch auf einem kartesischen Koordinatensystem umgesetzt werden, mit Bewegungen entlang der unsichtbaren x- und y-Achsen. Die Tänzerinnen bilden hier selber die Datenpunkte, immer in einem Verhältnis zueinander, aber im Laufe der Zeit sich verändernd, konvergierend, divergierend.
Unterschiede und Gemeinsamkeiten
Im mittleren Teil des Stücks wenden wir uns von den Mittelwerten ab und den individuellen Datenpunkten zu. Nicht mehr: „Wie lange lebt eine Frau in Deutschland und eine in Mosambik?“ wird erörtert, sondern Till und Manjate fragen sich direkt: „Wie alt bist Du?“. Die konfrontativ-neugierigen Fragen werden wie ein Zweikampf inszeniert und in grelles Licht getaucht. Die Schatten der Körper werden in Rechtecke projiziert, die von den Scheinwerfern gebildet werden und visualisieren damit unser Denken in Kästen. Das Publikum wartet mit der Fragenden höchstgespannt auf die nächste Antwort.
Im dritten Teil löst sich die Spannung, das Licht wird warm und der Tanz spielerisch. Die Künstlerlinnen beginnen nun eine freudige Auseinandersetzung mit ihren Unterschieden und feiern ihre Gemeinsamkeiten. Dies wird von Musik und Rhythmen aus beiden Ländern unterstützt.
Zahlen sind auch schön
Am Ende fragt Till, können wir ohne Zahlen leben? „Life in numbers“ antwortet „nein“, zeigt aber auch das Leben in den Zahlen. Und es zeigt, dass Zahlen mehr sind als ökonomische Werte. Zahlen können auch beschreiben, wie oft lache ich pro Tag, wie lange sehe ich die Sonne, und welche Rhythmen können meine Füße tanzen.
Als Wissenschaftler vermeide ich 3D-Diagramme, weil sie Daten verzerrt zeigen. Bei „Life in numbers“ kann man erleben wie Tanz, eine drei- oder sogar vierdimensionale Visualisierung, Zahlen ganz unmittelbar erlebbar und faszinierend machen kann.
We keep discussing axis layouts and the problematic cases of non-zero baselines (in bar charts). Here is another example from the city of Dresden. Dresden is a really pretty place and it is always worth coming for a visit. With the below chart, the city wanted to showcase that each year new record tourist numbers are recorded.
Truthful bar chart
Now, since this isn’t the first time we discuss baselines, you should immediately spot that a rise from around 200,000 to 300,000 isn’t even close to a tripling of number as the bar length visually suggests. And overall, the bar-length does not actually even represent the increase at all. It rather seems that their lengths were chosen to fit an imaginary linear increase. I re-plotted the bar chart with a zero-baseline. Lo-and-behold, the rising number of tourist is still visible, but clearly not nearly as record-worthy.
The image however circulated in 2017 – three years after the last data-point shown! Now, if you know that since the end of 2014 Dresden is plagued by very prominent weekly demonstrations of right-wing activists, having no data after 2014 is alarming. In the local science and business community the problems are very evident: we have a clear drop in international scientists applying and accepting jobs in the city! I therefore went to the Dresden city website to get the data for the subsequent years and this confirmed what I suspected: tourist numbers no longer rose, instead, they even dropped!
Line or bar chart?
Time trends are usually more visible in line charts. Indeed, the drop of tourist numbers since 2014 is very apparent in a line chart, and even more so when we leave out the zero-baseline, which somewhat flattens the data (note: in line charts leaving out zero-baseline is ok and sometimes even necessary!).
Fun with Excel
And, did you know you can use a picture as the background for your chart in Excel!?!
Bar charts encode categorical values by length. By comparing bar lengths, we can visually compare the category sizes.
When a bar is truncated due to a missing zero-baseline or an interrupted y-axis, the relative size difference between the bars changes. Now, the bars no longer visually encodes the actual category value. (Read more from a previous blog)
Misleading bar chart
The above is a DataViz classic. FOX NEWS reported an (to them: alarming) increase in Obamacare enrollments over a few days with a bar chart. They apparently feared imminent bankruptcy of the USA and therfore save the nation by overemphasizing the increase with truncated bar. Instead of the moderate ~20% rise (6 to 7 milllion), their bar showed a 300% increase in length!
I quickly re-designed the chart in Excel. The increase is still clearly visible, even with a zero-baseline.
A German version of this appears in Laborjournal 9/2019.
It took me around one year from data analysis to the final visualization of my postdoc results. A year, in which I tried many visualization strategies and simplified them again and again. Helpful at that time were Edward Tufte’s book „The visualization of quantitative data” and my pre-science background in applied art, that taught me effects of color and shape.
While visualizations are used in science since antiquity, their number and complexity now increased significantly. Nevertheless, even today budding scientists do not learn to encode and effectively decode visualizations. This is evident in the numerous examples of difficult-to-read figures and outright misleading charts. In the past years, I taught data visualization to hundreds of scientists. A few tips and tricks to increase readability and effectiveness of data figures are summarized here.
As always, the plan comes first. Who is my audience, what do I want to say, and why/ what is my message? Only when these questions have been answered succinctly can you begin to implement the visualization. What otherwise happens are charts with 50 categories, 12 colors, or the use of a diagram type altogether.
Testing a visualization is a must. Does the visualization work on paper? Is the electronic version readable? We should always look for external, objective reviewers. Often these are not the laboratory colleagues who already know our results and awkward visuals. Ideally, you should find several Guinea-pigs to get a feeling for the representative-ness of the answers.
Choose diagram type
A diagram should represent data truthfully and functionally. It is therefore important to select the diagram type suitable for the data type. A line chart shows a temporal progression, a bar chart compares sets of different categories, and if the category names are long, you can make a bar chart of them (Figure 1). Pie charts effectively show percentages for up to 5 rather differently sized categories. They quickly become a disaster if used for more categories, in 3D, or if two pie charts are to be compared precisely.
To avoid being misleading, some basic visualization rules should be followed: omitting the zero line distorts relative size differences in column and bar graphs (Figure 1C). In line charts, on the other hand, a zero line can be omitted because only trends are read, and in fact at times (link to twitter) a zero-baseline in line charts can result in very misleading information presentation. Absolute caution is advised with charts for statistical distributions: The box plot is suitable for normally distributed data; in histograms, the number of bins must match the number of data points. And important for readability: the invariable size, e.g. time, is shown on the x-axis, the dependent size on the y-axis (not as in Fig. 1D!).
Without text, each diagram is just like an abstract piece of art. To be self-explanatory, images need a title, axis labels, and a legend. Nowadays, the title often simply sits above the diagram itself, otherwise it is found in the caption. Legends should be placed where they are needed – i.e. better not in the signature, but close to the data (see: Layout).
The following applies to all text elements: Abbreviations should be avoided, with the exception of very, very common ones. Again, find colleagues to try out your favorite abbreviation. You’d be surprised how an abbreviation turns out to be just lab jargon or really specific to your field of research. But: your papers should not be! (Googling it is also possible: if not found among the top entries, probably you should not use that abbreviation).
It is also important to avoid redundancies (especially in the axis labels!!), filler texts and passive language.
In a fraction of a second we can capture visual information, understand content, color and organization. The better this information is structured, the faster we understand it.
Most importantly, we usually read images like text, from top left to bottom right. That’s why posters have titles at the top and references at the bottom. Just like on posters, however, we can also increase the reading speed of individual charts: a meaningful title might be placed above a diagram (left align is usually good!) and color codes should be explained right here. On the other hand, scale information, data source, or sample size can happily be placed at the bottom right.
Because we read visual information similar to text, composite images should be organized in either rows or columns. Once such a reading direction is set, it should not be changed (best: keep reading direction for all figures of the manuscript). To maintain an overall organized look, you are best advised to maintain the (imaginary) borders of the columns and rows – this is known in layout-ing as the underlying grid.
A wonderful resource for making visual information effective are the “Gestalt principles” described in the 1920s. They describe how we understand visual information. Helpful are: a simple overall form, for example organization of the bars according to their size, symmetry in the layout, proximity of objects belonging together, and a similar appearance of objects belonging together: the same color-code across the manuscript, easily recognizable symbols for the same category e.g. in scatterplots etc.
Henry Dreyfuss described color as the exclamation mark of a design. He meant that color invariably provokes a reaction, attracts attention, and, like the punctuation mark, should be used last and strategically. In diagrams, color is used to create groupings (see gestalt principles above), to convey quantities (e.g. red stands for high, blue for low temperatures), or to highlight part of the data (e.g. a red line between loud gray lines, Fig. 2 before/after).
To make sure I do not overuse color I start with greyscales. Sometimes, a black bar among grey bars is already enough to focus the reader. If color is to be used, it has to match the data. Quantitative data that have a common scale (e.g. animal age in days) should be shown with a hue in different intensities. Divergent data (temperature, gene expression above/below zero) are shown with two hues merging into each other. Only categorical data can be coded with different hues.
Don’t forget: each color must be explained in the illustration or legend (see labeling) and should be used consistently within a figure, poster, or manuscript (color code).
In any case, colors should also be accessible to colorblind people. So please do not mix red and green tones in one image.
Once a first version of a figure is done improving it ideally is iterated until the core statement stands out at the first glance (1-second test!) and it is self-explanatory. As always, helpful colleagues are gold!
Effective and clear figures are helpful as they transport your message faster, but they also make your science accessible to a wider audience (which means more citations!) and to future readers (reproducibility!). And, if you are extremely successful, some figures even become icons to an entire field, like the phylogenetic diagram of Darwin that is re-used again and again.
Understanding abstract visualizations is an almost inherent human trait. Without much training, kindergarten kids get that a longer item means more – bigger kids are older, longer lollipops are more etc.
Perception of color is immediate, but individually very distinct: around 10% of the male population is red-green color blind, seeing colors very much depends on individual photo-receptor composition. During a logo development, I was once very pleased with my beautiful blue-to-green gradient design. Alas, it turned out that the colleague in need of this logo, was unable to see even see the gradient! He literally thought I was making fun of him by showing him just a blue dot.
Color and shape
Colors alone usually have no meaning. But they can strongly influence the meaning of a shape. And, this meaning is a learned cultural standard. a red octagon symbolizes “STOP”, a red human shape indicates “FEMALE” and a red heart means “LOVE”, and a red cable is MINUS electric currency.
Changing the color of a symbol can change the meaning of a symbol: While a red octagon means STOP, a green octagon at airports inverts this and means “GO AHEAD”.
XKCD color survey: or what is the guys name for lavendar?
Despite out individual color perception, an Australian research project now identified the world’s ugliest color in a quest to find a repelling cigarette packaging. The unanimous lead spot was won by PANTONE448C! A murky mix of muddy green and brown. It proved so successful in discouraging smokers that it was since rolled out in other countries too.
You are free to use it, but beware that it might signal something BAD! 🙂
Tables are one of the most successful visualizations in the history of science. They existed long before charts were invented. Miescher reported the discovery of nucleic acids in a small table with just three rows. Janni Nusslein-Volhardt and Eric Wieschaus compiled their observations that gene expression controls embryogenesis in a table. And the seminal discovery that cells are enclosed by a lipid bilayer warranted a full-page table:
Gorter, 1925: Table reporting cells have lipid bilayers.
Different types of tables published in a recent issue of Nature.
Tables are still common in scientific papers and presentations today. In a recent issue of nature 70% of life science manuscripts incorporated some form table. Simple tables are often relegated to the supplemental section. Tables in the main section of a manuscript may also be tables in a fancier format, such as a heatmaps, HiC plots, or databases. This common use of tables explains the high interest in the subject among participants of my DataViz classes.
Data suitable for tables
When seeing a figure, we focus on its most prominent features: thick lines, tallest bars, patterns in a scatter plot. If the figure is done well, this provides us instant insight to the main message. Tables display text and numbers in an organized form. When seeing a table, we address it like a text: we read from top left, to bottom right. And this is the intention of a table: they present complete datasets without a punchy, key message. Tables force the reader to come up with a conclusion and thus are more work. And tables quite simply require more space than any summary chart.
Despite these disadvantages, tables very useful for reporting precise numbers, and for precisely comparing numbers across rows and columns. Tables are also great for presenting large datasets, in which every member of our audience is interested in a different aspect of the data. Each reader of the table on ERC starting grants likely will their country of residence first. If however I wanted to show which country gets most ERCs, a bar chart works better (for categorial data, not statistical summaries). One can immediately spot the longest bar, and read it even faster when data is sorted.
Designing a clear table
Because tables are organized text, alignment and typography are critical for their legibility, and good legibility allows faster reading.
Organize rows and columns.
Which data is used as the key for presenting the result? This goes to the first column of every row. In our case these are the country names. All observations for this country, ERC grants, population, funding rate, get a column each.
Alignment is your friend.
Text is best left aligned, where we start reading text. Numbers are right aligned to make comparisons by digits along a column easier. The column headers are aligned with the content. That means the header for a text column is left aligned, the header for a number column right aligned.
Choosing a legible font is always great, but really critical for numbers. To be legible, all numbers should have the same height (“new style”) and the same width (“tabular/monospace”). The width is important to compare digits within columns, the same height (so no ascenders and descenders in numbers) looks overall less cluttered.
It is not possible to make easy recommendation for a specific font since they are modified depending on the operating system and program. For example, Arial has proportional characters which would not work for numbers in a table. Microsoft Word/Mac has adapted the numbers to be tabularized, while Adobe illustrator/Mac has not.
To best compare numbers in a column, they should have the same decimal points. This helps the alignment and the understanding. In general, it is always worth considering if the decimals are even important: number of grants does not warrant a decimal because it can only be a whole number. Same for humans, but humans in millions might be reasonably presented with one decimal.
With the above rules, you usually end up with an organized and legible table. But good news is, you can do more! Removing left and right boarders is fantastic to create extra space and use the entire width of a table cell for the content – sometimes rather important for grants!
You can do still more. The minimalist Edward Tufte even says ‘every pixel should have a meaning’. In this spirit, removing all gridlines works often just as fine. Often the well-aligned content is sufficient to guide the eye through columns and rows.
Very long tables however are hard to read without any guides. Two options exist. Either, the content may be grouped into blocks of 5 or 10 rows each. The blocks are then visually separated by white space from each other. Alternatively, the table could include a horizontal gridline every 5 or 10 rows.
Last, and only last, think also about color (black and grey are colors too!). I personally like highlighting table headers by giving them a fill color. Usually a light grey works fine, but this very much depends on the overall presentation and the purpose. If shown on a beamer, some light greys don’t work anymore. Also think of color code (and never mix red and green). I consistently used pink, so using the same color for the header would maybe give it a coordinated look. If your fill color is dark, you have to think about white labels. And they in turn requires larger fonts to achieve equal legibility.
Heatmap. A heatmap is a matrix/table in which the cells are shaded according to a color-scale representing the an observed value. They are particularity used to represent many-to-many comparison. Heatmaps can display very large matrices in at a very small scale and allow us to rapidly compare numbers and even see coherent patterns in the data. Heatmaps may also be combined with clustering algorithms (or simple sorting by value), which facilitates seeing patterns in data. Heatmaps are not useful to get precise numbers.
Microarrays. A heatmap that informs about gene expression levels across samples. Gene expression is shown as relative expression compared to a ground truth state. Up-regulation is shown in green, downregulation in red (microarrays are thus not color-blind safe).
HiC plot. HiC plots show heat maps where each pixel represents counts for DNA interactions between two genomic regions. The pixel intensity indicates the number of reads (one color scale) or the divergence of reads from a control (dual color scale). The axes each show the genomic regions that are compared, usually binned to e.g. 1Mb.
Database: online formats of tables to present a large dataset.
Table-Chart hybrid: A table with several observations in columns and one of the observations being presented as small chart (dotplot, boxplot, barplot) adjacent to the respective row. The chart-column usually highlights a particularly important observation.