The gift of small edits and subtraction

While making the chart on fertility rates (link), I came across a problem that pops up quite often, and is  ignored by most software programs.

Here is an earlier version of the chart I later discarded:

Junkcharts_redofertilitychart_2

Compare this to the version I published in the blog post:

Junkcharts_redofertilitychart_1

Aside from adding the chart title, there is one major change. I removed the empty plots from the grid. This is a visualization trick that should be called adding by subtracting. The empty scaffolding on the first chart increases our cognitive load without yielding any benefit. The whitespace brings out the message that only countries in Asia and Africa have fertility rates above 5.0. 

This is a small edit. But small edits accumulate and deliver a big impact. Bear this in mind the next time you make a chart.

 

P.S.

(1) You'd have to use a lower-level coding language to execute this small edit. Most software programs are quite rigid when it comes to making small-multiples (facet) charts.

(2) If there is a next iteration, I'd reverse the Asia and Oceania rows.

 


Visualizing fertility rates around the globe

The following chart dropped on my Twitter feed.

Twitter_fertility_chart

It's an ambitious chart that tries to do a lot. The underlying data set contains fertility rate data from over 200 countries over 20 years.

The basic chart form is a column chart that is curled up into a ball. The column chart is given colors that map to continents. All countries are grouped into five continents. The column chart can only take a single data series, so the 2019 fertility rate is chosen.

Beyond this basic setup, the designer embellishes the chart with a trove of information. Here's a close up:

Twitter_fertilityrate_excerpt

The first number is the 2019 fertility rate, which means all the data encoded into the columns are also printed on the chart itself. Then, the flag of each country forms the next ring. Then, the name of the country. Finally, in brackets, the percent change in fertility rate between 2000 and 2019.

That is not all. Some contextual information are injected in those arrows that connect the columns to the data labels. A green arrow indicates that the fertility rate is trending lower - which is the case in most countries around the world. Once in a while, a purple arrow pops up. In the above excerpt, Seychelles gets a purple arrow because this island nation has increased the fertility rate from 2000 to 2019.

Also hiding in the background are several dashed rings. I think only the one that partially overlaps with the column chart contains any information - the other rings are inserted for an artistic reason. To decipher this dashed ring, we must look at the inset in the top left corner. We learn that the value of 2.1 children per woman is known as the replacement fertility rate. So it's also possible to assess whether each country is above or below the replacement fertility rate threshold.

Twitter_fertility_world_trend

[I'm presuming that this replacement threshold is about the births necessary to avoid a population decline. If that's the case, then comparing each country's fertility rate to a global fertility rate threshold is too simplistic because fertility is only one of several key factors driving a country's population growth. A more sophisticated model should generate country-level thresholds.]

***

Data graphics serve many functions. This chart works well as an embellished data table. It does take some time to find a specific country because the columns have been sorted by decreasing 2019 fertility rate but once we locate the column, all the other data fields are clearly laid out.

As a generator of data insights, this chart is less effective. The main insight I obtained from it is a rough ranking of continents, with African countries predominantly having higher fertility rates, followed by Asia and Oceania, then Americas, and finally, Europe which has the lowest fertility rates. If this is the key message, a standard choropleth map brings it out more directly.

***

Here is a small-multiples rendering of the fertility dataset. I chose 1999 values instead of 2000 to make a complete two-decade view.

Junkcharts_redofertilitychart_1

The columns represent a grouping of countries based on their 1999 fertility rates. The left column contains countries with the lowest number of births per woman, and the fertility rate increases left to right - both within an individual plot and in the grid.

If you're wondering, the hidden vertical axis sorts the countries by their 1999 rank. The lighter colors are 1999 values while the darker colors are 2019 values. For most countries the dots are shifting left over the 20 years. There are some exceptions. I have labeled several of these exceptions (e.g. Kazakhstan and Mongolia), and rendered them in italic.

 

 

 


Visually displaying multipliers

As I'm preparing a blog about another real-world study of Covid-19 vaccines, I came across the following chart (the chart title is mine).

React1_original

As background, this is the trend in Covid-19 cases in the U.K. in the last couple of months, courtesy of OurWorldinData.org.

Junkcharts_owid_uk_case_trend_july_august_2021

The React-1 Study sends swab kits to randomly selected people in England in order to assess the prevalence of Covid-19. Every month, there is a new round of returned swabs that are tested for Covid-19. This measurement method captures asymptomatic cases although it probably missed severe and hospitalized cases. Despite having some shortcomings, this is a far better way to measure cases than the hotch-potch assembling of variable-quality data submitted by different jurisdictions that has become the dominant source of our data.

Rounds 12 and 13 captured an inflection point in the pandemic in England. The period marked the beginning of the end of the belief that widespread vaccination will end the pandemic.

The chart I excerpted up top broke the data down by age groups. The column heights represent the estimated prevalence of Covid-19 during each round - also, described precisely in the paper as "swab positivity." Based on the study's design, one may generalize the prevalence to the population at large. About 1.5% of those aged 13-24 in England are estimated to have Covid-19 around the time of Round 13 (roughly early July).

The researchers came to the following conclusion:

We show that the third wave of infections in England was being driven primarily by the Delta variant in younger, unvaccinated people. This focus of infection offers considerable scope for interventions to reduce transmission among younger people, with knock-on benefits across the entire population... In our data, the highest prevalence of infection was among 12 to 24 year olds, raising the prospect that vaccinating more of this group by extending the UK programme to those aged 12 to 17 years could substantially reduce transmission potential in the autumn when levels of social mixing increase

***

Raise your hand if the graphics software you prefer dictates at least one default behavior you can't stand. I'm sure most hands are up in the air. No matter how much you love the software, there is always something the developer likes that you don't.

The first thing I did with today's chart is to get rid of all such default details.

Redo_react1_cleanup

For me, the bottom chart is cleaner and more inviting.

***

The researchers wanted readers to think in terms of Round 3 numbers as multiples of Round 2 numbers. In the text, they use statements such as:

weighted prevalence in round 13 was nine-fold higher in 13-17 year olds at 1.56% (1.25%, 1.95%) compared with 0.16% (0.08%, 0.31%) in round 12

It's not easy to perceive a nine-fold jump from the paired column chart, even though this chart form is better than several others. I added some subtle divisions inside each orange column in order to facilitate this task:

Redo_react1_multiples

I have recommended this before. I'm co-opting pictograms in constructing the column chart.

An alternative is to plot everything on an index scale although one would have to drop the prevalence numbers.

***

The chart requires an additional piece of context to interpret properly. I added each age group's share of the population below the chart - just to illustrate this point, not to recommend it as a best practice.

Redo_react1_multiples_popshare

The researchers concluded that their data supported vaccinating 13-17 year olds because that group experienced the highest multiplier from Round 12 to Round 13. Notice that the 13-17 year old age group represents only 6 percent of England's population, and is the least populous age group shown on the chart.

The neighboring 18-24 age group experienced a 4.5 times jump in prevalence in Round 13 so this age group is doing much better than 13-17 year olds, right? Not really.

While the same infection rate was found in both age groups during this period, the slightly older age group accounted for 50% more cases -- and that's due to the larger share of population.

A similar calculation shows that while the infection rate of people under 24 is about 3 times higher than that of those 25 and over, both age groups suffered over 175,000 infections during the Round 3 time period (the difference between groups was < 4,000).  So I don't agree that focusing on 13-17 year olds gives England the biggest bang for the buck: while they are the most likely to get infected, their cases account for only 14% of all infections. Almost half of the infections are in people 25 and over.

 


Stumped by the ATM

The neighborhood bank recently installed brand new ATMs, with tablet monitors and all that jazz. Then, I found myself staring at this screen:

Banknote_picker_us

I wanted to withdraw $100. I ordinarily love this banknote picker because I can get the $5, $10, $20 notes, instead of $50 and $100 that come out the slot when I don't specify my preference.

Something changed this time. I find myself wondering which row represents which note. For my non-U.S. readers, you may not know that all our notes are the same size and color. The screen resolution wasn't great and I had to squint really hard to see the numbers of those banknote images.

I suppose if I grew up here, I might be able to tell the note values from the figureheads. This is an example of a visualization that makes my life harder!

***
I imagine that the software developer might be a foreigner. I imagine the developer might live in Europe. In this case, the developer might have this image in his/her head:

Banknote_picker_euro

Euro banknotes are heavily differentiated - by color, by image, by height and by width. The numeric value also occupies a larger proportion of the area. This makes a lot of sense.

I like designs to be adaptable. Switching data from one country to another should not alter the design. Switching data at different time scales should not affect the design. This banknote picker UI is not adaptable across countries.

***

Once I figured out the note values, I learned another reason why I couldn't tell which row is which note. It's because one note is absent.

Banknote_us_2

Where is the $10 note? That and the twenty are probably the most frequently used. I am also surprised people want $1 notes from an ATM. But I assume the bank knows something I don't.


Metaphors, maps, and communicating data

There are some data visualization that are obviously bad. But what makes them bad?

Here is an example of such an effort:

Carbon footprint 2021-02-15_0

This visualization of carbon emissions is not successful. There is precious little that a reader can learn from this chart without expensing a lot of effort. It's relatively easy to identify the largest emitters of carbon but since the data are not expressed per-capita, the chart mainly informs us which countries have the largest populations. 

The color of the bubbles informs readers which countries belong to which parts of the world. However, it distorts the location of countries within regions, and regions relative to regions, as the primary constraint is fitting the bubbles inside the shape of a foot.

The visualization gives a very rough estimate of the relative sizes of total emissions. The circles not being perfect circles don't help. 

It's relatively easy to list the top emitters in each region but it's hard to list the top 10 emitters in the world (try!) 

The small emitters stole all of the attention as they account for most of the labels - and they engender a huge web of guiding lines - an unsightly nuisance.

The diagram clings dearly to the "carbon footprint" metaphor. Does this metaphor help readers consume the emissions data? Conversely, does it slow them down?

A more conventional design uses a cartogram, a type of map in which the positioning of countries are roughly preserved while the geographical areas are coded to the data. Here's how it looks:

Carbonatlasthumb

I can't seem to source this effort. If any reader can find the original source, please comment below.

This cartogram is a rearrangement of the footprint illustration. The map construct eliminates the need to include a color legend which just tells people which country is in which continent. The details of smaller countries are pushed to the bottom. 

In the footprint visualization, I'd even consider getting rid of the legend completely. This means trusting that readers know South Africa is part of Africa, and China is part of Asia.

Carbonfootprint_part

Imagine: what if this chart comes without a color legend? Do we really need it?

***

I'd like to try a word cloud visual for this dataset. Something that looks like this (obviously with the right data encoding):

Michaeltompsett_worldmapwords

(This map is by Michael Tompsett who sells it here.)

 


Pies, bars and self-sufficiency

Andy Cotgreave asked Twitter followers to pick between pie charts and bar charts:

Ac_pie_or_bar

The underlying data are proportions of people who say they won't get the coronavirus vaccine.

I noticed two somewhat unusual features: the use of pies to show single proportions, and the aspect ratio of the bars (taller than typical). Which version is easier to understand?

To answer this question, I like to apply a self-sufficiency test. This test is used to determine whether the readers are using the visual elements of the chart to udnerstand the data, or are they bypassing the visual elements and just reading the data labels? So, let's remove the printed data from the chart and take another look:

Junkcharts_selfsufficiency_pieorbar

For me, these charts are comparable. Each is moderately hard to read. That's because the percentages fall into a narrow range at one end of the range. For both charts, many readers are likely to be looking for the data labels.

Here's a sketch of a design that is self-sufficient.

Junkcharts_selfsufficientdesign

The data do not appear on this chart.

***

My first reaction to Andy's tweet turned out to be a misreading of the charts. I thought he was disaggregating the pie chart, like we can unstack a stacked bar chart.

Junkcharts_probabilities_proportions

Looking at the data more carefully, I realize that the "proportions" are not part to the whole. Or rather, the whole isn't "all races" or "all education levels". The whole is all respondents of a particular type.

 

 


Making graphics last over time

Yesterday, I analyzed the data visualization by the White House showing the progress of U.S. Covid-19 vaccinations. Here is the chart.

Whgov_proportiongettingvaccinated

John who tweeted this at me, saying "please get a better data viz".

I'm happy to work with them or the CDC on better dataviz. Here's an example of what I do.

Junkcharts_redo_whgov_usvaccineprogress

Obviously, I'm using made-up data here and this is a sketch. I want to design a chart that can be updated continuously, as data accumulate. That's one of the shortcomings of that bubble format they used.

In earlier months, the chart can be clipped to just the lower left corner.

Junkcharts_redo_whgov_usvaccineprogress_2


Reading an infographic about our climate crisis

Let's explore an infographic by SCMP, which draws attention to the alarming temperature recorded at Verkhoyansk in Russia on June 20, 2020. The original work was on the back page of the printed newspaper, referred to in this tweet.

This view of the globe brings out the two key pieces of evidence presented in the infographic: the rise in temperature in unexpected places, and the shrinkage of the Arctic ice.

Scmp_russianheat_1a

A notable design decision is to omit the color scale. On inspection, the scale is present - it was sewn into the graphic.

Scmp_russianheat_colorscale

I applaud this decision as it does not take the reader's eyes away from the graphic. Some information is lost as the scale isn't presented in full details but I doubt many readers need those details.

A key takeaway is that the temperature in Verkhoyansk, which is on the edge of the Arctic Circle, was the same as in New Delhi in India on that day. We can see how the red was encroaching upon the Arctic Circle.

***Scmp_russianheat_2a

Next, the rapid shrinkage of the Arctic ice is presented in two ways. First, a series of maps.

The annotations are pared to the minimum. The presentation is simple enough such that we can visually judge that the amount of ice cover has roughly halved from 1980 to 2009.

A numerical measure of the drop is provided on the side.

Then, a line chart reinforces this message.

The line chart emphasizes change over time while the series of maps reveals change over space.

Scmp_russianheat_3a

This chart suggests that the year 2020 may break the record for the smallest ice cover since 1980. The maps of Australia and India provide context to interpret the size of the Arctic ice cover.

I'd suggest reversing the pink and black colors so as to refer back to the blue and pink lines in the globe above.

***

The final chart shows the average temperature worldwide and in the Arctic, relative to a reference period (1981-2000).

Scmp_russianheat_4

This one is tough. It looks like an area chart but it should be read as a line chart. The darker line is the anomaly of Arctic average temperature while the lighter line is the anomaly of the global average temperature. The two series are synced except for a brief period around 1940. Since 2000, the temperatures have been dramatically rising above that of the reference period.

If this is a stacked area chart, then we'd interpret the two data series as summable, with the sum of the data series signifying something interesting. For example, the market shares of different web browsers sum to the total size of the market.

But the chart above should not be read as a stacked area chart because the outside envelope isn't the sum of the two anomalies. The problem is revealed if we try to articulate what the color shades mean.

Scmp_russianheat_4_inset

On the far right, it seems like the dark shade is paired with the lighter line and represents global positive anomalies while the lighter shade shows Arctic's anomalies in excess of global. This interpretation only works if the Arctic line always sits above the global line. This pattern is broken in the late 1990s.

Around 1999, the Arctic's anomaly is negative while the global anomaly is positive. Here, the global anomaly gets the lighter shade while the Arctic one is blue.

One possible fix is to encode the size of the anomaly into the color of the line. The further away from zero, the darker the red/blue color.

 

 


Book Review: Visualizing with Text by Richard Brath

Richardbarth_bookcoverThe creative process is sometimes described in terms of diverge-converge cycles. The diverge step involves experimentation and rewards suspending disbelief, while excesses are curbed and concepts refined during the converge step. Richard Brath's just-released book Visualizing with Text is an important resource that expands our appreciation for the place of text in visual displays.

Books on data visualization fall into recognizable types, of which two popular ones are the style guide, such as Edward Tufte, Dona Wong, and Alberto Cairo, and the coding manual, such as Ben Fry (processing) and Hadley Wickham (ggplot, Shiny). Brath's volume belongs to neither of those - it reads more like an encyclopedic catalog of how text can be incorporated into charts and graphs. He challenges us to blow up our imaginative space for characters, words, sentences, paragraphs and prose. It is a valuable aid for the diverge step of our creative process.

In modern data visualization, text is treated as an accessory, frequently found in titles, labels, legends, footnotes or surrounding text. Brath wants us to elevate text to the starring attraction. Starting with baby steps, such as direct labeling of lines and objects, and coordinating colors between chart elements and words, he experiments with inserting text into unlikely crannies, not shying away from ideas that even he admits may be somewhat of a dead-end.

One of the more immediately useful examples is the use of text labels that hug the lines on a line chart, similar to how roads and rivers are labeled on maps. I wish all software developers implement this function without delay.

Barth_riverlabelsonlines

A more esoteric example is to replace these lines with small-size text, as Brath makes an analogy between sentences and lines.

Barth_textinlines

I am still deciding if this is a gold mine or a minefield. It is thought-provoking nonetheless.

Finally, the book includes some flights of fancy, like this one:

Barth_french_departments

The red superscripts are numeric codes for French departments (provinces), arranged in ascending order of a given metric, and placed in proportional distance within the prose!

The converge step is left to the reader, as Brath refrains from bullhorning his opinions about chart types, which is why readers should not expect a style guide. He includes many experimental graphics, and may provide the pros and cons of a form without registering a judgement.

Because many of these ideas have yet to enter the mainstream, we'd need to implement these ideas on our own, which is why readers will not find a coding manual. As mentioned above, even the simplest and least controversial tactic of directly labeling lines is not available in Excel, let alone text that hugs or replaces lines. (This proves Brath's point that our community has done text a disservice.) Other ideas explored in later chapters require such features as italicizing numeric proportions of a word, rather than the entire word.

Recently, text has become a mainstay of Big Data. Visualizing with Text is timely, relevant and provocative. It is also clearly written, and tightly organized. Chapter 13 neatly summarizes the key concepts that have appeared along the way. There are plenty of use cases, primarily derived from research or business. After reading this book, you'll revel in the new sandbox of text, and long to free yourself from the constraints of your tool.


***

I recommend that you get the paper copy of the book. I reviewed the electronic version, and what irony! As you may have guessed, the electronic version ruins the typesetting. On every page, certain paragraphs show up in tiny font that resist all attempts to magnify, making Brath's case that legibility is an important metric for text visualization. Some of the more unusual fonts are dropped. The images are too small, even when popped up.

[P.S. Richard has a webpage where he included larger images and some code.]


Election visual 3: a strange, mash-up visualization

Continuing our review of FiveThirtyEight's election forecasting model visualization (link), I now look at their headline data visualization. (The previous posts in this series are here, and here.)

538_topchartofmaps

It's a set of 22 maps, each showing one election scenario, with one candidate winning. What chart form is this?

Small multiples may come to mind. A small-multiples chart is a grid in which every component graphic has the same form - same chart type, same color scheme, same scale, etc. The only variation from graphic to graphic is the data. The data are typically varied along a dimension of interest, for example, age groups, geographic regions, years. The following small-multiples chart, which I praised in the past (link), shows liquor consumption across the world.

image from junkcharts.typepad.com

Each component graphic changes according to the data specific to a country. When we scan across the grid, we draw conclusions about country-to-country variations. As with convention, there are as many graphics as there are countries in the dataset. Sometimes, the designer includes only countries that are directly relevant to the chart's topic.

***

What is the variable FiveThirtyEight chose to vary from map to map? It's the scenario used in the election forecasting model.

This choice is unconventional. The 22 scenarios is a subset of the 40,000 scenarios from the simulation - we are left wondering how those 22 are chosen.

Returning to our question: what chart form is this?

Perhaps you're reminded of the dot plot from the previous post. On that dot plot, the designer summarized the results of 40,000 scenarios using 100 dots. Since Biden is the winner in 75 percent of all scenarios, the dot plot shows 75 blue dots (and 25 red).

The map is the new dot. The 75 blue dots become 16 blue maps (rounded down) while the 25 red dots become 6 red maps.

Is it a pictogram of maps? If we ignore the details on the maps, and focus on the counts of colors, then yes. It's just a bit challenging because of the hole in the middle, and the atypical number of maps.

As with the dot plot, the map details are a nice touch. It connects readers with the simulation model which can feel very abstract.

Oddly, if you're someone familiar with probabilities, this presentation is quite confusing.

With 40,000 scenarios reduced to 22 maps, each map should represent 1818 scenarios. On the dot plot, each dot should represent 400 scenarios. This follows the rule for creating pictograms. Each object in a pictogram - dot, map, figurine, etc. - should encode an equal amount of the data. For the 538 visualization, is it true that each of the six red maps represents 1818 scenarios? This may be the case but not likely.

Recall the dot plot where the most extreme red dot shows a scenario in which Trump wins 376 out of 538 electoral votes (margin = 214). Each dot should represent 400 scenarios. The visualization implies that there are 400 scenarios similar to the one on display. For the grid of maps, the following red map from the top left corner should, in theory, represent 1,818 similar scenarios. Could be, but I'm not sure.

538_electoralvotemap_topleft

Mathematically, each of the depicted scenario, including the blowout win above, occurs with 1/40,000 chance in the simulation. However, one expects few scenarios that look like the extreme scenario, and ample scenarios that look like the median scenario.  

So, the right way to read the 538 chart is to ignore the map details when reading the embedded pictogram, and then look at the small multiples of detailed maps bearing in mind that extreme scenarios are unique while median scenarios have many lookalikes.

(Come to think about it, the analogous situation in the liquor consumption chart is the relative population size of different countries. When comparing country to country, we tend to forget that the data apply to large numbers of people in populous countries, and small numbers in tiny countries.)

***

There's a small improvement that can be made to the detailed maps. As I compare one map to the next, I'm trying to pick out which states that have changed to change the vote margin. Conceptually, the number of states painted red should decrease as the winning margin decreases, and the states that shift colors should be the toss-up states.

So I'd draw the solid Republican (Democratic) states with a lighter shade, forming an easily identifiable bloc on all maps, while the toss-up states are shown with a heavier shade.

Redo_junkcharts_538electoralmap_shading

Here, I just added a darker shade to the states that disappear from the first red map to the second.