Avoid concentric circles

A twitter follower sent me this chart by way of Munich:

Msc_staggereddonut

The logo of the Munich Security Conference (MSC) is quite cute. It looks like an ear. Perhaps that inspired this, em, staggered donut chart.

I like to straighten curves out so the donut chart becomes a bar chart:

Redo_junkcharts_msc_germanallies_distortion

The blue and gray bars mimic the lengths of the arcs in the donut chart. The yellow bars show the relative size of the underlying data. You can see that three of the four arcs under-represent the size of the data.

Why is that so? It's due to the staggering. Inner circles have smaller circumferences than outer circles. The designer keeps the angles the same so the arc lengths have been artificially reduced.

Junkcharts_redo_munichgermanallies_donuts

***

The donut chart is just a pie chart with a hole punched in the middle. For both pie charts and donut charts, the data are encoded in the angles at the center of the circle. Under normal circumstances, pie charts can also be read by comparing sector areas, and donut charts using arc lengths, as those are proportional to the angles.

The area and arc interpretation fails when the designer alters the radii of the sections. Look at the following pair of pie charts, produced by filling the hole in the above donuts:

Junkcharts_redo_munichgermanallies_pies

The staggered pie chart distorts the data if the reader compares areas but not so if the reader compares angles at the center. The pie chart can be read both ways so long as the designer does not alter the radii.

 


Election visuals 4: the snake pit is the best election graphic ever

This is the final post on the series of data visualization deployed by FiveThirtyEight to explain their election forecasting model. The previous posts are here, here and here.

I'm saving the best for last.

538_snakepit

This snake-pit chart brings me great joy - I wish I came up with it!

This chart wins by focusing on a limited set of questions, and doing so excellently. As with many election observers, we understand that the U.S. presidential election will turn on so-called "swing states," and the candidates' strength in these swing states are variable, as the name suggests. Thus, we like to know which states are in play, and within these states, which ones are most unpredictable.

This chart lines up all the states from the reddest of red up top to the bluest of blue at the bottom. Each state is ranked by the voting margin predicted by 538's election forecasting model. The swing states are found in the middle.

Since each state confers a fixed number of electoral votes, and a candidate must amass 270 to win, there is a "tipping" state. In the diagram above, it's Pennsylvania. This pivotal state is neatly foregrounded as the one crossing the line in the middle.

The lengths of the segments correspond to the number of electoral votes and so do not change with the data. What change are the sequencing of the segments, and the color shading.

This data visualization is a gem of visual story-telling. The form lends itself to a story.

***

The snake-pit chart succeeds by not doing too much. There are many items that the chart does not directly communicate.

The exact number of electoral votes by state is not explicit, nor is it easy to compare the lengths of bending segments. The color scale for conveying the predicted voting margins is crude, and it's not clear what is the difference between a deep color and a light color. It's also challenging to learn the electoral vote split; the actual winning margin is not even stated.

The reality is the average reader doesn't care. I got everything I wanted from the chart, and I ain't got the time to explore every state.

There is a hover-over effect that reveals some of the additional information:

538_snakepitchart_detail

One can keep going on. I have no idea how the 40,000 scenarios presented in the other graphics in this series have been reduced to the forecast shown in the inset. But again, those omissions did not lessen my enjoyment. The point is: let your graphics breathe.

***

I'm thinking of potential variations even though I'm fully satisfied with this effort.

I wonder if the color shading should be reversed. The light shading encodes a smaller voting margin, which indicates a tighter race. But our attention is typically drawn first to the darker shades. If the shading scheme is reversed, the color should be described as how tight the race is.

I also wonder if a third color (purple) should be introduced. Doing so would require the editors to make judgment calls on which set of states are swing states.

One strange thing about election day is the specific sequence of when TV stations (!) call the state results, which not only correlates with voting margin but also with time zones. I wonder if the time zone information can be worked into the sequencing of segments.

Let me know what you think of these ideas, or leave your own ideas, in the comments below.

***

I have already praised this graphic when it first came out in 2016. (link)

A key improvement is tilting the chart, which avoids vertical state labels.

The previous post was written around election day 2016. The snake pit further cements its status as a story-telling device. As states are called, they are taken out of the picture. So it works very well as a dynamic chart on election day.

I'm nominating this snake-pit chart as the best election graphic ever. Kudos to the FiveThirtyEight team.


Election visuals 2: informative and playful

In yesterday's post, I reviewed one section of 538's visualization of its election forecasting model, specifically, the post focuses on the probability plot visualization.

The visualization, technically called  a pdf, is a mainstay of statistical graphics. While every one of 40,000 scenarios shows up on this chart, it doesn't offer a direct answer to our topline question. What is Nate's call at this point in time? Elsewhere in their post, we learn that the 538 model currently gives Biden a 75% chance of winning, thrice that of Trump's.

538_pdf_pair

In graphical terms, the area to the right of the 270-line is three times the size of the left area (on the bottom chart). That's not apparent in the pdf representation. Addressing this, statisticians may convert the pdf into a cdf, which depicts the cumulative area as we sweep from the left to the right along the horizontal axis.  

The cdf visualization rarely leaves the pages of a scientific journal because it's not easy for a novice to understand. Not least because the relevant probability is 1 minus the cumulative probability. The cdf for the bottom chart will show 25% at the 270-line while the chance of Biden winning is 1 - 25% = 75%.

The cdf presentation is also wasteful for the election scenario. No one cares about any threshold other than the 270 votes needed to win, but the standard cdf shows every possible threshold.

The second graphical concept in the 538 post (link) is an attempt to solve this problem.

538_dotplot

If you drop all the dots to an imaginary horizontal baseline, the above dotplot looks like this:

Redo_junkcharts_538electionforecast_dotplot_1

There is a recent trend toward centering dots to produce symmetry. It's actually harder to perceive the differences in heights of the band.

The secret sauce is to put down 100 dots, with a 75-25 blue-red split that conveys the 75% chance of a Biden win. Imposing the pdf line from the other visualization, I find that the density of dots roughly mimics the probability of outcomes.

Redo_junkcharts_538electionforecast_dotplot_2

It's easier to estimate the blue vs red areas using those dots than the lines.

The dots are stuffed toys. Clicking on each dot reveals a map showing one of the 40,000 scenarios. It displays which candidate wins which state. For example, the most extreme example of a Trump win is:

538_dotplot_redextreme

Here is a scenario of a razor-tight election won by Trump:

538_dotplot_redmiddle

This presentation has a weakness as well. It gives the impression that each of the dots is equally important because they are the same size. In reality, the importance of each dot is proportional to the height of the band. Since the band is generally wider near the middle, the dots near the middle are more likely scenarios than the dots shown on the two edges.

On balance, I like this visualization that is both informative and playful.

As before, what strikes me about the simulation result is the flatness of the probability surface. This feature is obscured when we summarize the result as 75% chance of a Biden victory.


The hidden bad assumption behind most dual-axis time-series charts

[Note: As of Monday afternoon, Typepad is having problems rendering images. Please try again later if the charts are not loading properly.]

DC sent me the following chart over Twitter. It supposedly showcases one sector that has bucked the economic collapse, and has conversely been boosted by the stay-at-home orders around the world.

Covid19-pornhubtraffic


At first glance, I was drawn to the yellow line and the axis title on the right side. I understood the line to depict the growth rate in traffic "vs a normal day". The trend is clear as day. Since March 10 or so, the website has become more popular by the week.

For a moment, I thought the thin black line was a trendline that fits the rather ragged traffic growth data. But looking at the last few data points, I was afraid it was a glove that didn't fit. That's when I realized this is a dual-axis chart. The black line shows the worldwide total Covid-19 cases, with the axis shown on the left side.

As with any dual-axis charts, you can modify the relationship between the two scales to paint a different picture.

This next chart says that the site traffic growth lagged Covid-19 growth until around March 14.

Junkcharts_ph_dualaxis1

This one gives an ambiguous picture. One can't really say there is a strong correlation between the two time series.

Junkcharts_ph_dualaxis2

***

Now, let's look at the chart from the DATA corner of the Trifecta Checkup (link). The analyst selected definitions that are as far apart as possible. So this chart gives a good case study of the intricacy of data definitions.

First, notice the smoothness of the line of Covid-19 cases. This data series is naturally "smoothed" because it is an aggregate of country-level counts, which themselves are aggregates of regional counts.

By contrast, the line of traffic growth rates has not been smoothed. That's why we see sharp ups and downs. This series should be smoothed as well.

Junkcharts_ph_smoothedtrafficgrowth

The seven-day moving average line indicates a steady growth in traffic. The day-to-day fluctuations represent noise that distracts us from seeing the trendline.

Second, the Covid-19 series is a cumulative count, which means it's constantly heading upward over time (on rare days, it may go flat but never decrease). The traffic series represents change, is not cumulative, and so it can go up or down over time. To bring the data closer together, the Covid-19 series can be converted into new cases so they are change values.

Junkcharts_ph_smoothedcovidnewcases

Third, the traffic series are growth rates as percentages while the Covid-19 series are counts. It is possible to turn Covid-19 counts into growth rates as well. Like this:

Junkcharts_ph_smoothedcovidcasegrowth

By standardizing the units of measurement, both time series can be plotted on the same axis. Here is the new plot:

Redo_junkcharts_ph_trafficgrowthcasegrowth

Third, the two growth rates have different reference levels. The Covid-19 growth rate I computed is day-on-day growth. This is appropriate since we don't presume there is a seasonal effect - something like new cases on Mondays are typically larger than new cases on Tuesday doesn't seem plausible.

Thanks to this helpful explainer (link), I learned what the data analyst meant by a "normal day". The growth rate of traffic is not day-on-day change. It is the change in traffic relative to the average traffic in the last four weeks on the same day of week. If it's a Monday, the change in traffic is relative to the average traffic of the last four Mondays.

This type of seasonal adjustment is used if there is a strong day-of-week effect. For example, if the website reliably gets higher traffic during weekends than weekdays, then the Saturday traffic may always exceed the Friday traffic; instead of comparing Saturday to the day before, we index Saturday to the previous Saturday, Friday to the previous Friday, and then compare those two values.

***

Let's consider the last chart above, the one where I got rid of the dual axes.

A major problem with trying to establish correlation of two time series is time lag. Most charts like this makes a critical and unspoken assumption - that the effect of X on Y is immediate. This chart assumes that the higher the number Covid-19 cases, the more people stays home that day, the more people swarms the site that day. Said that way, you might see it's ridiculous.

What is true of any correlations in the wild - there is always some amount of time lag. It usually is hard to know how much lag.

***

Finally, the chart omitted a huge factor driving the growth in traffic. At various times dependent on the country, the website rolled out a free premium service offer. This is the primary reason for the spike around mid March. How much of the traffic growth is due to the popular marketing campaign, and how much is due to stay-at-home orders - that's the real question.


Proportions and rates: we are no dupes

Reader Lucia G. sent me this chart, from Ars Technica's FAQ about the coronavirus:

Arstechnica_covid-19-2.001-1280x960

She notices something wrong with the axis.

The designer took the advice not to make a dual axis, but didn't realize that the two metrics are not measured on the same scale even though both are expressed as percentages.

The blue bars, labeled "cases", is a distribution of cases by age group. The sum of the blue bars should be 100 percent.

The orange bars show fatality rates by age group. Each orange bar's rate is based on the number of cases in that age group. The sum of the orange bars will not add to 100 percent.

In general, the rates will have much lower values than the proportions. At least that should be the case for viruses that are not extremely fatal.

This is what the 80 and over section looks like.

Screen Shot 2020-03-12 at 1.19.46 AM

It is true that fatality rate (orange) is particularly high for the elderly while this age group accounts for less than 5 percent of total cases (blue). However, the cases that are fatal, which inhabit the orange bar, must be a subset of the total cases for 80 and over, which are shown in the blue bar. Conceptually, the orange bar should be contained inside the blue bar. So, it's counter-intuitive that the blue bar is so much shorter than the orange bar.

The following chart fixes this issue. It reveals the structure of the data, Total cases are separated by age group, then within each age group, a proportion of the cases are fatal.

Junkcharts_redo_arstechnicacovid19

This chart also shows that most patients recover in every age group. (This is only approximately true as some of the cases may not have been discharged yet.)

***

This confusion of rates and proportions reminds me of something about exit polls I just wrote about the other day on the sister blog.

When the media make statements about trends in voter turnout rate in the primary elections, e.g. when they assert that youth turnout has not increased, their evidence is from exit polls, which can measure only the distribution of voters by age group. Exit polls do not and cannot measure the turnout rate, which is the proportion of registered (or eligible) voters in the specific age group who voted.

Like the coronavirus data, the scales of these two metrics are different even though they are both percentages: the turnout rate is typically a number between 30 and 70 percent, and summing the rates across all age groups will exceed 100 percent many times over. Summing the proportions of voters across all age groups should be 100 percent, and no more.

Changes in the proportion of voters aged 18-29 and changes in the turnout rate of people aged 18-29 are not the same thing. The former is affected by the turnout of all age groups while the latter is a clean metric affected only by 18 to 29-years-old.

Basically, ignore pundits who use exit polls to comment on turnout trends. No matter how many times they repeat their nonsense, proportions and rates are not to be confused. Which means, ignore comments on turnout trends because the only data they've got come from exit polls which don't measure rates.

 

P.S. Here is some further explanation of my chart, as a response to a question from Enrico B. on Twitter.

The chart can be thought of as two distributions, one for cases (gray) and one for deaths (red). Like this:

Junkcharts_redo_arstechnicacoronavirus_2

The side-by-side version removes the direct visualization of the fatality rate within each age group. To understand fatality rate requires someone to do math in their head. Readers can qualitatively assess that for the 80 and over, they accounted for 3 percent of cases but also about 21 percent of deaths. People aged 70 to 79 however accounted for 9 percent of cases but 30 percent of deaths, etc.

What I did was to scale the distribution of deaths so that they can be compared to the cases. It's like fitting the red distribution inside the gray distribution. Within each age group, the proportion of red against the length of the bar is the fatality rate.

For every 100 cases regardless of age, 3 cases are for people aged 80 and over within which 0.5 are fatal (red).

So, the axis labels are correct. The values are proportions of total cases, although as the designer of the chart, I hope people are paying attention more to the proportion of red, as opposed to the units.

What might strike people as odd is that the biggest red bar does not appear against 80 and above. We might believe it's deadlier the older you are. That's because on an absolute scale, more people aged 70-79 died than those 80 and above. The absolute deaths is the product of the proportion of cases and the fatality rate. That's really a different story from the usual plot of fatality rates by age group. In those charts, we "control" for the prevalence of cases. If every age group were infected in the same frequency, then COVID-19 does kill more 80 and over.

 

 

 


Too many colors on a chart is bad, but why?

The following chart is bad, but how so?

Junkcharts_colors_columnchart

The chart is annoying because of the misuse of colors.

What is the purpose of the multiple colors used in this chart? It's not encoding any data. Colors are used here to differentiate one bar from its two neighbors. Or perhaps to make the chart more "appealing".

The reason why the coloring scheme backfires is that readers may look for meaning in the colors. What's common between Iceland, United States and Germany for them to be assigned green? What about Japan, New Zealand, Spain and France, all of which shown yellow?

The readers' instinct is driven by a set of unspoken rules that govern the production of data visualization. Specifically, the rule here is: color differences reflect data differences. When such a rule is violated, the reader is misled and confused.

***

For more about this rule, other rules related to making bar charts, and other other rules for making data graphics, please read my Long Read article, here.

 


The unspoken rules of visualization

My latest is at DataJournalism.com.

Ejc_unspokenrulesbanner

It's an essay on the following observation:

The efficiency and multidimensionality of the visual medium arise from a set of conventions and rules, which regularises the communications between producers of data visualisation and its consumers. These conventions and rules are often unspoken: it's the visual equivalent of saying ’it goes without saying’ .

There are lots of little things visualization designers do in their sleep that don't get mentioned. When a visual design deviates from these rules, the readers may get confused.

Here is one example I discussed in the article (hat tip to Xan Gregg).

Fig04_piechart_diverging

This pie chart is not easy to read beyond the obvious point that English is the most popular. The following pie chart is much easier on the readers:

Fig03_piechart_conforming

Why?

The designer follows some common conventions, such as placing the first slice at the top vertical, sorting the slices from largest to smallest (excepting the "other"), and introducing multiple colors only to encode data differences.

These rules are silently applied, and are not announced to the reader. There is a network effect: the more practitioners use these rules, the stronger they stick.

My essay attempts to outline some of the most important unspoken rules of visualizaiton. For more, see here.


Habit-busting designs don't work

The design changes that most frustrate users are those that bust their habits.

Case in point. Apple re-designed the bottom navigator of the iphone mail app. See what it looked like before and what it looks like today:

Iphone_mail_bottom_nav

Notice how the 2nd slot from the bottom right used to be for replying, and after the re-design, it has become the button for deleting. So when I intended to reply to a message, my finger instinctively presses that 2nd button and now, instead of replying, the message gets deleted!

In the last few years, my finger hit that button thousands of times whenever the brain said to reply. Now, it's really hard to change this habit. I kept having to undo the delete. It's frustrating beyond belief.

This also shows the habit is in the muscle memory, and I'm no longer paying attention to the visual icon. A more direct dataviz analogy is when you belatedly discovered that the horizontal axis in a line chart isn't representing time because you didn't read the axis labels.

***

A similar thing happened inside an elevator (lift) recently.

Most elevator panels place the Door Open and Door Close buttons side by side. Typically, the Door Open is on the left and the Door Close is on the right.

This particular elevator panel has the Door Open button on top, and Door Close at the bottom, laid out vertically. To the right of the Door Open button is the Alarm button! So I sounded the Alarm when I intended the doors to close.

(I didn't take a photo at the time. The figure on the right is a rough sketch of what the panel looked like.)

Junkcharts_elevatorpaneldesign

I bet the alarm is pressed multiple times a day by mistake.


Light entertainment: people of color

What colors do the "average" person like the most and the least? The following chart found here (Scott Design) tells you favorite and least favorite colors by age groups:

Color-preferences-by-age

(This is one of a series of charts. A total of 10 colors is covered by the survey. The same color can appear in both favorites and least favorites since these are aggregate proportions. Almost 40% of the respondents are under 18 and only one percent are over 70.)

Here's one item that has stumped me thus far: how are the colors ordered within each figurine?


Does this chart tell the sordid tale of TI's decline?

The Hustle has an interesting article on the demise of the TI calculator, which is popular in business circles. The article uses this bar chart:

Hustle_ti_calculator_chart

From a Trifecta Checkup perspective, this is a Type DV chart. (See this guide to the Trifecta Checkup.)

The chart addresses a nice question: is the TI graphing calculator a victim of new technologies?

The visual design is marred by the use of the calculator images. The images add nothing to our understanding and create potential for confusion. Here is a version without the images for comparison.

Redo_junkcharts_hustlet1calc

The gridlines are placed to reveal the steepness of the decline. The sales in 2019 will likely be half those of 2014.

What about the Data? This would have been straightforward if the revenues shown are sales of the TI calculator. But according to the subtitle, the data include a whole lot more than calculators - it's the "other revenues" category in the financial reports of Texas Instrument which markets the TI. 

It requires a leap of faith to believe this data. It is entirely possible that TI calculator sales increased while total "other revenues" decreased! The decline of TI calculator could be more drastic than shown here. We simply don't have enough data to say for sure.

 

P.S. [10/3/2019] Fixed TI.