Bubble charts, ratios and proportionality

A recent article in the Wall Street Journal about a challenger to the dominant weedkiller, Roundup, contains a nice selection of graphics. (Dicamba is the up-and-comer.)

Wsj_roundup_img1


The change in usage of three brands of weedkillers is rendered as a small-multiples of choropleth maps. This graphic displays geographical and time changes simultaneously.

The staircase chart shows weeds have become resistant to Roundup over time. This is considered a weakness in the Roundup business.

***

In this post, my focus is on the chart at the bottom, which shows complaints about Dicamba by state in 2019. This is a bubble chart, with the bubbles sorted along the horizontal axis by the acreage of farmland by state.

Wsj_roundup_img2

Below left is a more standard version of such a chart, in which the bubbles are allowed to overlap. (I only included the bubbles that were labeled in the original chart).

Redo_roundupwsj0

The WSJ’s twist is to use the vertical spacing to avoid overlapping bubbles. The vertical axis serves a design perogative and does not encode data.  

I’m going to stick with the more traditional overlapping bubbles here – I’m getting to a different matter.

***

The question being addressed by this chart is: which states have the most serious Dicamba problem, as revealed by the frequency of complaints? The designer recognizes that the amount of farmland matters. One should expect the more acres, the more complaints.

Let's consider computing directly the number of complaints per million acres.

The resulting chart (shown below right) – while retaining the design – gives a wholly different feeling. Arkansas now owns the largest bubble even though it has the least acreage among the included states. The huge Illinois bubble is still large but is no longer a loner.

Redo_dicambacomplaints1

Now return to the original design for a moment (the chart on the left). In theory, this should work in the following manner: if complaints grow purely as a function of acreage, then the bubbles should grow proportionally from left to right. The trouble is that proportional areas are not as easily detected as proportional lengths.

The pair of charts below depict made-up data in which all states have 30 complaints for each million acres of farmland. It’s not intuitive that the bubbles on the left chart are growing proportionally.

Redo_dicambacomplaints2

Now if you look at the right chart, which shows the relative metric of complaints per million acres, it’s impossible not to notice that all bubbles are the same size.


All these charts lament the high prices charged by U.S. hospitals

Nyt_medicalprocedureprices

A former student asked me about this chart from the New York Times that highlights much higher prices of hospital procedures in the U.S. relative to a comparison group of seven countries.

The dot plot is clearly thought through. It is not a default chart that pops out of software.

Based on its design, we surmise that the designer has the following intentions:

  1. The names of the medical procedures are printed to be read, thus the long text is placed horizontally.

  2. The actual price is not as important as the relative price, expressed as an index with the U.S. price at 100%. These reference values are printed in glaring red, unignorable.

  3. Notwithstanding the above point, the actual price is still of secondary importance, and the values are provided as a supplement to the row labels. Getting to the actual prices in the comparison countries requires further effort, and a calculator.

  4. The primary comparison is between the U.S. and the rest of the world (or the group of seven countries included). It is less important to distinguish specific countries in the comparison group, and thus the non-U.S. dots are given pastels that take some effort to differentiate.

  5. Probably due to reader feedback, the font size is subject to a minimum so that some labels are split into two lines to prevent the text from dominating the plotting region.

***

In the Trifecta Checkup view of the world, there is no single best design. The best design depends on the intended message and what’s in the available data.

To illustate this, I will present a few variants of the above design, and discuss how these alternative designs reflect the designer's intentions.

Note that in all my charts, I expressed the relative price in terms of discounts, which is the mirror image of premiums. Instead of saying Country A's price is 80% of the U.S. price, I prefer to say Country A's price is a 20% saving (or discount) off the U.S. price.

First up is the following chart that emphasizes countries instead of hospital procedures:

Redo_medicalprice_hor_dot

This chart encourages readers to draw conclusions such as "Hospital prices are 60-80 percent cheaper in Holland relative to the U.S." But it is more taxing to compare the cost of a specific procedure across countries.

The indexing strategy already creates a barrier to understanding relative costs of a specific procedure. For example, the value for angioplasty in Australia is about 55% and in Switzerland, about 75%. The difference 75%-55% is meaningless because both numbers are relative savings from the U.S. baseline. Comparing Australia and Switzerland requires a ratio (0.75/0.55 = 1.36): Australia's prices are 36% above Swiss prices, or alternatively, Swiss prices are a 64% 26% discount off Australia's prices.

The following design takes it even further, excluding details of individual procedures:

Redo_medicalprice_hor_bar

For some readers, less is more. It’s even easier to get a rough estimate of how much cheaper prices are in the comparison countries, for now, except for two “outliers”, the chart does not display individual values.

The widths of these bars reveal that in some countries, the amount of savings depends on the specific procedures.

The bar design releases the designer from a horizontal orientation. The country labels are shorter and can be placed at the bottom in a vertical design:

Redo_medicalprice_vert_bar

It's not that one design is obviously superior to the others. Each version does some things better. A good designer recognizes the strengths and weaknesses of each design, and selects one to fulfil his/her intentions.

 

P.S. [1/3/20] Corrected a computation, explained in Ken's comment.


Conceptualizing a chart using Trifecta: a practical example

In response to the reader who left a comment asking for ideas for improving the "marginal abatements chart" that was discussed here, I thought it might be helpful to lay out the process I go through when conceptualizing a chart. (Just a reminder, here is the chart we're dealing with.)

Ar_submit_Fig-3-2-The-policy-cost-curve-525

First, I'm very concerned about the long program names. I see their proper placement in a horizontal orientation as a hard constraint on the design. I'd reject every design that displays the text vertically, at an angle, or hides it behind some hover effect, or abbreviates or abridges the text.

Second, I strongly suggest re-thinking the "cost-effectiveness" metric on the vertical axis. Flipping the sign of this metric makes a return-on-investment-type metric, which is much more intuitive. Just to reiterate a prior point, it feels odd to be selecting more negative projects before more positive projects.

Third, I'd like to decide what metrics to place on the two axes. There are three main possibilities: a) benefits (that is, the average annual emissions abatement shown on the horizontal axis currently), b) costs, and c) some function that ties together costs and benefits (currently, this design uses cost per unit benefit, and calls it cost effectivness but there are a variety of similar metrics that can be defined).

For each of these metrics, there is a secondary choice. I can use the by-project value or the cumulative value. The cumulative value is dependent on a selection order, in this case, determined by the criterion of selecting from the most cost-effective program to the least (regardless of project size or any other criteria).

This is where I'd bring in the Trifecta Checkup framework (see here for a guide).

Trifectacheckup_junkcharts_image
The decision of which metrics to use on the axes means I'm operating in the "D" corner. But this decision must be made with respect to the "Q" corner, thus the green arrow between the two. Which two metrics are the most relevant depends on what we want the chart to accomplish. That in turn depends on the audience and what specific question we are addressing for them.

Fourth, if the purpose of the chart is exploratory - that is to say, we use it to guide decision-makers in choosing a subset of programs, then I would want to introduce an element of interactivity. Imagine an interface that allows the user to move programs in and out of the chart, while the chart updates itself to compute the total costs and total benefits.

This last point ties together the entire Trifacta Checkup framework (link). The Question being exploratory in nature suggests a certain way of organizing and analyzing the Data as well as a Visual form that facilitates interacting with the information.

 

 


How to read this cost-benefit chart, and why it is so confusing

Long-time reader Antonio R. found today's chart hard to follow, and he isn't alone. It took two of us multiple emails and some Web searching before we think we "got it".

Ar_submit_Fig-3-2-The-policy-cost-curve-525

 

Antonio first encountered the chart in a book review (link) of Hal Harvey et. al, Designing Climate Solutions. It addresses the general topic of costs and benefits of various programs to abate CO2 emissions. The reviewer praised the "wealth of graphics [in the book] which present complex information in visually effective formats." He presented the above chart as evidence, and described its function as:

policy-makers can focus on the areas which make the most difference in emissions, while also being mindful of the cost issues that can be so important in getting political buy-in.

(This description is much more informative than the original chart title, which states "The policy cost curve shows the cost-effectiveness and emission reduction potential of different policies.")

Spend a little time with the chart now before you read the discussion below.

Warning: this is a long read but well worth it.

 

***

 

If your experience is anything like ours, scraps of information flew at you from different parts of the chart, and you had a hard time piecing together a story.

What are the reasons why this data graphic is so confusing?

Everyone recognizes that this is a column chart. For a column chart, we interpret the heights of the columns so we look first at the vertical axis. The axis title informs us that the height represents "cost effectiveness" measured in dollars per million metric tons of CO2. In a cost-benefit sense, that appears to mean the cost to society of obtaining the benefit of reducing CO2 by a given amount.

That's how far I went before hitting the first roadblock.

For environmental policies, opponents frequently object to the high price of implementation. For example, we can't have higher fuel efficiency in cars because it would raise the price of gasoline too much. Asking about cost-effectiveness makes sense: a cost-benefit trade-off analysis encapsulates the something-for-something principle. What doesn't follow is that the vertical scale sinks far into the negative. The chart depicts the majority of the emissions abatement programs as having negative cost effectiveness.

What does it mean to be negatively cost-effective? Does it mean society saves money (makes a profit) while also reducing CO2 emissions? Wouldn't those policies - more than half of the programs shown - be slam dunks? Who can object to programs that improve the environment at no cost?

I tabled that thought, and proceeded to the horizontal axis.

I noticed that this isn't a standard column chart, in which the width of the columns is fixed and uneventful. Here, the widths of the columns are varying.

***

In the meantime, my eyes are distracted by the constellation of text labels. The viewing area of this column chart is occupied - at least 50% - by text. These labels tell me that each column represents a program to reduce CO2 emissions.

The dominance of text labels is a feature of this design. For a conventional column chart, the labels are situated below each column. Since the width does not usually carry any data, we tend to keep the columns narrow - Tufte, ever the minimalist, has even advocated reducing columns to vertical lines. That leaves insufficient room for long labels. Have you noticed that government programs hold long titles? It's tough to capture even the outline of a program with fewer than three big words, e.g. "Renewable Portfolio Standard" (what?).

The design solution here is to let the column labels run horizontally. So the graphical element for each program is a vertical column coupled with a horizontal label that invades the territories of the next few programs. Like this:

Redo_fueleconomystandardscars

The horror of this design constraint is fully realized in the following chart, a similar design produced for the state of Oregon (lifted from the Plan Washington webpage listed as a resource below):

Figure 2 oregon greenhouse

In a re-design, horizontal labeling should be a priority.

 

***

Realizing that I've been distracted by the text labels, back to the horizontal axis I went.

This is where I encountered the next roadblock.

The axis title says "Average Annual Emissions Abatement" measured in millions metric tons. The unit matches the second part of the vertical scale, which is comforting. But how does one reconcile the widths of columns with a continuous scale? I was expecting each program to have a projected annual abatement benefit, and those would fall as dots on a line, like this:

Redo_abatement_benefit_dotplot

Instead, we have line segments sitting on a line, like this:

Redo_abatement_benefit_bars_end2end_annuallabel

Think of these bars as the bottom edges of the columns. These line segments can be better compared to each other if structured as a bar chart:

Redo_abatement_benefit_bars

Instead, the design arranges these lines end-to-end.

To unravel this mystery, we go back to the objective of the chart, as announced by the book reviewer. Here it is again:

policy-makers can focus on the areas which make the most difference in emissions, while also being mindful of the cost issues that can be so important in getting political buy-in.

The primary goal of the chart is a decision-making tool for policy-makers who are evaluating programs. Each program has a cost and also a benefit. The cost is shown on the vertical axis and the benefit is shown on the horizontal. The decision-maker will select some subset of these programs based on the cost-benefit analysis. That subset of programs will have a projected total expected benefit (CO2 abatement) and a projected total cost.

By stacking the line segments end to end on top of the horizontal axis, the chart designer elevates the task of computing the total benefits of a subset of programs, relative to the task of learning the benefits of any individual program. Thus, the horizontal axis is better labeled "Cumulative annual emissions abatement".

 

Look at that axis again. Imagine you are required to learn the specific benefit of program titled "Fuel Economy Standards: Cars & SUVs".  

Redo_abatement_benefit_bars_end2end_cumlabel

This is impossible to do without pulling out a ruler and a calculator. What the axis labels do tell us is that if all the programs to the left of Fuel Economy Standards: Cars & SUVs were adopted, the cumulative benefits would be 285 million metric tons of CO2 per year. And if Fuel Economy Standards: Cars & SUVs were also implemented, the cumulative benefits would rise to 375 million metric tons.

***

At long last, we have arrived at a reasonable interpretation of the cost-benefit chart.

Policy-makers are considering throwing their support behind specific programs aimed at abating CO2 emissions. Different organizations have come up with different ways to achieve this goal. This goal may even have specific benchmarks; the government may have committed to an international agreement, for example, to reduce emissions by some set amount by 2030. Each candidate abatement program is evaluated on both cost and benefit dimensions. Benefit is given by the amount of CO2 abated. Cost is measured as a "marginal cost," the amount of dollars required to achieve each million metric ton of abatement.

This "marginal abatement cost curve" aids the decision-making. It lines up the programs from the most cost-effective to the least cost-effective. The decision-maker is presumed to prefer a more cost-effective program than a less cost-effective program. The chart answers the following question: for any given subset of programs (so long as we select them left to right contiguously), we can read off the cumulative amount of CO2 abated.

***

There are still more limitations of the chart design.

  • We can't directly read off the cumulative cost of the selected subset of programs because the vertical axis is not cumulative. The cumulative cost turns out to be the total area of all the columns that correspond to the selected programs. (Area is height x width, which is cost per benefit multiplied by benefit, which leaves us with the cost.) Unfortunately, it takes rulers and calculators to compute this total area.

  • We have presumed that policy-makers will make the Go-No-go decision based on cost effectiveness alone. This point of view has already been contradicted. Remember the mystery around negatively cost-effective programs - their existence shows that some programs are stalled even when they reduce emissions in addition to making money!

  • Since many, if not most, programs have negative cost-effectiveness (by the way they measured it), I'd flip the metric over and call it profitability (or return on investment). Doing so removes another barrier to our understanding. With the current cost-effectiveness metric, policy-makers are selecting the "negative" programs before the "positive" programs. It makes more sense to select the "positive" programs before the "negative" ones!

***

In a Trifecta Checkup (guide), I rate this chart Type V. The chart has a great purpose, and the design reveals a keen sense of the decision-making process. It's not a data dump for sure. In addition, an impressive amount of data gathering and analysis - and synthesis - went into preparing the two data series required to construct the chart. (Sure, for something so subjective and speculative, the analysis methodology will inevitably be challenged by wonks.) Those two data series are reasonable measures for the stated purpose of the chart.

The chart form, though, has various shortcomings, as shown here.  

***

In our email exchange, Antonio and I found the Plan Washington website useful. This is where we learned that this chart is called the marginal abatement cost curve.

Also, the consulting firm McKinsey is responsible for popularizing this chart form. They have published this long report that explains even more of the analysis behind constructing this chart, for those who want further details.


Marketers want millennials to know they're millennials

When I posted about the lack of a standard definition of "millennials", Dean Eckles tweeted about the arbitrary division of age into generational categories. His view is further reinforced by the following chart, courtesy of PewResearch by way of MarketingCharts.com.

PewResearch-Generational-Identification-Sept2015

Pew asked people what generation they belong to. The amount of people who fail to place themselves in the right category is remarkable. One way to interpret this finding is that these are marketing categories created by the marketing profession. We learned in my other post that even people who use the term millennial do not have a consensus definition of it. Perhaps the 8 percent of "millennials" who identify as "boomers" are handing in a protest vote!

The chart is best read row by row - the use of stacked bar charts provides a clue. Forty percent of millennials identified as millennials, which leaves sixty percent identifying as some other generation (with about 5 percent indicating "other" responses). 

While this chart is not pretty, and may confuse some readers, it actually shows a healthy degree of analytical thinking. Arranging for the row-first interpretation is a good start. The designer also realizes the importance of the diagonal entries - what proportion of each generation self-identify as a member of that generation. Dotted borders are deployed to draw eyes to the diagonal.

***

The design doesn't do full justice for the analytical intelligence. Despite the use of the bar chart form, readers may be tempted to read column by column due to the color scheme. The chart doesn't have an easy column-by-column interpretation.

It's not obvious which axis has the true category and which, the self-identified category. The designer adds a hint in the sub-title to counteract this problem.

Finally, the dotted borders are no match for the differential colors. So a key message of the chart is buried.

Here is a revised chart, using a grouped bar chart format:

Redo_junkcharts_millennial_id

***

In a Trifecta checkup (link), the original chart is a Type V chart. It addresses a popular, pertinent question, and it shows mature analytical thinking but the visual design does not do full justice to the data story.

 

 


The rule governing which variable to put on which axis, served a la mode

When making a scatter plot, the two variables should not be placed arbitrarily. There is a rule governing this: the outcome variable should be shown on the vertical axis (also called y-axis), and the explanatory variable on the horizontal (or x-) axis.

This chart from the archives of the Economist has this reversed:

20160402_WOC883_icecream_PISA

The title of the accompanying article is "Ice Cream and IQ"...

In a Trifecta Checkup (link), it's a Type DV chart. It's preposterous to claim eating ice cream makes one smarter without more careful studies. The chart also carries the xyopia fallacy: by showing just two variables, readers are unwittingly led to explain differences in "IQ" using differences in per-capita ice-cream consumption when lots of other stronger variables will explain any gaps in IQ.

In this post, I put aside my objections to the analysis, and focus on the issue of assigning variables to axes. Notice that this chart reverses the convention: the outcome variable (IQ) is shown on the horizontal, and the explanatory variable (ice cream) is shown on the vertical.

Here is a reconstruction of the above chart, showing only the dots that were labeled with country names. I fitted a straight regression line instead of a curve. (I don't understand why the red line in the original chart bends upwards when the data for Japan, South Korea, Singapore and Hong Kong should be dragging it down.)

Redo_econ_icecreamIQ_1A

Note that the interpretation of the regression line raises eyebrows because the presumed causality is reversed. For each 50 points increase in PISA score (IQ), this line says to expect ice cream consumption to raise by about 1-2 liters per person per year. So higher IQ makes people eat more ice cream.

***

If the convention is respected, then the following scatter plot results:

Redo_econ_icecreamIQ_2

The first thing to note is that the regression analysis is different here from that shown in the previous chart. The blue regression line is not equivalent to the black regression line from the previous chart. You cannot reverse the roles of the x and y variables in a regression analysis, and so neither should you reverse the roles of the x and y variables in a scatter plot.

The blue regression line can be interpreted as having two sections, roughly, for countries consuming more than or less than 6 liters of ice cream per person per year. In the less-ice-cream countries, the correlation between ice cream and IQ is stronger (I don't endorse the causal interpretation of this statement).

***

When you make a scatter plot, you have two variables for which you want to analyze their correlation. In most cases, you are exploring a cause-effect relationship.

Higher income households cares more on politics.
Less educated citizens are more likely to not register to vote.
Companies with more diverse workforce has better business performance.

Frequently, the reverse correlation does not admit a causal interpretation:

Caring more about politics does not make one richer.
Not registering to vote does not make one less educated.
Making more profits does not lead to more diversity in hiring.

In each of these examples, it's clear that one variable is the outcome, the other variable is the explanatory factor. Always put the outcome in the vertical axis, and the explanation in the horizontal axis.

The justification is scientific. If you are going to add a regression line (what Excel calls a "trendline"), you must follow this convention, otherwise, your regression analysis will yield the wrong result, with an absurd interpretation!

 

[PS. 11/3/2019: The comments below contain different theories that link the two variables, including theories that treat PISA score ("IQ") as the explanatory variable and ice cream consumption as the outcome. Also, I elaborated that the rule does not dictate which variable is the outcome - the designer effectively signals to the reader which variable is regarded as the outcome by placing it in the vertical axis.]


Does this chart tell the sordid tale of TI's decline?

The Hustle has an interesting article on the demise of the TI calculator, which is popular in business circles. The article uses this bar chart:

Hustle_ti_calculator_chart

From a Trifecta Checkup perspective, this is a Type DV chart. (See this guide to the Trifecta Checkup.)

The chart addresses a nice question: is the TI graphing calculator a victim of new technologies?

The visual design is marred by the use of the calculator images. The images add nothing to our understanding and create potential for confusion. Here is a version without the images for comparison.

Redo_junkcharts_hustlet1calc

The gridlines are placed to reveal the steepness of the decline. The sales in 2019 will likely be half those of 2014.

What about the Data? This would have been straightforward if the revenues shown are sales of the TI calculator. But according to the subtitle, the data include a whole lot more than calculators - it's the "other revenues" category in the financial reports of Texas Instrument which markets the TI. 

It requires a leap of faith to believe this data. It is entirely possible that TI calculator sales increased while total "other revenues" decreased! The decline of TI calculator could be more drastic than shown here. We simply don't have enough data to say for sure.

 

P.S. [10/3/2019] Fixed TI.

 

 


Women workers taken for a loop or four

I was drawn to the following chart in Business Insider because of the calendar metaphor. (The accompanying article is here.)

Businessinsider_payday

Sometimes, the calendar helps readers grasp concepts faster but I'm afraid the usage here slows us down.

The underlying data consist of just four numbers: the wage gaps between race and gender in the U.S., considered simply from an aggregate median personal income perspective. The analyst adopts the median annual salary of a white male worker as a baseline. Then, s/he imputes the number of extra days that others must work to attain the same level of income. For example, the median Asian female worker must work 64 extra days (at her daily salary level) to match the white guy's annual pay. Meanwhile, Hispanic female workers must work 324 days extra.

There are a host of reasons why the calendar metaphor backfired.

Firstly, it draws attention to an uncomfortable detail of the analysis - which papers over the fact that weekends or public holidays are counted as workdays. The coloring of the boxes compounds this issue. (And the designer also got confused and slipped up when applying the purple color for Hispanic women.)

Secondly, the calendar focuses on Year 2 while Year 1 lurks in the background - white men have to work to get that income (roughly $46,000 in 2017 according to the Census Bureau).

Thirdly, the calendar view exposes another sore point around the underlying analysis. In reality, the white male workers are continuing to earn wages during Year 2.

The realism of the calendar clashes with the hypothetical nature of the analysis.

***

One can just use a bar chart, comparing the number of extra days needed. The calendar design can be considered a set of overlapping bars, wrapped around the shape of a calendar.

The staid bars do not bring to life the extra toil - the message is that these women have to work harder to get the same amount of pay. This led me to a different metaphor - the white men got to the destination in a straight line but the women must go around loops (extra days) before reaching the same endpoint.

Redo_businessinsider_racegenderpaygap

While the above is a rough sketch, I made sure that the total length of the lines including the loops roughly matches the total number of days the women needed to work to earn $46,000.

***

The above discussion focuses solely on the V(isual) corner of the Trifecta Checkup, but this data visualization is also interesting from the D(ata) perspective. Statisticians won't like such a simple analysis that ignores, among other things, the different mix of jobs and industries underlying these aggregate pay figures.

Now go to my other post on the sister (book) blog for a discussion of the underlying analysis.

 

 


This Wimbledon beauty will be ageless

Ft_wimbledonage


This Financial Times chart paints the picture of the emerging trend in Wimbledon men’s tennis: the average age of players has been rising, and hits 30 years old for the first time ever in 2019.

The chart works brilliantly. Let's look at the design decisions that contributed to its success.

The chart contains a good amount of data and the presentation is carefully layered, with the layers nicely tied to some visual cues.

Readers are drawn immediately to the average line, which conveys the key statistical finding. The blue dot  reinforces the key message, aided by the dotted line drawn at 30 years old. The single data label that shows a number also highlights the message.

Next, readers may notice the large font that is applied to selected players. This device draws attention to the human stories behind the dry data. Knowledgable fans may recall fondly when Borg, Becker and Chang burst onto the scene as teenagers.

 

Then, readers may pick up on the ticker-tape data that display the spread of ages of Wimbledon players in any given year. There is some shading involved, not clearly explained, but we surmise that it illustrates the range of ages of most of the contestants. In a sense, the range of probable ages and the average age tell the same story. The current trend of rising ages began around 2005.

 

Finally, a key data processing decision is disclosed in chart header and sub-header. The chart only plots the players who reached the fourth round (16). Like most decisions involved in data analysis, this choice has both desirable and undesirable effects. I like it because it thins out the data. The chart would have appeared more cluttered otherwise, in a negative way.

The removal of players eliminated in the early rounds limits the conclusion that one can draw from the chart. We are tempted to generalize the finding, saying that the average men’s player has increased in age – that was what I said in the first paragraph. Thinking about that for a second, I am not so sure the general statement is valid.

The overall field might have gone younger or not grown older, even as the older players assert their presence in the tournament. (This article provides side evidence that the conjecture might be true: the author looked at the average age of players in the top 100 ATP ranking versus top 1000, and learned that the average age of the top 1000 has barely shifted while the top 100 players have definitely grown older.)

So kudos to these reporters for writing a careful headline that stays true to the analysis.

I also found this video at FT that discussed the chart.

***

This chart about Wimbledon players hits the Trifecta. It has an interesting – to some, surprising – message (Q). It demonstrates thoughtful processing and analysis of the data (D). And the visual design fits well with its intended message (V). (For a comprehensive guide to the Trifecta Checkup, see here.)


What is a bad chart?

In the recent issue of Madolyn Smith’s Conversations with Data newsletter hosted by DataJournalism.com, she discusses “bad charts,” featuring submissions from several dataviz bloggers, including myself.

What is a “bad chart”? Based on this collection of curated "bad charts", it is not easy to nail down “bad-ness”. The common theme is the mismatch between the message intended by the designer and the message received by the reader, a classic error of communication. How such mismatch arises depends on the specific example. I am able to divide the “bad charts” into two groups: charts that are misinterpreted, and charts that are misleading.

 

Charts that are misinterpreted

The Causes of Death entry, submitted by Alberto Cairo, is a “well-designed” chart that requires “reading the story where it is inserted and the numerous caveats.” So readers may misinterpret the chart if they do not also partake the story at Our World in Data which runs over 1,500 words not including the appendix.

Ourworldindata_causesofdeath

The map of Canada, submitted by Highsoft, highlights in green the provinces where the majority of residents are members of the First Nations. The “bad” is that readers may incorrectly “infer that a sizable part of the Canadian population is First Nations.”

Highsoft_CanadaFirstNations

In these two examples, the graphic is considered adequate and yet the reader fails to glean the message intended by the designer.

 

Charts that are misleading

Two fellow bloggers, Cole Knaflic and Jon Schwabish, offer the advice to start bars at zero (here's my take on this rule). The “bad” is the distortion introduced when encoding the data into the visual elements.

The Color-blindness pictogram, submitted by Severino Ribecca, commits a similar faux pas. To compare the rates among men and women, the pictograms should use the same baseline.

Colourblindness_pictogram

In these examples, readers who correctly read the charts nonetheless leave with the wrong message. (We assume the designer does not intend to distort the data.) The readers misinterpret the data without misinterpreting the graphics.

 

Using the Trifecta Checkup

In the Trifecta Checkup framework, these problems are second-level problems, represented by the green arrows linking up the three corners. (Click here to learn more about using the Trifecta Checkup.)

Trifectacheckup_img

The visual design of the Causes of Death chart is not under question, and the intended message of the author is clearly articulated in the text. Our concern is that the reader must go outside the graphic to learn the full message. This suggests a problem related to the syncing between the visual design and the message (the QV edge).

By contrast, in the Color Blindness graphic, the data are not under question, nor is the use of pictograms. Our concern is how the data got turned into figurines. This suggests a problem related to the syncing between the data and the visual (the DV edge).

***

When you complain about a misleading chart, or a chart being misinterpreted, what do you really mean? Is it a visual design problem? a data problem? Or is it a syncing problem between two components?