Two metrics in-fighting

The Wall Street Journal shows the following chart which pits two metrics against each other:

Wsj_salaries25to29

The primary metric is the change in median yearly salary between the two periods of time. We presume it's primary because of its presence in the chart title, and the blue bars being more readable than the green bubbles. The secondary metric is the median yearly salary in the later period.

That, I believe, was the intended design. When I saw this chart, my eyes went to the numbers inside the green bubbles. Perhaps it's because I didn't read the chart title first, and the horizontal axis wasn't labelled so it wasn't obvious what the blue bars coded.

As with most bubble charts, the data labels exist to cover up the inadequacy of circular areas. The self-sufficiency test - removing the data labels - shows this well:

Redo_wsj_salaries25to29

It's simply impossible to know what values should be in each bubble, or to perceive the relative sizes of those bubbles.

***

Reversing the order of the blue bars also helps:

Redo_wsjsalaries25to29_2

The original order is one of the more annoying features in most visualization packages. Because internally, the categories are numbered 1, 2, 3, ..., and because the convention is to have values run higher as they run up the vertical axis, these packages would place the top-ranked item at the bottom of the chart.

Most people read top to bottom, which means that they read the least important item first, and the most important item last!

In most visualization packages, it takes only 1 click or 1 action to reverse the order of the items. Please do it!

***

For change over time, I like using a Bumps chart, otherwise called a slope graph:

Redo_wsjsalaries25to29_3


What is the question is the question

I picked up a Fortune magazine while traveling, and saw this bag of bubbles chart.

Fortune_global500 copy

This chart is visually appealing, that must be said. Each circle represents the reported revenues of a corporation that belongs to the “Global 500 Companies” list. It is labeled by the location of the company’s headquarters. The largest bubble shows Beijing, the capital of China, indicating that companies based in Beijing count $6 trillion dollars of revenues amongst them. The color of the bubbles show large geographical units; the red bubbles are cities in Greater China.

I appreciate a couple of the design decisions. The chart title and legend are placed on the top, making it easy to find one’s bearing – effective while non-intrusive. The labeling signals a layering: the first and biggest group have icons; the second biggest group has both name and value inside the bubbles; the third group has values inside the bubbles but names outside; the smallest group contains no labels.

Note the judgement call the designer made. For cities that readers might not be familiar with, a country name (typically abbreviated) is added. This is a tough call since mileage varies.

***

As I discussed before (link), the bag of bubbles does not elevate comprehension. Just try answering any of the following questions, which any of us may have, using just the bag of bubbles:

  • What proportion of the total revenues are found in Beijing?
  • What proportion of the total revenues are found in Greater China?
  • What are the top 5 cities in Greater China?
  • What are the ranks of the six regions?

If we apply the self-sufficiency test and remove all the value labels, it’s even harder to figure out what’s what.

***

_trifectacheckup_image

Moving to the D corner of the Trifecta Checkup, we aren’t sure how to interpret this dataset. It’s unclear if these companies derive most of their revenues locally, or internationally. A company headquartered in Washington D.C. may earn most of its revenues in other places. Even if Beijing-based companies serve mostly Chinese customers, only a minority of revenues would be directly drawn from Beijing. Some U.S. corporations may choose its headquarters based on tax considerations. It’s a bit misleading to assign all revenues to one city.

As we explore this further, it becomes clear that the designer must establish a target – a strong idea of what question s/he wants to address. The Fortune piece comes with a paragraph. It appears that an important story is the spatial dispersion of corporate revenues in different countries. They point out that U.S. corporate HQs are more distributed geographically than Chinese corporate HQs, which tend to be found in the key cities.

There is a disconnect between the Question and the Data used to create the visualization. There is also a disconnect between the Question and the Visual display.


One bubble is a tragedy, and a bag of bubbles is...

From Kathleen Tyson's twitter account, I came across a graphic showing the destinations of Ukraine's grain exports since 2022 under the auspices of a UN deal. This graphic, made by AFP, uses one of the chart forms that baffle me - the bag of bubbles.

Ukraine_grains_bubbles

The first trouble with a bag of bubbles is the single bubble. The human brain is just not fit for comparing bubble sizes. The self-sufficiency test is my favorite device for demonstrating this weakness. The following is the European section of the above chart, with the data labels removed.

Redo_junkcharts_afp_ukrainegrains_europe_1

How much bigger is Spain than the Netherlands? What's the difference between Italy and the Netherlands? The answers don't come easily to mind. (The Netherlands is about 40% the size of Spain, and Italy is about 20% larger than the Netherlands.)

While comparing relative circular areas is a struggle, figuring out the relative ranks is not. Sure, it gets tougher with small differences (Germany vs S. Korea, Belgium vs Portugal) but saying those pairs are tied isn't a tragedy.

***

Another issue with bubble charts is how difficult it is to assess absolute values. A circle on its own has no reference point. The designer needs to add data labels or a legend. Adding data labels is an act of giving up. The data labels become the primary instrument for communicating the data, not the visual construct. Adding one data label is not enough, as the following shows:

Redo_junkcharts_afpukrainegrains_2

Being told that Spain's value is 4.1 does little to help estimate the values for the non-labelled bubbles.

The chart does come with the following legend:

Afp_ukrianegrains_legend

For this legend to work, the sample bubble sizes should span the range of the data. Notice that it's difficult to extrapolate from the size of the 1-million-ton bubble to 2-million, 4-million, etc. The analogy is a column chart in which the vertical axis does not extend through the full range of the dataset.

The designer totally gets this. The chart therefore contains both selected data labels and the partial legend. Every bubble larger than 1 million tons has an explicit data label. That's one solution for the above problem.

Nevertheless, why not use another chart form that avoids these problems altogether?

***

In Tyson's tweet, she showed another chart that pretty much contains the same information, this one from TASS.

Ukraine_grains_flows

This chart uses the flow diagram concept - in an abstract way, as I explained in previous post.

This chart form imposes structure on the data. The relative ranks of the countries within each region are listed from top to bottom. The relative amounts of grains are shown in black columns (and also in the thickness of the flows).

The aggregate value of movements within each region is called out in that middle section. It is impossible to learn this from the bag of bubbles version.

The designer did print the entire dataset onto this chart (except for the smallest countries grouped together as "other"). This decision takes away from the power of the underlying flow chart. Instead of thinking about the proportional representation of each country within its respective region, or the distribution of grains among regions, our eyes hone in on the data labels.

This brings me back to the principle of self-sufficiency: if we expect readers to consume the data labels - which comprise the entire dataset, why not just print a data table? If we decide to visualize, make the visual elements count!


Showing both absolute and relative values on the same chart 1

Visual Capitalist has a helpful overview on the "uninsured" deposits problem that has become the talking point of the recent banking crisis. Here is a snippet of the chart that you can see in full at this link:

Visualcapitalist_uninsureddeposits_top

This is in infographics style. It's a bar chart that shows the top X banks. Even though the headline says "by uninsured deposits", the sort order is really based on the proportion of deposits that are uninsured, i.e. residing in accounts that exceed $250K.  They used a red color to highlight the two failed banks, both of which have at least 90% of deposits uninsured.

The right column provides further context: the total amounts of deposits, presented both as a list of numbers as well as a column of bubbles. As readers know, bubbles are not self-sufficient, and if the list of numbers were removed, the bubbles lost most of their power of communication. Big, small, but how much smaller?

There are little nuggets of text in various corners that provide other information.

Overall, this is a pretty good one as far as infographics go.

***

I'd prefer to elevate information about the Too Big to Fail banks (which are hiding in plain sight). Addressing this surfaces the usual battle between relative and absolute values. While the smaller banks have some of the highest concentrations of uninsured deposits, each TBTF bank has multiples of the absolute dollars of uninsured deposits as the smaller banks.

Here is a revised version:

Redo_visualcapitalist_uninsuredassets_1

The banks are still ordered in the same way by the proportions of uninsured value. The data being plotted are not the proportions but the actual deposit amounts. Thus, the three TBTF banks (Citibank, Chase and Bank of America) stand out of the crowd. Aside from Citibank, the other two have relatively moderate proportions of uninsured assets but the sizes of the red bars for any of these three dominate those of the smaller banks.

Notice that I added the gray segments, which portray the amount of deposits that are FDIC protected. I did this not just to show the relative sizes of the banks. Having the other part of the deposits allow readers to answer additional questions, such as which banks have the most insured deposits? They also visually present the relative proportions.

***

The most amazing part of this dataset is the amount of uninsured money. I'm trying to think who these account holders are. It would seem like a very small collection of people and/or businesses would be holding these accounts. If they are mostly businesses, is FDIC insurance designed to protect business deposits? If they are mostly personal accounts, then surely only very wealthy individuals hold most of these accounts.

In the above chart, I'm assuming that deposits and assets are referring to the same thing. This may not be the correct interpretation. Deposits may be only a portion of the assets. It would be strange though that the analysts only have the proportions but not the actual deposit amounts at these banks. Nevertheless, until proven otherwise, you should see my revision as a sketch - what you can do if you have both the total deposits and the proportions uninsured.


Lay off bubbles

Wall Street Journal says that the scale of layoffs in the tech industry recently is worse than those caused by the pandemic lockdown. Here is the chart:

Redo_wsj_tech_layoffs_sufficiency

It's the dreaded bubble chart, complete with overlapping circles. Each bubble represents the total number of employees laid off in the U.S. in a given month.

The above isn't really the chart you find in the Journal. I have removed the two data labels from the chart. Look at the highlighted months of April 2020 and November 2022. Can you guess how much larger is the number of laid-off employees in November 2022 relative to April 2020?

***

If you guessed it's 100% - that the larger bubble is twice the size of the smaller one, then you're much better than I at reading bubble charts. Here is the published chart with the data labels:

Wsj tech layoffs

I like to run this exercise - removing data labels - in order to reveal whether the graphical elements on the page are sufficient to convey the underlying data. Bubbles are typically not great at this. (This is what I call the self-sufficiency test.)

***

Another problem with bubble charts is that the sizes of the bubbles are arbitrary. This allows the designer to convey different messages with the same data.

Take a look at these two bubble charts:

Redo_wsj_layoff_bubbles

The first one has huge bubbles, and lots of overlapping while the second one is roughly the same as the WSJ chart (I pulled a different dataset so the numbers may not be exactly the same).

Both charts are made from exactly the same data! In the second chart, the smallest bubbles are made very small while in the first chart, the smallest bubbles are still quite large.

Think twice before you make a bubble chart.

 


Energy efficiency deserves visual efficiency

Long-time contributor Aleksander B. found a good one, in the World Energy Outlook Report, published by IEA (International Energy Agency).

Iea_balloonchart_emissions

The use of balloons is unusual, although after five minutes, I decided I must do some research to have any hope of understanding this data visualization.

A lot is going on. Below, I trace my own journey through this chart.

The text on the top left explains that the chart concerns emissions and temperature change. The first set of balloons (the grey ones) includes helpful annotations. The left-right position of the balloons indicates time points, in 10-year intervals except for the first.

The trapezoid that sits below the four balloons is more mysterious. It's labelled "median temperature rise in 2100". I debate two possibilities: (a) this trapezoid may serve as the fifth balloon, extending the time series from 2050 to 2100. This interpretation raises a couple of questions: why does the symbol change from balloon to trapezoid? why is the left-right time scale broken? (b) this trapezoid may represent something unrelated to the balloons. This interpretation also raises questions: its position on the horizontal axis still breaks the time series; and  if the new variable is "median temperature rise", then what determines its location on the chart?

That last question is answered if I move my glance all the way to the right edge of the chart where there are vertical axis labels. This axis is untitled but the labels shown in degree Celsius units are appropriate for "median temperature rise".

Turning to the balloons, I wonder what the scale is for the encoded emissions data. This is also puzzling because only a few balloons wear data labels, and a scale is nowhere to be found.

Iea_balloonchart_emissions_legend

The gridlines suggests that the vertical location of the balloons is meaningful. Tracing those gridlines to the right edge leads me back to the Celsius scale, which seems unrelated to emissions. The amount of emissions is probably encoded in the sizes of the balloons although none of these four balloons have any data labels so I'm rather flustered. My attention shifts to the colored balloons, a few of which are labelled. This confirms that the size of the balloons indeed measures the amount of emissions. Nevertheless, it is still impossible to gauge the change in emissions for the 10-year periods.

The colored balloons rising above, way above, the gridlines is an indication that the gridlines may lack a relationship with the balloons. But in some charts, the designer may deliberately use this device to draw attention to outlier values.

Next, I attempt to divine the informational content of the balloon strings. Presumably, the chart is concerned with drawing the correlation between emissions and temperature rise. Here I'm also stumped.

I start to look at the colored balloons. I've figured out that the amount of emissions is shown by the balloon size but I am still unclear about the elevation of the balloons. The vertical locations of these balloons change over time, hinting that they are data-driven. Yet, there is no axis, gridline, or data label that provides a key to its meaning.

Now I focus my attention on the trapezoids. I notice the labels "NZE", "APS", etc. The red section says "Pre-Paris Agreement" which would indicate these sections denote periods of time. However, I also understand the left-right positions of same-color balloons to indicate time progression. I'm completely lost. Understanding these labels is crucial to understanding the color scheme. Clearly, I have to read the report itself to decipher these acronyms.

The research reveals that NZE means "net zero emissions", which is a forecasting scenario - an utterly unrealistic one - in which every country is assumed to fulfil fully its obligations, a sort of best-case scenario but an unattainable optimum. APS and STEPS embed different assumptions about the level of effort countries would spend on reducing emissions and tackling global warming.

At this stage, I come upon another discovery. The grey section is missing any acronym labels. It's actually the legend of the chart. The balloon sizes, elevations, and left-right positions in the grey section are all arbitrary, and do not represent any real data! Surprisingly, this legend does not contain any numbers so it does not satisfy one of the traditional functions of a legend, which is to provide a scale.

There is still one final itch. Take a look at the green section:

Iea_balloonchart_emissions_green

What is this, hmm, caret symbol? It's labeled "Net Zero". Based on what I have been able to learn so far, I associate "net zero" to no "emissions" (this suggests they are talking about net emissions not gross emissions). For some reason, I also want to associate it with zero temperature rise. But this is not to be. The "net zero" line pins the balloon strings to a level of roughly 2.5 Celsius rise in temperature.

Wait, that's a misreading of the chart because the projected net temperature increase is found inside the trapezoid, meaning at "net zero", the scientists expect an increase in 1.5 degrees Celsius. If I accept this, I come face to face with the problem raised above: what is the meaning of the vertical positioning of the balloons? There must be a reason why the balloon strings are pinned at 2.5 degrees. I just have no idea why.

I'm also stealthily presuming that the top and bottom edges of the trapezoids represent confidence intervals around the median temperature rise values. The height of each trapezoid appears identical so I'm not sure.

I have just learned something else about this chart. The green "caret" must have been conceived as a fully deflated balloon since it represents the value zero. Its existence exposes two limitations imposed by the chosen visual design. Bubbles/circles should not be used when the value of zero holds significance. Besides, the use of balloon strings to indicate four discrete time points breaks down when there is a scenario which involves only three buoyant balloons.

***

The underlying dataset has five values (four emissions, one temperature rise) for four forecasting scenarios. It's taken a lot more time to explain the data visualization than to just show readers those 20 numbers. That's not good!

I'm sure the designer did not set out to confuse. I think what happened might be that the design wasn't shown to potential readers for feedback. Perhaps they were shown only to insiders who bring their domain knowledge. Insiders most likely would not have as much difficulty with reading this chart as did I.

This is an important lesson for using data visualization as a means of communications to the public. It's easy for specialists to assume knowledge that readers won't have.

For the IEA chart, here is a list of things not found explicitly on the chart that readers have to know in order to understand it.

  • Readers have to know about the various forecasting scenarios, and their acronyms (APS, NZE, etc.). This allows them to interpret the colors and section titles on the chart, and to decide whether the grey section is missing a scenario label, or is a legend.
  • Since the legend does not contain any scale information, neither for the balloon sizes nor for the temperatures, readers have to figure out the scales on their own. For temperature, they first learn from the legend that the temperature rise information is encoded in the trapezoid, then find the vertical axis on the right edge, notice that this axis has degree Celsius units, and recognize that the Celsius scale is appropriate for measuring median temperature rise.
  • For the balloon size scale, readers must resist the distracting gridlines around the grey balloons in the legend, notice the several data labels attached to the colored balloons, and accept that the designer has opted not to provide a proper size scale.

Finally, I still have several unresolved questions:

  • The horizontal axis may have no meaning at all, or it may only have meaning for emissions data but not for temperature
  • The vertical positioning of balloons probably has significance, or maybe it doesn't
  • The height of the trapezoids probably has significance, or maybe it doesn't

 

 


Finding the right context to interpret household energy data

Bloomberg_energybillBloomberg's recent article on surging UK household energy costs, projected over this winter, contains data about which I have long been intrigued: how much energy does different household items consume?

A twitter follower alerted me to this chart, and she found it informative.

***
If the goal is to pick out the appliances and estimate the cost of running them, the chart serves its purpose. Because the entire set of data is printed, a data table would have done equally well.

I learned that the mobile phone costs almost nothing to charge: 1 pence for six hours of charging, which is deemed a "single use" which seems double what a full charge requires. The games console costs 14 pence for a "single use" of two hours. That might be an underestimate of how much time gamers spend gaming each day.

***

Understanding the design of the chart needs a bit more effort. Each appliance is measured by two metrics: the number of hours considered to be "single use", and a currency value.

It took me a while to figure out how to interpret these currency values. Each cost is associated with a single use, and the duration of a single use increases as we move down the list of appliances. Since the designer assumes a fixed cost of electicity (shown in the footnote as 34p per kWh), at first, it seems like the costs should just increase from top to bottom. That's not the case, though.

Something else is driving these numbers behind the scene, namely, the intensity of energy use by appliance. The wifi router listed at the bottom is turned on 24 hours a day, and the daily cost of running it is just 6p. Meanwhile, running the fridge and freezer the whole day costs 41p. Thus, the fridge&freezer consumes electricity at a rate that is almost 7 times higher than the router.

The chart uses a split axis, which artificially reduces the gap between 8 hours and 24 hours. Here is another look at the bottom of the chart:

Bloomberg_energycost_bottom

***

Let's examine the choice of "single use" as a common basis for comparing appliances. Consider this:

  • Continuous appliances (wifi router, refrigerator, etc.) are denoted as 24 hours, so a daily time window is also implied
  • Repeated-use appliances (e.g. coffee maker, kettle) may be run multiple times a day
  • Infrequent use appliances may be used less than once a day

I prefer standardizing to a "per day" metric. If I use the microwave three times a day, the daily cost is 3 x 3p = 9 p, which is more than I'd spend on the wifi router, run 24 hours. On the other hand, I use the washing machine once a week, so the frequency is 1/7, and the effective daily cost is 1/7 x 36 p = 5p, notably lower than using the microwave.

The choice of metric has key implications on the appearance of the chart. The bubble size encodes the relative energy costs. The biggest bubbles are in the heating category, which is no surprise. The next largest bubbles are tumble dryer, dishwasher, and electric oven. These are generally not used every day so the "per day" calculation would push them lower in rank.

***

Another noteworthy feature of the Bloomberg chart is the split legend. The colors divide appliances into five groups based on usage category (e.g. cleaning, food, utility). Instead of the usual color legend printed on a corner or side of the chart, the designer spreads the category labels around the chart. Each label is shown the first time a specific usage category appears on the chart. There is a presumption that the reader scans from top to bottom, which is probably true on average.

I like this arrangement as it delivers information to the reader when it's needed.

 

 

 


Working hard at clarity

As I am preparing another blog post about the pandemic, I came across the following data graphic, recently produced by the CDC for a vaccine advisory board meeting:

CDC_positivevaccineintent

This is not an example of effective visual communications.

***

For one thing, readers are directed to scour the footnotes to figure out what's going on. If we ignore those for the moment, we see clusters of bubbles that have remained pretty stable from December 2020 to August 2021. The data concern some measure of Americans' intent to take the COVID-19 vaccine. That much we know.

There may have been a bit of an upward trend between January and May, although if you were shown the clusters for December, February and April, you'd think the trend's been pretty flat. 

***

But those colors? What could they represent? You'd surely have to fish this one out of the footnotes. Specifically, this obtuse sentence: "Surveys with multiple time points are shown with the same color bubble for each time point." I had to read it several times. I think it simply means "Color represents the pollster." 

Then it adds: "Surveys with only one time point are shown in gray." which simply means "All pollsters who have only one entry in the dataset are grouped together and shown in gray."

Another problem with this chart is over-plotting. Look at the July cluster. It's impossible to tell how many polls were conducted in July because the circles pile on top of one another. 

***

The appearance of the flat trend is a result of two unfortunate decisions made by the designer. If I retained the chart form, I'd have produced something that looks like this:

Junkcharts_redo_cdcvaccineintent_sameform

The first design choice is to expand the vertical axis to range from 0% to 100%. This effectively squeezes all the bubbles into a small range.

Junkcharts_redo_cdcvaccineintent_startatzero

The second design choice is to enlarge the bubbles causing copious amount of overlapping. 

Junkcharts_redo_cdcvaccineintent_startatzero_bigdots

In particular, this decision blows up the Pew poll (big pink bubble) that contained 10 times the sample size of most of the other polls. The Pew outcome actually came in at 70% but the top of the pink bubble extends to over 80%. Because of this, the outlier poll of December 2020 - which surprisingly printed the highest number of all polls in the entire time window - no longer looks special. 

***

Now, let's see what else we can do to enhance this chart. 

I don't like how bubble size is used to encode the sample size. It creates a weird sensation for anyone who's familiar with sampling errors, and confidence regions. The Pew poll with 10 times the sample size is the most reliable poll of them all. Reliability means the error bars around the Pew poll outcome is the smallest of them all. I tend to think of the area around a point estimate as showing the sampling error so the Pew poll would be a dot, showing the high precision of that estimate. 

But that won't work because larger bubbles catch more of the reader's attention. So, in the following version, all dots have the same size. I encode reliability in the opacity of the color. The darker dots are polls that are more reliable, that have larger sample sizes.

Junkcharts_redo_cdcvaccineintent_opacity

Two of the pollsters have more frequent polling than others. In this next version, I highlighted those two, which reveals the trend better.

Junkcharts_redo_cdcvaccineintent_opacitywithlines

 

 

 


Metaphors, maps, and communicating data

There are some data visualization that are obviously bad. But what makes them bad?

Here is an example of such an effort:

Carbon footprint 2021-02-15_0

This visualization of carbon emissions is not successful. There is precious little that a reader can learn from this chart without expensing a lot of effort. It's relatively easy to identify the largest emitters of carbon but since the data are not expressed per-capita, the chart mainly informs us which countries have the largest populations. 

The color of the bubbles informs readers which countries belong to which parts of the world. However, it distorts the location of countries within regions, and regions relative to regions, as the primary constraint is fitting the bubbles inside the shape of a foot.

The visualization gives a very rough estimate of the relative sizes of total emissions. The circles not being perfect circles don't help. 

It's relatively easy to list the top emitters in each region but it's hard to list the top 10 emitters in the world (try!) 

The small emitters stole all of the attention as they account for most of the labels - and they engender a huge web of guiding lines - an unsightly nuisance.

The diagram clings dearly to the "carbon footprint" metaphor. Does this metaphor help readers consume the emissions data? Conversely, does it slow them down?

A more conventional design uses a cartogram, a type of map in which the positioning of countries are roughly preserved while the geographical areas are coded to the data. Here's how it looks:

Carbonatlasthumb

I can't seem to source this effort. If any reader can find the original source, please comment below.

This cartogram is a rearrangement of the footprint illustration. The map construct eliminates the need to include a color legend which just tells people which country is in which continent. The details of smaller countries are pushed to the bottom. 

In the footprint visualization, I'd even consider getting rid of the legend completely. This means trusting that readers know South Africa is part of Africa, and China is part of Asia.

Carbonfootprint_part

Imagine: what if this chart comes without a color legend? Do we really need it?

***

I'd like to try a word cloud visual for this dataset. Something that looks like this (obviously with the right data encoding):

Michaeltompsett_worldmapwords

(This map is by Michael Tompsett who sells it here.)

 


Circular areas offer misleading cues of their underlying data

John M. pointed me on Twitter to this chart about the progress of U.S.'s vaccination campaign:

Whgov_proportiongettingvaccinated

This looks like a White House production, retweeted by WHO. John is unhappy about this nested bubble format, which I'll come back to later.

Let's zoom in on what matters:

Whgov_proportiongettingvaccinated_clip

An even bigger problem with this chart is the Q corner in our Trifecta Checkup. What is the question they are trying to address? It would appear to be the proportion of population that has "already received [one or more doses of] vaccine". And the big words tell us the answer is 8 percent.

_junkcharts_trifectacheckupBut is that really the question? Check out the dark blue circle. It is labeled "population that has already received vaccine" and thus we infer this bubble represents 8 percent. Now look at the outer bubble. Its annotation is "new population that received vaccine since January 27, 2021". The only interpretation that makes sense is that 8 percent  is not the most current number. If that is the case, why would the headline highlight an older statistic, and not the most up-to-date one?

Perhaps the real question is how fast is the progress in vaccination. Perhaps it took weeks to get to the dark circle and then days to get beyond. In order to improve this data visualization, we must first decide what the question really is.

***

Now let's get to those nested bubbles. The bubble chart is a format that is not "sufficient," by which I mean the visual by itself does not convey the data without the help of aids such as labels. Try to answer the following questions:

Junkcharts_whgov_vaccineprogress_bubblequiz

In my view, if your answer to the last question is anything more than 5 seconds, the dataviz has failed. A successful data visualization should not make readers solve puzzles.

The first two questions depict the confusing nature of concentric circle diagrams. The first data point is coded to the inner circle. Where is the second data point? Is it encoded to the outer circle, or just the outer ring?

In either case, human brains are not trained to compare circular areas. For question 1, the outer circle is 70% larger than the smaller circle. For question 2, the ring is 70% of the area of the dark blue circle. If you're thinking those numbers seem unreasonable, I can tell you that was my first reaction too! So I made the following to convince myself that the calculation was correct:

Junkcharts_whgov_vaccineprogress_bubblequiz_2

Circular areas offer misleading visual cues, and should be used sparingly.

[P.S. 2/10/2021. In the next post, I sketch out an alternative dataviz for this dataset.]