## Aligning V and Q by way of D

##### Apr 08, 2024

In the Trifecta Checkup (link), there is a green arrow between the Q (question) and V (visual) corners, indicating that they should align. This post illustrates what I mean by that.

I saw the following chart in a Washington Post article comparing dairy milk and plant-based "milks".

The article contains a whole series of charts. The one shown here focuses on vitamins.

The red color screams at the reader. At first, it appears to suggest that dairy milk is a standout on all four categories of vitamins. But that's not what the data say.

Let's take a look at the chart form: it's a grid of four plots, each containing one square for each of four types of "milk". The data are encoded in the areas of the squares. The red and green colors represent category labels and do not reflect data values.

Whenever we make bubble plots (the closest relative of these square plots), we have to solve a scale problem. What is the relationship between the scales of the four plots?

I noticed the largest square is the same size across all four plots. So, the size of each square is made relative to the maximum value in each plot, which is assigned a fixed size. In effect, the data encoding scheme is that the areas of the squares show the index values relative to the group maximum of each vitamin category. So, soy milk has 72% as much potassium as dairy milk while oat and almond milks have roughly 45% as much as dairy.

The same encoding scheme is applied also to riboflavin. Oat milk has the most riboflavin, so its square is the largest. Soy milk is 80% of oat, while dairy has 60% of oat.

***

Let's step back to the Trifecta Checkup (link). What's the question being asked in this chart? We're interested in the amount of vitamins found in plant-based milk relative to dairy milk. We're less interested in which type of "milk" has the highest amount of a particular vitamin.

Thus, I'd prefer the indexing tied to the amount found in dairy milk, rather than the maximum value in each category. The following set of column charts show this encoding:

I changed the color coding so that blue columns represent higher amounts than dairy while yellow represent lower.

From the column chart, we find that plant-based "milks" contain significantly less potassium and phosphorus than dairy milk while oat and soy "milks" contain more riboflavin than dairy. Almond "milk" has negligible amounts of riboflavin and phosphorus. There is vritually no difference between the four "milk" types in providing vitamin D.

***

In the above redo, I strengthen the alignment of the Q and V corners. This is accomplished by making a stop at the D corner: I change how the raw data are transformed into index values.

Just for comparison, if I only change the indexing strategy but retain the square plot chart form, the revised chart looks like this:

The four squares showing dairy on this version have the same size. Readers can evaluate the relative sizes of the other "milk" types.

## The curse of dimensions

##### Mar 20, 2024

Usually the curse of dimensions concerns data with many dimensions. But today I want to talk about a different kind of curse. This is the curse of dimensions in mapping.

We are only talking about a few dimensions, typically between 3 and 6, so small number of dimensions. And yet it's already a curse. Maps are typically drawn in two dimensions. Those two dimensions are usually spoken for: they show the x- and y-coordinate of space. If we want to include a third, fourth or fifth dimension of data on the map, we have to appeal to colors, shapes, and so on. Cartographers have long realized that adding dimensions involves tradeoffs.

***

Andrew featured some colored bubble maps in a recent post. Here is one example:

The above map shows the proportion of population in each U.S. county that is Hispanic. Each county is represented by a bubble pinned to the centroid of the county. The color of the bubble shows the data, divided into demi-deciles so they are using a equal-width binning method. The size of a bubble indicates the size of a county.

The map is sometimes called a "Dorling map" after its presumptive original designer.

I'm going to use this map to explore the curse of dimensions.

***

It's clear from the design that county-level details are regarded as extremely important. As there are about 3,000 counties in the U.S., I don't see how any visual design can satisfy this requirement without giving up clarity.

More details require more objects, which spread readers' attention. More details contain more stories, but that too dilutes their focus.

Another principle of this map is to not allow bubbles to overlap. Of course, having bubbles overlap or print on top of one another is a visual faux pas. But to prevent such behavior on this particular design means the precise locations are sacrificed. Consider the eastern seaboard where there are densely populated counties: they are not pinned to their centroids. Instead, the counties are pushed out of their normal positions, similar to making a cartogram.

I remarked at the start – erroneously but deliberately – that each bubble is centered at the centroid of each county. I wonder how many of you noticed the inaccuracy of that statement. If that rule were followed, then the bubbles in New England would have overlapped and overprinted.

This tradeoff affects how we perceive regional patterns, as all the densely populated regions are bent out of shape.

Another aspect of the data that the designer treats as important is county population, or rather relative county population. Relative – because bubble size don't portray absolutes, plus the designer didn't bother to provide a legend to decipher bubble sizes.

The tradeoff is location. The varying bubble sizes, coupled with the previous stipulation of no overlapping, push bubbles from their proper centroids. This forced displacement disproportionately affects larger counties.

***

What if we are willing to sacrifice county-level details?

In this setting, we are not obliged to show every single county. One alternative is to perform spatial smoothing. Intuitively, think about the following steps: plot all these bubbles in their precise locations, turn the colors slightly transparent, let them overlap, blend away the edges, and then we have a nice picture of where the Hispanic people are located.

I have sacrificed the county-level details but the regional pattern becomes much clearer, and we don't need to deviate from the well-understood shape of the standard map.

This version reminds me of the language maps that Josh Katz made.

Here is an old post about these maps.

This map design only reduces but does not eliminate the geographical inaccuracy. It uses the same trick as the Dorling map: the "vertical" density of population has been turned into "horizontal" span. It's a bit better because the centroids are not displaced.

***

Which map is better depends on what tradeoffs one is making. In the above example, I'd have made different choices.

One final thing – it's minor but maybe not so minor. Most of the bubbles on the map especially in the middle are tiny; as most of them have Hispanic proportions that are on the left side of the scale, they should be showing light orange. However, all of them appear darker than they ought to be. That's because each bubble has a dark border. For small bubbles, the ratio of ink on the border is a high proportion of the ink for the entire object.

## Two metrics in-fighting

##### Oct 10, 2023

The Wall Street Journal shows the following chart which pits two metrics against each other:

The primary metric is the change in median yearly salary between the two periods of time. We presume it's primary because of its presence in the chart title, and the blue bars being more readable than the green bubbles. The secondary metric is the median yearly salary in the later period.

That, I believe, was the intended design. When I saw this chart, my eyes went to the numbers inside the green bubbles. Perhaps it's because I didn't read the chart title first, and the horizontal axis wasn't labelled so it wasn't obvious what the blue bars coded.

As with most bubble charts, the data labels exist to cover up the inadequacy of circular areas. The self-sufficiency test - removing the data labels - shows this well:

It's simply impossible to know what values should be in each bubble, or to perceive the relative sizes of those bubbles.

***

Reversing the order of the blue bars also helps:

The original order is one of the more annoying features in most visualization packages. Because internally, the categories are numbered 1, 2, 3, ..., and because the convention is to have values run higher as they run up the vertical axis, these packages would place the top-ranked item at the bottom of the chart.

Most people read top to bottom, which means that they read the least important item first, and the most important item last!

In most visualization packages, it takes only 1 click or 1 action to reverse the order of the items. Please do it!

***

For change over time, I like using a Bumps chart, otherwise called a slope graph:

## What is the question is the question

##### Sep 20, 2023

I picked up a Fortune magazine while traveling, and saw this bag of bubbles chart.

This chart is visually appealing, that must be said. Each circle represents the reported revenues of a corporation that belongs to the “Global 500 Companies” list. It is labeled by the location of the company’s headquarters. The largest bubble shows Beijing, the capital of China, indicating that companies based in Beijing count \$6 trillion dollars of revenues amongst them. The color of the bubbles show large geographical units; the red bubbles are cities in Greater China.

I appreciate a couple of the design decisions. The chart title and legend are placed on the top, making it easy to find one’s bearing – effective while non-intrusive. The labeling signals a layering: the first and biggest group have icons; the second biggest group has both name and value inside the bubbles; the third group has values inside the bubbles but names outside; the smallest group contains no labels.

Note the judgement call the designer made. For cities that readers might not be familiar with, a country name (typically abbreviated) is added. This is a tough call since mileage varies.

***

As I discussed before (link), the bag of bubbles does not elevate comprehension. Just try answering any of the following questions, which any of us may have, using just the bag of bubbles:

• What proportion of the total revenues are found in Beijing?
• What proportion of the total revenues are found in Greater China?
• What are the top 5 cities in Greater China?
• What are the ranks of the six regions?

If we apply the self-sufficiency test and remove all the value labels, it’s even harder to figure out what’s what.

***

Moving to the D corner of the Trifecta Checkup, we aren’t sure how to interpret this dataset. It’s unclear if these companies derive most of their revenues locally, or internationally. A company headquartered in Washington D.C. may earn most of its revenues in other places. Even if Beijing-based companies serve mostly Chinese customers, only a minority of revenues would be directly drawn from Beijing. Some U.S. corporations may choose its headquarters based on tax considerations. It’s a bit misleading to assign all revenues to one city.

As we explore this further, it becomes clear that the designer must establish a target – a strong idea of what question s/he wants to address. The Fortune piece comes with a paragraph. It appears that an important story is the spatial dispersion of corporate revenues in different countries. They point out that U.S. corporate HQs are more distributed geographically than Chinese corporate HQs, which tend to be found in the key cities.

There is a disconnect between the Question and the Data used to create the visualization. There is also a disconnect between the Question and the Visual display.

## One bubble is a tragedy, and a bag of bubbles is...

##### Jul 19, 2023

From Kathleen Tyson's twitter account, I came across a graphic showing the destinations of Ukraine's grain exports since 2022 under the auspices of a UN deal. This graphic, made by AFP, uses one of the chart forms that baffle me - the bag of bubbles.

The first trouble with a bag of bubbles is the single bubble. The human brain is just not fit for comparing bubble sizes. The self-sufficiency test is my favorite device for demonstrating this weakness. The following is the European section of the above chart, with the data labels removed.

How much bigger is Spain than the Netherlands? What's the difference between Italy and the Netherlands? The answers don't come easily to mind. (The Netherlands is about 40% the size of Spain, and Italy is about 20% larger than the Netherlands.)

While comparing relative circular areas is a struggle, figuring out the relative ranks is not. Sure, it gets tougher with small differences (Germany vs S. Korea, Belgium vs Portugal) but saying those pairs are tied isn't a tragedy.

***

Another issue with bubble charts is how difficult it is to assess absolute values. A circle on its own has no reference point. The designer needs to add data labels or a legend. Adding data labels is an act of giving up. The data labels become the primary instrument for communicating the data, not the visual construct. Adding one data label is not enough, as the following shows:

Being told that Spain's value is 4.1 does little to help estimate the values for the non-labelled bubbles.

The chart does come with the following legend:

For this legend to work, the sample bubble sizes should span the range of the data. Notice that it's difficult to extrapolate from the size of the 1-million-ton bubble to 2-million, 4-million, etc. The analogy is a column chart in which the vertical axis does not extend through the full range of the dataset.

The designer totally gets this. The chart therefore contains both selected data labels and the partial legend. Every bubble larger than 1 million tons has an explicit data label. That's one solution for the above problem.

Nevertheless, why not use another chart form that avoids these problems altogether?

***

In Tyson's tweet, she showed another chart that pretty much contains the same information, this one from TASS.

This chart uses the flow diagram concept - in an abstract way, as I explained in previous post.

This chart form imposes structure on the data. The relative ranks of the countries within each region are listed from top to bottom. The relative amounts of grains are shown in black columns (and also in the thickness of the flows).

The aggregate value of movements within each region is called out in that middle section. It is impossible to learn this from the bag of bubbles version.

The designer did print the entire dataset onto this chart (except for the smallest countries grouped together as "other"). This decision takes away from the power of the underlying flow chart. Instead of thinking about the proportional representation of each country within its respective region, or the distribution of grains among regions, our eyes hone in on the data labels.

This brings me back to the principle of self-sufficiency: if we expect readers to consume the data labels - which comprise the entire dataset, why not just print a data table? If we decide to visualize, make the visual elements count!

## Showing both absolute and relative values on the same chart 1

##### Apr 13, 2023

Visual Capitalist has a helpful overview on the "uninsured" deposits problem that has become the talking point of the recent banking crisis. Here is a snippet of the chart that you can see in full at this link:

This is in infographics style. It's a bar chart that shows the top X banks. Even though the headline says "by uninsured deposits", the sort order is really based on the proportion of deposits that are uninsured, i.e. residing in accounts that exceed \$250K.  They used a red color to highlight the two failed banks, both of which have at least 90% of deposits uninsured.

The right column provides further context: the total amounts of deposits, presented both as a list of numbers as well as a column of bubbles. As readers know, bubbles are not self-sufficient, and if the list of numbers were removed, the bubbles lost most of their power of communication. Big, small, but how much smaller?

There are little nuggets of text in various corners that provide other information.

Overall, this is a pretty good one as far as infographics go.

***

I'd prefer to elevate information about the Too Big to Fail banks (which are hiding in plain sight). Addressing this surfaces the usual battle between relative and absolute values. While the smaller banks have some of the highest concentrations of uninsured deposits, each TBTF bank has multiples of the absolute dollars of uninsured deposits as the smaller banks.

Here is a revised version:

The banks are still ordered in the same way by the proportions of uninsured value. The data being plotted are not the proportions but the actual deposit amounts. Thus, the three TBTF banks (Citibank, Chase and Bank of America) stand out of the crowd. Aside from Citibank, the other two have relatively moderate proportions of uninsured assets but the sizes of the red bars for any of these three dominate those of the smaller banks.

Notice that I added the gray segments, which portray the amount of deposits that are FDIC protected. I did this not just to show the relative sizes of the banks. Having the other part of the deposits allow readers to answer additional questions, such as which banks have the most insured deposits? They also visually present the relative proportions.

***

The most amazing part of this dataset is the amount of uninsured money. I'm trying to think who these account holders are. It would seem like a very small collection of people and/or businesses would be holding these accounts. If they are mostly businesses, is FDIC insurance designed to protect business deposits? If they are mostly personal accounts, then surely only very wealthy individuals hold most of these accounts.

In the above chart, I'm assuming that deposits and assets are referring to the same thing. This may not be the correct interpretation. Deposits may be only a portion of the assets. It would be strange though that the analysts only have the proportions but not the actual deposit amounts at these banks. Nevertheless, until proven otherwise, you should see my revision as a sketch - what you can do if you have both the total deposits and the proportions uninsured.

## Lay off bubbles

##### Mar 03, 2023

Wall Street Journal says that the scale of layoffs in the tech industry recently is worse than those caused by the pandemic lockdown. Here is the chart:

It's the dreaded bubble chart, complete with overlapping circles. Each bubble represents the total number of employees laid off in the U.S. in a given month.

The above isn't really the chart you find in the Journal. I have removed the two data labels from the chart. Look at the highlighted months of April 2020 and November 2022. Can you guess how much larger is the number of laid-off employees in November 2022 relative to April 2020?

***

If you guessed it's 100% - that the larger bubble is twice the size of the smaller one, then you're much better than I at reading bubble charts. Here is the published chart with the data labels:

I like to run this exercise - removing data labels - in order to reveal whether the graphical elements on the page are sufficient to convey the underlying data. Bubbles are typically not great at this. (This is what I call the self-sufficiency test.)

***

Another problem with bubble charts is that the sizes of the bubbles are arbitrary. This allows the designer to convey different messages with the same data.

Take a look at these two bubble charts:

The first one has huge bubbles, and lots of overlapping while the second one is roughly the same as the WSJ chart (I pulled a different dataset so the numbers may not be exactly the same).

Both charts are made from exactly the same data! In the second chart, the smallest bubbles are made very small while in the first chart, the smallest bubbles are still quite large.

Think twice before you make a bubble chart.

## Energy efficiency deserves visual efficiency

##### Dec 05, 2022

Long-time contributor Aleksander B. found a good one, in the World Energy Outlook Report, published by IEA (International Energy Agency).

The use of balloons is unusual, although after five minutes, I decided I must do some research to have any hope of understanding this data visualization.

A lot is going on. Below, I trace my own journey through this chart.

The text on the top left explains that the chart concerns emissions and temperature change. The first set of balloons (the grey ones) includes helpful annotations. The left-right position of the balloons indicates time points, in 10-year intervals except for the first.

The trapezoid that sits below the four balloons is more mysterious. It's labelled "median temperature rise in 2100". I debate two possibilities: (a) this trapezoid may serve as the fifth balloon, extending the time series from 2050 to 2100. This interpretation raises a couple of questions: why does the symbol change from balloon to trapezoid? why is the left-right time scale broken? (b) this trapezoid may represent something unrelated to the balloons. This interpretation also raises questions: its position on the horizontal axis still breaks the time series; and  if the new variable is "median temperature rise", then what determines its location on the chart?

That last question is answered if I move my glance all the way to the right edge of the chart where there are vertical axis labels. This axis is untitled but the labels shown in degree Celsius units are appropriate for "median temperature rise".

Turning to the balloons, I wonder what the scale is for the encoded emissions data. This is also puzzling because only a few balloons wear data labels, and a scale is nowhere to be found.

The gridlines suggests that the vertical location of the balloons is meaningful. Tracing those gridlines to the right edge leads me back to the Celsius scale, which seems unrelated to emissions. The amount of emissions is probably encoded in the sizes of the balloons although none of these four balloons have any data labels so I'm rather flustered. My attention shifts to the colored balloons, a few of which are labelled. This confirms that the size of the balloons indeed measures the amount of emissions. Nevertheless, it is still impossible to gauge the change in emissions for the 10-year periods.

The colored balloons rising above, way above, the gridlines is an indication that the gridlines may lack a relationship with the balloons. But in some charts, the designer may deliberately use this device to draw attention to outlier values.

Next, I attempt to divine the informational content of the balloon strings. Presumably, the chart is concerned with drawing the correlation between emissions and temperature rise. Here I'm also stumped.

I start to look at the colored balloons. I've figured out that the amount of emissions is shown by the balloon size but I am still unclear about the elevation of the balloons. The vertical locations of these balloons change over time, hinting that they are data-driven. Yet, there is no axis, gridline, or data label that provides a key to its meaning.

Now I focus my attention on the trapezoids. I notice the labels "NZE", "APS", etc. The red section says "Pre-Paris Agreement" which would indicate these sections denote periods of time. However, I also understand the left-right positions of same-color balloons to indicate time progression. I'm completely lost. Understanding these labels is crucial to understanding the color scheme. Clearly, I have to read the report itself to decipher these acronyms.

The research reveals that NZE means "net zero emissions", which is a forecasting scenario - an utterly unrealistic one - in which every country is assumed to fulfil fully its obligations, a sort of best-case scenario but an unattainable optimum. APS and STEPS embed different assumptions about the level of effort countries would spend on reducing emissions and tackling global warming.

At this stage, I come upon another discovery. The grey section is missing any acronym labels. It's actually the legend of the chart. The balloon sizes, elevations, and left-right positions in the grey section are all arbitrary, and do not represent any real data! Surprisingly, this legend does not contain any numbers so it does not satisfy one of the traditional functions of a legend, which is to provide a scale.

There is still one final itch. Take a look at the green section:

What is this, hmm, caret symbol? It's labeled "Net Zero". Based on what I have been able to learn so far, I associate "net zero" to no "emissions" (this suggests they are talking about net emissions not gross emissions). For some reason, I also want to associate it with zero temperature rise. But this is not to be. The "net zero" line pins the balloon strings to a level of roughly 2.5 Celsius rise in temperature.

Wait, that's a misreading of the chart because the projected net temperature increase is found inside the trapezoid, meaning at "net zero", the scientists expect an increase in 1.5 degrees Celsius. If I accept this, I come face to face with the problem raised above: what is the meaning of the vertical positioning of the balloons? There must be a reason why the balloon strings are pinned at 2.5 degrees. I just have no idea why.

I'm also stealthily presuming that the top and bottom edges of the trapezoids represent confidence intervals around the median temperature rise values. The height of each trapezoid appears identical so I'm not sure.

I have just learned something else about this chart. The green "caret" must have been conceived as a fully deflated balloon since it represents the value zero. Its existence exposes two limitations imposed by the chosen visual design. Bubbles/circles should not be used when the value of zero holds significance. Besides, the use of balloon strings to indicate four discrete time points breaks down when there is a scenario which involves only three buoyant balloons.

***

The underlying dataset has five values (four emissions, one temperature rise) for four forecasting scenarios. It's taken a lot more time to explain the data visualization than to just show readers those 20 numbers. That's not good!

I'm sure the designer did not set out to confuse. I think what happened might be that the design wasn't shown to potential readers for feedback. Perhaps they were shown only to insiders who bring their domain knowledge. Insiders most likely would not have as much difficulty with reading this chart as did I.

This is an important lesson for using data visualization as a means of communications to the public. It's easy for specialists to assume knowledge that readers won't have.

For the IEA chart, here is a list of things not found explicitly on the chart that readers have to know in order to understand it.

• Readers have to know about the various forecasting scenarios, and their acronyms (APS, NZE, etc.). This allows them to interpret the colors and section titles on the chart, and to decide whether the grey section is missing a scenario label, or is a legend.
• Since the legend does not contain any scale information, neither for the balloon sizes nor for the temperatures, readers have to figure out the scales on their own. For temperature, they first learn from the legend that the temperature rise information is encoded in the trapezoid, then find the vertical axis on the right edge, notice that this axis has degree Celsius units, and recognize that the Celsius scale is appropriate for measuring median temperature rise.
• For the balloon size scale, readers must resist the distracting gridlines around the grey balloons in the legend, notice the several data labels attached to the colored balloons, and accept that the designer has opted not to provide a proper size scale.

Finally, I still have several unresolved questions:

• The horizontal axis may have no meaning at all, or it may only have meaning for emissions data but not for temperature
• The vertical positioning of balloons probably has significance, or maybe it doesn't
• The height of the trapezoids probably has significance, or maybe it doesn't

## Finding the right context to interpret household energy data

##### Oct 12, 2022

Bloomberg's recent article on surging UK household energy costs, projected over this winter, contains data about which I have long been intrigued: how much energy does different household items consume?

A twitter follower alerted me to this chart, and she found it informative.

***
If the goal is to pick out the appliances and estimate the cost of running them, the chart serves its purpose. Because the entire set of data is printed, a data table would have done equally well.

I learned that the mobile phone costs almost nothing to charge: 1 pence for six hours of charging, which is deemed a "single use" which seems double what a full charge requires. The games console costs 14 pence for a "single use" of two hours. That might be an underestimate of how much time gamers spend gaming each day.

***

Understanding the design of the chart needs a bit more effort. Each appliance is measured by two metrics: the number of hours considered to be "single use", and a currency value.

It took me a while to figure out how to interpret these currency values. Each cost is associated with a single use, and the duration of a single use increases as we move down the list of appliances. Since the designer assumes a fixed cost of electicity (shown in the footnote as 34p per kWh), at first, it seems like the costs should just increase from top to bottom. That's not the case, though.

Something else is driving these numbers behind the scene, namely, the intensity of energy use by appliance. The wifi router listed at the bottom is turned on 24 hours a day, and the daily cost of running it is just 6p. Meanwhile, running the fridge and freezer the whole day costs 41p. Thus, the fridge&freezer consumes electricity at a rate that is almost 7 times higher than the router.

The chart uses a split axis, which artificially reduces the gap between 8 hours and 24 hours. Here is another look at the bottom of the chart:

***

Let's examine the choice of "single use" as a common basis for comparing appliances. Consider this:

• Continuous appliances (wifi router, refrigerator, etc.) are denoted as 24 hours, so a daily time window is also implied
• Repeated-use appliances (e.g. coffee maker, kettle) may be run multiple times a day
• Infrequent use appliances may be used less than once a day

I prefer standardizing to a "per day" metric. If I use the microwave three times a day, the daily cost is 3 x 3p = 9 p, which is more than I'd spend on the wifi router, run 24 hours. On the other hand, I use the washing machine once a week, so the frequency is 1/7, and the effective daily cost is 1/7 x 36 p = 5p, notably lower than using the microwave.

The choice of metric has key implications on the appearance of the chart. The bubble size encodes the relative energy costs. The biggest bubbles are in the heating category, which is no surprise. The next largest bubbles are tumble dryer, dishwasher, and electric oven. These are generally not used every day so the "per day" calculation would push them lower in rank.

***

Another noteworthy feature of the Bloomberg chart is the split legend. The colors divide appliances into five groups based on usage category (e.g. cleaning, food, utility). Instead of the usual color legend printed on a corner or side of the chart, the designer spreads the category labels around the chart. Each label is shown the first time a specific usage category appears on the chart. There is a presumption that the reader scans from top to bottom, which is probably true on average.

I like this arrangement as it delivers information to the reader when it's needed.

## Working hard at clarity

##### Sep 02, 2021

As I am preparing another blog post about the pandemic, I came across the following data graphic, recently produced by the CDC for a vaccine advisory board meeting:

This is not an example of effective visual communications.

***

For one thing, readers are directed to scour the footnotes to figure out what's going on. If we ignore those for the moment, we see clusters of bubbles that have remained pretty stable from December 2020 to August 2021. The data concern some measure of Americans' intent to take the COVID-19 vaccine. That much we know.

There may have been a bit of an upward trend between January and May, although if you were shown the clusters for December, February and April, you'd think the trend's been pretty flat.

***

But those colors? What could they represent? You'd surely have to fish this one out of the footnotes. Specifically, this obtuse sentence: "Surveys with multiple time points are shown with the same color bubble for each time point." I had to read it several times. I think it simply means "Color represents the pollster."

Then it adds: "Surveys with only one time point are shown in gray." which simply means "All pollsters who have only one entry in the dataset are grouped together and shown in gray."

Another problem with this chart is over-plotting. Look at the July cluster. It's impossible to tell how many polls were conducted in July because the circles pile on top of one another.

***

The appearance of the flat trend is a result of two unfortunate decisions made by the designer. If I retained the chart form, I'd have produced something that looks like this:

The first design choice is to expand the vertical axis to range from 0% to 100%. This effectively squeezes all the bubbles into a small range.

The second design choice is to enlarge the bubbles causing copious amount of overlapping.

In particular, this decision blows up the Pew poll (big pink bubble) that contained 10 times the sample size of most of the other polls. The Pew outcome actually came in at 70% but the top of the pink bubble extends to over 80%. Because of this, the outlier poll of December 2020 - which surprisingly printed the highest number of all polls in the entire time window - no longer looks special.

***

Now, let's see what else we can do to enhance this chart.

I don't like how bubble size is used to encode the sample size. It creates a weird sensation for anyone who's familiar with sampling errors, and confidence regions. The Pew poll with 10 times the sample size is the most reliable poll of them all. Reliability means the error bars around the Pew poll outcome is the smallest of them all. I tend to think of the area around a point estimate as showing the sampling error so the Pew poll would be a dot, showing the high precision of that estimate.

But that won't work because larger bubbles catch more of the reader's attention. So, in the following version, all dots have the same size. I encode reliability in the opacity of the color. The darker dots are polls that are more reliable, that have larger sample sizes.

Two of the pollsters have more frequent polling than others. In this next version, I highlighted those two, which reveals the trend better.