## The most dangerous day

##### Aug 16, 2024

Our World in Data published this interesting chart about infant mortality in the U.S.

The article that sent me to this chart called the first day of life the "most dangerous day". This dot plot seems to support the notion, as the "per-day" death rate is the highest on the day of birth, and then drops quite fast (note log scale) over the the first year of life.

***

Based on the same dataset, I created the following different view of the data, using the same dot plot form:

By this measure, a baby has 99.63% chance of surviving the first 30 days while the survival rate drops to 99.5% by day 180.

There is an important distinction between these two metrics.

The "per day" death rate is the chance of dying on a given day, conditional on having survived up to that day. The chance of dying on day 2 is lower partly because some of the more vulnerable ones have died on day 1 or day 0,  etc.

The survival rate metric is cumulative: it measures how many babies are still alive given they were born on day 0. The survival rate can never go up, so long as we can't bring back the dead.

***

If we are assessing a 5-day-old baby's chance of surviving day 6, the "per-day" death rate is relevant since that baby has not died in the first 5 days.

If the baby has just been born, and we want to know the chance it might die in the first five days (or survive beyond day 5), then the cumulative survival rate curve is the answer. If we use the per-day death rate, we can't add the first five "per-day" death rates It's a more complicated calculation of dying on day 0, then having not died on day 0, dying on day 1, then having not died on day 0 or day 1, dying on day 2, etc.

## Making colors and groups come alive

##### Aug 05, 2024

In the May 2024 issue of Significance, there is an enlightening article (link, paywall) about a new measure of inflation being adopted by the U.K. government known as HCI (Household Costs Indices). This is expected to replace CPI which is the de facto standard measure used around the world. In Chapter 7 of Numbersense (link), I discuss the construction of the CPI, which critics have alleged is manipulated by public officials to be over-optimistic.

The HCI looks promising as it addresses several weaknesses in the CPI measure. First, it implements accounting for household spending on housing - this has always been a tricky subject, regarding those who own homes rather than rent. Second, it recognizes that the average inflation number, which represents the average price changes on the average basket of goods purchased by the average person, does not reflect the experience of many. The HCI measures are broken down into demographic subgroups, so it's possible to compare the HCI of retirees vs non-retirees, for example.

Then comes this multi-colored bar chart:

***

The chart is servicable: the reader can find the story. For almost all the subgroups listed, the HCI measure comes in higher than the CPI measure (black). For the income deciles, the reader sense that the relationship is not linear, that is to say, inflation does not increase (or decrease) as income. It appears that inflation is highest at both ends of the spectrum, and lowest for those who are in deciles 6 to 8. The only subgroup for whom CPI overestimates inflation is "private renter," which totally makes sense since the CPI index previously did not account for "owner-occupier housing" cost.

This is a chart with 19 bars, and 19 colors. The colors do not encode any data at all, which is a bit wasteful. We can make the colors come alive by encoding subgroup identity. This is what the grouped bar chart looks like:

While this is still messy, this version makes it a bit easier to compare across subgroups. The chart simultaneously plots four different grouping methods: by retired/not, by income deciles, by housing situation and by having children/not. Within each grouping, the segments are mutually exclusive but between the grouping, the segments are overlapping. For example, the same person can be counted in Retired, and having Children, and also some retirees have children while other don't.

***

To better display the interactions between groups and subgroups, I prefer using a dot plot.

This is not a simple dot plot either. It's a grouped dot plot with four levels that correspond to each grouping method. One can see the distribution of HCI values across the subgroups within each grouping, and also compare the range of values from one group to another group.

One side benefit of using the dot plot is to get rid of the non-informative space between values 0 and 20. When using a bar chart, we have to start the bars at zero to avoid distorting the encoding. Not so for a dot plot.

P.S. In the next iteration, I'd consider flipping the axes as that might simplify labeling the subgroups.

## Reading log: HBR's specialty bar charts

##### Mar 28, 2024

Today, I want to talk about a type of analysis that I used to ask students to do. I'm calling it a reading log analysis – it's a reading report that traces how one consumes a dataviz work from where your eyes first land to the moment of full comprehension (or abandonment, if that is the outcome). Usually, we do this orally during a live session, but it's difficult to arrive at a full report within the limited class time. A written report overcomes this problem. A stack of reading logs should be a gift to any chart designer.

My report below is very detailed, reflecting the amount of attention I pay to the craft. Most readers won't spend as much time consuming a graphic. The value of the report is not only in what it covers but also in what it does not mention.

***

The chart being analyzed showed up in a Harvard Business Review article (link), and it was submitted by longtime reader Howie H.

First and foremost, I recognized the chart form as a bar chart. It's an advanced bar chart in which each bar has stacked sections and a vertical line in the middle. Now, I wanted to figure out how data enter the picture.

My eyes went to the top legend which tells me the author was comparing the proportion of respondents who said "business should take responsibility" to the proportion who rated "business is doing well". The difference in proportions is called the "performance gap". I glanced quickly at the first row label to discover the underlying survey addresses social issues such as environmental concerns.

Next, I looked at the first bar, trying to figure out its data encoding scheme. The bold, blue vertical line in the middle of the bar caused me to think each bar is split into left and right sections. The right section is shaded and labeled with the performance gap numbers so I focused on the segment to the left of the blue line.

My head started to hurt a little. The green number (76%) is associated with the left edge of the left section of the bar. And if the blue line represents the other number (29%), then the width of the left section should map to the performance gap. This interpretation was obviously incorrect since the right section already showed the gap, and the width of the left section was not equal to that of the right shaded section.

I jumped to the next row. My head hurt a little bit more. The only difference between the two rows is the green number being 74%, 2 percent smaller. I couldn't explain how the left sections of both bars have the same width, which confirms that the left section doesn't display the performance gap (assuming that no graphical mistakes have been made). It also appeared that the left edge of the bar was unrelated to the green number. So I retreated to square one. Let's start over. How were the data encoded in this bar chart?

I scrolled down to the next figure, which applies the same chart form to other data.

I became even more confused. The first row showed labels (green number 60%, blue number 44%, performance gap -16%). This bar is much bigger than the one in the previous figure, even though 60% was less than 76%. Besides, the left section, which is bracketed by the green number on the left and the blue number on the right, appeared much wider than the 16% difference that would have been merited. I again lapsed into thinking that the left section represents performance gaps.

Then I noticed that the vertical blue lines were roughly in proportion. Soon, I realized that the total bar width (both sections) maps to the green number. Now back to the first figure. The proportion of respondents who believe business should take responsibility (green number) is encoded in the full bar. In other words, the left edges of all the bars represent 0%. Meanwhile the proportion saying business is doing well is encoded in the left section. Thus, the difference between the full width and the left-section width is both the right-section width and the performance gap.

Here is an edited version that clarifies the encoding scheme:

***

That's my reading log. Howie gave me his take:

I had to interrupt my reading of the article for quite a while to puzzle this one out. It's sorted by performance gap, and I'm sure there's a better way to display that. Maybe a dot plot, similar to here - https://junkcharts.typepad.com/junk_charts/2023/12/the-efficiency-of-visual-communications.html.

A dot plot might look something like this:

Howie also said:

I interpret the authros' gist to be something like "Companies underperform public expectations on a wide range of social challenges" so I think I'd want to focus on the uniform direction and breadth of the performance gap more than the specifics of each line item.

And I agree.

## A nice plot of densities, but what's behind the colors?

##### Feb 08, 2024

I came across this chart by Planet Anomaly that compares air quality across the world's cities (link). The chart is in long form. The top part looks like this:

The bottom part looks like this:

You can go to the Visual Capitalist website to see the entire chart.

***

Plots of densities are relatively rare. The metric for air quality is micrograms of fine particulate matter (PM) per cubic meter, so showing densities is natural.

It's pretty clear the cities with the worst air quality at the bottom has a lot more PM in the air than the cleanest cities shown at the top.

This density chart plays looser with the data than our canonical chart types. The perceived densities of dots inside the squares do not represent the actual concentrations of PM. It's certainly not true that in New Delhi, the air is packed tightly with PM.

Further, a random number generator is required to scatter the red dots inside the circle. Thus, different software or designers will make the same chart look a bit different - the densities will be the same but the locations of the dots will not be.

I don't have a problem with this. Do you?

***

Another notable feature of this chart is the double encoding. The same metric is not just presented as densities; it is also encoded in a color scale.

I don't think this adds much.

Both color and density are hard for humans to perceive precisely so adding color does not convey  precision to readers.

The color scale is gradated, so it effectively divided the cities into seven groups. But I don't attach particular significance to the classification. If that is important, it would be clearer to put boxes around the groups of plots. So I don't think the color scale convey clustering to readers effectively.

There is one important grouping which is defined by WHO's safe limit of 5 pg/cubic meter. A few cities pass this test while almost every other place fails. But the design pays no attention to this test, as it uses the same hue on both sides, and even the same tint changes on either side of the limit.

***

Another notable project that shows densities as red dots is this emotional chart by Mona Chalabi about measles, which I wrote about in 2019.

## The cult of raw unadjusted data

##### Jan 23, 2024

The author attached a message: "Let's look at raw, unadjusted temperature data from remote US thermometers. What story do they tell?"

I suppose this post came from a climate change skeptic, and the story we're expected to take away from the chart is that there is nothing to see here.

***

What are we looking at, really?

"Nothing to see" probably refers to the patch of blue squares that cover the entire plot area, as time runs left to right from the 1910s to the present.

But we can't really see what's going on in the middle of the patch. So, "nothing to see" is effectively only about the top-to-bottom range of roughly 29.8 to 82.0. What does that range signify?

The blue patch is subdivided into vertical lines consisting of blue squares. Each line is a year's worth of temperature measurements. Each square is the average temperature on a specific day. The vertical range is the difference between the maximum and minimum daily temperatures in a given year. These are extreme values that say almost nothing about the temperatures in the other ~363 days of the year.

We know quite a bit more about the density of squares along each vertical line. They are broken up roughly by seasons. Those values near the top came from summers while the values near the bottom came from winters. The density is the highest near the middle, where the overplotting is so severe that we can barely see anything.

Within each vertical line, the data are not ordered chronologically. This is a very key observation. From left to right, the data are ordered from earliest to latest but not from top to bottom! Therefore, it is impossible for the human eye to trace the entire trajectory of the daily temperature readings from this chart. At best, you can trace the yearly average temperature – but only extremely roughly by eyeballing where the annual averages are inside the blue patch.

Indeed, there is "nothing to see" on this chart because its design has pulverized the data.

***

In Numbersense (link), I wrote "not adjusting the raw data is to knowingly publish bad information. It is analogous to a restaurant's chef knowingly sending out spoilt fish."

It's a fallacy to think that "raw unadjusted" data are the best kind of data. It's actually the opposite. Adjustments are designed to correct biases or other problems in the data. Of course, adjustments can be subverted to introduce biases in the data as well. It is subversive to presume that all adjustments are of the subversive kind.

What kinds of adjustments are of interest in this temperature dataset?

Foremost is the seasonal adjustment. See my old post here. If we want to learn whether temperatures have risen over these decades, we can't do so without separating out the seasons.

The whole dataset can be simplified by drawing the smoothed annual average temperature grouped by season of the year, and when that is done, the trend of rising temperatures is obvious.

***

The following chart by the EPA roughly implements the above:

The original can be found here. They made one adjustment which isn't the one I expected.

Note the vertical scale is titled "temperature anomaly". So, they are not plotting the actual recorded average temperatures, but the "anomalies", i.e. the difference between the recorded temperatures and some kind of "expected" temperature. This is a type of data adjustment as well. The purpose is to focus attention on the relative rather than absolute values. Think of this formula: recorded value = expected value + anomaly. The chart shows how many degrees above or below expectation, rather than how many degrees.

For a chart like this, there should be a required footnote that defines what "anomaly" is. Specifically, the reader should know about the model behind the "expectation". Typically, it's a kind of long-term average value.

For me, this adjustment is not necessary. Without the adjustment, the four panels can be combined into one panel with four lines. That's because the data nicely fit into four levels based on seasons.

The further adjustment I'd have liked to see is "smoothing". Each line above has a "smooth" trend, as well as some variability around this trend. The latter is not a big part of the story.

***

It's weird to push back on climate change advocacy by attacking data adjustments. The more productive direction, in my view, is to ask whether the observed trend is caused by human activities or part of some long-term up-and-down cycle. That is a very challenging question to answer.

## The efficiency of visual communications

##### Dec 12, 2023

Visual Capitalist has this wonderful chart showing the gaps between the stock market returns expected by "investors" compared to "professionals".

It's a model of clarity. The chart form is a dot plot.

The blue dots represent what investors (individuals?) expect to earn from investing in the stock market in the long run. The orange dots represent the professional viewpoint. Each row shows survey results in a different country.

At first glance, U.S. investors are vastly more optimistic than professionals. There is excess enthusiasm in most other countries as well.

The exceptions are Chile, Mexico and Singapore in which the two groups are almost perfectly aligned. The high degree of concordance in these countries makes me wonder if their investors are demographically similar to professionals.

***

Those are the first insights one can take from the dot plot, with almost no effort.

But there's more.

The global average is shown in the middle of the chart, allowing readers to compare each country against it. (It's not clear what average this represents though - maybe the average return expected by investors?)

There's more.

The chart shows what's professional about professionals. Not only do professionals hold a much more pessimistic view of stock returns in general, they also exhibit a much lower variance in expectations.

This reflects that professionals adhere to an orthodoxy - they went to the same schools, were taught from the same textbooks, took the same professional exams, and live in their own echo chambers.

Chile, Mexico and Singapore, however, stick out. For a change, the professionals share the enthusiasm of investors.

***

This chart shows the power of data visualization. So much information can be conveyed in a small space, if one designs the visual well.

## Graphics that stretch stomachs and make merry

##### Jul 26, 2023

Washington Post has a fun article about the Hot Dog Eating Contest in Coney Island here.

This graphic shows various interesting insights about the annual competition:

Joey Chestnut is the recent king of hot-dog eating. Since the late 2000s, he's dominated the competition. He typically chows down over 60 hot dogs in 10 minutes. This is shown by the yellow line. Even at that high level, Chestnut has shown steady growth over time.

The legend tells us that the chart shows the results of all the other competitors. It's pretty clear that few have been able to even get close to Chestnut all these years. Most contestants were able to swallow 30 hot dogs or fewer.

It doesn't appear that the general standard has increased over time.

In 2011, a separate competition for women started. There is also a female champion (Miki Sudo) who has won almost every competition since she started playing.

One strange feature is the lack of competition in the early years. The footnote informs us that the trend is not real - they simply did not keep records of other competitors in early contests.

The only question I can't answer from this chart is the general standard and number of female competitors. The chart designer chooses not to differentiate between male and female contestants, other than the champions. I can understand that. Adding another dimension to the chart is a double-edged sword.

***

There is even more fun. There is a little video illustrating theories about what kind of human bodies can take in that many hot dogs in a short time. Here is a screen shot of it:

##### Mar 08, 2023

This dataviz project by CT Mirror is excellent. The project walks through key statistics of the state of Connecticut.

Here are a few charts I enjoyed.

The first one shows the industries employing the most CT residents. The left and right arrows are perfect, much better than the usual dot plots.

The industries are sorted by decreasing size from top to bottom, based on employment in 2019. The chosen scale is absolute, showing the number of employees. The relative change is shown next to the arrow heads in percentages.

The inclusion of both absolute and relative scales may be a source of confusion as the lengths of the arrows encode the absolute differences, not the relative differences indicated by the data labels. This type of decision is always difficult for the designer. Selecting one of the two scales may improve clarity but induce loss aversion.

***

The next example is a bumps chart showing the growth in residents with at least a bachelor's degree.

This is more like a slopegraph as it appears to draw straight lines between two time points 9 years apart, omitting the intervening years. Each line represents a state. Connecticut's line is shown in red. The message is clear. Connecticut is among the most highly educated out of the 50 states. It maintained this advantage throughout the period.

I'd prefer to use solid lines for the background states, and the axis labels can be sparser.

It's a little odd that pretty much every line has the same slope. I'm suspecting that the numbers came out of a regression model, with varying slopes by state, but the inter-state variance is low.

In the online presentation, one can click on each line to see the values.

***

The final example is a two-sided bar chart:

This shows migration in and out of the state. The red bars represent the number of people who moved out, while the green bars represent those who moved into the state. The states are arranged from the most number of in-migrants to the least.

I have clipped the bottom of the chart as it extends to 50 states, and the bottom half is barely visible since the absolute numbers are so small.

I'd suggest showing the top 10 states. Then group the rest of the states by region, and plot them as regions. This change makes the chart more compact, as well as more useful.

***

There are many other charts, and I encourage you to visit and support this data journalism.

## If you blink, you'd miss this axis trick

##### Jan 31, 2023

When I set out to write this post, I was intending to make a quick point about the following chart found in the current issue of Harvard Magazine (link):

This chart concerns the "tectonic shift" of undergraduates to STEM majors at the expense of humanities in the last 10 years.

I like the chart. The dot plot is great for showing this data. They placed the long text horizontally. The use of color is crucial, allowing us to visually separate the STEM majors from the humanities majors.

My intended post is to suggest dividing the chart into four horizontal slices, each showing one of the general fields. It's a small change that makes the chart even more readable. (It has the added benefit of not needing a legend box.)

***

Then, the axis announced itself.

I was baffled, then disgusted.

Here is a magnified view of the axis:

It's not a linear scale, as one would have expected. What kind of transformation did they use? It's baffling.

Notice the following features of this transformed scale:

• It can't be a log scale because many of the growth values are negative.
• The interval for 0%-25% is longer than for 25%-50%. The interval for 0%-50% is also longer than for 50%-100%. On the positive side, the larger values are pulled in and the smaller values are pushed out.
• The interval for -20%-0% is the same length as that for 0%-25%. So, the transformation is not symmetric around 0

I have no idea what transformation was applied. I took the growth values, measured the locations of the dots, and asked Excel to fit a polynomial function, and it gave me a quadratic fit, R square > 99%.

This formula fits the values within the range extremely well. I hope this isn't the actual transformation. That would be disgusting. Regardless, they ought to have advised readers of their unusual scale.

***

Without having the fitted formula, there is no way to retrieve the actual growth values except for those that happen to fall on the vertical gridlines. Using the inverse of the quadratic formula, I deduced what the actual values were. The hardest one is for Computer Science, since the dot sits to the right of the last gridline. I checked that value against IPEDS data.

The growth values are not extreme, falling between -50% and 125%. There is nothing to be gained by transforming the scale.

The following chart undoes the transformation, and groups the majors by field as indicated above:

***

Yesterday, I published a version of this post at Andrew's blog. Several readers there figured out that the scale is the log of the relative ratio of the number of degrees granted. In the above notation, it is log10(100%+x), where x is the percent change in number of degrees between 2011 and 2021.

Here is a side-by-side view of the two scales:

The chart on the right spreads the negative growth values further apart while slightly compressing the large positive values. I still don't think there is much benefit to transforming this set of data.

P.S. [1/31/2023]

(1) A reader on Andrew's blog asked what's wrong with using the log relative ratio scale. What's wrong is exactly what this post is about. For any non-linear scale, the reader can't make out the values between gridlines. In the original chart, there are four points that exist between 0% and 25%. What values are those? That chart is even harder because now that we know what the transform is, we'd need to first think in terms of relative ratios, so 1.25 instead of 25%, then think in terms of log, if we want to know what those values are.

(2) The log scale used for change values is often said to have the advantage that equal distances on either side represent counterbalancing values. For example, (1.5) (0.66) = (3/2) (2/3)  = 1. But this is a very specific scenario that doesn't actually apply to our dataset.  Consider these scenarios:

History: # degrees went from 1000 to 666 i.e. Relative ratio = 2/3
Psychology: # degrees went from 2000 to 3000 i.e. Relative ratio = 3/2

The # of History degrees dropped by 334 while the number of Psychology degrees grew by 1000 (Psychology I think is the more popular major)

History: # degrees went from 1000 to 666 i.e. Relative ratio = 2/3
Psychology: from 1000 to 1500, i.e. Relative ratio = 3/2

The # of History degrees dropped by 334 while # of Psychology degrees grew by 500
(Assume same starting values)

History: # degrees went from 1000 to 666 i.e. Relative ratio = 2/3
Psychology: from 666 to 666*3/2 = 999 i.e. Relative ratio = 3/2

The # of History degrees dropped by 334 while # of Psychology degrees grew by 333
(Assume Psychology's starting value to be History's ending value)

Psychology: # degrees went from 1000 to 1500 i.e. Relative ratio = 3/2
History: # degrees went from 1500 to 1000 i.e. Relative ratio = 2/3

The # of Psychology degrees grew by 500 while the # of History degrees dropped by 500
(Assume History's starting value to be Psychology's ending value)

## Visualizing the impossible

##### Jul 06, 2022

Note [July 6, 2022]: Typepad's image loader is broken yet again. There is no way for me to fix the images right now. They are not showing despite being loaded properly yesterday. I also cannot load new images. Apologies!

Note 2: Manually worked around the automated image loader.

Note 3: Thanks Glenn for letting me about the image loading problem. It turns out the comment approval function is also broken, so I am not able to approve the comment.

***

A twitter user sent me this chart:

It's, hmm, mystifying. It performs magic, as I explain below.

What's the purpose of the gridlines and axis labels? Even if there is a rationale for printing those numbers, they make it harder, not easier, for readers to understand the chart!

I think the following chart shows the main message of this poll result. Democrats are much more likely to think of immigration as a positive compared to Republicans, with Independents situated in between.

***

The axis title gives a hint as to what the chart designer was aiming for with the unconventional axis. It reads "Overall Percentage for All Participants". It appears that the total length of the stacked bar is the weighted aggregate response rate. Roughly 17% of Americans thought this development to be "very positive" which include 8% of Republicans, 27% of Democrats and 12% of Independents. Since the three segments are not equal in size, 17% is a weighted average of the three proportions.

Within each of the three political affiliations, the data labels add to 100%. These numbers therefore are unweighted response rates for each segment. (If weighted, they should add up to the proportion of each segment.)

This sets up an impossible math problem. The three segments within each bar then represent the sum of three proportions, each unweighted within its segment. Adding these unweighted proportions does not yield the desired weighted average response rate. To get the weighted average response rate, we need to sum the weighted segment response rates instead.

This impossible math problem somehow got resolved visually. We can see that each bar segment faithfully represent the unweighted response rates shown in the respective data labels. Summing them would not yield the aggregate response rates as shown on the axis title. The difference is not a simple multiplicative constant because each segment must be weighted by a different multiplier. So, your guess is as good as mine: what is the magic that makes the impossible possible?

[P.S. Another way to see this inconsistency. The sum of all the data labels is 300% because the proportions of each segment add up to 100%. At the same time, the axis title implies that the sum of the lengths of all five bars should be 100%. So, the chart asserts that 300% = 100%.]

***

This poll question is a perfect classroom fodder to discuss how wording of poll questions affects responses (something called "response bias"). Look at the following variants of the same questions. Are we likely to get answers consistent with the above question?

As you know, the demographic makeup of America is changing and becoming more diverse, while the U.S. Census estimates that white people will still be the largest race in approximately 25 years. Generally speaking, do you find these changes to be very positive, somewhat positive, somewhat negative or very negative?

***

As you know, the demographic makeup of America is changing and becoming more diverse, with the U.S. Census estimating that black people will still be a minority in approximately 25 years. Generally speaking, do you find these changes to be very positive, somewhat positive, somewhat negative or very negative?

***

As you know, the demographic makeup of America is changing and becoming more diverse, with the U.S. Census estimating that Hispanic, black, Asian and other non-white people together will be a majority in approximately 25 years. Generally speaking, do you find these changes to be very positive, somewhat positive, somewhat negative or very negative?

What is also amusing is that in the world described by the pollster in 25 years, every race will qualify as a "minority". There will be no longer majority since no race will constitute at least 50% of the U.S. population. So at that time, the word "minority" will  have lost meaning.