« December 2021 | Main

Best chart I have seen this year

Marvelling at this chart:

 

***

The credit ultimately goes to a Reddit user (account deleted). I first saw it in this nice piece of data journalism by my friends at System 2 (link). They linked to Visual Capitalism (link).

There are so many things on this one chart that makes me smile.

The animation. The message of the story is aging population. Average age is moving up. This uptrend is clear from the chart, as the bulge of the population pyramid is migrating up.

The trend happens to be slow, and that gives the movement a mesmerizing, soothing effect.

Other items on the chart are synced to the time evolution. The year label on the top but also the year labels on the right side of the chart, plus the counts of total population at the bottom.

OMG, it even gives me average age, and life expectancy, and how those statistics are moving up as well.

Even better, the designer adds useful context to the data: look at the names of the generations paired with the birth years.

This chart is also an example of dual axes that work. Age, birth year and current year are connected to each other, and given two of the three, the third is fixed. So even though there are two vertical axes, there is only one scale.

The only thing I'm not entirely convinced about is placing the scroll bar on the very top. It's a redundant piece that belongs to a less prominent part of the chart.


Think twice before you spiral

After Nathan at FlowingData sang praises of the following chart, a debate ensued on Twitter as others dislike it.

Nyt_spiral_covidcases

The chart was printed in an opinion column in the New York Times (link).

I have found few uses for spiral charts, and this example has not changed my mind.

The canonical time-series chart is like this:

Junkcharts_redo_nyt_covidcasesspiral_1

 

***

The area chart takes no effort to understand. We can see when the peaks occurred. We notice that the current surge is already double the last peak seen a year ago.

It's instructive to trace how one gets from the simple area chart to the spiral chart.

Junkcharts_redo_nyt_covidcasesspiral_2

Step 1 is to center the area on the zero baseline, instead of having the zero baseline as the baseline. While this technique frequently makes for a more pleasant visual (because of our preference for symmetry), it actually makes it harder to see the trend over time. Effectively, any change is split in half, which is why the envelope of the area is less sharp.

Junkcharts_redo_nyt_covidcasesspiral_3

In Step 2, I massively compress the vertical scale. That's because when you plot a spiral, you are forced to fit each cycle of data into a much shorter range. Such compression causes the year on year doubling of cases to appear less dramatic. (Actually, the aspect ratio is devastated because while the vertical scale is hugely compressed, the horizontal scale is dramatically stretched out due to the curled up design)

Junkcharts_redo_nyt_covidcasesspiral_4

Step 3 may elude your attention. If you simply curl up the compressed, centered area chart, you don't get the spiral chart. The key is to ask about the radius of the spiral. As best I can tell, the radius has no meaning; it is gradually increased so that each year of data has its own "orbit". What would the change in radius translate to on our non-circular chart? It should mean that the center of the area is gradually lifted away from the zero line. On the right chart, I mimic this effect (I only measured the change in radius every 3 months so the change is more angular than displayed in the spiral chart.) The problem I have with this Step is that it serves no purpose, while it complicates cognition,

In Step 4, just curl up the object into a ball based on aligning months of the year.

Junkcharts_redo_nyt_covidcasesspiral_5

This is the point when I realized I missed a Step 2B. I carefully aligned the scales of both charts so that the 150K cases shown in the legend on the right have the same vertical representation as on the left. This exposes a severe horizontal rescaling. The length of the horizontal axis on the left chart is many times smaller than the circumference of the spiral! That's why earlier, I said one of the biggest feature of this spiral chart is that it imposes a dubious aspect ratio, that is extremely wide and extremely short.

As usual, think twice before you spiral.

 

 


Visual design is hard, brought to you by NYC subway

This poster showed up in a NY subway train recently.

Rootin-sm

Visual design is hard!

What is the message? The intention is, of course, to say Rootine is better than others. (That's the Q corner, if you're following the Trifecta Checkup.)

What is the visual telling us (V corner)? It says Rootine is yellow while Others are purple. What do these color mean? There is no legend to help decipher it. And yellow-purple doesn't have a canonical interpretation (unlike say, red-green). In theory, purple can be better than yellow.

The other mystery is the black dot on the fifth item. (This is the NYC subway so the poster could have been vandalized.) It could mean "diet + lifestyle analyzed" is a unique feature of Rootine, not available on any other platform. That implies purple to mean available but not as effective, which significantly lessnes the impact of the chart.

***

Finally, let's imagine the data that may exist to support this chart.

The aggregation of all competitors to "Others" imposes a major challenge. If yellow means yes, and purple means no, we'd expect few if any purple dots because across all competitors, there is a good chance that at least one of them has a particular feature.

Next, I'm dubious about the claim of "precision dosed, unique to you". I'm imagining they are selling some kind of medicine or health food, which can be "dosed". Predictive modelers like to market their models as "personalized," unique to each person but such a thing is impractical. Before you start using their products, they have no data on you, or your response to those products. How could the recommendation be "precision dosed, unique to you"?

Even if you've used the product for a while, it will be tough to achieve a good level of optimality with so little data. In fact, given that your past data are used to generate actions intended to improve your health - that is to say, to cause the future data to diverge from the past data, how do you know that any change you observe next period is caused by the actions you took? The pre-post difference is both affected by temporal shifts and the actions you've taken. If the next period's metric improves, you may want to believe that the actions worked. If the next period's metric declines, are you willing to conclude that the actions you took backfired?

"Formulas improve with you". This makes me more worried than relieved.

***

Problems like these can be solved by showing our work to others. Sometimes, we're too immersed in our own world we don't see we have left off key information.

 

 


Start at zero, or start at wherever

Andrew's post about start-at-zero helps me refine my own thinking on this evergreen topic.

The specific example he gave is this one:

Andrewgelman_invitezeroin

The dataset is a numeric variable (y) with values over time (x). The minimum numeric value is around 3 and the range of values is from around 3 to just above 20. His advice is "If zero is in the neighborhood, invite it in". (Link)

The rule, as usual, sounds simpler than it really is. In the discussion, Andrew highlights several considerations.

Is zero a meaningful reference value? In his example, we assume it is and so we invite zero in. But, as Andrew also says, if zero is meaningless, then recall the invitation. So context must be accounted for.

In Chapter 1 of Numbersense (link), I looked at some SAT score data of applicants to competitive colleges. Is zero a meaningful reference value for SAT scores? Someone might argue yes, since it is the theoretical minimum score that anyone could get from the test. Any statistician will likely say no, since a competitive college will have never seen an applicant submitting a score of zero, or anywhere close to zero. Thus, starting such a chart at zero inserts a lot of whitespace and draws attention to a useless insight - how far above the theoretical worst performer is someone's score.

***

What about the left panel of Andrew's chart makes us uncomfortable? I ask myself this question. My answer is that the horizontal axis highlights an arbitrary value that distracts from the key patterns of the data.

As shown below, the arbitrary value is ~2.5. This is utterly meaningless.

Redo_andrewgelman_invitezeroin

What if 0 is also a meaningless value for this dataset? I'd recommend "bench the axis". Like this:

Redo_andrewgelman_benchtheaxis

An axis is a tool to help readers understand a chart. If it isn't serving a function, an axis doesn't need to be there. When I choose a line chart for time-series data, I'm drawing attention to temporal change in the numeric values, or the range of values. I'm not saying something about the values relative to some reference number.

From this example, we also see that the horizontal axis should not be regarded as a hanger for time labels. Time labels can exist by themselves.