The radial is still broken

It's puzzling to me why people like radial charts. Here is a recent set of radial charts that appear in an article in Significance magazine (link to paywall, currently), analyzing NBA basketball data.

Significance radial nba

This example is not as bad as usual (the color scheme notwithstanding) because the story is quite simple.

The analysts divided the data into three time periods: 1980-94, 1995-15, 2016-23. The NBA seasons were summarized using a battery of 15 metrics arranged in a circle. In the first period, all but 3 of the metrics sat much above the average level (indicated by the inner circle). In the second period, all 15 metrics reduced below the average, and the third period is somewhat of a mirror image of the first, which is the main message.

***

The puzzle: why prefer this circular arrangement to a rectangular arrangement?

Here is what the same graph looks like in a rectangular arrangement:

Junkcharts_redo_significanceslamdunkstats

One plausible justification for the circular arrangement is if the metrics can be clustered so that nearby metrics are semantically related.

Nevertheless, the same semantics appear in a rectangular arrangement. For example, P3-P3A are three point scores and attempts while P2-P2A are two-pointers. That is a key trend. They are neighborhoods in this arrangement just as they are in the circular arrangement.

So the real advantage is when the metrics have some kind of periodicity, and the wraparound point matters. Or, that the data are indexed to directions so north, east, south, west are meaningful concepts.

If you've found other use cases, feel free to comment below.

***


I can't end this post without returning to the colors. If one can take a negative image of the original chart, one should. Notice that the colors that dominate our attention - the yellow background, and the black lines - have no data in them: yellow being the canvass, and black being the gridlines. The data are found in the white polygons.

The other informative element, as one learns from the caption, is the "blue dashed line" that represents the value zero (i.e. average) in the standardized scale. Because the size of the image was small in the print magazine that I was reading, and they selected a dark blue encroaching on black, I had to squint hard to find the blue line.

 

 


Adjust, and adjust some more

This Financial Times report illustrates the reason why we should adjust data.

The story explores the trend in economic statistics during 14 years of governing by conservatives. One of those metrics is so-called council funding (local governments). The graphic is interactive: as the reader scrolls the page, the chart transforms.

The first chart shows the "raw" data.

Ft_councilfunding1

The vertical axis shows year-on-year change in funding. It is an index relative to the level in 2010. From this line chart, one concludes that council funding decreased from 2010 to around 2016, then grew; by 2020, funding has recovered to the level of 2010 and then funding expanded rapidly in recent years.

When the reader scrolls down, this chart is replaced by another one:

Ft_councilfunding2

This chart contains a completely different picture. The line dropped from 2010 to 2016 as before. Then, it went flat, and after 2021, it started raising, even though by 2024, the value is still 10 percent below the level in 2010.

What happened? The data journalist has taken the data from the first chart, and adjusted the values for inflation. Inflation was rampant in recent years, thus, some of the raw growth have been dampened. In economics, adjusting for inflation is also called expressing in "real terms". The adjustment is necessary because the same dollar (hmm, pound) is worth less when there is inflation. Therefore, even though on paper, council funding in 2024 is more than 25 percent higher than in 2010, inflation has gobbled up all of that and more, to the point in which, in real terms, council funding has fallen by 20 percent.

This is one material adjustment!

Wait, they have a third chart:

Ft_councilfunding3

It's unfortunate they didn't stabilize the vertical scale. Relative to the middle chart, the lowest point in this third chart is about 5 percent lower, while the value in 2024 is about 10 percent lower.

This means, they performed a second adjustment - for population change. It is a simple adjustment of dividing by the population. The numbers look worse probably because population has grown during these years. Thus, even if the amount of funding stayed the same, the money would have to be split amongst more people. The per-capita adjustment makes this point clear.

***

The final story is much different from the initial one. Not only was the magnitude of change different but the direction of change reversed.

Whenever it comes to adjustments, remember that all adjustments are subjective. In fact, choosing not to adjust is also subjective. Not adjusting is usually much worse.

 

 

 

 


Excess delay

The hot topic in New York at the moment is congestion pricing for vehicles entering Manhattan, which is set to debut during the month of June. I found this chart (link) that purports to prove the effectiveness of London's similar scheme introduced a while back.

Transportxtra_2

This is a case of the visual fighting against the data. The visual feels very busy and yet the story lying beneath the data isn't that complex.

This chart was probably designed to accompany some text which isn't available free from that link so I haven't seen it. The reader's expectation is to compare the periods before and after the introduction of congestion charges. But even the task of figuring out the pre- and post-period is taking more time than necessary. In particular, "WEZ" is not defined. (I looked this up, it's "Western Extension Zone" so presumably they expanded the area in which charges were applied when the travel rates went back to pre-charging levels.)

The one element of the graphic that raises eyebrows is the legend which screams to be read.

Transportxtra_londoncongestioncharge_legend

Why are there four colors for two items? The legend is not self-sufficient. The reader has to look at the chart itself and realize that purple is the pre-charging period while green (and blue) is the post-charging period (ignoring the distinction between CCZ and WEZ).

While we are solving this puzzle, we also notice that the bottom two colors are used to represent an unchanging quantity - which is the definition of "no congestion". This no-congestion travel rate is a constant throughout the chart and yet a lot of ink of two colors have been spilled on it. The real story is in the excess delay, which the congestion charging scheme was supposed to reduce.

The excess on the chart isn't harmless. The excess delay on the roads has been transferred to the chart reader. It actually distracts from the story the analyst is wanting to tell. Presumably, the story is that the excess delays dropped quite a bit after congestion charging was introduced. About four years later, the travel rates had creeped back to pre-charging levels, whereupon the authorities responded by extending the charging zone to WEZ (which as of the time of the chart, wasn't apparently bringing the travel rate down.)

Instead of that story, the excess of the chart makes me wonder... the roads are still highly congested with travel rates far above the level required to achieve no congestion, even after the charging scheme was introduced.

***

I started removing some of the excess from the chart. Here's the first cut:

Junkcharts_redo_transportxtra_londoncongestioncharge

This is better but it is still very busy. One problem is the choice of columns, even though the data are found strictly on the top of each column. (Besides, when I chop off the unchanging sections of the columns, I created a start-not-from-zero problem.) Also, the labeling of the months leaves much to be desired, there are too many grid lines, etc.

***

Here is the version I landed on. Instead of columns, I use lines. When lines are used, there is no need for month labels since we can assume a reader knows the structure of months within a year.

Junkcharts_redo_transportxtra_londoncongestioncharge-2

A priniciple I hold dear is not to have legends unless it is absolutely required. In this case, there is no need to have a legend. I also brought back the notion of a uncongested travel speed, with a single line (and annotation).

***

The chart raises several questions about the underlying analysis. I'd interested in learning more about "moving car observer surveys". What are those? Are they reliable?

Further, for evidence of efficacy, I think the pre-charging period must be expanded to multiple years. Was 2002 a particularly bad year?

Thirdly, assuming WEZ indicates the expansion of the program to a new geographical area, I'm not sure whether the data prior to its introduction represents the travel rate that includes the WEZ (despite no charging) or excludes it. Arguments can be made for each case so the key from a dataviz perspective is to clarify what was actually done.

 

P.S. [6-6-24] On the day I posted this, NY State Governer decided to cancel the congestion pricing scheme that was set to start at the end of June.


Prime visual story-telling

A story from the New York Times about New York City neighborhoods has been making the rounds on my Linkedin feed. The Linkedin post sends me to this interactive data visualization page (link).

Here, you will find a multi-colored map.

Nyt_newyorkneighborhoodsmap

The colors show the extant of named neighborhoods in the city. If you look closely, the boundaries between neighborhoods are blurred since it's often not clear where one neighborhood ends and where another one begins. I was expecting this effect when I recognize the names of the authors, who have previously published other maps that obsess over spatial uncertainty.

I clicked on an area for which I know there may be differing opinions:

Nyt_newyorkneighborhoods_example

There was less controversy than I expected.

***

What was the dataset behind this dataviz project? How did they get such detailed data on every block of the city? Wouldn't they have to interview a lot of residents to compile the data?

I'm quite impressed with what they did. They put up a very simple survey (emphasis on: very simple). This survey is only possible with modern browser technology. It asks the respondent to pinpoint the location of where they live, and name their neighborhood. Then it asks the respondent to draw a polygon around their residence to include the extant of the named neighborhood. This consists of a few simple mouse clicks on the map that shows the road network. Finally, the survey collects optional information on alternative names for the neighborhood, etc.

When they process the data, they assign the respondent's neighborhood name to all blocks encircled by the polygon. This creates a lot of data in a few brush strokes, so to speak. This is a small (worthwhile) tradeoff even though the respondent didn't really give an answer for every block.

***

Bear with me, I'm getting to the gist of this blog post. The major achievement isn't the page that was linked to above. The best thing the dataviz team did here is the visual story that walks the reader through insights drawn from the dataviz. You can find the visual story here.

What are the components of a hugely impressive visual story?

  • It combines data visualization with old-fashioned archival research. The historical tidbits add a lot of depth to the story.
  • It combines data visualization with old-fashioned reporting. The quotations add context to how people think about neighborhoods - something that cannot be obtained from the arms-length process of conducting an online survey.
  • It highlights curated insights from the underlying data - even walking the reader step by step through the relevant sections of the dataviz that illustrate these insights.

At the end of this story, some fraction of users may be tempted to go back to the interactive dataviz to search for other insights, or obtain answers to their personalized questions. They are much better prepared to do so, having just seen how to use the interactive tool!

***

The part of the visual story I like best is toward the end. Instead of plotting all the data on the map, they practice some restraint, and filter the data. They show the boundaries that have reached at least a certain level of consensus among the respondents.

The following screenshot shows those areas for which at least 90% agree.

Nyt_newyorkneighborhoods_90pc

Pardon the white text box, I wasn't able to remove it.

***

One last thing...

Every time an analyst touches data, or does something with data, s/he imposes assumptions, and sometimes, these assumptions are so subtle that even the analyst may not have noticed. Frequently, these assumptions are baked into the analytical "models," which is why they may fall through the cracks.

One such assumption in making this map is that every block in the city belongs to at least one named neighborhood. An alternative assumption is that neighborhoods are named only because certain blocks have things in common, and because these naming events occur spontaneously, it's perfectly ok to have blocks that aren't part of any named neighborhood.