Five-value summaries of distributions

BG commented on my previous post, describing her frustration with the “stacked range chart”:

A stacked graph visualizes cubes stacked one on top of the other. So you can't use it for negative numbers, because there's no such thing [as] "negative data". In graphs, a "minus" sign visualizes the opposite direction of one series from another. Doing average plus average plus average plus average doesn't seem logical at all.

***

I have already planned a second post to discuss the problems of using a stacked column chart to show markers of a numeric distribution.

I tried to replicate how the Youtuber generated his “stacked range chart” by appropriating Excel’s stacked column chart, but failed. I think there are some missing steps not mentioned in the video. At around 3:33 of the video, he shows a “hack” involving adding 100 degrees (any large enough value) to all values (already converted to ranges). Then, the next screen displays the resulting chart. Here is the dataset on the left and the chart on the right.

Minutephysics_londontemperature_datachart

Afterwards, he replaces the axis labels with new labels, effectively shifting the axis. But something is missing from the narrative. Since he’s using a stacked column chart, the values in the table are encoded in the heights of the respective blocks. The total stacked heights of each column should be in the hundreds since he has added 100 to each cell. But that’s not what the chart shows.

***

In the rest of the post, I’ll skip over how to make such a chart in Excel, and talk about the consequences of inserting “range” values into the heights of the blocks of a stacked column chart.

Let’s focus on London, Ontario; the five temperature values, corresponding to various average temperatures, are -3, 5, 9, 14, 24. Just throwing those numbers into a stacked column chart in Excel results in the following useless chart:

Stackedcolumnchart_londonontario

The temperature averages are cumulatively summed, which makes no sense, as noted by reader BG. [My daily temperature data differ somewhat from those in the Youtube. My source is here.]

We should ignore the interiors of the blocks, and instead interpret the edges of these blocks. There are five edges corresponding to the five data values. As in:

Junkcharts_redo_londonontariotemperatures_dotplot

The average temperature in London, Ontario (during Spring 2023-Winter 2024) is 9 C. This overall average hides seasonal as well as diurnal variations in temperature.

If we want to acknowledge that night-time temperatures are lower than day-time temperatures, we draw attention to the two values bracketing 9 C, i.e. 5 C and 14 C. The average daytime (max) temperature is 14 C while the average night-time (min) temperature is 5 C. Furthermore, Ontario experiences seasons, so that the average daytime temperature of 14 C is subject to seasonal variability; in the summer, it goes up to 24 C. In the winter, the average night-time temperature goes down to -3 C, compared to 5 C across all seasons. [For those paying closer attention, daytime/max and night-time/min form congruous pairs because the max temperature occurs during daytime while the min temperature occurs during night-time. Thus, the average of maximum temperatures is the same as the average of daytime maximum temperatures.]

The above dotplot illustrates this dataset adequately. The Youtuber explained why he didn’t like it – I couldn’t quite make sense of what he said. It’s possible he thinks the gaps between those averages are more meaningful than the averages themselves, and therefore he prefers a chart form that draws our attention to the ranges, rather than the values.

***

Our basic model of temperature can be thought of as: temperature on a given day = overall average + adjustment for seasonality + adjustment for diurnality.

Take the top three values 9, 14, 24 from above list. Starting at the overall average of 9 C, the analyst gets to 14 if he hones in on max daily temperatures, and to 24 if he further restricts the analysis to summer months (which have the higher temperatures). The second gap is 10 C, twice as large as the first gap of 5 C. Thus, the seasonal fluctuations have larger magnitude than daily fluctuations. Said differently, the effect of seasons on temperature is bigger than that of hour of day.

In interpreting the “ranges” or gaps between averages, narrow ranges suggest low variability while wider ranges suggest higher variability.

Here's a set of boxplots for the same data:

Junkcharts_redo_londonontariotemperatures

The boxplot "edges" also demarcate five values; they are not the same five values as defined by the Youtuber but both sets of five values describe the underlying distribution of temperatures.

 

P.S. For a different example of something similar, see this old post.


What is this "stacked range chart"?

Long-time reader Aleksander B. sent me to this video (link), in which a Youtuber ranted that most spreadsheet programs do not make his favorite chart. This one:

Two questions immediately come to mind: a) what kind of chart is this? and b) is it useful?

Evidently, the point of the above chart is to tell readers there are (at least) three places called “London”, only one of which features red double-decker buses. He calls this a “stacked range chart”. This example has three stacked columns, one for each place called London.

What can we learn from this chart? The range of temperatures is narrowest in London, England while it is broadest in London, Ontario (Canada). The highest temperature is in London, Kentucky (USA) while the lowest is in London, Ontario.

But what kind of “range” are we talking about? Do the top and bottom of each stacked column indicate the maximum and minimum temperatures as we’ve interpreted them to be? In theory, yes, but in this example, not really.

Let’s take one step back, and think about the data. Elsewhere in the video, another version of this chart contains a legend giving us hints about the data. (It's the chart on the right of the screenshot.)

Each column contains four values: the average maximum and minimum temperatures in each place, the average maximum temperature in summer, and the average minimum temperature in winter. These metrics are mouthfuls of words, because the analyst has to describe what choices were made while aggregating the raw data.

The raw data comprise daily measurements of temperatures at each location. (To make things even more complex, there are likely multiple measurement stations in each town, and thus, the daily temperatures themselves may already be averages; or else, the analyst has picked a representative station for each town.) From this single sequence of daily data, we extract two subsequences: the maximum daily, and the minimum daily. This transformation acknowledges that temperatures fluctuate, sometimes massively, over the course of each day.

Each such subsequence is aggregated to four representative numbers. The first pair of max, min is just the averages of the respective subsequences. The remaining two numbers require even more explanation. The “summer average maximum temperature” should be the average of the max subsequence after filtering it down to the “summer” months. Thus, it’s a trimmed average of the max subsequence, or the average of the summer subsequence of the max subsequence. Since summer temperatures are the highest of the four seasons, this number suggests the maximum of the max subsequence, but it’s not the maximum daily maximum since it’s still an average. Similarly, the “winter average minimum temperature” is another trimmed average, computed over the winter months, which is related to but not exactly the minimum daily minimum.

Thus, the full range of each column is the difference between the trimmed summer average and the trimmed winter average. I assume weather scientists use this metric instead of the full range of max to min temperature because it’s less affected by outlier values.

***

Stepping out of the complexity, I’ll say this: what the “stacked range chart” depicts are selected values along the distribution of a single numeric data series. In this sense, this chart is a type of “boxplot”.

Here is a random one I grabbed from a search engine.

Analytica_tukeyboxplotA boxplot, per its inventor Tukey, shows a five-number summary of a distribution: the median, the 25th and 75th percentile, and two “whisker values”. Effectively, the boxplot shows five percentile values. The two whisker values are also percentiles, but not fixed percentiles like 25th, 50th, and 75th. The placement of the whiskers is determined automatically by a formula that determines the threshold for outliers, which in turn depends on the shape of the data distribution. Anything contained within the whiskers is regarded as a “normal” value of the distribution, not an outlier. Any value larger than the upper whisker value, or lower than the lower whisker value, is an outlier. (Outliers are shown individually as dots above or below the whiskers - I see this as an optional feature because it doesn't make sense to show them individually for large datasets with lots of outliers.)

The stacked range chart of temperatures picks off different waypoints along the distribution but in spirit, it is a boxplot.

***

This discussion leads me to the answer to our second question: is the "stacked range chart" useful?  The boxplot is indeed useful. It does a good job describing the basic shape of any distribution.

I make variations of the boxplot all the time, with different percentiles. One variation commonly seen out there replaces the whisker values with the maximum and minimum values. Thus all the data live within the whiskers. This wasn’t what Tukey originally intended but the max-min version can be appropriate in some situations.

Most statistical software makes the boxplot. Excel is the one big exception. It has always been a mystery to me why the Excel developers are so hostile to the boxplot.

 

P.S. Here is the official manual for making a box plot in Excel. I wonder if they are the leading promoter of the max-min boxplot that strays from Tukey's original. It is possible to make the original whiskers but I suppose they don't want to explain it, and it's much easier to have people compute the maximum and minimum values in the dataset.

The max-min boxplot is misleading if the dataset contains true outliers. If the maximum value is really far from the 75th percentile, then most of the data between the 75th and 100th percentile could be sitting just above the top of the box.

 

P.S. [1/9/2025] See the comments below. Steve made me realize that the color legend of the London chart actually has five labels, the last one is white which blends into the white background. Note that, in the next post in this series, I found that I could not replicate the guy's process to produce the stacked column chart in Excel so I went in a different direction.


the wtf moment

You're reading some article that contains a standard chart. You're busy looking for the author's message on the chart. And then, the wtf moment strikes.

It's the moment when you discover that the chart designer has done something unexpected, something that changes how you should read the chart. It's when you learn that time is running right to left, for example. It's when you realize that negative numbers are displayed up top. It's when you notice that the columns are ordered by descending y-value despite time being on the x-axis.

Tell me about your best wtf moments!

***

The latest case of the wtf moment occurred to me when I was reading Rajiv Sethi's blog post on his theory that Kennedy voters crowded out Cheney voters in the 2024 Presidential election (link). Was the strategy to cosy up to Cheney and push out Kennedy wise?

In the post, Rajiv has included this chart from Pew:

Pew_science_confidence

The chart is actually about the public's confidence in scientists. Rajiv summarizes the message as: 'Public confidence in scientists has fallen sharply since the early days of the pandemic, especially among Republicans. There has also been a shift among Democrats, but of a slightly different kind—the proportion with “a great deal” of trust in scientists to act in our best interests rose during the first few months of the pandemic but has since fallen back.'

Pew produced a stacked column chart, with three levels for each demographic segment and month of the survey. The question about confidence in scientists admits three answers: a great deal, a fair amount, and not too much/None at all. [It's also possible that they offered 4 responses, with the bottom two collapsed as one level in the visual display.]

As I scan around the chart understanding the data, suddenly I realized that the three responses were not listed in the expected order. The top (light blue) section is the middling response of "a fair amount", while the middle (dark blue) section is the "a great deal" answer.

wtf?

***

Looking more closely, this stacked column chart has bells and whistles, indicating that the person who made it expended quite a bit of effort. Whether it's worthwhile effort, it's for us readers to decide.

By placing "a great deal" right above the horizon, the designer made it easier to see the trend in the proportion responding with "a great deal". It's also easy to read the trend of those picking the "negative" response because of how the columns are anchored. In effect, the designer is expressing the opinion that the middle group (which is also the most popular answer) is just background, and readers should not pay much attention to it.

The designer expects readers to care about one other trend, that of the "top 2 box" proportion. This is why sitting atop the columns are the data labels called "NET" which is the sum of those responding "a great deal" or "a fair amount".

***

For me, it's interesting to know whether the prior believers in science who lost faith in science went down one notch or two. Looking at the Republicans, the proportion of "a great deal" roughly went down by 10 percentage points while the proportion saying "Not too much/None at all" went up about 13%. Thus, the shift in the middle segment wasn't enough to explain all of the jump in negative sentiment; a good portion went from believer to skeptic during the pandemic.

As for Democrats, the proportion of believers also dropped by about 10 percentage points while the proportion saying "a fair amount" went up by almost 10 percent, accounting for most of the shift. The proportion of skeptics increased by about 2 percent.

So, for Democrats, I'm imagining a gentle slide in confidence that applies to the whole distribution while for Republicans, if someone loses confidence, it's likely straight to the bottom.

If I'm interested in the trends of all three responses, it's more effective to show the data in a panel like this:

Junkcharts_redo_pew_scientists

***

Remember to leave a comment when you hit your wtf moment next time!

 


Election coverage prompts good graphics

The election broadcasts in the U.S. are full-day affairs, and they make a great showcase for interactive graphics.

The election setting is optimal as it demands clear graphics that are instantly digestible. Anything else would have left viewers confused or frustrated.

The analytical concepts conveyed by the talking heads during these broadcasts are quite sophisticated, and they did a wonderful job at it.

***

One such concept is the value of comparing statistics against a benchmark (or, even multiple benchmarks). This analytics tactic comes in handy in the 2024 election especially, because both leading candidates are in some sense incumbents. Kamala was part of the Biden ticket in 2020, while Trump competed in both 2016 and 2020 elections.

Msnbc_2024_ga_douglas

In the above screenshot, taken around 11 pm on election night, the MSNBC host (that looks like Steve K.) was searching for Kamala votes because it appeared that she was losing the state of Georgia. The question of the moment: were there enough votes left for her to close the gap?

In the graphic (first numeric column), we were seeing Kamala winning 65% of the votes, against Trump's 34%, in Douglas county in Georgia. At first sight, one would conclude that Kamala did spectacularly well here.

But, is 65% good enough? One can't answer this question without knowing past results. How did Biden-Harris do in the 2020 election when they won the presidency?

The host touched the interactive screen to reveal the second column of numbers, which allows viewers to directly compare the results. At the time of the screenshot, with 94% of the votes counted, Kamala was performing better in this county than they did in 2020 (65% vs 62%). This should help her narrow the gap.

If in 2020, they had also won 65% of the Douglas county votes, then, we should not expect the vote margin to shrink after counting the remaining 6% of votes. This is why the benchmark from 2020 is crucial. (Of course, there is still the possibility that the remaining votes were severely biased in Kamala's favor but that would not be enough, as I'll explain further below.)

All stations used this benchmark; some did not show the two columns side by side, making it harder to do the comparison.

Interesting side note: Douglas county has been rapidly shifting blue in the last two decades. The proportion of whites in the county dropped from 76% to 35% since 2000 (link).

***

Though Douglas county was encouraging for Kamala supporters, the vote gap in the state of Georgia at the time was over 130,000 in favor of Trump. The 6% in Douglas represented only about 4,500 votes (= 70,000*0.06/0.94). Even if she won all of them (extremely unlikely), it would be far from enough.

So, the host flipped to Fulton county, the most populous county in Georgia, and also a Democratic stronghold. This is where the battle should be decided.

Msnbc_2024_ga_fulton

Using the same format - an interactive version of a small-multiples arrangement, the host looked at the situation in Fulton. The encouraging sign was that 22% of the votes here had not yet been counted. Moreover, she captured 73% of those votes that had been tallied. This was 10 percentage points better than her performance in Douglas, Ga. So, we know that many more votes were coming in from Fulton, with the vast majority being Democratic.

But that wasn't the full story. We have to compare these statistics to our 2020 benchmark. This comparison revealed that she faced a tough road ahead. That's because Biden-Harris also won 73% of the Fulton votes in 2020. She might not earn additional votes here that could be used to close the state-wide gap.

If the 73% margin held to the end of the count, she would win 90,000 additional votes in Fulton but Trump would win 33,000, so that the state-wide gap should narrow by 57,000 votes. Let's round that up, and say Fulton halved Trump's lead in Georgia. But where else could she claw back the other half?

***

From this point, the analytics can follow one of two paths, which should lead to the same conclusion. The first path runs down the list of Georgia counties. The second path goes up a level to a state-wide analysis, similar to what was done in my post on the book blog (link).

Cnn_2024_ga

Around this time, Georgia had counted 4.8 million votes, with another 12% outstanding. So, about 650,000 votes had not been assigned to any candidate. The margin was about 135,000 in Trump's favor, which amounted to 20% of the outstanding votes. But that was 20% on top of her base value of 48% share, meaning she had to claim 68% of all remaining votes. (If in the outstanding votes, she got the same share of 48% as in the already-counted, then she would lose the state with the same vote margin as currently seen, and would lose by even more absolute votes.)

The reason why the situation was more hopeless than it even sounded here is that the 48% base value came from the 2024 votes that had been counted; thus, for example, it included her better-than-benchmark performance in Douglas county. She would have to do even better to close the gap! In Fulton, which has the biggest potential, she was unable to push the vote share above the 2020 level.

That's why in my book blog (link), I suggested that the networks could have called Georgia (and several other swing states) earlier, if they used "numbersense" rather than mathematical impossibility as the criterion.

***

Before ending, let's praise the unsung heroes - the data analysts who worked behind the scenes to make these interactive graphics possible.

The graphics require data feeds, which cover a broad scope, from real-time vote tallies to total votes casted, both at the county level and the state level. While the focus is on the two leading candidates, any votes going to other candidates have to be tabulated, even if not displayed. The talking heads don't just want raw vote counts; in order to tell the story of the election, they need some understanding of how many votes are still to be counted, where they are coming from, what's the partisan lean on those votes, how likely is the result going to deviate from past elections, and so on.

All those computations must be automated, but manually checked. The graphics software has to be reliable; the hosts can touch any part of the map to reveal details, and it's not possible to predict all of the user interactions in advance.

Most importantly, things will go wrong unexpectedly during election night so many data analysts were on standby, scrambling to fix issues like breakage of some data feed from some county in some state.


Book review: Getting (more out of ) Graphics by Antony Unwin

Unwin_gettingmoreoutofgraphics_coverAntony Unwin, a statistics professor at Augsburg, has published a new dataviz textbook called "Getting (more out of) Graphics", and he kindly sent me a review copy. (Amazon link)

I am - not surprisingly - in the prime audience for such a book. It covers some gaps in the market:
a) it emphasizes exploratory graphics rather than presentation graphics
b) it deals not just with designing graphics but also interpreting (i.e. reading) them
c) it covers data pre-processing and data visualization in a more balanced way
d) it develops full case studies involving multiple graphics from the same data sources

The book is divided into two parts: the first, which covers 75% of the materials, details case studies, while the final quarter of the book offers "advice". The book has a github page containing R code which, as I shall explain below, is indispensable to the serious reader.

Given the aforementioned design, the case studies in Unwin's book have a certain flavor: most of the data sets are relatively complex, with many variables, including a time component. The primary goal of Unwin's exploratory graphics can be stated as stimulating "entertaining discussions" about and "involvment" with the data. They are open-ended, and frequently inconclusive. This is a major departure from other data visualization textbooks on the market, and also many of my own blog posts, where we focus on selecting a good graphic for presenting insights visually to an intended audience, without assuming domain expertise.

I particularly enjoyed the following sections: a discussion of building graphs via "layering" (starting on p. 326), enumeration of iterative improvement to graphics (starting on p. 402), and several examples of data wrangling (e.g. p.52).

Unwin_fig4.7

Unwin does not give "advice" in the typical style of do this, don't do that. His advice is fashioned in the style of an analyst. He frames and describes the issues, shows rather than tells. This paragraph from the section about grouping data is representative:

Sorting into groups gets complicated when there are several grouping variables. Variables may be nested in a hierarchy... or they may have no such structure... Groupings need to be found that reflect the aims of the study. (p. 371)

He writes down what he has done, may provide a reason for his choices, but is always understated. He sees no point in selling his reasoning.

The structure of the last part of the book, the "advice" chapters, is quite unusual. The chapter headers are: (data) provenance and quality; wrangling; colour; setting the scene (scaling, layout, etc.); ordering, sorting and arranging; what affects interpretation; and varieties of plots.

What you won't find are extended descriptions of chart forms, rules of visualization, or flowcharts tying data types to chart forms. Those are easily found online if you want them (you probably won't care if you're reading Unwin's book.)

***

For the serious reader, the book should be consumed together with the code on github. Find specific graphs from the case studies that interest you, open the code in your R editor, and follow how Unwin did it. The "advice" chapters highlight points of interest from the case studies presented earlier so you may start there, cross-reference the case studies, then jump to the code.

Unfortunately, the code is sparsely commented. So also open up your favorite chatbot, which helps to explain the code, and annotate it yourself. Unwin uses R, and in particular, lives in the "tidyverse".

To understand the data manipulation bits, reviewing the code is essential. It's hard to grasp what is being done to the data without actually seeing the datasets. There are no visuals of the datasets in the book, as the text is primarily focused on the workflow leading to a graphic. The data processing can get quite involved, such as Chapter 16.

I'm glad Unwin has taken the time to write this book and publish the code. It rewards the serious reader with skills that are not commonly covered in other textbooks. For example, I was rather amazed to find this sentence (p. 366):

To ensure that a return to a particular ordering is always possible, it is essential to have a variable with a unique value for every case, possibly an ID variable constructed for just this reason. Being able to return to the initial order of a dataset is useful if something goes wrong (and something will).

Anyone who has analyzed real-world datasets would immediately recognize this as good advice but who'd have thought to put it down in a book?


Expert handling of multiple dimensions of data

I enjoyed reading this Washington Post article about immigration in America. It features a number of graphics. Here's one graphic I particularly like:

Wpost_smallmultiplesmap

This is a small multiples of six maps, showing the spatial distribution of immigrants from different countries. The maps reveal some interesting patterns: Los Angeles is a big favorite of Guatamalans while Houston is preferred by Hondurans. Venezuelans like Salt Lake City and Denver (where there are also some Colombians and Mexicans). The breadth of the spatial distribution surprises me.

The dataset behind this graphic is complex. It's got country of origin, place of settlement, and time of arrival. The maps above collapsed the time dimension, while drawing attention to the other two dimensions.

***

They have another set of charts that highlight the time dimension while collapsing the place of settlement dimension. Here's one view of it:

Wpost_inkblot_overall

There are various names for this chart form. Stream river is one. I like to call it "inkblot", where the two sides are symmetric around the middle vertical line. The chart shows that "migrants in the U.S. immigration court" system have grown substantially since the end of the Covid-19 pandemic, during which they stopped coming.

I'm not a fan of the inkblot. One reason is visible in the following view, which showcases three Central American countries.

Wpost_inkblot_centralamerica

The main message is clear enough. The volume of immigrants from these three countries have been relatively stable over the last decade, with a bulge in the late 2000s. The recent spurt in migrants have come from other places.

But try figuring out what proportion of total immigration is accounted for by these three countries say in 2024. It's a task that is tougher than it should be, and the culprit is that the "other countries" category has been split in half with the two halves separated.

 


Adjust, and adjust some more

This Financial Times report illustrates the reason why we should adjust data.

The story explores the trend in economic statistics during 14 years of governing by conservatives. One of those metrics is so-called council funding (local governments). The graphic is interactive: as the reader scrolls the page, the chart transforms.

The first chart shows the "raw" data.

Ft_councilfunding1

The vertical axis shows year-on-year change in funding. It is an index relative to the level in 2010. From this line chart, one concludes that council funding decreased from 2010 to around 2016, then grew; by 2020, funding has recovered to the level of 2010 and then funding expanded rapidly in recent years.

When the reader scrolls down, this chart is replaced by another one:

Ft_councilfunding2

This chart contains a completely different picture. The line dropped from 2010 to 2016 as before. Then, it went flat, and after 2021, it started raising, even though by 2024, the value is still 10 percent below the level in 2010.

What happened? The data journalist has taken the data from the first chart, and adjusted the values for inflation. Inflation was rampant in recent years, thus, some of the raw growth have been dampened. In economics, adjusting for inflation is also called expressing in "real terms". The adjustment is necessary because the same dollar (hmm, pound) is worth less when there is inflation. Therefore, even though on paper, council funding in 2024 is more than 25 percent higher than in 2010, inflation has gobbled up all of that and more, to the point in which, in real terms, council funding has fallen by 20 percent.

This is one material adjustment!

Wait, they have a third chart:

Ft_councilfunding3

It's unfortunate they didn't stabilize the vertical scale. Relative to the middle chart, the lowest point in this third chart is about 5 percent lower, while the value in 2024 is about 10 percent lower.

This means, they performed a second adjustment - for population change. It is a simple adjustment of dividing by the population. The numbers look worse probably because population has grown during these years. Thus, even if the amount of funding stayed the same, the money would have to be split amongst more people. The per-capita adjustment makes this point clear.

***

The final story is much different from the initial one. Not only was the magnitude of change different but the direction of change reversed.

Whenever it comes to adjustments, remember that all adjustments are subjective. In fact, choosing not to adjust is also subjective. Not adjusting is usually much worse.

 

 

 

 


Prime visual story-telling

A story from the New York Times about New York City neighborhoods has been making the rounds on my Linkedin feed. The Linkedin post sends me to this interactive data visualization page (link).

Here, you will find a multi-colored map.

Nyt_newyorkneighborhoodsmap

The colors show the extant of named neighborhoods in the city. If you look closely, the boundaries between neighborhoods are blurred since it's often not clear where one neighborhood ends and where another one begins. I was expecting this effect when I recognize the names of the authors, who have previously published other maps that obsess over spatial uncertainty.

I clicked on an area for which I know there may be differing opinions:

Nyt_newyorkneighborhoods_example

There was less controversy than I expected.

***

What was the dataset behind this dataviz project? How did they get such detailed data on every block of the city? Wouldn't they have to interview a lot of residents to compile the data?

I'm quite impressed with what they did. They put up a very simple survey (emphasis on: very simple). This survey is only possible with modern browser technology. It asks the respondent to pinpoint the location of where they live, and name their neighborhood. Then it asks the respondent to draw a polygon around their residence to include the extant of the named neighborhood. This consists of a few simple mouse clicks on the map that shows the road network. Finally, the survey collects optional information on alternative names for the neighborhood, etc.

When they process the data, they assign the respondent's neighborhood name to all blocks encircled by the polygon. This creates a lot of data in a few brush strokes, so to speak. This is a small (worthwhile) tradeoff even though the respondent didn't really give an answer for every block.

***

Bear with me, I'm getting to the gist of this blog post. The major achievement isn't the page that was linked to above. The best thing the dataviz team did here is the visual story that walks the reader through insights drawn from the dataviz. You can find the visual story here.

What are the components of a hugely impressive visual story?

  • It combines data visualization with old-fashioned archival research. The historical tidbits add a lot of depth to the story.
  • It combines data visualization with old-fashioned reporting. The quotations add context to how people think about neighborhoods - something that cannot be obtained from the arms-length process of conducting an online survey.
  • It highlights curated insights from the underlying data - even walking the reader step by step through the relevant sections of the dataviz that illustrate these insights.

At the end of this story, some fraction of users may be tempted to go back to the interactive dataviz to search for other insights, or obtain answers to their personalized questions. They are much better prepared to do so, having just seen how to use the interactive tool!

***

The part of the visual story I like best is toward the end. Instead of plotting all the data on the map, they practice some restraint, and filter the data. They show the boundaries that have reached at least a certain level of consensus among the respondents.

The following screenshot shows those areas for which at least 90% agree.

Nyt_newyorkneighborhoods_90pc

Pardon the white text box, I wasn't able to remove it.

***

One last thing...

Every time an analyst touches data, or does something with data, s/he imposes assumptions, and sometimes, these assumptions are so subtle that even the analyst may not have noticed. Frequently, these assumptions are baked into the analytical "models," which is why they may fall through the cracks.

One such assumption in making this map is that every block in the city belongs to at least one named neighborhood. An alternative assumption is that neighborhoods are named only because certain blocks have things in common, and because these naming events occur spontaneously, it's perfectly ok to have blocks that aren't part of any named neighborhood.

 

 


Reading log: HBR's specialty bar charts

Today, I want to talk about a type of analysis that I used to ask students to do. I'm calling it a reading log analysis – it's a reading report that traces how one consumes a dataviz work from where your eyes first land to the moment of full comprehension (or abandonment, if that is the outcome). Usually, we do this orally during a live session, but it's difficult to arrive at a full report within the limited class time. A written report overcomes this problem. A stack of reading logs should be a gift to any chart designer.

My report below is very detailed, reflecting the amount of attention I pay to the craft. Most readers won't spend as much time consuming a graphic. The value of the report is not only in what it covers but also in what it does not mention.

***

The chart being analyzed showed up in a Harvard Business Review article (link), and it was submitted by longtime reader Howie H.

Hbr_specialbarcharts

First and foremost, I recognized the chart form as a bar chart. It's an advanced bar chart in which each bar has stacked sections and a vertical line in the middle. Now, I wanted to figure out how data enter the picture.

My eyes went to the top legend which tells me the author was comparing the proportion of respondents who said "business should take responsibility" to the proportion who rated "business is doing well". The difference in proportions is called the "performance gap". I glanced quickly at the first row label to discover the underlying survey addresses social issues such as environmental concerns.

Next, I looked at the first bar, trying to figure out its data encoding scheme. The bold, blue vertical line in the middle of the bar caused me to think each bar is split into left and right sections. The right section is shaded and labeled with the performance gap numbers so I focused on the segment to the left of the blue line.

My head started to hurt a little. The green number (76%) is associated with the left edge of the left section of the bar. And if the blue line represents the other number (29%), then the width of the left section should map to the performance gap. This interpretation was obviously incorrect since the right section already showed the gap, and the width of the left section was not equal to that of the right shaded section.

I jumped to the next row. My head hurt a little bit more. The only difference between the two rows is the green number being 74%, 2 percent smaller. I couldn't explain how the left sections of both bars have the same width, which confirms that the left section doesn't display the performance gap (assuming that no graphical mistakes have been made). It also appeared that the left edge of the bar was unrelated to the green number. So I retreated to square one. Let's start over. How were the data encoded in this bar chart?

I scrolled down to the next figure, which applies the same chart form to other data.

Hbr_specialbarcharts_2

I became even more confused. The first row showed labels (green number 60%, blue number 44%, performance gap -16%). This bar is much bigger than the one in the previous figure, even though 60% was less than 76%. Besides, the left section, which is bracketed by the green number on the left and the blue number on the right, appeared much wider than the 16% difference that would have been merited. I again lapsed into thinking that the left section represents performance gaps.

Then I noticed that the vertical blue lines were roughly in proportion. Soon, I realized that the total bar width (both sections) maps to the green number. Now back to the first figure. The proportion of respondents who believe business should take responsibility (green number) is encoded in the full bar. In other words, the left edges of all the bars represent 0%. Meanwhile the proportion saying business is doing well is encoded in the left section. Thus, the difference between the full width and the left-section width is both the right-section width and the performance gap.

Here is an edited version that clarifies the encoding scheme:

Hbr_specialbarcharts_2

***

That's my reading log. Howie gave me his take:

I had to interrupt my reading of the article for quite a while to puzzle this one out. It's sorted by performance gap, and I'm sure there's a better way to display that. Maybe a dot plot, similar to here - https://junkcharts.typepad.com/junk_charts/2023/12/the-efficiency-of-visual-communications.html.

A dot plot might look something like this:

Junkcharts_redo_hbr_specialcharts_2
Howie also said:

I interpret the authros' gist to be something like "Companies underperform public expectations on a wide range of social challenges" so I think I'd want to focus on the uniform direction and breadth of the performance gap more than the specifics of each line item.

And I agree.


The cult of raw unadjusted data

Long-time reader Aleks came across the following chart on Facebook:

Unadjusted temp data fgfU4-ia fb post from aleks

The author attached a message: "Let's look at raw, unadjusted temperature data from remote US thermometers. What story do they tell?"

I suppose this post came from a climate change skeptic, and the story we're expected to take away from the chart is that there is nothing to see here.

***

What are we looking at, really?

"Nothing to see" probably refers to the patch of blue squares that cover the entire plot area, as time runs left to right from the 1910s to the present.

But we can't really see what's going on in the middle of the patch. So, "nothing to see" is effectively only about the top-to-bottom range of roughly 29.8 to 82.0. What does that range signify?

The blue patch is subdivided into vertical lines consisting of blue squares. Each line is a year's worth of temperature measurements. Each square is the average temperature on a specific day. The vertical range is the difference between the maximum and minimum daily temperatures in a given year. These are extreme values that say almost nothing about the temperatures in the other ~363 days of the year.

We know quite a bit more about the density of squares along each vertical line. They are broken up roughly by seasons. Those values near the top came from summers while the values near the bottom came from winters. The density is the highest near the middle, where the overplotting is so severe that we can barely see anything.

Within each vertical line, the data are not ordered chronologically. This is a very key observation. From left to right, the data are ordered from earliest to latest but not from top to bottom! Therefore, it is impossible for the human eye to trace the entire trajectory of the daily temperature readings from this chart. At best, you can trace the yearly average temperature – but only extremely roughly by eyeballing where the annual averages are inside the blue patch.

Indeed, there is "nothing to see" on this chart because its design has pulverized the data.

***

_numbersense_bookcoverIn Numbersense (link), I wrote "not adjusting the raw data is to knowingly publish bad information. It is analogous to a restaurant's chef knowingly sending out spoilt fish."

It's a fallacy to think that "raw unadjusted" data are the best kind of data. It's actually the opposite. Adjustments are designed to correct biases or other problems in the data. Of course, adjustments can be subverted to introduce biases in the data as well. It is subversive to presume that all adjustments are of the subversive kind.

What kinds of adjustments are of interest in this temperature dataset?

Foremost is the seasonal adjustment. See my old post here. If we want to learn whether temperatures have risen over these decades, we can't do so without separating out the seasons.

The whole dataset can be simplified by drawing the smoothed annual average temperature grouped by season of the year, and when that is done, the trend of rising temperatures is obvious.

***

The following chart by the EPA roughly implements the above:

Epa-seasonal-temperature_2022

The original can be found here. They made one adjustment which isn't the one I expected.

Note the vertical scale is titled "temperature anomaly". So, they are not plotting the actual recorded average temperatures, but the "anomalies", i.e. the difference between the recorded temperatures and some kind of "expected" temperature. This is a type of data adjustment as well. The purpose is to focus attention on the relative rather than absolute values. Think of this formula: recorded value = expected value + anomaly. The chart shows how many degrees above or below expectation, rather than how many degrees.

For a chart like this, there should be a required footnote that defines what "anomaly" is. Specifically, the reader should know about the model behind the "expectation". Typically, it's a kind of long-term average value.

For me, this adjustment is not necessary. Without the adjustment, the four panels can be combined into one panel with four lines. That's because the data nicely fit into four levels based on seasons.

The further adjustment I'd have liked to see is "smoothing". Each line above has a "smooth" trend, as well as some variability around this trend. The latter is not a big part of the story.

***

It's weird to push back on climate change advocacy by attacking data adjustments. The more productive direction, in my view, is to ask whether the observed trend is caused by human activities or part of some long-term up-and-down cycle. That is a very challenging question to answer.