Reading log: HBR's specialty bar charts

Today, I want to talk about a type of analysis that I used to ask students to do. I'm calling it a reading log analysis – it's a reading report that traces how one consumes a dataviz work from where your eyes first land to the moment of full comprehension (or abandonment, if that is the outcome). Usually, we do this orally during a live session, but it's difficult to arrive at a full report within the limited class time. A written report overcomes this problem. A stack of reading logs should be a gift to any chart designer.

My report below is very detailed, reflecting the amount of attention I pay to the craft. Most readers won't spend as much time consuming a graphic. The value of the report is not only in what it covers but also in what it does not mention.

***

The chart being analyzed showed up in a Harvard Business Review article (link), and it was submitted by longtime reader Howie H.

Hbr_specialbarcharts

First and foremost, I recognized the chart form as a bar chart. It's an advanced bar chart in which each bar has stacked sections and a vertical line in the middle. Now, I wanted to figure out how data enter the picture.

My eyes went to the top legend which tells me the author was comparing the proportion of respondents who said "business should take responsibility" to the proportion who rated "business is doing well". The difference in proportions is called the "performance gap". I glanced quickly at the first row label to discover the underlying survey addresses social issues such as environmental concerns.

Next, I looked at the first bar, trying to figure out its data encoding scheme. The bold, blue vertical line in the middle of the bar caused me to think each bar is split into left and right sections. The right section is shaded and labeled with the performance gap numbers so I focused on the segment to the left of the blue line.

My head started to hurt a little. The green number (76%) is associated with the left edge of the left section of the bar. And if the blue line represents the other number (29%), then the width of the left section should map to the performance gap. This interpretation was obviously incorrect since the right section already showed the gap, and the width of the left section was not equal to that of the right shaded section.

I jumped to the next row. My head hurt a little bit more. The only difference between the two rows is the green number being 74%, 2 percent smaller. I couldn't explain how the left sections of both bars have the same width, which confirms that the left section doesn't display the performance gap (assuming that no graphical mistakes have been made). It also appeared that the left edge of the bar was unrelated to the green number. So I retreated to square one. Let's start over. How were the data encoded in this bar chart?

I scrolled down to the next figure, which applies the same chart form to other data.

Hbr_specialbarcharts_2

I became even more confused. The first row showed labels (green number 60%, blue number 44%, performance gap -16%). This bar is much bigger than the one in the previous figure, even though 60% was less than 76%. Besides, the left section, which is bracketed by the green number on the left and the blue number on the right, appeared much wider than the 16% difference that would have been merited. I again lapsed into thinking that the left section represents performance gaps.

Then I noticed that the vertical blue lines were roughly in proportion. Soon, I realized that the total bar width (both sections) maps to the green number. Now back to the first figure. The proportion of respondents who believe business should take responsibility (green number) is encoded in the full bar. In other words, the left edges of all the bars represent 0%. Meanwhile the proportion saying business is doing well is encoded in the left section. Thus, the difference between the full width and the left-section width is both the right-section width and the performance gap.

Here is an edited version that clarifies the encoding scheme:

Hbr_specialbarcharts_2

***

That's my reading log. Howie gave me his take:

I had to interrupt my reading of the article for quite a while to puzzle this one out. It's sorted by performance gap, and I'm sure there's a better way to display that. Maybe a dot plot, similar to here - https://junkcharts.typepad.com/junk_charts/2023/12/the-efficiency-of-visual-communications.html.

A dot plot might look something like this:

Junkcharts_redo_hbr_specialcharts_2
Howie also said:

I interpret the authros' gist to be something like "Companies underperform public expectations on a wide range of social challenges" so I think I'd want to focus on the uniform direction and breadth of the performance gap more than the specifics of each line item.

And I agree.


The cult of raw unadjusted data

Long-time reader Aleks came across the following chart on Facebook:

Unadjusted temp data fgfU4-ia fb post from aleks

The author attached a message: "Let's look at raw, unadjusted temperature data from remote US thermometers. What story do they tell?"

I suppose this post came from a climate change skeptic, and the story we're expected to take away from the chart is that there is nothing to see here.

***

What are we looking at, really?

"Nothing to see" probably refers to the patch of blue squares that cover the entire plot area, as time runs left to right from the 1910s to the present.

But we can't really see what's going on in the middle of the patch. So, "nothing to see" is effectively only about the top-to-bottom range of roughly 29.8 to 82.0. What does that range signify?

The blue patch is subdivided into vertical lines consisting of blue squares. Each line is a year's worth of temperature measurements. Each square is the average temperature on a specific day. The vertical range is the difference between the maximum and minimum daily temperatures in a given year. These are extreme values that say almost nothing about the temperatures in the other ~363 days of the year.

We know quite a bit more about the density of squares along each vertical line. They are broken up roughly by seasons. Those values near the top came from summers while the values near the bottom came from winters. The density is the highest near the middle, where the overplotting is so severe that we can barely see anything.

Within each vertical line, the data are not ordered chronologically. This is a very key observation. From left to right, the data are ordered from earliest to latest but not from top to bottom! Therefore, it is impossible for the human eye to trace the entire trajectory of the daily temperature readings from this chart. At best, you can trace the yearly average temperature – but only extremely roughly by eyeballing where the annual averages are inside the blue patch.

Indeed, there is "nothing to see" on this chart because its design has pulverized the data.

***

_numbersense_bookcoverIn Numbersense (link), I wrote "not adjusting the raw data is to knowingly publish bad information. It is analogous to a restaurant's chef knowingly sending out spoilt fish."

It's a fallacy to think that "raw unadjusted" data are the best kind of data. It's actually the opposite. Adjustments are designed to correct biases or other problems in the data. Of course, adjustments can be subverted to introduce biases in the data as well. It is subversive to presume that all adjustments are of the subversive kind.

What kinds of adjustments are of interest in this temperature dataset?

Foremost is the seasonal adjustment. See my old post here. If we want to learn whether temperatures have risen over these decades, we can't do so without separating out the seasons.

The whole dataset can be simplified by drawing the smoothed annual average temperature grouped by season of the year, and when that is done, the trend of rising temperatures is obvious.

***

The following chart by the EPA roughly implements the above:

Epa-seasonal-temperature_2022

The original can be found here. They made one adjustment which isn't the one I expected.

Note the vertical scale is titled "temperature anomaly". So, they are not plotting the actual recorded average temperatures, but the "anomalies", i.e. the difference between the recorded temperatures and some kind of "expected" temperature. This is a type of data adjustment as well. The purpose is to focus attention on the relative rather than absolute values. Think of this formula: recorded value = expected value + anomaly. The chart shows how many degrees above or below expectation, rather than how many degrees.

For a chart like this, there should be a required footnote that defines what "anomaly" is. Specifically, the reader should know about the model behind the "expectation". Typically, it's a kind of long-term average value.

For me, this adjustment is not necessary. Without the adjustment, the four panels can be combined into one panel with four lines. That's because the data nicely fit into four levels based on seasons.

The further adjustment I'd have liked to see is "smoothing". Each line above has a "smooth" trend, as well as some variability around this trend. The latter is not a big part of the story.

***

It's weird to push back on climate change advocacy by attacking data adjustments. The more productive direction, in my view, is to ask whether the observed trend is caused by human activities or part of some long-term up-and-down cycle. That is a very challenging question to answer.


To a new year of pleasant surprises

Happy new year!

This year promises to be the year of AI. Already last year, we pretty much couldn't lift an eyebrow without someone making an AI claim. This year will be even noisier. Visual Capitalist acknowledged this by making the noisiest map of 2023:

Visualcapitalist_01_Generative_AI_World_map sm

I kept thinking they have a geography teacher on the team, who really, really wants to give us a lesson of where each country is on the world map.

All our attention is drawn to the guiding lines and the random scatter of numbers. We have to squint to find the country names. All this noise drowns out the attempt to make sense of the data, namely, the inset of the top 10 countries in the lower left corner, and the classification of countries into five colored groups.

A small dose of editing helps. Remove most data labels except for the countries for which they have a story. Provide a data table below for those who want details.

***

In the Methodology section, the data analysts (possibly from a third party called ElectronicsHub) indicated that they used Google search volume of "over 90 of the most popular generative AI tools", calculating the "overall volume across all tools per 100k population". Then came a baffling line: "all search volumes were scaled up according to the search engine market share in each country, using figures from statscounter.com." (Note: in the following, I'm calling the data "AI-related search" for simplicity even though their measurement is restricted to the terms described above.)

It took me a while to comprehend what they could have meant by that line. I believe this is what that sentence means: Google is not the only search engine out there so by only researching Google search volume, they undercount the true search volume. How did they deal with the missing data problem? They "scaled up" so if Google is 80% of the search volume in a country, then they divide the Google volume by 80% to "scale up" to 100%.

Whenever we use heuristics like this, we should investigate its foundations. What is the implicit assumption behind this scaling-up procedure? It is that all search engines are effectively the same. The users of non-Google search engines behave exactly as the Google search engine users. If the analysts somehow could get their hands on the data of other search engines, they would discover that the proportion of search volume that is AI-related is effectively the same as seen on Google.

This is one of those convenient, and obviously wrong assumptions – if true, the market would have no need for more than one search engine. Each search engine's audience is just a random sample from the population of all users.

Let's make up some numbers. Let's say Google has 80% share of search volume in Country A, and AI-related search 10% of the overall Google search volume. The remaining search engines have 20% share. Scaling up here means taking the 8% of Google AI-related search volume, divide by 80%, which yields 10%. Since Google owns 8% of the 10%, the other search engines see 2% of overall search volume attributed to AI searches in Country A. Thus, the proportion of AI-related searches on those other search engines is 2%/20% = 10%.

Now, in certain countries, Google is not quite as dominant. Let's say Google only has 20% share of Country B's search volume. AI-related search on Google is 2%, which is 10% of its total. Using the same scaling-up procedure, the analysts have effectively assumed that the proportion of AI-related search volume in the dominant search engines in Country B to be also 10%.

I'm using the above calculations to illustrate a shortcoming of this heuristic. Using this procedure inflates the search volume in countries in which Google is less dominant because the inflation factor is the reciprocal of Google's market share. The less dominant Google is, the larger the inflation factor.

What's also true? The less dominant Google is, the smaller proportion of the total data the analysts are able to see, the lower the quality of the available information. So the heuristic is the most influential where it has the greatest uncertainty.

***

Hope your new year is full of uncertainty, and your heuristics shall lead you to pleasant surprises.

If you like the blog's content, please spread the word. I'm looking forward to sharing more content as the world of data continues to evolve at an amazing pace.

Disclosure: This blog post is not written by AI.


Why some dataviz fail

Maxim Lisnic's recent post should delight my readers (link). Thanks Alek for the tip. Maxim argues that charts "deceive" not merely by using visual tricks but by a variety of other non-visual means.

This is also the reasoning behind my Trifecta Checkup framework which looks at a data visualization project holistically. There are lots of charts that are well designed and constructed but fail for other reasons. So I am in agreement with Maxim.

He analyzed "10,000 Twitter posts with data visualizations about COVID-19", and found that 84% are "misleading" while only 11% of the 84% "violate common design guidelines". I presume he created some kind of computer program to evaluate these 10,000 charts, and he compiled some fixed set of guidelines that are regarded as "common" practice.

***

Let's review Maxim's examples in the context of the Trifecta Checkup.

_trifectacheckup_image

The first chart shows Covid cases in the U.S. in July and August of 2021 (presumably the time when the chart was published) compared to a year ago (prior to the vaccination campaign).

Maxim_section1

Maxim calls this cherry-picking. He's right - and this is a pet peeve of mine, even with all the peer-reviewed scientific research. In my paper on problems with observational studies (link), my coauthors and I call for a new way forward: researchers should put their model calculations up on a website which is updated as new data arrive, so that we can be sure that the conclusions they published apply generally to all periods of time, not just the time window chosen for the publication.

Looking at the pair of line charts, readers can quickly discover its purpose, so it does well on the Q(uestion) corner of the Trifecta. The cherry-picking relates to the link between the Question and the Data, showing that this chart suffers from subpar analysis.

In addition, I find that the chart also misleads visually - the two vertical scales are completely different: the scale on the left chart spans about 60,000 cases while on the right, it's double the amount.

Thus, I'd call this a Type DV chart, offering opportunities to improve in two of the three corners.

***

The second chart cited by Maxim plots a time series of all-cause mortality rates (per 100,000 people) from 1999 to 2020 as columns.

The designer does a good job drawing our attention to one part of the data - that the average increase in all-cause mortality rate in 2020 over the previous five years was 15%. I also like the use of a different color for the pandemic year.

Then, the designer lost the plot. Instead of drawing a conclusion based on the highlighted part of the data, s/he pushed a story that the 2020 rate was about the same as the 2003 rate. If that was the main message, then instead of computing a 15% increase relative to the past five years, s/he should have shown how the 2003 and 2020 levels are the same!

On a closer look, there is a dashed teal line on the chart but the red line and text completely dominate our attention.

This chart is also Type DV. The intention of the designer is clear: the question is to put the jump in all-cause mortality rate in a historical context. The problem lies again with subpar analysis. In fact, if we take the two insights from the data, they both show how serious a problem Covid was at the time.

When the rate returned to the level of 2003, we have effectively gave up all the gains made over 17 years in a few months.

Besides, a jump in 15% from year to year is highly significant if we look at all other year-to-year changes shown on the chart.

***

The next section concerns a common misuse of charts to suggest causality when the data could only indicate correlation (and where the causal interpretation appears to be dubious). I may write a separate post about this vast topic in the future. Today, I just want to point out that this problem is acute with any Covid-19 research, including official ones.

***

I find the fourth section of Maxim's post to be less convincing. In the following example, the tweet includes two charts, one showing proportion of people vaccinated, and the other showing the case rate, in Iceland and Nigeria.

Maxim_section4

This data visualization is poor even on the V(isual) corner. The first chart includes lots of countries that are irrelevant to the comparison. It includes the unnecessary detail of fully versus partially vaccinated, unnecessary because the two countries selected are at two ends of the scale. The color coding is off sync between the two charts.

Maxim's critique is:

The user fails to account, however, for the fact that Iceland had a much higher testing rate—roughly 200 times as high at the time of posting—making it unreasonable to compare the two countries.

And the section is titled "Issues with Data Validity". It's really not that simple.

First, while the differential testing rate is one factor that should be considered, this factor alone does not account for the entire gap. Second, this issue alone does not disqualify the data. Third, if testing rate differences should be used to invalidate this set of data, then all of the analyses put out by official sources lauding the success of vaccination should also be thrown out since there are vast differences in testing rates across all countries (and also across different time periods for the same country).

One typical workaround for differential testing rate is to look at deaths rather than cases. For the period of time plotted on the case curve, Nigeria's cumulative death per million is about 1/8th that of Iceland. The real problem is again in the Data analysis, and it is about how to interpret this data casually.

This example is yet another Type DV chart. I'd classify it under problems with "Casual Inference". "Data Validity" is definitely a real concern; I just don't find this example convincing.

***

The next section, titled "Failure to account for statistical nuance," is a strange one. The example is a chart that the CDC puts out showing the emergence of cases in a specific county, with cases classified by vaccination status. The chart shows that the vast majority of cases were found in people who were fully vaccinated. The person who tweeted concluded that vaccinated people are the "superspreaders". Maxim's objection to this interpretation is that most cases are in the fully vaccinated because most people are fully vaccinated.

I don't think it's right to criticize the original tweeter in this case. If by superspreader, we mean people who are infected and out there spreading the virus to others through contacts, then what the data say is exactly that most such people are fully vaccinated. In fact, one should be very surprised if the opposite were true.

Indeed, this insight has major public health implications. If the vaccine is indeed 90% effective at stopping cases, we should not be seeing that level of cases. And if the vaccine is only moderately effective, then we may not be able to achieve "herd immunity" status, as was the plan originally.

I'd be reluctant to complain about this specific data visualization. It seems that the data allow different interpretations - some of which are contradictory but all of which are needed to draw a measured conclusion.

***
The last section on "misrepresentation of scientific results" could use a better example. I certainly agree with the message: that people have confirmation bias. I have been calling this "story-first thinking": people with a set story visualize only the data that support their preconception.

However, the example given is not that. The example shows a tweet that contains a chart from a scientific paper that apparently concludes that hydroxychloroquine helps treat Covid-19. Maxim adds this study was subsequently retracted. If the tweet was sent prior to the retraction, then I don't think we can grumble about someone citing a peer reviewed study published in Lancet.

***

Overall, I like Maxim's message. In some cases, I think there are better examples.

 

 


Visual story-telling: do you know or do you think?

One of the most important data questions of all time is: do you know? or do you think?

And one of the easiest traps to fall into is: I think, therefore I know.

***

Visual story-telling can be great but it can also mislead. Deception sometimes happens when readers are nudged to "fill in the blanks" with stuff they think they know, but they don't.

A Twitter reader asked me to look at the map in this Los Angeles Times (paywall) opinion column.

Latimes_lifeexpectancy_postcovid

The column promptly announces its premise:

Years of widening economic inequality, compounded by the pandemic and political storm and stress, have given Americans the impression that the country is on the wrong track. Now there’s empirical data to show just how far the country has run off the rails: Life expectancies have been falling.

The writer creates the expectation that he will reveal evidence in the form of data to show that life expectancies have been driven down by economic inequality, pandemic, and politics. Does he succeed?

***

The map portrays average life expectancy (at birth) for some mysterious, presumably very recent, year for every county in the United States. From the color legend, we learn that the bottom-to-top range is about 20 years. There is a clear spatial pattern, with the worst results in the south (excepting south Florida).

The choice of colors is telling. Red and blue on a U.S. map has heavy baggage, as they signify the two main political parties in the country. Given that the author believes politics to be a key driver of health outcomes, the usage of red and blue here is deliberate. Throughout the article, the columnist connects the lower life expectancies in southern states to its politics.

For example, he said "these geographical disparities aren't artifacts of pure geography or demographics; they're the consequences of policy decisions at the state level... Of the 20 states with the worst life expectancies, eight are among the 12 that have not implemented Medicaid expansion under the Affordable Care Act..."

Casual readers may fall into a trap here. There is nothing on the map itself that draws the connection between politics and life expectancies; the idea is evoked purely through the red-blue color scheme. So, as readers, we are filling in the blanks with our own politics.

What could have been done instead? Let's look at the life expectancy map side by side with the map of the U.S. 2020 Presidential election.

Junkcharts_lifeexpectancy_elections

Because of how close recent elections have been, we may think the political map has a nice balance of red and blue but it isn't. The Democrats' votes are heavily concentrated in densely-populated cities so most of the Presidential election map is red. When placed next to each other, it's obvious that politics don't explain the variance in life expectancy well. The Midwest is deep red and yet they have above average life expectancies. I have circled out various regions that contradict the claim that Republican politics drove life expectancies down.

It's not sufficient to point to the South, in which Republican votes and life expectancy are indeed inversely correlated. A good theory has to explain most of the country.

***

The columnist also suggests that poverty is the cause of low life expectancy. That too cannot be gleaned from the published map. Again, readers are nudged to use their wild imagination to fill in the blank.

Data come to the rescue. Here is a side-by-side comparison of the map of life expectancies and the map of median incomes.

Junkcharts_lifeexpectancy_income

A similar conundrum. While the story feels right in the South, it fails to explain the northwest, Florida, and various other parts of the country. Take a look again at the circled areas. Lower income brackets are also sometimes associated with high life expectancies.

***

The author supplies a third cause of lower life expectancies: Covid-19 response. Because Covid-19 was the "most obvious and convenient" explanation for the loss of life expectancy during the pandemic, this theory suggests that the red areas on the life expectancy map should correspond to the regions most ravaged by Covid-19.

Let's see the data.

Junkcharts_lifeexpectancy_covidcases

The map on the right shows the number of confirmed cases until June 2021. As before, the correlation holds somewhat in the South but there are notable exceptions, e.g. the Midwest. We also have states with low Covid-19 cases but below-average life expectancy.

***

What caused the decline of life expectancy in the U.S. - which began before the pandemic, and has continued beyond - is highly complex, beyond what a single map or a pair of maps or a few pairs of maps could convey. Showing a red-blue map presents a trap for readers to fall into, in which they start thinking, without knowing.

 


Parsons Student Projects

I had the pleasure of attending the final presentations of this year's graduates from Parsons's MS in Data Visualization program. You can see the projects here.

***

A few of the projects caught my eye.

A project called "Authentic Food in NYC" explores where to find "authentic" cuisine in New York restaurants. The project is notable for plowing through millions of Yelp reviews, and organizing the information within. Reviews mentioning "authentic" or "original" were extracted.

During the live presentation, the student clicked on Authentic Chinese, and the name that popped up was Nom Wah Tea Parlor, which serves dim sum in Chinatown that often has lines out the door.

Shuyaoxiao_authenticfood_parsons

Curiously, the ranking is created from raw counts of authentic reviews, which favors restaurants with more reviews, such as restaurants that have been operating for a longer time. It's unclear what rule is used to transfer authenticity from reviews to restaurants: does a single review mentioning "authentic" qualify a restaurant as "authentic", or some proportion of reviews?

Later, we see a visualization of the key words found inside "authentic" reviews for each cuisine. Below are words for Chinese and Italian cuisines:

Shuyaoxiao_authenticcuisines_parsons_words

These are word clouds with a twist. Instead of encoding the word counts in the font sizes, she places each word inside a bubble, and uses bubble sizes to indicate relative frequency.

Curiously, almost all the words displayed come from menu items. There isn't any subjective words to be found. Algorithms that extract keywords frequently fail in the sense that they surface the most obvious, uninteresting facts. Take the word cloud for Taiwanese restaurants as an example:

Shuyaoxiao_authenticcuisines_parsons_taiwan

The overwhelming keyword found among reviews of Taiwanese restaurants is... "taiwanese". The next most important word is "taiwan". Among the remaining words, "886" is the name of a specific restaurant, "bento" is usually associated with Japanese cuisine, and everything else is a menu item.

Getting this right is time-consuming, and understandably not a requirement for a typical data visualization course.

The most interesting insight is found in this data table.

Shuyaoxiao_authenticcuisines_ratios

It appears that few reviewers care about authenticity when they go to French, Italian, and Japanese restaurants but the people who dine at various Asian restaurants, German restaurants, and Eastern European restaurants want "authentic" food. The student concludes: "since most Yelp reviewers are Americans, their pursuit of authenticity creates its own trap: Food authenticity becomes an americanized view of what non-American food is."

This hits home hard because I know what authentic dim sum is, and Nom Wah Tea Parlor it ain't. Let me check out what Yelpers are saying about Nom Wah:

  1. Everything was so authentic and delicious - and cheap!!!
  2. Your best bet is to go around the corner and find something more authentic.
  3. Their dumplings are amazing everything is very authentic and tasty!
  4. The food was delicious and so authentic, and the staff were helpful and efficient.
  5. Overall, this place has good authentic dim sum but it could be better.
  6. Not an authentic experience at all.
  7. this dim sum establishment is totally authentic
  8. The onions, bean sprouts and scallion did taste very authentic and appreciated that.
  9. I would skip this and try another spot less hyped and more authentic.
  10. I would have to take my parents here the next time I visit NYC because this is authentic dim sum.

These are the most recent ten reviews containing the word "authentic". Seven out of ten really do mean authentic, the other three are false friends. Text mining is tough business! The student removed "not authentic" which helps. As seen from above, "more authentic" may be negative, and there may be words between "not" and "authentic". Also, think "not inauthentic", "people say it's authentic, and it's not", etc.

One thing I learned from this project is that "authentic" may be a synonym for "I like it" when these diners enjoy the food at an ethnic restaurant. I'm most curious about what inauthentic onions, bean sprouts and scallion taste like.

I love the concept and execution of this project. Nice job!

***

Another project I like is about tourism in Venezuela. The back story is significant. Since a dictatorship took over the country, the government stopped reporting tourism statistics. It's known that tourism collapsed, and that it may be gradually coming back in recent years.

This student does not have access to ready-made datasets. But she imaginatively found data to pursue this story. Specifically, she mentioned grabbing flight schedules into the country from the outside.

The flow chart is a great way to explore this data:

Ibonnet_parsons_dataviz_flightcities

A map gives a different perspective:

Ibonnet_parsons_dataviz_flightmap

I'm glad to hear the student recite some of the limitations of the data. It's easy to look at these visuals and assume that the data are entirely reliable. They aren't. We don't know that what proportion of the people traveling on those flights are tourists, how full those planes are, or the nationalities of those on board. The fact that a flight originated from Panama does not mean that everyone on board is Panamanian.

***

The third project is interesting in its uniqueness. This student wants to highlight the effect of lead in paint on children's health. She used the weight of lead marbles to symbolize the impact of lead paint. She made a dress with two big pockets to hold these marbles.

Scherer_parsons_dataviz_leaddress sm

It's not your standard visualization. One can quibble that dividing the marbles into two pockets doesn't serve a visualziation purpose, and so on. But at the end, it's a memorable performance.


Finding the story in complex datasets

In CT Mirror's feature about Connecticut, which I wrote about in the previous post, there is one graphic that did not rise to the same level as the others.

Ctmirror_highschools

This section deals with graduation rates of the state's high school districts. The above chart focuses on exactly five districts. The line charts are organized in a stack. No year labels are provided. The time window is 11 years from 2010 to 2021. The column of numbers show the difference in graduation rates over the entire time window.

The five lines look basically the same, if we ignore what looks to be noisy year-to-year fluctuations. This is due to the weird aspect ratio imposed by stacking.

Why are those five districts chosen? Upon investigation, we learn that these are the five districts with the biggest improvement in graduation rates during the 11-year time window.

The same five schools also had some of the lowest graduation rates at the start of the analysis window (2010). This must be so because if a school graduated 90% of its class in 2010, it would be mathematically impossible for it to attain a 35% percent point improvement! This is a dissatisfactory feature of the dataviz.

***

In preparing an alternative version, I start by imagining how readers might want to utilize a visualization of this dataset. I assume that the readers may have certain school(s) they are particularly invested in, and want to see its/their graduation performance over these 11 years.

How does having the entire dataset help? For one thing, it provides context. What kind of context is relevant? As discussed above, it's futile to compare a school at the top of the ranking to one that is near the bottom. So I created groups of schools. Each school is compared to other schools that had comparable graduation rates at the start of the analysis period.

Amistad School District, which takes pole position in the original dataviz, graduated only 58% of its pupils in 2010 but vastly improved its graduation rate by 35% over the decade. In the chart below (left panel), I plotted all of the schools that had graduation rates between 50 and 74% in 2010. The chart shows that while Amistad is a standout, almost all schools in this group experienced steady improvements. (Whether this phenomenon represents true improvement, or just grade inflation, we can't tell from this dataset alone.)

Redo_junkcharts_ctmirrorhighschoolsgraduation_1

The right panel shows the group of schools with the next higher level of graduation rates in 2010. This group of schools too increased their graduation rates almost always. The rate of improvement in this group is lower than in the previous group of schools.

The next set of charts show school districts that already achieved excellent graduation rates (over 85%) by 2010. The most interesting group of schools consists of those with 85-89% rates in 2010. Their performance in 2021 is the most unpredictable of all the school groups. The majority of districts did even better while others regressed.

Redo_junkcharts_ctmirrorhighschoolsgraduation_2

Overall, there is less variability than I'd expect in the top two school groups. They generally appeared to have been able to raise or maintain their already-high graduation rates. (Note that the scale of each chart is different, and many of the lines in the second set of charts are moving within a few percentages.)

One more note about the charts: The trend lines are "smoothed" to focus on the trends rather than the year to year variability. Because of smoothing, there is some awkward-looking imprecision e.g. the end-to-end differences read from the curves versus the observed differences in the data. These discrepancies can easily be fixed if these charts were to be published.


Thoughts on Daniel's fix for dual-axes charts

I've taken a little time to ponder Daniel Z's proposed "fix" for dual-axes charts (link). The example he used is this:

Danielzvinca_dualaxes_linecolumn

In that long post, Daniel explained why he preferred to mix a line with columns, rather than using the more common dual lines construction: to prevent readers from falsely attributing meaning to crisscrossing lines. There are many issues with dual-axes charts, which I won't repeat in this post; one of their most dissatisfying features is the lack of connection between the two vertical scales, and thus, it's pretty easy to manufacture an image of correlation when it doesn't exist. As shown in this old post, one can expand or restrict one of the vertical axes and shift the line up and down to "match" the other vertical axis.

Daniel's proposed fix retains the dual axes, and he even restores the dual lines construction.

Danielzvinca_dualaxes_estimatedy

How is this chart different from the typical dual-axes chart, like the first graph in this post?

Recall that the problem with using two axes is that the designer could squeeze, expand or shift one of the axes in any number of ways to manufacture many realities. What Daniel effectively did here is selecting one specific way to transform the "New Customers" axis (shown in gray).

His idea is to run a simple linear regression between the two time series. Think of fitting a "trendline" in Excel between Revenues and New Customers. Then, use the resulting regression equation to compute an "estimated" revenues based on the New Customers series. The coefficients of this regression equation then determines the degree of squeezing/expansion and shifting applied to the New Customers axis.

The main advantage of this "fix" is to eliminate the freedom to manufacture multiple realities. There is exactly one way to transform the New Customers axis.

The chart itself takes a bit of time to get used to. The actual values plotted in the gray line are "estimated revenues" from the regression model, thus the blue axis values on the left apply to the gray line as well. The gray axis shows the respective customer values. Because we performed a linear fit, each value of estimated revenues correspond to a particular customer value. The gray line is thus a squeezed/expanded/shifted replica of the New Customers line (shown in orange in the first graph). The gray line can then be interpreted on two connected scales, and both the blue and gray labels are relevant.

***

What are we staring at?

The blue line shows the observed revenues while the gray line displays the estimated revenues (predicted by the regression line). Thus, the vertical gaps between the two lines are the "residuals" of the regression model, i.e. the estimation errors. If you have studied Statistics 101, you may remember that the residuals are the components that make up the R-squared, which measures the quality of fit of the regression model. R-squared is the square of r, which stands for the correlation between Customers and the observed revenues. Thus the higher the (linear) correlation between the two time series, the higher the R-squared, the better the regression fit, the smaller the gaps between the two lines.

***

There is some value to this chart, although it'd be challenging to explain to someone who has not taken Statistics 101.

While I like that this linear regression approach is "principled", I wonder why this transformation should be preferred to all others. I don't have an answer to this question yet.

***

Daniel's fix reminds me of a different, but very common, chart.

Forecastvsactualinflationchart

This chart shows actual vs forecasted inflation rates. This chart has two lines but only needs one axis since both lines represent inflation rates in the same range.

We can think of the "estimated revenues" line above as forecasted or expected revenues, based on the actual number of new customers. In particular, this forecast is based on a specific model: one that assumes that revenues is linearly related to the number of new customers. The "residuals" are forecasting errors.

In this sense, I think Daniel's solution amounts to rephrasing the question of the chart from "how closely are revenues and new customers correlated?" to "given the trend in new customers, are we over- or under-performing on revenues?"

Instead of using the dual-axes chart with two different scales, I'd prefer to answer the question by showing this expected vs actual revenues chart with one scale.

This does not eliminate the question about the "principle" behind the estimated revenues, but it makes clear that the challenge is to justify why revenues is a linear function of new customers, and no other variables.

Unlike the dual-axes chart, the actual vs forecasted chart is independent of the forecasting method. One can produce forecasted revenues based on a complicated function of new customers, existing customers, and any other factors. A different model just changes the shape of the forecasted revenues line. We still have two comparable lines on one scale.

 

 

 

 

 


Getting simple charts right

Ian K. submitted this chart on Twitter:

Iankos_chicagocops

The chart comes from a video embedded in this report (link) about Chicago cops leaving their jobs.

Let's start with the basics. This is an example of a simple line chart illustrating a time series of five observations. The vertical axis starts at 10,000 instead of 0. With this choice, the designer wants to focus on the point-to-point change in values, rather than its relation to the initial value.

Every graph has add-ons that assist cognition. On this chart, we have axis labels, gridlines and data labels. Every add-on increases reading time so we should be sparing.

First consider the gridlines. In the following chart, I conduct a self-sufficiency test by removing the data labels from the chart:

Redo_wgn9chicagocops_junkcharts_selfsufficiency

You can see that the last three values present no problems. The first two, especially the first value, are hard to read - because the top gridline is missing! The next chart restores the bounding gridline, so you can see the difference that one small detail can make:

Redo_wgn9chicagocops_junkcharts_addedgridline

***

Next, let's compare the following versions of the chart. The left one contains data labels without gridlines and axis labels. The right one has the gridlines and axis labels but no data labels.

Redo_wgn9chicagocops_gridlinesdatalabels

The left chart prints the entire dataset onto the chart. The reader in essence is reading the raw data. That appears to be the intention of the chart designer as the data labels are in large size, placed inside shiny white boxes. The level of the boxes determines the reader's perception as those catch more of our attention than the dots that actually represent the data.

The right chart highlights the dots and the lines between them. The gridlines are way too thick and heavy so as to distract rather than abet. This chart presumes that the reader isn't that interested in the precise numbers as she is in the trend.

***

As Ian pointed out, one of the biggest problems with this chart is the appearance of even time intervals when all except one of the date values are January. This seemingly innocent detail destroys the chart. The line segments of the chart encodes the pre-post change in the staffing numbers. For most of the line segments, the metric is year-on-year change but the last two line segments on the right show something else: a 19-month change, followed by a 5-month change.

I did the following analysis to understand how big of a staffing problem CPD faces.

Redo_wgn9chicagocops_trendanalysis
First I restored the January 2022 time value, while shifting the Aug 2022 value to its rightful place on the time axis. Next, I added the dashed brown line, which represents a linear extension of the trend seen between January 2020-2021, before the sudden dip. We don't know what the true January 2022 value is but the projected value based on past trend is around 12,200. By August, the projected value is around 11,923, about 300 above the actual value of 11,611. By January 2023, the projected value is almost exactly the same as the actual value.

This linear trending analysis is likely too simplistic but it offers a baseline to start thinking about what the story is. The long-term trend is still down but the apparent dip in 2022 may not be meaningful.

 

 


Longest life, shortest length

Racetrack charts refuse to die. For old time's sake, here is a blog post from 2005 in which I explain why they don't make good dataviz.

Our latest example comes from Visual Capitalist (link), which publishes a fair share of nice dataviz. In this infographics, they feature a racetrack chart, just because the topic is the lifespan of cars.

Visualcapitalist_lifespan_cars_top

The whole infographic has four parts, each a racetrack chart. I'll focus on the first racetrack chart (shown above), which deals with the product category of sedans and hatchbacks.

The first thing I noticed is the reference value of 100,000 miles, which is described as the expected lifespan of a typical car made in the 1970s. This is of dubious value since the top of the page informs us the current relevant reference value is 200,000 miles, which is unlabeled. We surmise that 200,000 miles is indicated by the end of the grey sections of the racetrack. (This is eventually confirmed in the next racettrack chart for SUVs in the second sectiotn of the infographic.)

Now let's zoom in on the brown section of the track. Each of the four sections illustrates the same datum = 100,000 miles and yet they exhibit different lengths. From this, we learn that the data are not encoded in the lengths of these tracks -- but rather the data are to be found in the angle sustained at the centre of the concentric circles. The problem with racetrack charts is that readers are drawn to the lengths of the tracks rather than the angles at the center, which are not explicitly represented.

The Avalon model has the longest life span on this chart, and yet it is shown as the shortest curve.

***

The most baffling part of this chart is not the visual but the analysis methodology.

I quote:

iSeeCars analyzed over 2M used cars on the road between Jan. and Oct. 2022. Rankings are based on the mileage that the top 1% of cars within each model obtained.

According to this blurb, the 245,710 miles number for Avalon is the average mileage found in the top 1% of Avalons within the iSeeCars sample of 2M used cars.

The word "lifespan" strikes me as incorporating a date of death, and yet nothing in the above text indicates that any of the sampled cars are at end of life. The cars they really need are not found in their sample at all.

I suppose taking the top 1% is meant to exclude younger cars but why 1%? Also, this sample completely misses the cars that prematurely died, e.g. the cars that failed after 100,000 miles but before 200,000 miles. This filtering also ensures that newer models are excluded from the sample.

_trifectacheckup_imageIn the Trifecta Checkup, this qualifies as Type DV. The dataset does not answer the question of concern while the visual form distorts the data.