Chart without an axis

When it comes to global warming, most reports cite a single number such as an average temperature rise of Y degrees by year X. Most reports also claim the existence of a consensus within scientists. The Guardian presented the following chart that shows the spread of opinions amongst the experts.

Guardian_globalwarming

Experts were asked how many degrees they expect average global temperature to increase by 2100. The estimates ranged from "below 1.5 degrees" to "5 degrees or more". The most popular answer was 2.5 degrees. Roughly three out of four respondents picked a number at 2.5 degrees or above. The distribution is close to symmetric around the middle.

***

What kind of chart is this?

It's a type of histogram, given that the horizontal axis shows binned ranges of temperature change while the vertical axis shows number of respondents (out of 380).

A (count) histogram typically encodes the count data in the vertical axis. Did you notice there isn't a vertical axis?

That's because the chart has an abnormal axis. Each of the 380 respondents is shown here as a cell. What looks like a "column" is actually two-dimensional. Each row of cells has 10 slots. To find out how many respondents chose the 2.5 celsius category, you count the number of rows and then the number of stray items on top. (It's 132.)

Only the top row of cells can be partially filled so the general shape of the distribution isn't affected much. However, the lack of axis labels makes it hard to learn the count of each column.

It's even harder to know the proportions of respondents, which should be the primary message of the chart. The proportion would have been possible to show if the maximum number of rows was set to 38. The maximum number of rows on the above chart is 22. Using 38 rows leads to a chart with a lot of white space as the tallest column (count of 132) is roughly 35% of the total response.

At the end, I'm not sure this variant of histogram beats the standard histogram.


One doesn't have to plot raw data

Visual Capitalist chose a treemap to show us where gold is produced (link):

Viscap_gold2023

The treemap is embedded into a brick of gold. Any treemap is difficult to read, mostly because some block are vertical, others horizontal. A rough understanding is nevertheless possible: the entire global production can be roughly divided into four parts: China plus three other Asian producers account for roughly (not quite) a quarter; "rest of the world" (i.e. all countries not individually listed) is a quarter; Russia and Australia together is again a bit less than a quarter.

***

When I look at datasets that rank countries by some metric, I'm hoping to present insights, rather than the raw data. Insights typically involve comparing countries, or sets of countries, or one country against a set of countries. So, I made the following chart that includes some of these insights I found in the gold production dataset:

Junkcharts_redo_viscap_gold2023

For example, the top 4 producers in Asia account for almost a quarter of the world's output; Canada, U.S. and Australia together also roughly produce a quarter; the rest of the world has a similar output. In Asia, China's output is about the sum of the next 3 producers, which is about the same as U.S. and Canada, which is about the same as the top 5 in Africa.

 


Aligning V and Q by way of D

In the Trifecta Checkup (link), there is a green arrow between the Q (question) and V (visual) corners, indicating that they should align. This post illustrates what I mean by that.

I saw the following chart in a Washington Post article comparing dairy milk and plant-based "milks".

Vitamins

The article contains a whole series of charts. The one shown here focuses on vitamins.

The red color screams at the reader. At first, it appears to suggest that dairy milk is a standout on all four categories of vitamins. But that's not what the data say.

Let's take a look at the chart form: it's a grid of four plots, each containing one square for each of four types of "milk". The data are encoded in the areas of the squares. The red and green colors represent category labels and do not reflect data values.

Whenever we make bubble plots (the closest relative of these square plots), we have to solve a scale problem. What is the relationship between the scales of the four plots?

I noticed the largest square is the same size across all four plots. So, the size of each square is made relative to the maximum value in each plot, which is assigned a fixed size. In effect, the data encoding scheme is that the areas of the squares show the index values relative to the group maximum of each vitamin category. So, soy milk has 72% as much potassium as dairy milk while oat and almond milks have roughly 45% as much as dairy.

The same encoding scheme is applied also to riboflavin. Oat milk has the most riboflavin, so its square is the largest. Soy milk is 80% of oat, while dairy has 60% of oat.

***

_trifectacheckup_imageLet's step back to the Trifecta Checkup (link). What's the question being asked in this chart? We're interested in the amount of vitamins found in plant-based milk relative to dairy milk. We're less interested in which type of "milk" has the highest amount of a particular vitamin.

Thus, I'd prefer the indexing tied to the amount found in dairy milk, rather than the maximum value in each category. The following set of column charts show this encoding:

Junkcharts_redo_msn_dairyplantmilks_2

I changed the color coding so that blue columns represent higher amounts than dairy while yellow represent lower.

From the column chart, we find that plant-based "milks" contain significantly less potassium and phosphorus than dairy milk while oat and soy "milks" contain more riboflavin than dairy. Almond "milk" has negligible amounts of riboflavin and phosphorus. There is vritually no difference between the four "milk" types in providing vitamin D.

***

In the above redo, I strengthen the alignment of the Q and V corners. This is accomplished by making a stop at the D corner: I change how the raw data are transformed into index values. 

Just for comparison, if I only change the indexing strategy but retain the square plot chart form, the revised chart looks like this:

Junkcharts_redo_msn_dairyplantmilks_1

The four squares showing dairy on this version have the same size. Readers can evaluate the relative sizes of the other "milk" types.


Reading log: HBR's specialty bar charts

Today, I want to talk about a type of analysis that I used to ask students to do. I'm calling it a reading log analysis – it's a reading report that traces how one consumes a dataviz work from where your eyes first land to the moment of full comprehension (or abandonment, if that is the outcome). Usually, we do this orally during a live session, but it's difficult to arrive at a full report within the limited class time. A written report overcomes this problem. A stack of reading logs should be a gift to any chart designer.

My report below is very detailed, reflecting the amount of attention I pay to the craft. Most readers won't spend as much time consuming a graphic. The value of the report is not only in what it covers but also in what it does not mention.

***

The chart being analyzed showed up in a Harvard Business Review article (link), and it was submitted by longtime reader Howie H.

Hbr_specialbarcharts

First and foremost, I recognized the chart form as a bar chart. It's an advanced bar chart in which each bar has stacked sections and a vertical line in the middle. Now, I wanted to figure out how data enter the picture.

My eyes went to the top legend which tells me the author was comparing the proportion of respondents who said "business should take responsibility" to the proportion who rated "business is doing well". The difference in proportions is called the "performance gap". I glanced quickly at the first row label to discover the underlying survey addresses social issues such as environmental concerns.

Next, I looked at the first bar, trying to figure out its data encoding scheme. The bold, blue vertical line in the middle of the bar caused me to think each bar is split into left and right sections. The right section is shaded and labeled with the performance gap numbers so I focused on the segment to the left of the blue line.

My head started to hurt a little. The green number (76%) is associated with the left edge of the left section of the bar. And if the blue line represents the other number (29%), then the width of the left section should map to the performance gap. This interpretation was obviously incorrect since the right section already showed the gap, and the width of the left section was not equal to that of the right shaded section.

I jumped to the next row. My head hurt a little bit more. The only difference between the two rows is the green number being 74%, 2 percent smaller. I couldn't explain how the left sections of both bars have the same width, which confirms that the left section doesn't display the performance gap (assuming that no graphical mistakes have been made). It also appeared that the left edge of the bar was unrelated to the green number. So I retreated to square one. Let's start over. How were the data encoded in this bar chart?

I scrolled down to the next figure, which applies the same chart form to other data.

Hbr_specialbarcharts_2

I became even more confused. The first row showed labels (green number 60%, blue number 44%, performance gap -16%). This bar is much bigger than the one in the previous figure, even though 60% was less than 76%. Besides, the left section, which is bracketed by the green number on the left and the blue number on the right, appeared much wider than the 16% difference that would have been merited. I again lapsed into thinking that the left section represents performance gaps.

Then I noticed that the vertical blue lines were roughly in proportion. Soon, I realized that the total bar width (both sections) maps to the green number. Now back to the first figure. The proportion of respondents who believe business should take responsibility (green number) is encoded in the full bar. In other words, the left edges of all the bars represent 0%. Meanwhile the proportion saying business is doing well is encoded in the left section. Thus, the difference between the full width and the left-section width is both the right-section width and the performance gap.

Here is an edited version that clarifies the encoding scheme:

Hbr_specialbarcharts_2

***

That's my reading log. Howie gave me his take:

I had to interrupt my reading of the article for quite a while to puzzle this one out. It's sorted by performance gap, and I'm sure there's a better way to display that. Maybe a dot plot, similar to here - https://junkcharts.typepad.com/junk_charts/2023/12/the-efficiency-of-visual-communications.html.

A dot plot might look something like this:

Junkcharts_redo_hbr_specialcharts_2
Howie also said:

I interpret the authros' gist to be something like "Companies underperform public expectations on a wide range of social challenges" so I think I'd want to focus on the uniform direction and breadth of the performance gap more than the specifics of each line item.

And I agree.


What's a histogram?

Almost all graphing tools make histograms, and almost all dataviz books cover the subject. But I've always felt there are many unanswered questions. In my talk this Thursday in NYC, I'll provide some answers. You can reserve a spot here.

***

Here's the most generic histogram:

Salaries_count_histogram

Even Excel can make this kind of histogram. Notice that we have counts in the y-axis. Is this really a useful chart?

I haven't found this type of histogram useful ever, since I don't do analyses in which I needed to know the exact count of something - when I analyze data, I'm generalizing from the observed sample to a larger group.

Speaking of Excel, I felt that the developers have always hated histograms. Why is it much harder to make histograms than other basic charts?

***

Another question. We often think of histograms as a crude approximation to a probability density function (PDF). An example of a PDF is the famous bell curve. Textbooks sometimes show the concept like this:

Histogram_normal_pdf

This is true of only some types of histograms (and not the one shown in the first section!) Instead, we often face the following situation:

Normals_histogram50_undercurve

This isn't a trick. The data in the histogram above were generated by sampling the pink bell curve.

***

If you've used histograms, you probably also have run into strange issues. I haven't found much materials out there to address these questions, and they have been lingering in my mind, hidden, for a long time.

My Thursday talk will hopefully fill in some of these gaps.


My talk next week on histograms

Next Thursday (March 14), I'll be presenting at the Data Visualization New York Meetup, hosted by Naomi and Cameron. The event is in-person at Datadog's office. You can reserve your spot here.

Kfung_dataviznewyorkmeetup_mar2024

This talk is brand new, based on some work inspired by a blog post by Andrew Gelman. One of Andrew's correspondents asked about a particular type of histogram. While exploring this topic, I filled some of my own gaps in knowledge about this deceptively simple chart form. I'll be sharing this story.

Bits and pieces have appeared before on my blog. See this, this, and this for background.

If you're attending the talk, come up and say hi.

To register, click here.


Lost in the middle class

Washington Post asks people what it means to be middle class in the U.S. (link; paywall)

The following graphic illustrates one type of definition, purely based on income ranges.

Wpost_middleclass

For me, this chart is more taxing to read than it appears.

It can be read column by column. Each column represents a hypotheticial annual income for a family of four. People are asked whether they consider that family lower/working class, middle class or upper class. Be careful as the increments from column to column are not uniform.

Now, what's the question again? We're primarily interested in what incomes constitute middle class.

So, we should be looking at the deep green blocks that hang in the middle of each column. It's not easy to read the proportion of middle blocks in a stacked column chart.

***

I tried separating out the three perceived income classes, using a small-multiples design.

Junkcharts_redo_wpost_middleclass

One can more directly see what income ranges are most popularly perceived as being in each income class.

***

The article also goes into alternative definitions of middle class, using more qualitative metrics, such as "able to pay all bills on time without worry". That's a whole other post.

 


The art of making simple things harder

It's no longer a shock when a TV network such as MSNBC plays loose with the scaling of the column heights, as in this recent example:

Rachelbitecofer_markp_2024candidatescashonhand

Hat tip to Mark P. for forwarding the image, and Rachel for the original tweet.

***

What's shocking is that the designer appears to believe that the column heights of a column chart can be determined without reference to the data.

There is not a single relationship that has been retained on this chart. The designer just picks whatever size column is desired.

One obvious distortion is between the Biden and Trump columns. Trump's number is about 1/3 of Biden's (120 vs 40), and yet the red column's height is 70% of the blue's.

Furthermore, amongst the red columns, the heights are also haphazard. Trump's number is almost 3 times larger than Haley's; the ratio of column heights is almost 4 times. Haley's number is just a tad higher than DeSantis and yet Haley's column is twice the height of DeSantis.

Junkcharts_msnbc_candidatecash_analysis

***

There is a further, subtle distortion of the column's widths. By curving the chart canvas, certain columns are widened more than others. The diagram above retains the distorted widths and you can see that the Desantis column is wider than that of Haley's.

Here is what the undistorted column chart looks like:

Junkcharts_redo_msnbc_candidatecash

It's easy to make such a chart in Excel or any charting software, so it's mystery why this type of distortion happens. Did the designer open up an empty canvas and start putting up columns of any size?


Elevator shoes for column charts

Continuing my review of some charts spammed to me, I wasn’t expecting to find any interest in the following:

Masterworks_chart4

It’s a column chart showing the number of years of data available for different asset classes. The color has little value other than to subtly draw the reader’s attention to the bar called “Art,” which is the focus of the marketing copy.

Do the column heights encode the data?

The answer is no.

***

Let’s take a little journey. First I notice there is a grid behind the column chart, hanging above the baseline.

Redo_masterworks4_grid
I marked out two columns with values 50 and 25, so the second column should be exactly half the height of the first. Each column consists of two parts, the first overlapping the grid while the second connecting the bottom of the grid to the baseline. The second part is a constant for every column; I label this distance Y.  

Against the grid, the column “50” spans 9 cells while the column “25” spans 4 cells. I label the grid height X. Now, if the first column is twice the height of the second, the equation: 9X + Y = 2*(4X+Y) should hold.

The only solution to this equation is X = Y. In other words, the distance between the bottom of the grid to the baseline must be exactly the height of one grid cell if the column heights were to faithfully represent the data. Well – it’s obvious that the former is larger than the latter.

In the revision, I have chopped off the excess height by moving the baseline upwards.

Redo_masterworks4_corrected

That’s the mechanics. Now, figuring out the motivation is another matter.


Chartjunk as marketing copy

I got some spam marketing message last week. How exciting. They even use a subject line that has absolutely nothing to do with its content, baiting me to open it. And open I did, to some data graphics horrors.

The marketer promises a whole series of charts to prove that art is a great asset class for investment returns.

The very first chart already caught my full attention. It's this one:

Masterworks_chart1

It's a simple bar chart, with four values. Looks innocuous.

I'm unable to appreciate the recent trend to align bars in the middle, rather than at their bases. So I converted it to the canonical form:

Redo_masterworks_1_barchart

Do you see the problem?

The second value ($1.7 trillion) is exactly half the size of the first value ($3.4 trillion) and yet the second bar is two-thirds of the length of the first bar. So, the size of the second bar is exaggerated relative to its label – and that’s the bar displaying the market size for “art,” which is what the spammer is pitching.

The bottom pair of values share the same relationship: $0.8 trillion is exactly half of $1.6 trillion. Again, the relative lengths of those two bars are not 50% but slightly over 60%.

Redo_masterworks_1_barchart_excess

Did the designer think that the bar lengths could be customized to whatever s/he desires? This one is hard to crack.

***

The sixth chart in the series is a different kind of puzzle:

Masterworks_chart6

All three lines have the exact same labels but show different values over time.

***

And they have pie charts, of course. Take a look:

Masterworks_chart

Something went wrong here too. I'll leave it to my readers who can certainly figure it out :)

***

These charts were probably spammed to at least thousands.