One doesn't have to plot raw data

Visual Capitalist chose a treemap to show us where gold is produced (link):

Viscap_gold2023

The treemap is embedded into a brick of gold. Any treemap is difficult to read, mostly because some block are vertical, others horizontal. A rough understanding is nevertheless possible: the entire global production can be roughly divided into four parts: China plus three other Asian producers account for roughly (not quite) a quarter; "rest of the world" (i.e. all countries not individually listed) is a quarter; Russia and Australia together is again a bit less than a quarter.

***

When I look at datasets that rank countries by some metric, I'm hoping to present insights, rather than the raw data. Insights typically involve comparing countries, or sets of countries, or one country against a set of countries. So, I made the following chart that includes some of these insights I found in the gold production dataset:

Junkcharts_redo_viscap_gold2023

For example, the top 4 producers in Asia account for almost a quarter of the world's output; Canada, U.S. and Australia together also roughly produce a quarter; the rest of the world has a similar output. In Asia, China's output is about the sum of the next 3 producers, which is about the same as U.S. and Canada, which is about the same as the top 5 in Africa.

 


Lost in the middle class

Washington Post asks people what it means to be middle class in the U.S. (link; paywall)

The following graphic illustrates one type of definition, purely based on income ranges.

Wpost_middleclass

For me, this chart is more taxing to read than it appears.

It can be read column by column. Each column represents a hypotheticial annual income for a family of four. People are asked whether they consider that family lower/working class, middle class or upper class. Be careful as the increments from column to column are not uniform.

Now, what's the question again? We're primarily interested in what incomes constitute middle class.

So, we should be looking at the deep green blocks that hang in the middle of each column. It's not easy to read the proportion of middle blocks in a stacked column chart.

***

I tried separating out the three perceived income classes, using a small-multiples design.

Junkcharts_redo_wpost_middleclass

One can more directly see what income ranges are most popularly perceived as being in each income class.

***

The article also goes into alternative definitions of middle class, using more qualitative metrics, such as "able to pay all bills on time without worry". That's a whole other post.

 


Neither the forest nor the trees

On the NYT's twitter feed, they featured an article titled "These Seven Tech Stocks are Driving the Market". The first sentence of the article reads: "The S&P 500 is at an all-time high, and investors have just a handful of stocks to thank for it."

Without having seen any data, I'd surmise from that line that (a) the S&P 500 index has gone up recently, and (b) most if not all of the gain in the index can be attributed to gains in the tech stocks mentioned in the headline. (For purists, a handful is five, not seven.)

The chart accompanying the tweet is a treemap:

Nyt_magnificentseven

The treemap is possibly the most overhyped chart type of the modern era. Its use here is tangential to the story of surging market value. That's because the treemap presents a snapshot of the composition of the index, but contains nothing about the trend (change over time) of the average index value or of its components.

***

Even in representing composition, the treemap is inferior to, gasp, a pie chart. Of course, we can only use a pie chart for small numbers of components. The following illustration takes the data from the NYT chart on the Magnificent Seven tech stocks, and compares a treemap versus a pie chart side by side:

Junkcharts_redo_nyt_magnificent7

The reason why the treemap is worse is that both the width and the height of the boxes are changing while only the radius (or angle) of the pie slices is varying. (Not saying use a pie chart, just saying the treemap is worse.)

There is a reason why the designer appended data labels to each of the seven boxes. The effect of not having those labels is readily felt when our eyes reach the next set of stocks – which carry company names but not their market values. What is the market value of Berkshire Hathaway?

Even more so, what proportion of the total is the market value of Berkshire Hathaway? Indeed, if the designer did not write down 29%, it would take a bit of work to figure out the aggregate value of yellow boxes relative to the entire box!

This design sucessfully draws our attention to the structural importance of various components of the whole. There are three layers - the yellow boxes (Magnificent Seven), the gray boxes with company names, and the other gray boxes. I also like how they positioned the text on the right column.

***

Going inside the NYT article itself, we find two line charts that convey the story as told.

Here's the first one:

Nyt_magnificent7_linechart1

They are comparing the most recent stock prices with those from October 12 2022, which is identified as the previous "low". (I'm actually confused by how the most recent "low" is defined, but that's a different subject.)

This chart carries a lot of good information, even though it does not plot "all the data", as in each of the 500 S&P components individually. Over the period under analysis, the average index value has gone up about 35% while the Magnificent Seven's value have skyrocketed by 65% in aggregate. The latter accounted for 30% of the total value at the most recent time point.

If we set the S&P 500 index value in 2024 as 100, then the M7 value in 2024 is 30. After unwinding the 65% growth, the M7 value in October 2022 was 18; the S&P 500 in October 2022 was 74. Thus, the weight of M7 was 24% (18/74) in October 2022, compared to 30% now. Consequently, the weight of the other 473 stocks declined from 76% to 70%.

This isn't even the full story because most of the action within the M7 is in Nvidia, the stock most tightly associated with the current AI hype, as shown in the other line chart.

Nyt_magnificent7_linechart2

Nvidia's value jumped by 430% in that time window. From the treemap, the total current value of M7 is $12.3 b while Nvidia's value is $1.4 b, thus Nvidia is 11.4% of M7 currently. Since M7 is 29% of the total S&P 500, Nvidia is 11.4%*29% = 3% of the S&P. Thus, in 2024, against 100 for the S&P, Nvidia's share is 3. After unwinding the 430% growth, Nvidia's share in October 2022 was 0.6, about 0.8% of 74. Its weight tripled during this period of time.


The efficiency of visual communications

Visual Capitalist has this wonderful chart showing the gaps between the stock market returns expected by "investors" compared to "professionals".

Visualcapitalism_Global-Investor-Gap_11172023It's a model of clarity. The chart form is a dot plot.

The blue dots represent what investors (individuals?) expect to earn from investing in the stock market in the long run. The orange dots represent the professional viewpoint. Each row shows survey results in a different country.

At first glance, U.S. investors are vastly more optimistic than professionals. There is excess enthusiasm in most other countries as well.

The exceptions are Chile, Mexico and Singapore in which the two groups are almost perfectly aligned. The high degree of concordance in these countries makes me wonder if their investors are demographically similar to professionals.

***

Those are the first insights one can take from the dot plot, with almost no effort.

But there's more.

The global average is shown in the middle of the chart, allowing readers to compare each country against it. (It's not clear what average this represents though - maybe the average return expected by investors?)

There's more.

Junkcharts_redo_visualcapitalist_expectedreturnsgapsThe chart shows what's professional about professionals. Not only do professionals hold a much more pessimistic view of stock returns in general, they also exhibit a much lower variance in expectations.

This reflects that professionals adhere to an orthodoxy - they went to the same schools, were taught from the same textbooks, took the same professional exams, and live in their own echo chambers.

Chile, Mexico and Singapore, however, stick out. For a change, the professionals share the enthusiasm of investors.

***

This chart shows the power of data visualization. So much information can be conveyed in a small space, if one designs the visual well.


The choice to encode data using colors

NBC News published the following heatmap that shows inflation by product category in the last year or so:

Nbcnews_inflationtracker

The general story might be that inflation was rampant in airfare and electricity prices about a year ago but these prices have moderated recently, especially in airfare. Gas prices appear to have inflated far less than overall inflation during these months.

***

Now, if you're someone who cares about the magnitude of differences, not just the direction, then revisit the above statements, and you'll feel a sense of inadequacy.

When we choose to encode data in colors, we're giving up on showing magnitudes or precision. The color scale shown up top sends the message that the continuous nature of the number line is being displayed but it really isn't.

The largest value of the chart is found on the left side of the airfare row:

Nbcnews_inflationtracker_highest

The value is about 36% which strangely enough is far larger than the maximum value shown in the legend above. Even if those values align, it is still impossible to guess what values the different colors and shades in the cells map to from the legend.

***

The following small-multiples chart shows the underlying values more precisely:

Redo_junkcharts_nbcnewsinflation

I have transformed the data differently. In these line charts, the data are indexed to the first month (100) so each chart shows the cumulative change in prices from that month to the current month, for each category, compared to the overall.

The two most interesting categories are airfare and gas. Airfare has recently decreased quite drastically relative to September 2022, and thus the line is far below the overall inflation trend. Gas prices moved in reverse: they dropped in the last quarter of 2022 but have steadily risen over 2023, and in the most recent month, is tracking overall inflation.

 

 


Several tips for visualizing matrices

Continuing my review of charts that were spammed to my inbox, today I look at the following visualization of a matrix of numbers:

Masterworks_chart9

The matrix shows pairwise correlations between the returns of 16 investment asset classes. Correlation is a number between -1 and 1. It is a symmetric scale around 0. It embeds two dimensions: the magnitude of the correlation, and its direction (positive or negative).

The correlation matrix is a special type of matrix: a bit easier to deal with as the data already come “standardized”. As with the other charts in this series, there is a good number of errors in the chart's execution.

I’ll leave the details maybe for a future post. Just check two key properties of a correlation matrix: the diagonal consisting of self-correlations should contain all 1s; and the matrix should be symmetric across that diagonal.

***

For this post, I want to cover nuances of visualizing matrices. The chart designer knows exactly what the message of the chart is - that the asset class called "art" is attractive because it has little correlation with other popular asset classes. Regardless of the chart's errors, it’s hard for the reader to find the message in the matrix shown above.

That's because the specific data carrying the message sit in the bottom row (and the rightmost column). The cells in this row (and column) has a light purple color, which has been co-opted by the even lighter gray color used for the diagonal cells. These diagonal cells pop out of the chart despite being the least informative (they have the same values for all correlation matrices!)

***

Several tactics can be deployed to push the message to the fore.

First, let's bring the key data to the prime location on the chart - this is the top row and left column (for cultures which read top to bottom, left to right).

Redo_masterwork9_matrix_arttop

For all the drafts in this post, I have dropped the text descriptions of the asset classes, and replaced them with numbers so that it's easier to follow the changes. (For those who're paying attention, I also edited the data to make the matrix symmetric.)

Second, let's look at the color choice. Here, the designer made a wise choice of restricting the number of color levels to three (dark, medium and light). I retained that decision in the above revision - actually, I used four colors but there are no values in one of the four sections, therefore, effectively, only three colors appear. But let's look at what happens when the number of color levels is increased.

Redo_masterwork9_matrix_colors

The more levels of color, the more strain it puts on our processing... with little reward.

Third, and most importantly, the order of the categories affects perception majorly. I have no idea what the designer used as the sorting criterion. In step one of the fix, I moved the art category to the front but left all the other categories in the original order.

The next chart has the asset classes organized from lowest to highest average correlation. Conveniently, using this sorting metric leaves the art category in its prime spot.

Redo_masterwork9_matrix_orderbyavg

Notice that the appearance has completely changed. The new version brings out clusters in the data much more effectively. Most of the assets in the bottom of the chart have high correlation with each other.

Finally, because the correlation matrix is symmetric across the diagonal of self-correlations, the two halves are mirror images and thus redundant. The following removes one of the mirrored halves, and also removes the diagonal, leading to a much cleaner look.

Redo_masterwork9_matrix_orderbyavg_tri

Next time you visualize a matrix, think about how you sort the rows/columns, how you choose the color scale, and whether to plot the mirrored image and the diagonal.

 

 

 


Elevator shoes for column charts

Continuing my review of some charts spammed to me, I wasn’t expecting to find any interest in the following:

Masterworks_chart4

It’s a column chart showing the number of years of data available for different asset classes. The color has little value other than to subtly draw the reader’s attention to the bar called “Art,” which is the focus of the marketing copy.

Do the column heights encode the data?

The answer is no.

***

Let’s take a little journey. First I notice there is a grid behind the column chart, hanging above the baseline.

Redo_masterworks4_grid
I marked out two columns with values 50 and 25, so the second column should be exactly half the height of the first. Each column consists of two parts, the first overlapping the grid while the second connecting the bottom of the grid to the baseline. The second part is a constant for every column; I label this distance Y.  

Against the grid, the column “50” spans 9 cells while the column “25” spans 4 cells. I label the grid height X. Now, if the first column is twice the height of the second, the equation: 9X + Y = 2*(4X+Y) should hold.

The only solution to this equation is X = Y. In other words, the distance between the bottom of the grid to the baseline must be exactly the height of one grid cell if the column heights were to faithfully represent the data. Well – it’s obvious that the former is larger than the latter.

In the revision, I have chopped off the excess height by moving the baseline upwards.

Redo_masterworks4_corrected

That’s the mechanics. Now, figuring out the motivation is another matter.


Chartjunk as marketing copy

I got some spam marketing message last week. How exciting. They even use a subject line that has absolutely nothing to do with its content, baiting me to open it. And open I did, to some data graphics horrors.

The marketer promises a whole series of charts to prove that art is a great asset class for investment returns.

The very first chart already caught my full attention. It's this one:

Masterworks_chart1

It's a simple bar chart, with four values. Looks innocuous.

I'm unable to appreciate the recent trend to align bars in the middle, rather than at their bases. So I converted it to the canonical form:

Redo_masterworks_1_barchart

Do you see the problem?

The second value ($1.7 trillion) is exactly half the size of the first value ($3.4 trillion) and yet the second bar is two-thirds of the length of the first bar. So, the size of the second bar is exaggerated relative to its label – and that’s the bar displaying the market size for “art,” which is what the spammer is pitching.

The bottom pair of values share the same relationship: $0.8 trillion is exactly half of $1.6 trillion. Again, the relative lengths of those two bars are not 50% but slightly over 60%.

Redo_masterworks_1_barchart_excess

Did the designer think that the bar lengths could be customized to whatever s/he desires? This one is hard to crack.

***

The sixth chart in the series is a different kind of puzzle:

Masterworks_chart6

All three lines have the exact same labels but show different values over time.

***

And they have pie charts, of course. Take a look:

Masterworks_chart

Something went wrong here too. I'll leave it to my readers who can certainly figure it out :)

***

These charts were probably spammed to at least thousands.

 


Showing both absolute and relative values on the same chart 2

In the previous post, I looked at Visual Capitalist's visualization of the amount of uninsured deposits at U.S. banks. Using a stacked bar chart, I placed both absolute and relative values on the same chart.

In making that chart, I made these three tradeoffs.

First, I elevated absolute values (dollar amounts) over relative values (proportions). The original designer decided the opposite.

Second, I elevated the TBTF banks over the smaller banks. The original designer also decided the opposite.

Third, I elevated the total value over the disaggregated values (insured, uninsured). The original designer only visualized the uninsured values in the bars.

Which chart is better depends on what story one wants to tell.

***
For today's post, I'm showing another sketch of the same data, with the same goal of putting both absolute and relative values on the same chart.

Redo_visualcapitalist_uninsureddeposits_2b

The starting point of this sketch is the original chart - the stacked bar chart showing relative proportions. I added the insured portion so that it is on almost equal footing as the uninsured portion of the deposits. This edit is crucial to convey the impression of proportions.

My story hasn't changed; I still want to elevate the TBTF banks.

For this version, I try a different way of elevating TBTF banks. The key step is to encode data into the heights of the bars. I use these bar heights to convey the relative importance of banks, as reflected by total deposits.

The areas of the red blocks represent the uninsured amounts. That said, it's not easy to compare rectangular areas when both dimensions are different.

Comparing the total red area with the total yellow area, we learn that the majority of deposits in these banks are uninsured(!)

 


Showing both absolute and relative values on the same chart 1

Visual Capitalist has a helpful overview on the "uninsured" deposits problem that has become the talking point of the recent banking crisis. Here is a snippet of the chart that you can see in full at this link:

Visualcapitalist_uninsureddeposits_top

This is in infographics style. It's a bar chart that shows the top X banks. Even though the headline says "by uninsured deposits", the sort order is really based on the proportion of deposits that are uninsured, i.e. residing in accounts that exceed $250K.  They used a red color to highlight the two failed banks, both of which have at least 90% of deposits uninsured.

The right column provides further context: the total amounts of deposits, presented both as a list of numbers as well as a column of bubbles. As readers know, bubbles are not self-sufficient, and if the list of numbers were removed, the bubbles lost most of their power of communication. Big, small, but how much smaller?

There are little nuggets of text in various corners that provide other information.

Overall, this is a pretty good one as far as infographics go.

***

I'd prefer to elevate information about the Too Big to Fail banks (which are hiding in plain sight). Addressing this surfaces the usual battle between relative and absolute values. While the smaller banks have some of the highest concentrations of uninsured deposits, each TBTF bank has multiples of the absolute dollars of uninsured deposits as the smaller banks.

Here is a revised version:

Redo_visualcapitalist_uninsuredassets_1

The banks are still ordered in the same way by the proportions of uninsured value. The data being plotted are not the proportions but the actual deposit amounts. Thus, the three TBTF banks (Citibank, Chase and Bank of America) stand out of the crowd. Aside from Citibank, the other two have relatively moderate proportions of uninsured assets but the sizes of the red bars for any of these three dominate those of the smaller banks.

Notice that I added the gray segments, which portray the amount of deposits that are FDIC protected. I did this not just to show the relative sizes of the banks. Having the other part of the deposits allow readers to answer additional questions, such as which banks have the most insured deposits? They also visually present the relative proportions.

***

The most amazing part of this dataset is the amount of uninsured money. I'm trying to think who these account holders are. It would seem like a very small collection of people and/or businesses would be holding these accounts. If they are mostly businesses, is FDIC insurance designed to protect business deposits? If they are mostly personal accounts, then surely only very wealthy individuals hold most of these accounts.

In the above chart, I'm assuming that deposits and assets are referring to the same thing. This may not be the correct interpretation. Deposits may be only a portion of the assets. It would be strange though that the analysts only have the proportions but not the actual deposit amounts at these banks. Nevertheless, until proven otherwise, you should see my revision as a sketch - what you can do if you have both the total deposits and the proportions uninsured.