## Aligning V and Q by way of D

##### Apr 08, 2024

In the Trifecta Checkup (link), there is a green arrow between the Q (question) and V (visual) corners, indicating that they should align. This post illustrates what I mean by that.

I saw the following chart in a Washington Post article comparing dairy milk and plant-based "milks".

The article contains a whole series of charts. The one shown here focuses on vitamins.

The red color screams at the reader. At first, it appears to suggest that dairy milk is a standout on all four categories of vitamins. But that's not what the data say.

Let's take a look at the chart form: it's a grid of four plots, each containing one square for each of four types of "milk". The data are encoded in the areas of the squares. The red and green colors represent category labels and do not reflect data values.

Whenever we make bubble plots (the closest relative of these square plots), we have to solve a scale problem. What is the relationship between the scales of the four plots?

I noticed the largest square is the same size across all four plots. So, the size of each square is made relative to the maximum value in each plot, which is assigned a fixed size. In effect, the data encoding scheme is that the areas of the squares show the index values relative to the group maximum of each vitamin category. So, soy milk has 72% as much potassium as dairy milk while oat and almond milks have roughly 45% as much as dairy.

The same encoding scheme is applied also to riboflavin. Oat milk has the most riboflavin, so its square is the largest. Soy milk is 80% of oat, while dairy has 60% of oat.

***

Let's step back to the Trifecta Checkup (link). What's the question being asked in this chart? We're interested in the amount of vitamins found in plant-based milk relative to dairy milk. We're less interested in which type of "milk" has the highest amount of a particular vitamin.

Thus, I'd prefer the indexing tied to the amount found in dairy milk, rather than the maximum value in each category. The following set of column charts show this encoding:

I changed the color coding so that blue columns represent higher amounts than dairy while yellow represent lower.

From the column chart, we find that plant-based "milks" contain significantly less potassium and phosphorus than dairy milk while oat and soy "milks" contain more riboflavin than dairy. Almond "milk" has negligible amounts of riboflavin and phosphorus. There is vritually no difference between the four "milk" types in providing vitamin D.

***

In the above redo, I strengthen the alignment of the Q and V corners. This is accomplished by making a stop at the D corner: I change how the raw data are transformed into index values.

Just for comparison, if I only change the indexing strategy but retain the square plot chart form, the revised chart looks like this:

The four squares showing dairy on this version have the same size. Readers can evaluate the relative sizes of the other "milk" types.

## The curse of dimensions

##### Mar 20, 2024

Usually the curse of dimensions concerns data with many dimensions. But today I want to talk about a different kind of curse. This is the curse of dimensions in mapping.

We are only talking about a few dimensions, typically between 3 and 6, so small number of dimensions. And yet it's already a curse. Maps are typically drawn in two dimensions. Those two dimensions are usually spoken for: they show the x- and y-coordinate of space. If we want to include a third, fourth or fifth dimension of data on the map, we have to appeal to colors, shapes, and so on. Cartographers have long realized that adding dimensions involves tradeoffs.

***

Andrew featured some colored bubble maps in a recent post. Here is one example:

The above map shows the proportion of population in each U.S. county that is Hispanic. Each county is represented by a bubble pinned to the centroid of the county. The color of the bubble shows the data, divided into demi-deciles so they are using a equal-width binning method. The size of a bubble indicates the size of a county.

The map is sometimes called a "Dorling map" after its presumptive original designer.

I'm going to use this map to explore the curse of dimensions.

***

It's clear from the design that county-level details are regarded as extremely important. As there are about 3,000 counties in the U.S., I don't see how any visual design can satisfy this requirement without giving up clarity.

More details require more objects, which spread readers' attention. More details contain more stories, but that too dilutes their focus.

Another principle of this map is to not allow bubbles to overlap. Of course, having bubbles overlap or print on top of one another is a visual faux pas. But to prevent such behavior on this particular design means the precise locations are sacrificed. Consider the eastern seaboard where there are densely populated counties: they are not pinned to their centroids. Instead, the counties are pushed out of their normal positions, similar to making a cartogram.

I remarked at the start – erroneously but deliberately – that each bubble is centered at the centroid of each county. I wonder how many of you noticed the inaccuracy of that statement. If that rule were followed, then the bubbles in New England would have overlapped and overprinted.

This tradeoff affects how we perceive regional patterns, as all the densely populated regions are bent out of shape.

Another aspect of the data that the designer treats as important is county population, or rather relative county population. Relative – because bubble size don't portray absolutes, plus the designer didn't bother to provide a legend to decipher bubble sizes.

The tradeoff is location. The varying bubble sizes, coupled with the previous stipulation of no overlapping, push bubbles from their proper centroids. This forced displacement disproportionately affects larger counties.

***

What if we are willing to sacrifice county-level details?

In this setting, we are not obliged to show every single county. One alternative is to perform spatial smoothing. Intuitively, think about the following steps: plot all these bubbles in their precise locations, turn the colors slightly transparent, let them overlap, blend away the edges, and then we have a nice picture of where the Hispanic people are located.

I have sacrificed the county-level details but the regional pattern becomes much clearer, and we don't need to deviate from the well-understood shape of the standard map.

This version reminds me of the language maps that Josh Katz made.

Here is an old post about these maps.

This map design only reduces but does not eliminate the geographical inaccuracy. It uses the same trick as the Dorling map: the "vertical" density of population has been turned into "horizontal" span. It's a bit better because the centroids are not displaced.

***

Which map is better depends on what tradeoffs one is making. In the above example, I'd have made different choices.

One final thing – it's minor but maybe not so minor. Most of the bubbles on the map especially in the middle are tiny; as most of them have Hispanic proportions that are on the left side of the scale, they should be showing light orange. However, all of them appear darker than they ought to be. That's because each bubble has a dark border. For small bubbles, the ratio of ink on the border is a high proportion of the ink for the entire object.

## Do you want a taste of the new hurricane cone?

##### Mar 05, 2024

The National Hurricane Center (NHC) put out a press release (link to PDF) to announce upcoming changes (in August 2024) to their "hurricane cone" map. This news was picked up by Miami Herald (link).

The above example is what the map looks like. (The data are probably fake since the new map is not yet implemented.)

The cone map has been a focus of research because experts like Alberto Cairo have been highly critical of its potential to mislead. Unfortunately, the more attention paid to it, the more complicated the map has become.

The latest version of this map comprises three layers.

The bottom layer is the so-called "cone". This is the white patch labeled below as the "potential track area (day 1-5)".  Researchers dislike this element because they say readers tend to misinterpret the cone as predicting which areas would be damaged by hurricane winds when the cone is intended to depict the uncertainty about the path of the hurricane. Prior criticism has led the NHC to add the text at the top of the chart, saying "The cone contains the probable path of the storm center but does not show the size of the storm. Hazardous conditions can occur outside of the cone."

The middle layer are the multi-colored bits. Two of these show the areas for which the NHC has issued "watches" and "warnings". All of these color categories represent wind speeds at different times. Watches and warnings are forecasts while the other colors indicate "current" wind speeds.

The top layer consists of black dots. These provide a single forecast of the most likely position of the storm, with the S, H, M labels indicating the most likely range of wind speeds at forecast times.

***

Let's compare the new cone map to a real hurricane map from 2020. (This older map came from a prior piece also by NHC.)

Can we spot the differences?

To my surprise, the differences were minor, in spite of the pre-announced changes.

The first difference is a simplification. Instead of dividing the white cone (the bottom layer) into two patches -- a white patch for days 1-3, and a dotted transparent patch for days 4-5, the new map aggregates the two periods. Visually, simplifying makes the map less busy but loses the implicit acknowledge found in the old map that forecasts further out are not as reliable.

The second point of departure is the addition of "inland" warnings and watches. Notice how the red and blue areas on the old map hugged the coastline while the red and blue areas on the new map reach inland.

Both changes push the bottom layer, i.e. the cone, deeper into the background. It's like a shrink-flation ice cream cone that has a tiny bit of ice cream stuffed deep in its base.

***

How might one improve the cone map? I'd start by dismantling the layers. The three layers present answers to different problems, albeit connected.

Let's begin with the hurricane forecasting problem. We have the current location of the storm, and current measurements of wind speeds around its center. As a first requirement, a forecasting model predicts the path of the storm in the near future. At any time, the storm isn't a point in space but a "cloud" around a center. The path of the storm traces how that cloud will move, including any expansion or contraction of its radius.

That's saying a lot. To start with, a forecasting model issues the predicted average path -- the expected path of the storm's center. This path is (not competently) indicated by the black dots in the top layer of the cone map. These dots offer only a sampled view of the average path.

Not surprisingly, there is quite a bit of uncertainty about the future path of any storm. Many models simulate future worlds, generating many predictions of the average paths. The envelope of the most probable set of paths is the "cone". The expanding width of the cone over time reflects the higher uncertainty of our predictions further into the future. Confusingly, this cone expansion does not depict spatial expansion of either the storm's size or the potential areas that may suffer the greatest damage. Both of those tend to shrink as hurricanes move inland.

Nevertheless, the cone and the black dots are connected. The path drawn out by the black dots should be the average path of the center of the storm.

The forecasting model also generates estimates of wind speeds. Those are given as labels inside the black dots. The cone itself offers no information about wind speeds. The map portrays the uncertainty of the position of the storm's center but omits the uncertainty of the projected wind speeds.

The middle layer of colored patches also inform readers about model projections - but in an interpreted manner. The colors portray hurricane warnings and watches for specific areas, which are based on projected wind speeds from the same forecasting models described above. The colors represent NHC's interpretation of these model outputs. Each warning or watch simultaneously uses information on location, wind speed and time. The uncertainty of the projected values is suppressed.

I think it's better to use two focused maps instead of having one that captures a bit of this and a bit of that.

One map can present the interpreted data, and show the areas that have current warnings and watches. This map is about projected wind strength in the next 1-3 days. It isn't about the center of the storm, or its projected path. Uncertainty can be added by varying the tint of the colors, reflecting the confidence of the model's prediction.

Another map can show the projected path of the center of the storm, plus the cone of uncertainty around that expected path. I'd like to bring more attention to the times of forecasting, perhaps shading the cone day by day, if the underlying model has this level of precision.

***

Back in 2019, I wrote a pretty long post about these cone maps. Well worth revisiting today!

## Neither the forest nor the trees

##### Feb 15, 2024

On the NYT's twitter feed, they featured an article titled "These Seven Tech Stocks are Driving the Market". The first sentence of the article reads: "The S&P 500 is at an all-time high, and investors have just a handful of stocks to thank for it."

Without having seen any data, I'd surmise from that line that (a) the S&P 500 index has gone up recently, and (b) most if not all of the gain in the index can be attributed to gains in the tech stocks mentioned in the headline. (For purists, a handful is five, not seven.)

The chart accompanying the tweet is a treemap:

The treemap is possibly the most overhyped chart type of the modern era. Its use here is tangential to the story of surging market value. That's because the treemap presents a snapshot of the composition of the index, but contains nothing about the trend (change over time) of the average index value or of its components.

***

Even in representing composition, the treemap is inferior to, gasp, a pie chart. Of course, we can only use a pie chart for small numbers of components. The following illustration takes the data from the NYT chart on the Magnificent Seven tech stocks, and compares a treemap versus a pie chart side by side:

The reason why the treemap is worse is that both the width and the height of the boxes are changing while only the radius (or angle) of the pie slices is varying. (Not saying use a pie chart, just saying the treemap is worse.)

There is a reason why the designer appended data labels to each of the seven boxes. The effect of not having those labels is readily felt when our eyes reach the next set of stocks – which carry company names but not their market values. What is the market value of Berkshire Hathaway?

Even more so, what proportion of the total is the market value of Berkshire Hathaway? Indeed, if the designer did not write down 29%, it would take a bit of work to figure out the aggregate value of yellow boxes relative to the entire box!

This design sucessfully draws our attention to the structural importance of various components of the whole. There are three layers - the yellow boxes (Magnificent Seven), the gray boxes with company names, and the other gray boxes. I also like how they positioned the text on the right column.

***

Going inside the NYT article itself, we find two line charts that convey the story as told.

Here's the first one:

They are comparing the most recent stock prices with those from October 12 2022, which is identified as the previous "low". (I'm actually confused by how the most recent "low" is defined, but that's a different subject.)

This chart carries a lot of good information, even though it does not plot "all the data", as in each of the 500 S&P components individually. Over the period under analysis, the average index value has gone up about 35% while the Magnificent Seven's value have skyrocketed by 65% in aggregate. The latter accounted for 30% of the total value at the most recent time point.

If we set the S&P 500 index value in 2024 as 100, then the M7 value in 2024 is 30. After unwinding the 65% growth, the M7 value in October 2022 was 18; the S&P 500 in October 2022 was 74. Thus, the weight of M7 was 24% (18/74) in October 2022, compared to 30% now. Consequently, the weight of the other 473 stocks declined from 76% to 70%.

This isn't even the full story because most of the action within the M7 is in Nvidia, the stock most tightly associated with the current AI hype, as shown in the other line chart.

Nvidia's value jumped by 430% in that time window. From the treemap, the total current value of M7 is \$12.3 b while Nvidia's value is \$1.4 b, thus Nvidia is 11.4% of M7 currently. Since M7 is 29% of the total S&P 500, Nvidia is 11.4%*29% = 3% of the S&P. Thus, in 2024, against 100 for the S&P, Nvidia's share is 3. After unwinding the 430% growth, Nvidia's share in October 2022 was 0.6, about 0.8% of 74. Its weight tripled during this period of time.

## A nice plot of densities, but what's behind the colors?

##### Feb 08, 2024

I came across this chart by Planet Anomaly that compares air quality across the world's cities (link). The chart is in long form. The top part looks like this:

The bottom part looks like this:

You can go to the Visual Capitalist website to see the entire chart.

***

Plots of densities are relatively rare. The metric for air quality is micrograms of fine particulate matter (PM) per cubic meter, so showing densities is natural.

It's pretty clear the cities with the worst air quality at the bottom has a lot more PM in the air than the cleanest cities shown at the top.

This density chart plays looser with the data than our canonical chart types. The perceived densities of dots inside the squares do not represent the actual concentrations of PM. It's certainly not true that in New Delhi, the air is packed tightly with PM.

Further, a random number generator is required to scatter the red dots inside the circle. Thus, different software or designers will make the same chart look a bit different - the densities will be the same but the locations of the dots will not be.

I don't have a problem with this. Do you?

***

Another notable feature of this chart is the double encoding. The same metric is not just presented as densities; it is also encoded in a color scale.

I don't think this adds much.

Both color and density are hard for humans to perceive precisely so adding color does not convey  precision to readers.

The color scale is gradated, so it effectively divided the cities into seven groups. But I don't attach particular significance to the classification. If that is important, it would be clearer to put boxes around the groups of plots. So I don't think the color scale convey clustering to readers effectively.

There is one important grouping which is defined by WHO's safe limit of 5 pg/cubic meter. A few cities pass this test while almost every other place fails. But the design pays no attention to this test, as it uses the same hue on both sides, and even the same tint changes on either side of the limit.

***

Another notable project that shows densities as red dots is this emotional chart by Mona Chalabi about measles, which I wrote about in 2019.

## Messing with expectations

##### Jan 11, 2024

A co-worker sent me to the following map, found in Forbes:

It shows the amount of state tax surcharge per gallon of gas in the U.S. And it's got one of the most common issues found in choropleth maps - the color scheme runs opposite to reader expectations.

Typically, if we see a red-green color scale, we would expect red to represent large numbers and green, small numbers. This map reverses the typical setup: California, the state with the heftiest gas tax, is shown green.

I know, I know - if we apply the typical color scheme, California would bleed red, and it's a blue state, damn it.

The solution is to avoid the red color. Just don't use red or blue.

There is no need to use two colors either.

***

A few minor fixes. Given that all dollar amounts on the map are shown to two decimal places, the legend labels should also be shown to 2 decimal places, and with dollar signs.

The subtitle should read "Dollars per gallon" instead of "Cents per gallon". Alternatively, keep "Cents per gallon" but convert all data labels into cents.

Some of the states are missing data labels.

***

I recast this as a small-multiples by categorizing states into four subgroups.

With this change, one can almost justify using maps because there is sort of a spatial pattern.

## To a new year of pleasant surprises

##### Jan 01, 2024

Happy new year!

This year promises to be the year of AI. Already last year, we pretty much couldn't lift an eyebrow without someone making an AI claim. This year will be even noisier. Visual Capitalist acknowledged this by making the noisiest map of 2023:

I kept thinking they have a geography teacher on the team, who really, really wants to give us a lesson of where each country is on the world map.

All our attention is drawn to the guiding lines and the random scatter of numbers. We have to squint to find the country names. All this noise drowns out the attempt to make sense of the data, namely, the inset of the top 10 countries in the lower left corner, and the classification of countries into five colored groups.

A small dose of editing helps. Remove most data labels except for the countries for which they have a story. Provide a data table below for those who want details.

***

In the Methodology section, the data analysts (possibly from a third party called ElectronicsHub) indicated that they used Google search volume of "over 90 of the most popular generative AI tools", calculating the "overall volume across all tools per 100k population". Then came a baffling line: "all search volumes were scaled up according to the search engine market share in each country, using figures from statscounter.com." (Note: in the following, I'm calling the data "AI-related search" for simplicity even though their measurement is restricted to the terms described above.)

It took me a while to comprehend what they could have meant by that line. I believe this is what that sentence means: Google is not the only search engine out there so by only researching Google search volume, they undercount the true search volume. How did they deal with the missing data problem? They "scaled up" so if Google is 80% of the search volume in a country, then they divide the Google volume by 80% to "scale up" to 100%.

Whenever we use heuristics like this, we should investigate its foundations. What is the implicit assumption behind this scaling-up procedure? It is that all search engines are effectively the same. The users of non-Google search engines behave exactly as the Google search engine users. If the analysts somehow could get their hands on the data of other search engines, they would discover that the proportion of search volume that is AI-related is effectively the same as seen on Google.

This is one of those convenient, and obviously wrong assumptions – if true, the market would have no need for more than one search engine. Each search engine's audience is just a random sample from the population of all users.

Let's make up some numbers. Let's say Google has 80% share of search volume in Country A, and AI-related search 10% of the overall Google search volume. The remaining search engines have 20% share. Scaling up here means taking the 8% of Google AI-related search volume, divide by 80%, which yields 10%. Since Google owns 8% of the 10%, the other search engines see 2% of overall search volume attributed to AI searches in Country A. Thus, the proportion of AI-related searches on those other search engines is 2%/20% = 10%.

Now, in certain countries, Google is not quite as dominant. Let's say Google only has 20% share of Country B's search volume. AI-related search on Google is 2%, which is 10% of its total. Using the same scaling-up procedure, the analysts have effectively assumed that the proportion of AI-related search volume in the dominant search engines in Country B to be also 10%.

I'm using the above calculations to illustrate a shortcoming of this heuristic. Using this procedure inflates the search volume in countries in which Google is less dominant because the inflation factor is the reciprocal of Google's market share. The less dominant Google is, the larger the inflation factor.

What's also true? The less dominant Google is, the smaller proportion of the total data the analysts are able to see, the lower the quality of the available information. So the heuristic is the most influential where it has the greatest uncertainty.

***

Hope your new year is full of uncertainty, and your heuristics shall lead you to pleasant surprises.

If you like the blog's content, please spread the word. I'm looking forward to sharing more content as the world of data continues to evolve at an amazing pace.

Disclosure: This blog post is not written by AI.

## Stranger things found on scatter plots

##### Nov 29, 2023

Washington Post published a nice scatter plot which deconstructs scores from the recent World Championships in Gymnastics. (link)

The chart presents the main message clearly - the winner Simone Biles scored the highest on both components of the score (difficulty and execution), by quite some margin.

What else can we learn from this chart?

***

Every athlete who qualified for the final scored at or above average on both components.

Scoring below average on either component is a death knell: no athlete scored enough on the other component to compensate. (The top left and bottom right quadrants would have had some yellow dots otherwise.)

Several athletes in the top right quadrant presumably scored enough to qualify but didn't. The footnote likely explains it: each country can send at most two athletes to the final. It may be useful to mark out these "unlucky" athletes using a third color.

Curiously, it's not easy to figure out who these unlucky athletes were from this chart alone. We need two pieces of data: the minimum qualifying score, and the total score for each athlete. The scatter plot isn't the best chart form to show totals, but qualification to the final is based on the sum of the difficulty and execution scores. (Note also, neither axis starts at zero, compounding the challenge.)

***

This scatter plot is most memorable for shattering one of my expectations about risk and reward in sports.

I expect risk-seeking athletes to suffer from higher variance in performance. The tennis player who goes for big serves tend to also commit more double faults. The sluggers who hit home runs tend to strike out more often. Similarly, I expect gymnasts who attempt more difficult skills to receive lower execution scores.

Indeed, the headline writer seemed to agree, suggesting that Biles is special because she's both high in difficulty and strong in execution.

The scatter plot, however, sends the opposite message - this should not surprise. The entire field shows a curiously strong positive correlation between difficulty and execution scores. The more difficult is the routine, the higher the excution score!

It's hard to explain such a pattern. My guesses are:

a) judges reward difficult routines, and subconsciously confound execution and difficulty scores. They use separate judges for excecution and difficulty. Paradoxically, this arrangement may have caused separation anxiety - the judges for execution might just feel the urge to reward high difficulty.

b) those athletes who are skilled enough to attempt more difficult routines are also those who are more consistent in execution. This is a type of self-selection bias frequently found in observational data.

Regardless of the reasons for the strong correlation, the chart shows that these two components of the total score are not independent, i.e. the metrics have significant overlap in what they measure. Thus, one cannot really talk about a difficult routine without also noting that it's a well-executed routine, and vice versa. In an ideal scoring design, we'd like to have independent components.

## The choice to encode data using colors

##### Nov 20, 2023

NBC News published the following heatmap that shows inflation by product category in the last year or so:

The general story might be that inflation was rampant in airfare and electricity prices about a year ago but these prices have moderated recently, especially in airfare. Gas prices appear to have inflated far less than overall inflation during these months.

***

Now, if you're someone who cares about the magnitude of differences, not just the direction, then revisit the above statements, and you'll feel a sense of inadequacy.

When we choose to encode data in colors, we're giving up on showing magnitudes or precision. The color scale shown up top sends the message that the continuous nature of the number line is being displayed but it really isn't.

The largest value of the chart is found on the left side of the airfare row:

The value is about 36% which strangely enough is far larger than the maximum value shown in the legend above. Even if those values align, it is still impossible to guess what values the different colors and shades in the cells map to from the legend.

***

The following small-multiples chart shows the underlying values more precisely:

I have transformed the data differently. In these line charts, the data are indexed to the first month (100) so each chart shows the cumulative change in prices from that month to the current month, for each category, compared to the overall.

The two most interesting categories are airfare and gas. Airfare has recently decreased quite drastically relative to September 2022, and thus the line is far below the overall inflation trend. Gas prices moved in reverse: they dropped in the last quarter of 2022 but have steadily risen over 2023, and in the most recent month, is tracking overall inflation.

## Several tips for visualizing matrices

##### Nov 07, 2023

Continuing my review of charts that were spammed to my inbox, today I look at the following visualization of a matrix of numbers:

The matrix shows pairwise correlations between the returns of 16 investment asset classes. Correlation is a number between -1 and 1. It is a symmetric scale around 0. It embeds two dimensions: the magnitude of the correlation, and its direction (positive or negative).

The correlation matrix is a special type of matrix: a bit easier to deal with as the data already come “standardized”. As with the other charts in this series, there is a good number of errors in the chart's execution.

I’ll leave the details maybe for a future post. Just check two key properties of a correlation matrix: the diagonal consisting of self-correlations should contain all 1s; and the matrix should be symmetric across that diagonal.

***

For this post, I want to cover nuances of visualizing matrices. The chart designer knows exactly what the message of the chart is - that the asset class called "art" is attractive because it has little correlation with other popular asset classes. Regardless of the chart's errors, it’s hard for the reader to find the message in the matrix shown above.

That's because the specific data carrying the message sit in the bottom row (and the rightmost column). The cells in this row (and column) has a light purple color, which has been co-opted by the even lighter gray color used for the diagonal cells. These diagonal cells pop out of the chart despite being the least informative (they have the same values for all correlation matrices!)

***

Several tactics can be deployed to push the message to the fore.

First, let's bring the key data to the prime location on the chart - this is the top row and left column (for cultures which read top to bottom, left to right).

For all the drafts in this post, I have dropped the text descriptions of the asset classes, and replaced them with numbers so that it's easier to follow the changes. (For those who're paying attention, I also edited the data to make the matrix symmetric.)

Second, let's look at the color choice. Here, the designer made a wise choice of restricting the number of color levels to three (dark, medium and light). I retained that decision in the above revision - actually, I used four colors but there are no values in one of the four sections, therefore, effectively, only three colors appear. But let's look at what happens when the number of color levels is increased.

The more levels of color, the more strain it puts on our processing... with little reward.

Third, and most importantly, the order of the categories affects perception majorly. I have no idea what the designer used as the sorting criterion. In step one of the fix, I moved the art category to the front but left all the other categories in the original order.

The next chart has the asset classes organized from lowest to highest average correlation. Conveniently, using this sorting metric leaves the art category in its prime spot.

Notice that the appearance has completely changed. The new version brings out clusters in the data much more effectively. Most of the assets in the bottom of the chart have high correlation with each other.

Finally, because the correlation matrix is symmetric across the diagonal of self-correlations, the two halves are mirror images and thus redundant. The following removes one of the mirrored halves, and also removes the diagonal, leading to a much cleaner look.

Next time you visualize a matrix, think about how you sort the rows/columns, how you choose the color scale, and whether to plot the mirrored image and the diagonal.