Neither the forest nor the trees

On the NYT's twitter feed, they featured an article titled "These Seven Tech Stocks are Driving the Market". The first sentence of the article reads: "The S&P 500 is at an all-time high, and investors have just a handful of stocks to thank for it."

Without having seen any data, I'd surmise from that line that (a) the S&P 500 index has gone up recently, and (b) most if not all of the gain in the index can be attributed to gains in the tech stocks mentioned in the headline. (For purists, a handful is five, not seven.)

The chart accompanying the tweet is a treemap:

Nyt_magnificentseven

The treemap is possibly the most overhyped chart type of the modern era. Its use here is tangential to the story of surging market value. That's because the treemap presents a snapshot of the composition of the index, but contains nothing about the trend (change over time) of the average index value or of its components.

***

Even in representing composition, the treemap is inferior to, gasp, a pie chart. Of course, we can only use a pie chart for small numbers of components. The following illustration takes the data from the NYT chart on the Magnificent Seven tech stocks, and compares a treemap versus a pie chart side by side:

Junkcharts_redo_nyt_magnificent7

The reason why the treemap is worse is that both the width and the height of the boxes are changing while only the radius (or angle) of the pie slices is varying. (Not saying use a pie chart, just saying the treemap is worse.)

There is a reason why the designer appended data labels to each of the seven boxes. The effect of not having those labels is readily felt when our eyes reach the next set of stocks – which carry company names but not their market values. What is the market value of Berkshire Hathaway?

Even more so, what proportion of the total is the market value of Berkshire Hathaway? Indeed, if the designer did not write down 29%, it would take a bit of work to figure out the aggregate value of yellow boxes relative to the entire box!

This design sucessfully draws our attention to the structural importance of various components of the whole. There are three layers - the yellow boxes (Magnificent Seven), the gray boxes with company names, and the other gray boxes. I also like how they positioned the text on the right column.

***

Going inside the NYT article itself, we find two line charts that convey the story as told.

Here's the first one:

Nyt_magnificent7_linechart1

They are comparing the most recent stock prices with those from October 12 2022, which is identified as the previous "low". (I'm actually confused by how the most recent "low" is defined, but that's a different subject.)

This chart carries a lot of good information, even though it does not plot "all the data", as in each of the 500 S&P components individually. Over the period under analysis, the average index value has gone up about 35% while the Magnificent Seven's value have skyrocketed by 65% in aggregate. The latter accounted for 30% of the total value at the most recent time point.

If we set the S&P 500 index value in 2024 as 100, then the M7 value in 2024 is 30. After unwinding the 65% growth, the M7 value in October 2022 was 18; the S&P 500 in October 2022 was 74. Thus, the weight of M7 was 24% (18/74) in October 2022, compared to 30% now. Consequently, the weight of the other 473 stocks declined from 76% to 70%.

This isn't even the full story because most of the action within the M7 is in Nvidia, the stock most tightly associated with the current AI hype, as shown in the other line chart.

Nyt_magnificent7_linechart2

Nvidia's value jumped by 430% in that time window. From the treemap, the total current value of M7 is $12.3 b while Nvidia's value is $1.4 b, thus Nvidia is 11.4% of M7 currently. Since M7 is 29% of the total S&P 500, Nvidia is 11.4%*29% = 3% of the S&P. Thus, in 2024, against 100 for the S&P, Nvidia's share is 3. After unwinding the 430% growth, Nvidia's share in October 2022 was 0.6, about 0.8% of 74. Its weight tripled during this period of time.


To a new year of pleasant surprises

Happy new year!

This year promises to be the year of AI. Already last year, we pretty much couldn't lift an eyebrow without someone making an AI claim. This year will be even noisier. Visual Capitalist acknowledged this by making the noisiest map of 2023:

Visualcapitalist_01_Generative_AI_World_map sm

I kept thinking they have a geography teacher on the team, who really, really wants to give us a lesson of where each country is on the world map.

All our attention is drawn to the guiding lines and the random scatter of numbers. We have to squint to find the country names. All this noise drowns out the attempt to make sense of the data, namely, the inset of the top 10 countries in the lower left corner, and the classification of countries into five colored groups.

A small dose of editing helps. Remove most data labels except for the countries for which they have a story. Provide a data table below for those who want details.

***

In the Methodology section, the data analysts (possibly from a third party called ElectronicsHub) indicated that they used Google search volume of "over 90 of the most popular generative AI tools", calculating the "overall volume across all tools per 100k population". Then came a baffling line: "all search volumes were scaled up according to the search engine market share in each country, using figures from statscounter.com." (Note: in the following, I'm calling the data "AI-related search" for simplicity even though their measurement is restricted to the terms described above.)

It took me a while to comprehend what they could have meant by that line. I believe this is what that sentence means: Google is not the only search engine out there so by only researching Google search volume, they undercount the true search volume. How did they deal with the missing data problem? They "scaled up" so if Google is 80% of the search volume in a country, then they divide the Google volume by 80% to "scale up" to 100%.

Whenever we use heuristics like this, we should investigate its foundations. What is the implicit assumption behind this scaling-up procedure? It is that all search engines are effectively the same. The users of non-Google search engines behave exactly as the Google search engine users. If the analysts somehow could get their hands on the data of other search engines, they would discover that the proportion of search volume that is AI-related is effectively the same as seen on Google.

This is one of those convenient, and obviously wrong assumptions – if true, the market would have no need for more than one search engine. Each search engine's audience is just a random sample from the population of all users.

Let's make up some numbers. Let's say Google has 80% share of search volume in Country A, and AI-related search 10% of the overall Google search volume. The remaining search engines have 20% share. Scaling up here means taking the 8% of Google AI-related search volume, divide by 80%, which yields 10%. Since Google owns 8% of the 10%, the other search engines see 2% of overall search volume attributed to AI searches in Country A. Thus, the proportion of AI-related searches on those other search engines is 2%/20% = 10%.

Now, in certain countries, Google is not quite as dominant. Let's say Google only has 20% share of Country B's search volume. AI-related search on Google is 2%, which is 10% of its total. Using the same scaling-up procedure, the analysts have effectively assumed that the proportion of AI-related search volume in the dominant search engines in Country B to be also 10%.

I'm using the above calculations to illustrate a shortcoming of this heuristic. Using this procedure inflates the search volume in countries in which Google is less dominant because the inflation factor is the reciprocal of Google's market share. The less dominant Google is, the larger the inflation factor.

What's also true? The less dominant Google is, the smaller proportion of the total data the analysts are able to see, the lower the quality of the available information. So the heuristic is the most influential where it has the greatest uncertainty.

***

Hope your new year is full of uncertainty, and your heuristics shall lead you to pleasant surprises.

If you like the blog's content, please spread the word. I'm looking forward to sharing more content as the world of data continues to evolve at an amazing pace.

Disclosure: This blog post is not written by AI.


Chartjunk as marketing copy

I got some spam marketing message last week. How exciting. They even use a subject line that has absolutely nothing to do with its content, baiting me to open it. And open I did, to some data graphics horrors.

The marketer promises a whole series of charts to prove that art is a great asset class for investment returns.

The very first chart already caught my full attention. It's this one:

Masterworks_chart1

It's a simple bar chart, with four values. Looks innocuous.

I'm unable to appreciate the recent trend to align bars in the middle, rather than at their bases. So I converted it to the canonical form:

Redo_masterworks_1_barchart

Do you see the problem?

The second value ($1.7 trillion) is exactly half the size of the first value ($3.4 trillion) and yet the second bar is two-thirds of the length of the first bar. So, the size of the second bar is exaggerated relative to its label – and that’s the bar displaying the market size for “art,” which is what the spammer is pitching.

The bottom pair of values share the same relationship: $0.8 trillion is exactly half of $1.6 trillion. Again, the relative lengths of those two bars are not 50% but slightly over 60%.

Redo_masterworks_1_barchart_excess

Did the designer think that the bar lengths could be customized to whatever s/he desires? This one is hard to crack.

***

The sixth chart in the series is a different kind of puzzle:

Masterworks_chart6

All three lines have the exact same labels but show different values over time.

***

And they have pie charts, of course. Take a look:

Masterworks_chart

Something went wrong here too. I'll leave it to my readers who can certainly figure it out :)

***

These charts were probably spammed to at least thousands.

 


Two metrics in-fighting

The Wall Street Journal shows the following chart which pits two metrics against each other:

Wsj_salaries25to29

The primary metric is the change in median yearly salary between the two periods of time. We presume it's primary because of its presence in the chart title, and the blue bars being more readable than the green bubbles. The secondary metric is the median yearly salary in the later period.

That, I believe, was the intended design. When I saw this chart, my eyes went to the numbers inside the green bubbles. Perhaps it's because I didn't read the chart title first, and the horizontal axis wasn't labelled so it wasn't obvious what the blue bars coded.

As with most bubble charts, the data labels exist to cover up the inadequacy of circular areas. The self-sufficiency test - removing the data labels - shows this well:

Redo_wsj_salaries25to29

It's simply impossible to know what values should be in each bubble, or to perceive the relative sizes of those bubbles.

***

Reversing the order of the blue bars also helps:

Redo_wsjsalaries25to29_2

The original order is one of the more annoying features in most visualization packages. Because internally, the categories are numbered 1, 2, 3, ..., and because the convention is to have values run higher as they run up the vertical axis, these packages would place the top-ranked item at the bottom of the chart.

Most people read top to bottom, which means that they read the least important item first, and the most important item last!

In most visualization packages, it takes only 1 click or 1 action to reverse the order of the items. Please do it!

***

For change over time, I like using a Bumps chart, otherwise called a slope graph:

Redo_wsjsalaries25to29_3


An elaborate data vessel

Visualcapitalist_globaloilproductionI recently came across the following dataviz showing global oil production (link).

This is an ambitious graphic that addresses several questions of composition.

The raw data show the amount of production by country adding up to the global total. The countries are then grouped by region. Further, the graph presents an oil-and-gas specific grouping, as indicated by the legend shown just below the chart title. This grouping is indicated by the color of the circumference of the circle containing the flag of the country.

This chart form is popular in modern online graphics programs. It is like an elaborate data vessel. Because the countries are lined up around the barrel, a space has been created on three sides to admit labels and text annotations. This is a strength of this chart form.

***

The chart conveys little information about the underlying data. Each country is given a unique odd shaped polygon, making it impossible to compare sizes. It’s definitely possible to pick out U.S., Russia, Saudi Arabia as the top producers. But in presenting the ranks of the data, this chart form pales in comparison to a straightforward data table, or a bar chart. The less said about presenting values, the better.

Indeed, our self-sufficiency test exposes the inability of these polygons to convey the data. This is precisely why almost all values of the dataset are present on the chart.

***

The dataviz subtly presumes some knowledge on the part of the readers.

The regions are not directly labeled. The readers must know that Saudi Arabia is in the Middle East, U.S. is part of North America, etc. Admittedly this is not a big ask, but it is an ask.

It is also assumed that readers know their flags, especially those of smaller countries. Some of the small polygons have no space left for country names and they are labeled with just flags.

Visualcapitalist_globaloilproduction_nocountrylabels

In addition, knowing country acronyms is required for smaller countries as well. For example, in Africa, we find AGO, COG and GAB.

Visualcapitalist_globaloilproduction_countryacronyms

For this chart form the designer treats each country according to the space it has on the chart (except those countries that found themselves on the edges of the barrel). Font sizes, icons, labels, acronyms, data labels, etc. vary.

The readers are assumed to know the significance of OPEC and OPEC+. This grouping is given second fiddle, and can be found via the color of the circumference of the flag icons.

Visualcapitalist_globaloilproduction_opeclegend

I’d have not assigned a color to the non-OPEC countries, and just use the yellow and blue for OPEC and OPEC+. This is a little edit but makes the search for the edges more efficient.

Visualcapitalist_globaloilproduction_twoopeclabels

***

Let’s now return to the perception of composition.

In exactly the same manner as individual countries, the larger regions are represented by polygons that have arbitrary shapes. One can strain to compile the rank order of regions but it’s impossible to compare the relative values of production across regions. Perhaps this explains the presence of another chart at the bottom that addresses this regional comparison.

The situation is worse for the OPEC/OPEC+ grouping. Now, the readers must find all flag icons with edges of a specific color, then mentally piece together these arbitrarily shaped polygons, then realizing that they won’t fit together nicely, and so must now mentally morph the shapes in an area-preserving manner, in order to complete this puzzle.

This is why I said earlier this is an elaborate data vessel. It’s nice to look at but it doesn’t convey information about composition as readers might expect it to.

Visualcapitalist_globaloilproduction_excerpt


What is the question is the question

I picked up a Fortune magazine while traveling, and saw this bag of bubbles chart.

Fortune_global500 copy

This chart is visually appealing, that must be said. Each circle represents the reported revenues of a corporation that belongs to the “Global 500 Companies” list. It is labeled by the location of the company’s headquarters. The largest bubble shows Beijing, the capital of China, indicating that companies based in Beijing count $6 trillion dollars of revenues amongst them. The color of the bubbles show large geographical units; the red bubbles are cities in Greater China.

I appreciate a couple of the design decisions. The chart title and legend are placed on the top, making it easy to find one’s bearing – effective while non-intrusive. The labeling signals a layering: the first and biggest group have icons; the second biggest group has both name and value inside the bubbles; the third group has values inside the bubbles but names outside; the smallest group contains no labels.

Note the judgement call the designer made. For cities that readers might not be familiar with, a country name (typically abbreviated) is added. This is a tough call since mileage varies.

***

As I discussed before (link), the bag of bubbles does not elevate comprehension. Just try answering any of the following questions, which any of us may have, using just the bag of bubbles:

  • What proportion of the total revenues are found in Beijing?
  • What proportion of the total revenues are found in Greater China?
  • What are the top 5 cities in Greater China?
  • What are the ranks of the six regions?

If we apply the self-sufficiency test and remove all the value labels, it’s even harder to figure out what’s what.

***

_trifectacheckup_image

Moving to the D corner of the Trifecta Checkup, we aren’t sure how to interpret this dataset. It’s unclear if these companies derive most of their revenues locally, or internationally. A company headquartered in Washington D.C. may earn most of its revenues in other places. Even if Beijing-based companies serve mostly Chinese customers, only a minority of revenues would be directly drawn from Beijing. Some U.S. corporations may choose its headquarters based on tax considerations. It’s a bit misleading to assign all revenues to one city.

As we explore this further, it becomes clear that the designer must establish a target – a strong idea of what question s/he wants to address. The Fortune piece comes with a paragraph. It appears that an important story is the spatial dispersion of corporate revenues in different countries. They point out that U.S. corporate HQs are more distributed geographically than Chinese corporate HQs, which tend to be found in the key cities.

There is a disconnect between the Question and the Data used to create the visualization. There is also a disconnect between the Question and the Visual display.


Deconstructing graphics as an analysis tool in dataviz

One of the useful exercises I like to do with charts is to "deconstruct" them. (This amounts to a deeper version of the self-sufficiency test.)

Here is a chart stripped down to just the main visual elements.

Junkcharts_cbcrevenues_deconstructed1

The game is to guess what is the structure of the data given these visual elements.

I guessed the following:

  • The data has a top-level split into two groups
  • Within each group, the data is further split into 3 parts, corresponding to the 3 columns
  • With each part, there are a variable number of subparts, each of which is given a unique color
  • The color legend suggests that each group's data are split into 7 subparts, so I'm guessing that the 7 subparts are aggregated into 3 parts
  • The core chart form is a stacked column chart with absolute values so relative proportions within each column (part) is important
  • Comparing across columns is not supported because each column has its own total value
  • Comparing same-color blocks across the two groups is meaningful. It's easier to compare their absolute values but harder to compare the relative values (proportions of total)

If I knew that the two groups are time periods, I'd also guess that the group on the left is the earlier time period, and the one on the right is the later time period. In addition to the usual left-to-right convention for time series, the columns are getting taller going left to right. Many things (not all, obviously) grow over time.

The color choice is a bit confusing because if the subparts are what I think they are, then it makes more sense to use one color and different shades within each column.

***

The above guesses are a mixed bag. What one learns from the exercise is what cues readers are receiving from the visual structure.

Here is the same chart with key contextual information added back:

Junkcharts_cbcrevenues_deconstructed2

Now I see that the chart concerns revenues of a business over two years.

My guess on the direction of time was wrong. The more recent year is placed on the left, counter to convention. This entity therefore suffered a loss of revenues from 2017-8 to 2018-9.

The entity receives substantial government funding. In 2017-8, it has 1 dollar of government funds for every 2 dollars of revenues. In 2018-9, it's roughly 2 dollars of government funds per every 3 dollars of revenues. Thus, the ratio of government funding to revenues has increased.

On closer inspection, the 7 colors do not represent 7 components of this entity's funding. The categories listed in the color legend overlap.

It's rather confusing but I missed one very important feature of the chart in my first assessment: the three columns within each year group are nested. The second column breaks down revenues into 3 parts while the third column subdivides advertising revenues into two parts.

What we've found is that this design does not offer any visual cues to help readers understand how the three columns within a year-group relates to each other. Adding guiding lines or changing the color scheme helps.

***

Next, I add back the data labels:

Cbc_revenues_original

The system of labeling can be described as: label everything that is not further broken down into parts on the chart.

Because of the nested structure, this means two of the column segments, which are the sums of subparts, are not labeled. This creates a very strange appearance: usually, the largest parts are split into subparts, so such a labeling system means the largest parts/subparts are not labeled while the smaller, less influential, subparts are labeled!

You may notice another oddity. The pink segment is well above $1 billion but it is roughly the size of the third column, which represents $250 million. Thus, these columns are not drawn to scale. What happened? Keep reading.

***

Here is the whole chart:

Cbc_revenues_original

A twitter follower sent me this chart. Elon Musk has been feuding with the Canadian broadcaster CBC.

Notice the scale of the vertical axis. It has a discontinuity between $700 million and $1.7 billion. In other words, the two pink sections are artificially shortened. The erased section contains $1 billion (!) Notice that the erased section is larger than the visible section.

The focus of Musk's feud with CBC is on what proportion of the company's funds come from the government. On this chart, the only way to figure that out is to copy out the data and divide. It's roughly 1.2/1.7 = 70% approx.

***

The exercise of deconstructing graphics helps us understand what parts are doing what, and it also reveals what cues certain parts send to readers.

In better dataviz, every part of the chart is doing something useful, it's free of redundant parts that take up processing time for no reason, and the cues to readers move them towards the intended message, not away from it.

***

A couple of additional comments:

I'm not sure why old data was cited because in the most recent accounting report, the proportion of government funding was around 65%.

Source of funding is not a useful measure of pro- or anti-government bias, especially in a democracy where different parties lead the government at different times. There are plenty of mouthpiece media that do not apparently receive government funding.


All about Connecticut

This dataviz project by CT Mirror is excellent. The project walks through key statistics of the state of Connecticut.

Here are a few charts I enjoyed.

The first one shows the industries employing the most CT residents. The left and right arrows are perfect, much better than the usual dot plots.

Ctmirror_growingindustries

The industries are sorted by decreasing size from top to bottom, based on employment in 2019. The chosen scale is absolute, showing the number of employees. The relative change is shown next to the arrow heads in percentages.

The inclusion of both absolute and relative scales may be a source of confusion as the lengths of the arrows encode the absolute differences, not the relative differences indicated by the data labels. This type of decision is always difficult for the designer. Selecting one of the two scales may improve clarity but induce loss aversion.

***

The next example is a bumps chart showing the growth in residents with at least a bachelor's degree.

Ctmirror_highered

This is more like a slopegraph as it appears to draw straight lines between two time points 9 years apart, omitting the intervening years. Each line represents a state. Connecticut's line is shown in red. The message is clear. Connecticut is among the most highly educated out of the 50 states. It maintained this advantage throughout the period.

I'd prefer to use solid lines for the background states, and the axis labels can be sparser.

It's a little odd that pretty much every line has the same slope. I'm suspecting that the numbers came out of a regression model, with varying slopes by state, but the inter-state variance is low.

In the online presentation, one can click on each line to see the values.

***

The final example is a two-sided bar chart:

Ctmirror_migration

This shows migration in and out of the state. The red bars represent the number of people who moved out, while the green bars represent those who moved into the state. The states are arranged from the most number of in-migrants to the least.

I have clipped the bottom of the chart as it extends to 50 states, and the bottom half is barely visible since the absolute numbers are so small.

I'd suggest showing the top 10 states. Then group the rest of the states by region, and plot them as regions. This change makes the chart more compact, as well as more useful.

***

There are many other charts, and I encourage you to visit and support this data journalism.

 

 

 


Lay off bubbles

Wall Street Journal says that the scale of layoffs in the tech industry recently is worse than those caused by the pandemic lockdown. Here is the chart:

Redo_wsj_tech_layoffs_sufficiency

It's the dreaded bubble chart, complete with overlapping circles. Each bubble represents the total number of employees laid off in the U.S. in a given month.

The above isn't really the chart you find in the Journal. I have removed the two data labels from the chart. Look at the highlighted months of April 2020 and November 2022. Can you guess how much larger is the number of laid-off employees in November 2022 relative to April 2020?

***

If you guessed it's 100% - that the larger bubble is twice the size of the smaller one, then you're much better than I at reading bubble charts. Here is the published chart with the data labels:

Wsj tech layoffs

I like to run this exercise - removing data labels - in order to reveal whether the graphical elements on the page are sufficient to convey the underlying data. Bubbles are typically not great at this. (This is what I call the self-sufficiency test.)

***

Another problem with bubble charts is that the sizes of the bubbles are arbitrary. This allows the designer to convey different messages with the same data.

Take a look at these two bubble charts:

Redo_wsj_layoff_bubbles

The first one has huge bubbles, and lots of overlapping while the second one is roughly the same as the WSJ chart (I pulled a different dataset so the numbers may not be exactly the same).

Both charts are made from exactly the same data! In the second chart, the smallest bubbles are made very small while in the first chart, the smallest bubbles are still quite large.

Think twice before you make a bubble chart.

 


Getting simple charts right

Ian K. submitted this chart on Twitter:

Iankos_chicagocops

The chart comes from a video embedded in this report (link) about Chicago cops leaving their jobs.

Let's start with the basics. This is an example of a simple line chart illustrating a time series of five observations. The vertical axis starts at 10,000 instead of 0. With this choice, the designer wants to focus on the point-to-point change in values, rather than its relation to the initial value.

Every graph has add-ons that assist cognition. On this chart, we have axis labels, gridlines and data labels. Every add-on increases reading time so we should be sparing.

First consider the gridlines. In the following chart, I conduct a self-sufficiency test by removing the data labels from the chart:

Redo_wgn9chicagocops_junkcharts_selfsufficiency

You can see that the last three values present no problems. The first two, especially the first value, are hard to read - because the top gridline is missing! The next chart restores the bounding gridline, so you can see the difference that one small detail can make:

Redo_wgn9chicagocops_junkcharts_addedgridline

***

Next, let's compare the following versions of the chart. The left one contains data labels without gridlines and axis labels. The right one has the gridlines and axis labels but no data labels.

Redo_wgn9chicagocops_gridlinesdatalabels

The left chart prints the entire dataset onto the chart. The reader in essence is reading the raw data. That appears to be the intention of the chart designer as the data labels are in large size, placed inside shiny white boxes. The level of the boxes determines the reader's perception as those catch more of our attention than the dots that actually represent the data.

The right chart highlights the dots and the lines between them. The gridlines are way too thick and heavy so as to distract rather than abet. This chart presumes that the reader isn't that interested in the precise numbers as she is in the trend.

***

As Ian pointed out, one of the biggest problems with this chart is the appearance of even time intervals when all except one of the date values are January. This seemingly innocent detail destroys the chart. The line segments of the chart encodes the pre-post change in the staffing numbers. For most of the line segments, the metric is year-on-year change but the last two line segments on the right show something else: a 19-month change, followed by a 5-month change.

I did the following analysis to understand how big of a staffing problem CPD faces.

Redo_wgn9chicagocops_trendanalysis
First I restored the January 2022 time value, while shifting the Aug 2022 value to its rightful place on the time axis. Next, I added the dashed brown line, which represents a linear extension of the trend seen between January 2020-2021, before the sudden dip. We don't know what the true January 2022 value is but the projected value based on past trend is around 12,200. By August, the projected value is around 11,923, about 300 above the actual value of 11,611. By January 2023, the projected value is almost exactly the same as the actual value.

This linear trending analysis is likely too simplistic but it offers a baseline to start thinking about what the story is. The long-term trend is still down but the apparent dip in 2022 may not be meaningful.