Color bomb

I found a snapshot of the following leaderboard (link) in a newsletter in my inbox.

Openrouter_leaderboard_stackedcolumns

This chart ranks different AIs (foundational models) by token usage (which is the unit by which AI companies charge users).

It's a standard stacked column chart, with data aggregated by week. The colors represent different foundational models.

In the original webpage, there is a table printed below, listing the top 20 model names, ordered from the most tokens used.

Openrouter_leaderboard_table

Certain AI models have come and gone (e.g. the yellow and blue ones at the bottom of the chart in the first half). The model in pink has been the front runner through all weeks.

Total usage has been rising, although it might be flattening, which is the point made by the newsletter publisher.

***

A curiosity is the gray shaded section on the far right - it represents the projected total token usage for the days that have not yet passed during the current week. This is one of those additions that I like to see more often. If the developer had chosen to plot the raw data and nothing more, then they would have made the same chart except for the gray section. On that chart, the last column should not be compared to any other column as it is the only one that encodes a partial week.

This added gray section addresses the specific question: whether the total token usage for the current week is on pace with prior weeks, or faster or slower. (The accuracy of the projection is a different matter, which I won't discuss.)

This added gray section leaves another set of questions unanswered. The chart suggests that the total token usage is expected to exceed the values for the prior few weeks, at the time it was frozen. We naturally want to know which models are contributing to this projected growth (and which aren't). The current design cannot address this issue because the projected additional usage is aggregated, and not available at the model level.

While it "tops up" the weekly total usage using a projected value, the chart does not show how many days are remaining. That's an important piece of information for interpreting the projection.

***

Now, we come to the good part, for those of us who loves details.

A major weakness of these stacked column charts is of course the dizzy set of colors required, one for each model. Some of the shades are so similar it's hard to tell if they repeated colors. Are these two different blues or the same blue?

Openrouter_leaderboard_blues

Besides, the visualization software has a built-in feature that "softens" a color when it is clicked on. This feature introduces unpleasant surprises as that soft shade might have been used for another category.

Openrouter_aimodels_ranking_mutedcolors

It appears that the series is running sideways (following the superimposed gray line) when in fact the first section is a softened red associated with the series that went higher (following the white line).

It's near impossible to work with so many colors. If you extract the underlying data, you find that they show 10 values per day across 24 weeks. Because the AI companies are busy launching new models, the dataset contains 40 unique model names, which imply they needed 40 different shades on this one chart. (Double that to 80 shades if we add the colors on click variations.)

***

I hope some of you have noticed something else. Earlier, I mentioned the model in pink as the most popular AI model but if you take a closer look, this pink section actually represents a mostly useless catch-all category called "Others," that presumably aggregates the token usages of a range of less popular models. In this design, the Others category is catching an undeserved amount of attention.

It's unclear how the models are ordered within each column. The developer did not group together different generations of models by the same developer. Anthropic Claude has many entries: Sonnet 4 [green], Sonnet 3.5 [blue], Sonnet 3.5 (self-moderated) [yellow], Sonnet 3.7 (thinking) [pink], Sonnet 3.7 [violet], Sonnet 3.7 (self-moderated) [cyan], etc. The same for OpenAI, Google, etc.

This graphical decision may reflect how users of large language models evaluate performance. Perhaps at this time, there is no brand loyalty, or lock-in effect, and users see all these different models as direct substitutes. Therefore, our attention is focused on the larger number of individual models, rather than the smaller set of AI developers.

***

Before ending the post, I must point out that the publisher of this set of rankings offers a platform that allows users to switch between models. They are visualizing their internal data. This means the dataset only describes what customers of Openrouter.ai do on this platform. There should be no expectation that this company's user base is representative of all users of LLMs.


Students demonstrate how analytics underlie strong dataviz

In today's post, I'm delighted to feature work by several students of Ray Vella's data visualization class at NYU. They have been asked to improve the following Economist chart entitled "The Rich Get Richer".

Economist_richgetricher

In my guest lecture to the class, I emphasized the importance of upfront analytics when constructing data visualizations.

One of the key messages is pay attention to definitions. How does the Economist define "rich" and "poor"? (it's not what you think). Instead of using percentiles (e.g. top 1% of the income distribution), they define "rich" as people living in the richest region by average GDP, and "poor" as people living in the poorest region by average GDP. Thus, the "gap" between the rich and the poor is measured by the difference in GDP between the average persons in those two regions.

I don't like this metric at all but we'll just have to accept that that's the data available for the class assignment.

***

Shulin Huang's work is notable in how she clarifies the underlying algebra.

Shulin_rvella_economist_richpoorgap

The middle section classifies the countries into two groups, those with widening vs narrowing gaps. The side panels show the two components of the gap change. The gap change is the sum of the change in the richest region and the change in the poorest region.

If we take the U.S. as an example, the gap increased by 1976 units. This is because the richest region gained 1777 while the poor region lost 199. Germany has a very different experience: the richest region regressed by 2215 while the poorest region improved by 424, leading to the gap narrowing by 2638.

Note how important it is to keep the order of the countries fixed across all three panels. I'm not sure how she decided the order of these countries, which is a small oversight in an otherwise excellent effort.

Shulin's text is very thoughtful throughout. The chart title clearly states "rich regions" rather than "the rich". Take a look at the bottom of the side panels. The label "national AVG" shows that the zero level is the national average. Then, the label "regions pulled further ahead" perfectly captures the positive direction.

Compared to the original, this chart is much more easily understood. The secret is the clarity of thought, the deep understanding of the nature of the data.

***

Michael Unger focuses his work on elucidating the indexing strategy employed by the Economist. In the original, each value of regional average GDP is indexed to the national average of the relevant year. A number like 150 means the region has an average GDP for the given year that is 50% higher than the national average. It's tough to explain how such indices work.

Michael's revision goes back to the raw data. He presents them in two panels. On the left, the absolute change over time in the average GDPs are presented for each of the richest/poorest region while on the right, the relative change is shown.

Mungar_rvella_economist_richpoorgap

(Some of the country labels are incorrect. I'll replace with a corrected version when I receive one.)

Presenting both sides is not redundant. In France, for example, the richest region improved by 17K while the poorest region went up by not quite 6K. But 6K on a much lower base represents a much higher proportional jump as the right side shows.

***

Related to Michael's work, but even simpler, is Debbie Hsieh's effort.

Debbiehsieh_rayvella_economist_richpoorgap

Debbie reduces the entire exercise to one message - the relative change over time in average GDP between the richest and poorest region in each country. In this simplest presentation, if both columns point up, then both the richest and the poorest region increased their average GDP; if both point down, then both regions suffered GDP drops.

If the GDP increased in the richest region while it decreased in the poorest region, then the gap widened by the most. This is represented by the blue column pointing up and the red column pointing down.

In some countries (e.g. Sweden), the poorest region (orange) got worse while the richest region (blue) improved slightly. In Italy and Spain, both the best and worst regions gained in average GDPs although the richest region attained a greater relative gain.

While Debbie's chart is simpler, it hides something that Michael's work shows more clearly. If both the richest and poorest regions increased GDP by the same percentage amount, the average person in the richest region actually experienced a higher absolute increase because the base of the percentage is higher.

***

The numbers across these charts aren't necessarily well aligned. That's actually one of the challenges of this dataset. There are many ways to process the data, and small differences in how each student handles the data lead to differences in the derived values, resulting in differences in the visual effects.


Hammock plots

Prof. Matthias Schonlau gave a presentation about "hammock plots" in New York recently.

Here is an example of a hammock plot that shows the progression of different rounds of voting during the 1903 papal conclave. (These are taken at the event and thus a little askew.)

Hammockplot_conclave

The chart shows how Cardinal Sarto beat the early favorite Rampolla during later rounds of voting. The chart traces the movement of votes from one round to the next. The Vatican destroys voting records, and apparently, records were unexpectedly retained for this particular conclave.

The dataset has several features that brings out the strengths of such a plot.

There is a fixed number of votes, and a fixed number of candidates. At each stage, the votes are distributed across the subset of candidates. From stage to stage, the support levels for candidate shift. The chart brings out the evolution of the vote.

From the "marginals", i.e. the stacked columns shown at each time point, we learn the relative strengths of the candidates, as they evolve from vote to vote.

The links between the column blocks display the evolution of support from one vote to the next. We can see which candidate received more votes, as well as where the additional votes came from (or, to whom some voters have drifted).

The data are neatly arranged in successive stages, resulting in discrete time steps.

Because the total number of votes are fixed, the relative sizes of the marginals are nicely constrained.

The chart is made much more readable because of binning. Only the top three candidates are shown individually with all the others combined into a single category. This chart would have been quite a mess if it showed, say, 10 candidates.

How precisely we can show the intra-stage movement depends on how the data records were kept. If we have the votes for each person in each round, then it should be simple to execute the above! If we only have the marginals (the vote distribution by candidate) at each round, then we are forced to make some assumptions about which voters switched their votes. We'd likely have to rule out unlikely scenarios, such as that in which all of the previous voters for candidate X switched to someone other candidates while another set of voters switched their votes to candidate X.

***

Matthias also showed examples of hammock plots applied to different types of datasets.

The following chart displays data from course evaluations. Unlike the conclave example, the variables tied to questions on the survey are neither ordered nor sequential. Therefore, there is no natural sorting available for the vertical axes.

Hammockplot_evals

Time is a highly useful organizing element for this type of charts. Without such an organizing element, the designer manually customizes an order.

The vertical axes correspond to specific questions on the course evaluation. Students are aggregated into groups based on the "profile" of grades given for the whole set of questions. It's quite easy to see that opinions are most aligned on the "workload" question while most of the scores are skewed high.

Missing values are handled by plotting them as a new category at the bottom of each vertical axis.

This example is similar to the conclave example in that each survey response is categorical, one of five values (plus missing). Matthias also showed examples of hammock plots in which some or all of the variables are numeric data.

***

Some of you will see some resemblance of the hammock plot with various similar charts, such as the profile chart, the alluvial chart, the parallel coordinates chart, and Sankey diagrams. Matthias discussed all those as well.

Matthias has a book out called "Applied Statistical Learning" (link).

Also, there is a Python package for the hammock plot on github.


Scrambled egg

Let's take a look at the central message this chart is aiming to convey: "U.S. egg prices hit a 10-year high in 2025 after avian flu killed 30 million egg-laying birds." (The original is found on Visual Capitalist.)

Visualcapitalist_eggs

_trifectacheckup_image

Using the Trifecta Checkup framework (link), we ask how the data are aligned with this question. What do the data say?

The data give the average egg prices in 41 countries, sorted from highest to lowest, and arranged in a clockwise manner starting from the top.

The dataset does not address the question posed by the central message.

  • With no history, it cannot show that U.S. egg prices is at a 10-year high.
  • With no explanatory variables, it cannot say why egg prices have increased in 2025.
  • Without context, it cannot address the avian flu.
  • The U.S. does not even stand out.
  • It also does not show the extreme magnitude of the recent increase in egg price in the U.S.

Because of this mismatch, the graphic fails to deliver the intended message.

Notably, the dataset introduces the country dimension, which is unrelated to the central message, but nevertheless interesting. Yet the question of interest isn't the point-in-time comparison. I'd like to know if egg price inflation is a global trend, or an American exclusive. At some point, the inflation will flatten out, although the price of eggs would probably not return to the pre-inflation level. An international comparison across time would bring this insight out clearly.

***

Before ending, we'll make a quick stop at the Visual corner of the Trifecta Checkup. Since the designer uses an ellipse to represent the egg, the bars sticking out of the ellipse are somewhat distorted. Do the bar lengths encode the data accurately?

I looked at Brazil vs Italy. The price in Italy $3.97 is basically twice that in Brazil $1.99. But the length of BRA bar is 40% that of the ITA bar.

Italy and Belgium, shown side by side, have the same egg price to the second decimal place. The bar lengths are not the same.

This observation suggests that the chart fails my self-sufficiency test. If the entire dataset were not printed on the chart, the reader can't interpret the bars.


The message left the visual

The following chart showed up in Princeton Alumni Weekly, in a report about China's population:

Sciam_chinapop_19802020

This chart was one of several that appeared in a related Scientific American article.

The story itself is not surprising. As China develops, its birth rate declines, while the death rate also falls, thus, the population ages. The same story has played out in all advanced economies.

***

From a Trifecta Checkup perspective, this chart suffers from several problems.

The text annotation on the top right suggests what message the authors intended to deliver. Pointing to the group of people aged between 30 and 59 in 2020, they remarked that this large cohort would likely cause "a crisis" when they age. There would be fewer youngsters to support them.

Unfortunately, the data and visual elements of the chart do not align with this message. Instead of looking forward in time, the chart compares the 2020 population pyramid with that from 1980, looking back 40 years. The chart shows an insight from the data, just not the right one.

A major feature of a population pyramid is the split by gender. The trouble is gender isn't part of the story here.

In terms of age groups, the chart treats each subgroup "fairly". As a result, the reader isn't shown which of the 22 subgroups to focus on. There are really 44 subgroups if we count each gender separately, and 88 subgroups if we include the year split.

***

The following redesign traces the "crisis" subgroup (those who were 30-59 in 2020) both backwards and forwards.

Junkcharts_redo_chinapopulationpyramids

The gender split has been removed; here, the columns show the total population. Color is used to focus attention to one cohort as it moves through time.

Notice I switched up the sample times. I pulled the population data for 1990 and 2060 (from this website). The original design used the population data from 1980 instead of 1990. However, this choice is at odds with the message. People who were 30 in 2020 were not yet born in 1980! They started showing up in the 1990 dataset.

At the other end of the "crisis" cohort, the oldest (59 year old in 2020) would have deceased by 2100 as 59+80 = 139. Even the youngest (30 in 2020) would be 110 by 2100 so almost everyone in the pink section of the 2020 chart would have fallen off the right side of the chart by 2100.

These design decisions insert a gap between the visual and the message.

 

 


The reckless practice of eyeballing trend lines

MSN showed this chart claiming a huge increase in the number of British children who believe they are born the wrong gender.

Msn_genderdysphoria

The graph has a number of defects, starting with drawing a red line that clearly isn’t the trend in the data.

To find the trend line, we have to draw a line that is closest to the top of every column. The true trend line is closer to the blue line drawn below:

Junkcharts_redo_msngenderdysphoria_1

The red line moves up one unit roughly every three years while the blue line does so every four years.

Notice the dramatic jump in the last column of the chart. The observed trend is not a straight line, and therefore it is not appropriate to force a straight-line model. Instead, it makes more sense to divide the time line into three periods, with different rates of change.

Junkcharts_redo_msngenderdysphoria_2

Most of the growth during this 10 year period occurred in the last year, and one should check the data, and also check to see if any accounting criterion changed that might explain this large unexpected jump.

***

The other curiosity about this chart is the scale of the vertical axis. Nowhere on the chart does it say which metric of gender dysphoria it is depicting. The title suggests they are counting the number of diagnoses but the axis labels that range from one to five point to some other metric.

From the article, we learn that annual number of gender dysphoria diagnoses was about 10,000 in 2021, and that is encoded as 4.5 in the column chart. The sub-header of the chart indicates that the unit is number per 1,000 people. Ten thousand diagnoses divided by the population size of under 18 x 1,000 = 4.5. This implies there were roughly 2.2 million people under 18 in the U.K. in 2021.

But according to these official statistics (link), there were about 13 million people aged 0-18 in just England and Wales in mid-2022, which is not in the right range. From a dataviz perspective, the designer needs to explain what the values on the vertical axes represent. Right now, I have no idea what it means.

***

Using the Trifecta Checkup framework, we say that the question addressed by the chart is clear but there are problems relating to data encoding as well as the trend-line visual.

_trifectacheckup_image


Making major things easy, and minor things hard

A recent issue of Significance magazine carried the following stacked column chart showing how the driver license status of men and women change as they age. The data came from the U.K.

Siginificance_olddrivers_1

Quick question - what percentage of British men in their sixties hold full driver licenses?

***

I was just kidding. Those questions can't be quickly answered on a stacked column chart. That's because you have to find the axis, and then mentally invert the axis.

On that chart, larger values are shown pointing down (green) and also pointing up (blue), and ... well, I don't have words for the yellow. In fact, the yellow segments, showing people without licenses, are possibly the most important category for this report.

In making decisions about visualizing data, it's important to separate out the major things from the minor things.

***

Here is a reimagination of the chart using connected dots:

Junkcharts_redo_significanceolderdrivers

What is hard to do using this chart is to verify that the three proportions add to 100%. What is easy is to read off the proportion for any gender, age and license status subgroup.

It's really quite intricate how these researchers binned the age data. There are bins of size 1, 4, 5 and 10, plus the top group is 85 and above. The way I handled these is to turn everything to 1-year bins. I assume that in the wider bins, we don't have precise data for each age, and the bin value is the average among the bin, thus it is as if someone had drawn a horizontal line across the bin width. (I left the top bin alone as I don't know what is the maximum age of a person in this study.)

***

Those of you who have laminated the flowchart of data visualization are probably irate. According to such a flowchart, one must use a column chart because the x variable (age band) has irregularly-sized discrete values, and one must use a stacked column chart because the y variable is a percentage, grouped by a third variable (license status).

Don't be mad, just ditch the flowchart.

 


Organizing time-stamped data

In a previous post, I looked at the Economist chart about Elon Musk's tweeting compulsion. It's chart that contains lots of data, every tweet included, but one can't tell the number or frequency of tweets.

In today's post, I'll walk through a couple of sketches of other charts. I was able to find a dataset on Github that does not cover the same period of time but it's good enough for illustration purposes.

As discussed previously, I took cues from the Economist chart, in particular that the hours of the day should be divided up into four equal-width periods. One thing Musk is known for is tweeting at any hour of the day.

Junkcharts_redo_musktweets_columnsbyhourgroup

This is a small-multiples arrangement of column charts. Each column chart represents the tweets that were posted during a six-hour window, across all days in the dataset. A column covers half a year of tweets. We note that there were more tweets in the afternoon hours as he started tweeting more. In the first half of 2022, he sent roughly 750 tweets between 7 pm and midnight.

***

In this next sketch, I used a small-multiples of line charts. Each line chart represents tweets posted during a six-hour window, as before. Instead of counting how many tweets, here I "smoothed" the daily tweet count, so that each number is an average daily tweet count, with the average computed based on a rolling time window.

Junkcharts_redo_musktweets_sidebysidelines

 

***

Finally, let's cover a few details only people who make charts would care about. The time of day variable only makes sense if all times are expressed as "local time", i.e. the time at the location where Musk was tweeting from. This knowledge is not necessary to make a chart but it is essential to make the chart interpretable. A statement like Musk tweets a lot around midnight assumes that it was midnight where he was when he sent each tweet.

Since we don't have his travel schedule, we will definitely be wrong. In my charts, I assumed he is in the Pacific time zone, and never tweeted anywhere outside that time zone.

(Food for thought: the server that posts tweets certainly had the record of the time and time zone for each tweet. Typically, databases store these time stamps standardized to one time zone - call it Greenwich Mean Time. If you have all time stamps expressed in GMT, is it now possible to make a statement about midnight tweeting? Does standardizing to one time zone solve this problem?)

In addition, I suspect that there may be problems with the function used to compute those rolling sums and averages, so take the actual numbers on those sketches with a grain of salt. Specifically, it's hard to tell on any of these charts but Musk did not tweet every single day so there are lots of holes in the time series.


Ranks, labels, metrics, data and alignment

A long-time reader Chris V. (since 2012!) sent me to this WSJ article on airline ratings (link).

The key chart form is this:

Wsj_airlines_overallranks

It's a rhombus shaped chart, really a bar chart rotated counter-clockwise by 45 degrees. Thus, all the text is at 45 degree angles. An airplane icon is imprinted on each bar.

There is also this cute interpretation of the white (non-data-ink) space as a symmetric reflection of the bars (with one missing element). On second thought, the decision to tilt the chart was probably made in service of this quasi-symmetry. If the data bars were horizontal, then the white space would have been sliced up into columns, which just doesn't hold the same appeal.

If we be Tuftian, all of these flourishes do not serve the data. But do they do much harm? This is a case that's harder to decide. The data consist of just a ranking of airlines. The message still comes across. The head must tilt, but the chart beguiles.

***

As the article progresses, the same chart form shows up again and again, with added layers of detail. I appreciate how the author has constructed the story. Subtly, the first chart teaches the readers how the graphic encodes the data, and fills in contextual information such as there being nine airlines in the ranking table.

In the second section, the same chart form is used, while the usage has evolved. There are now a pair of these rhombuses. Each rhombus shows the rankings of a single airline while each bar inside the rhombus shows the airline's ranking on a specific metric. Contrast this with the first chart, where each bar is an airline, and the ranking is the overall ranking on all metrics.

Wsj_airlines_deltasouthwestranks

You may notice that you've used a piece of knowledge picked up from the first chart - that on each of these metrics, each airline has been ranked against eight others. Without that knowledge, we don't know that being 4th is just better than the median. So, in a sense, this second section is dependent on the first chart.

There is a nice use of layering, which links up both charts. A dividing line is drawn between the first place (blue) and not being first (gray). This layering allows us to quickly see that Delta, the overall winner, came first in two of the seven metrics while Southwest, the second-place airline, came first in three of the seven (leaving two metrics for which neither of these airlines came first).

I'd be the first to admit that I have motion sickness. I wonder how many of you are starting to feel dizzy while you read the labels, heads tilted. Maybe you're trying, like me, to figure out the asterisks and daggers.

***

Ironically, but not surprisingly, the asterisks reveal a non-trivial matter. Asterisks direct readers to footnotes, which should be supplementary text that adds color to the main text without altering its core meaning. Nowadays, asterisks may hide information that changes how one interprets the main text, such as complications that muddy the main argument.

Here, the asterisks address a shortcoming of representing ranking using bars. By convention, lower ranking indicates better, and most ranking schemes start counting from 1. If ranks are directly encoded in bars, then the best object is given the shortest bar. But that's not what we see on the chart. The bars actually encode the reverse ranking so the longest bar represents the lowest ranking.

That's level one of this complication. Level two is where these asterisks are at.

Notice that the second metric is called "Canceled flights". The asterisk quipped "fewest". The data collected is on the number of canceled flights but the performance metric for the ranking is really "fewest canceled flights". 

If we see a long bar labelled "1st" under "canceled flights", it causes a moment of pause. Is the airline ranked first because it had the most canceled flights? That would imply being first is worst for this category. It couldn't be that. So perhaps "1st" means having the fewest canceled flights but then it's just weird to show that using the longest bar. The designer correctly anticipates this moment of pause, and that's why the chart has those asterisks.

Unfortunately, six out of the seven metrics require asterisks. In almost every case, we have to think in reverse. "Extreme delays" really mean "Least extreme delays"; "Mishandled baggage" really mean "Less mishandled baggage"; etc. I'd spend some time renaming the metrics to try to fix this avoiding footnotes. For example, saying "Baggage handling" instead of "mishandled baggage" is sufficient.

***

The third section contains the greatest details. Now, each chart prints the ranking of nine airlines for a particular metric.

Wsj_airlinerankings_bymetric

 

By now, the cuteness faded while the neck muscles paid. Those nice annotations, written horizontally, offered but a twee respite.

 

 

 

 

 


Simple presentations

In the previous post, I looked at this chart that shows the distributions of four subgroups found in a dataset:

Davidcurran_originenglishwords

This chart takes quite some effort to decipher, as does another version I featured.

The key messages appear to be: (i) most English words are of Germanic origin, (ii) the most popular English words are even more skewed towards Germanic origin, (iii) words of French origin started showing up around rank 50, those of Latin origin around rank 250.

***

If we are making a graphic for presentation, we can simplify the visual clutter tremendously by - hmmm - a set of pie charts.

Junkcharts_redo_originenglishwords_pies

For those allergic to pies, here's a stacked column chart:

Junkcharts_redo_originenglishwords_columns

Both of these can be thought of as "samples" from the original chart, selected to highlight shifts in the relative proportions.

Davidcurran_originenglishwords_sampled

I also reversed the direction of the horizontal axis as I think the story is better told starting from the whole dataset and honing in on subsets.

 

P.S. [1/10/2025] A reader who has expertise in this subject also suggested a stacked column chart with reversed axis in a comment, so my recommendation here is confirmed.