Color bomb
Jul 14, 2025
I found a snapshot of the following leaderboard (link) in a newsletter in my inbox.
This chart ranks different AIs (foundational models) by token usage (which is the unit by which AI companies charge users).
It's a standard stacked column chart, with data aggregated by week. The colors represent different foundational models.
In the original webpage, there is a table printed below, listing the top 20 model names, ordered from the most tokens used.
Certain AI models have come and gone (e.g. the yellow and blue ones at the bottom of the chart in the first half). The model in pink has been the front runner through all weeks.
Total usage has been rising, although it might be flattening, which is the point made by the newsletter publisher.
***
A curiosity is the gray shaded section on the far right - it represents the projected total token usage for the days that have not yet passed during the current week. This is one of those additions that I like to see more often. If the developer had chosen to plot the raw data and nothing more, then they would have made the same chart except for the gray section. On that chart, the last column should not be compared to any other column as it is the only one that encodes a partial week.
This added gray section addresses the specific question: whether the total token usage for the current week is on pace with prior weeks, or faster or slower. (The accuracy of the projection is a different matter, which I won't discuss.)
This added gray section leaves another set of questions unanswered. The chart suggests that the total token usage is expected to exceed the values for the prior few weeks, at the time it was frozen. We naturally want to know which models are contributing to this projected growth (and which aren't). The current design cannot address this issue because the projected additional usage is aggregated, and not available at the model level.
While it "tops up" the weekly total usage using a projected value, the chart does not show how many days are remaining. That's an important piece of information for interpreting the projection.
***
Now, we come to the good part, for those of us who loves details.
A major weakness of these stacked column charts is of course the dizzy set of colors required, one for each model. Some of the shades are so similar it's hard to tell if they repeated colors. Are these two different blues or the same blue?
Besides, the visualization software has a built-in feature that "softens" a color when it is clicked on. This feature introduces unpleasant surprises as that soft shade might have been used for another category.
It appears that the series is running sideways (following the superimposed gray line) when in fact the first section is a softened red associated with the series that went higher (following the white line).
It's near impossible to work with so many colors. If you extract the underlying data, you find that they show 10 values per day across 24 weeks. Because the AI companies are busy launching new models, the dataset contains 40 unique model names, which imply they needed 40 different shades on this one chart. (Double that to 80 shades if we add the colors on click variations.)
***
I hope some of you have noticed something else. Earlier, I mentioned the model in pink as the most popular AI model but if you take a closer look, this pink section actually represents a mostly useless catch-all category called "Others," that presumably aggregates the token usages of a range of less popular models. In this design, the Others category is catching an undeserved amount of attention.
It's unclear how the models are ordered within each column. The developer did not group together different generations of models by the same developer. Anthropic Claude has many entries: Sonnet 4 [green], Sonnet 3.5 [blue], Sonnet 3.5 (self-moderated) [yellow], Sonnet 3.7 (thinking) [pink], Sonnet 3.7 [violet], Sonnet 3.7 (self-moderated) [cyan], etc. The same for OpenAI, Google, etc.
This graphical decision may reflect how users of large language models evaluate performance. Perhaps at this time, there is no brand loyalty, or lock-in effect, and users see all these different models as direct substitutes. Therefore, our attention is focused on the larger number of individual models, rather than the smaller set of AI developers.
***
Before ending the post, I must point out that the publisher of this set of rankings offers a platform that allows users to switch between models. They are visualizing their internal data. This means the dataset only describes what customers of Openrouter.ai do on this platform. There should be no expectation that this company's user base is representative of all users of LLMs.