Longest life, shortest length

Racetrack charts refuse to die. For old time's sake, here is a blog post from 2005 in which I explain why they don't make good dataviz.

Our latest example comes from Visual Capitalist (link), which publishes a fair share of nice dataviz. In this infographics, they feature a racetrack chart, just because the topic is the lifespan of cars.

Visualcapitalist_lifespan_cars_top

The whole infographic has four parts, each a racetrack chart. I'll focus on the first racetrack chart (shown above), which deals with the product category of sedans and hatchbacks.

The first thing I noticed is the reference value of 100,000 miles, which is described as the expected lifespan of a typical car made in the 1970s. This is of dubious value since the top of the page informs us the current relevant reference value is 200,000 miles, which is unlabeled. We surmise that 200,000 miles is indicated by the end of the grey sections of the racetrack. (This is eventually confirmed in the next racettrack chart for SUVs in the second sectiotn of the infographic.)

Now let's zoom in on the brown section of the track. Each of the four sections illustrates the same datum = 100,000 miles and yet they exhibit different lengths. From this, we learn that the data are not encoded in the lengths of these tracks -- but rather the data are to be found in the angle sustained at the centre of the concentric circles. The problem with racetrack charts is that readers are drawn to the lengths of the tracks rather than the angles at the center, which are not explicitly represented.

The Avalon model has the longest life span on this chart, and yet it is shown as the shortest curve.

***

The most baffling part of this chart is not the visual but the analysis methodology.

I quote:

iSeeCars analyzed over 2M used cars on the road between Jan. and Oct. 2022. Rankings are based on the mileage that the top 1% of cars within each model obtained.

According to this blurb, the 245,710 miles number for Avalon is the average mileage found in the top 1% of Avalons within the iSeeCars sample of 2M used cars.

The word "lifespan" strikes me as incorporating a date of death, and yet nothing in the above text indicates that any of the sampled cars are at end of life. The cars they really need are not found in their sample at all.

I suppose taking the top 1% is meant to exclude younger cars but why 1%? Also, this sample completely misses the cars that prematurely died, e.g. the cars that failed after 100,000 miles but before 200,000 miles. This filtering also ensures that newer models are excluded from the sample.

_trifectacheckup_imageIn the Trifecta Checkup, this qualifies as Type DV. The dataset does not answer the question of concern while the visual form distorts the data.


The blue mist

The New York Times printed several charts about Twitter "blue checks," and they aren't one of their best efforts (link).

Blue checks used to be credentials given to legitimate accounts, typically associated with media outlets, celebrities, brands, professors, etc. They are free but must be approved by Twitter. Since Elon Musk acquired Twitter, he turned blue checks into a revenue generator. Yet another subscription service (but you're buying "freedom"!). Anyone can get a blue check for US$8 per month.

[The charts shown here are scanned from the printed edition.]

Nyt_twitterblue_chart1

The first chart is a scatter plot showing the day of joining Twitter and the total number of followers the account has as of early November, 2022. Those are very strange things to pair up on a scatter plot but I get it: the designer could only work with the data that can be pulled down from Twitter's API.

What's wrong with the data? It would seem the interesting question is whether blue checks are associated with number of followers. The chart shows only Twitter Blue users so there is nothing to compare to. The day of joining Twitter is not the day of becoming "Twitter Blue", almost surely not for any user (Nevetheless, the former is not a standard data element released by Twitter). The chart has a built-in time bias since the longer an account exists, one would assume the higher the number of followers (assuming all else equal). Some kind of follower rate (e.g. number of followers per year of existence) might be more informative.

Still, it's hard to know what the chart is saying. That most Blue accounts have fewer than 5,000 followers? I also suspect that they chopped off the top of the chart (outliers) and forgot to mention it. Surely, some of the celebrity accounts have way over 150,000 followers. Another sign that the top of the chart was removed is that an expected funnel effect is not seen. Given the follower count is cumulative from the day of registration, we'd expect the accounts that started in the last few months should have markedly lower counts than those created years ago. (This is even more true if there is a survivorship bias - less successful accounts are more likely to be deleted over time.)

The designer arbitrarily labelled six specific accounts ("Crypto influencer", "HBO fan", etc.) but this feature risks sending readers the wrong message. There might be one HBO fan account that quickly grew to 150,000 followers in just a few months but does the data label suggest to readers that HBO fan accounts as a group tend to quickly attain high number of followers?

***

The second chart, which is an inset of the first, attempts to quantify the effect of the Musk acquisition on the number of "registrations and subscriptions". In the first chart, the story was described as "Elon Musk buys Twitter sparking waves of new users who later sign up for Twitter Blue".

Nyt_twitterblue_chart2

The second chart confuses me. I was trying to figure out what is counted in the vertical axis. This was before I noticed the inset in the first chart, easy to miss as it is tucked into the lower right corner. I had presumed that the axis would be the same as in the first chart since there weren't any specific labels. In that case, I am looking at accounts with 0 to 500 followers, pretty inconsequential accounts. Then, the chart title uses the words "registrations and subscriptions." If the blue dots on this chart also refer to blue-check accounts as in the first chart, then I fail to see how this chart conveys any information about registrations (wbich presumably would include free accounts). As before, new accounts that aren't blue checks won't appear.

Further, to the extent that this chart shows a surge in subscriptions, we are restricted to accounts with fewer than 500 followers, and it's really unclear what proportion of total subscribers is depicted. Nor is it possible to estimate the magnitude of this surge.

Besides, I'm seeing similar densities of the dots across the entire time window between October 2021 and 2022. Perhaps the entire surge is hidden behind the black lines indicating the specific days when Musk announced and completed the acquisition, respectively. If the surge is hiding behind the black vertical lines, then this design manages to block the precise spots readers are supposed to notice.

Here is where we can use the self-sufficiency test. Imagine the same chart without the text. What story would you have learned from the graphical elements themselves? Not much, in my view.

***

The third chart isn't more insightful. This chart purportedly shows suspended accounts, only among blue-check accounts.

Nyt_twitterblue_chart3

From what I could gather (and what I know about Twitter's API), the chart shows any Twitter Blue account that got suspended at any time. For example, all the black open circles occurring prior to October 27, 2022 represent suspensions by the previous management, and presumably have nothing to do with Elon Musk, or his decision to turn blue checks into a subscription product.

There appears to be a cluster of suspensions since Musk took over. I am not sure what that means. Certainly, it says he's not about "total freedom". Most of these suspended accounts have fewer than 50 followers, and only been around for a few weeks. And as before, I'm not sure why the analyst decided to focus on accounts with fewer than 500 followers.

What could have been? Given the number of suspended accounts are relatively small, an interesting analysis would be to form clusters of suspended accounts, and report on the change in what types of accounts got suspended before and after the change of management.

***

The online article (link) is longer, filling in some details missing from the printed edition.

There is one view that shows the larger accounts:

Nyt_twitterblue_largestaccounts

While more complete, this view isn't very helpful as the biggest accounts are located in the sparsest area of the chart. The data labels again pick out strange accounts like those of adult film stars and an Arabic news site. It's not clear if the designer is trying to tell us that most of Twitter Blue accounts belong to those categories.

***
See here for commentary on other New York Times graphics.

 

 

 

 


What does Elon Musk do every day?

The Wall Street Journal published a fun little piece about tweets by Elon Musk (link).

Here is an overview of every tweet he sent since he started using Twitter more than a decade ago.

Wsj_musk_tweets_alldaylong2
Apparently, he sent at least one tweet almost every day for the last four years. In addition, his tweets appear at all hours of the day. (Presumably, he is not the only one tweeting from his account.)

He doesn't just spend time writing tweets; he also reads other people's tweets. WSJ finds that up to 80% of his tweets include mentions of other users.

Wsj_musk_tweets_mentionsothers7

***

One problem with "big data" analytics is that they often don't answer interesting questions. Twitter is already one of the companies that put more of their data out there, but still, analysts are missing some of the most important variables.

We know that Musk has 93 million followers. We already know from recent news that a large proportion of such users may be spam/fake. It is frequently assumed in twitter analysis that any tweet he makes reaches 93 million accounts. That's actually far from correct. Twitter uses algorithms to decide what posts show up in each user's feed so we have no idea how many of the 93 million accounts are in fact exposed to any of Musk's tweets.

Further, not every user reads everything on their Twitter feed. I don't even check it every day. Because Twitter operates as a 'firehose" with ever-changing content as users send out short messages at all hours, what one sees depends on when one reads. If Musk tweets in the morning, the users who log on in the afternoon won't see it.

Let's say an analyst wants to learn how impactful Musk's tweets are. That's pretty difficult when one can't figure out which of the 93 million followers were shown these tweets, and who read them. The typical data used to measure response are retweets and likes. Those are convenient metrics because they are available. They are very limited in what they measure. There are lots of users who don't like or retweet at all.

***

The available data do make for some fun charts. This one gave me a big smile:

Wsj_musk_tweets_emojis9

Between writing tweets, reading tweets, and ROTFL, every hour of almost every day, Musk finds time to run his several companies. That's impressive.

 


Dots, lines, and 2D histograms

Daniel Z. tweeted about my post from last week. In particular, he took a deeper look at the chart of energy demand that put all hourly data onto the same plot, originally published at the StackOverflow blog:

Stackoverflow_variabilitychart

I noted that this is not a great chart particularly since what catches our eyes are not the key features of the underlying data. Daniel made a clearly better chart:

Danielzvinca_densitychart

This is a dot plot, rather than a line chart. The dots are painted in light gray, pushed to the background, because readers should be looking at the orange line. (I'm not sure what is going on with the horizontal scale as I could not get the peaks to line up on the two charts.)

What is this orange line? It's supposed to prove the point that the apparent dark band seen in the line chart does not represent the most frequently occurring values, as one might presume.

Looking closer, we see that the gray dots do not show all the hourly data but binned values.

Danielzvinca_densitychart_inset
We see vertical columns of dots, each representing a bin of values. The size of the dots represents the frequency of values of each bin. The orange line connects the bins with the highest number of values.

Daniel commented that

"The visual aggregation doesn't in fact map to the most frequently occurring values. That is because the ink of almost vertical lines fills in all the space between start and end."

Xan Gregg investigated further, and made a gif to show this effect better. Here is a screenshot of it (see this tweet):

Xangregg_dots_vs_line

The top chart is a true dot plot so that the darker areas are denser as the dots overlap. The bottom chart is the line chart that has the see-saw pattern. As Xan noted, the values shown are strangely very well behaved (aggregated? modeled?) - with each day, it appears that the values sweep up and down consistently.  This means the values are somewhat evenly spaced on the underlying trendline, so I think this dataset is not the best one to illustrate Daniel's excellent point.

It's usually not a good idea to connect lots of dots with a single line.

 

[P.S. 3/21/2022: Daniel clarified what the orange line shows: "In the posted chart, the orange line encodes the daily demand average (the mean of the daily distribution), rounded, for displaying purposes, to the closed bin. Bin size = 1000. Orange could have encode the daily median as well."]

 


Multicultural, multicolor, manufactured outrage

Twitter users were incensed by this chart:

Twitter_worstpiechart

It's being slammed as one of the most outrageous charts ever.

Mollywhite_twitter_outrageous

***

An image search reveals this chart form has international appeal.

In Kazakh:

Eurasianbank_piechart_kazakh

In Turkish:

Medirevogrupperformans_piechart_turkey

In Arabic, but the image source is a Spanish company:

Socialpubli_piechart_spain

In English, from an Indian source:

Panipatinstitute_piechart_india

In Russian:

Russian_piechart

***

Some people are calling this a pie chart.

But it isn't a pie chart since the slices clearly add up to more than one full circle.

It may be a graph template from an infographics website. You see people are applying data labels without changing the sizes or orientation or even colors of the slices. So the chart form is used as a container for data, rather than an encoder.

***

The Twitter user who called this "outrageous" appears to want to protect the designer, as the words have been deliberately snipped from the chart.

Mollywhite_twitter_outrageous_tweet

Nevertheless, Molly White coughed up the source in a subsequent tweet.

Mollywhite_twitter_outrageous_source

A bit strange, if you stop and think a little. Why would Molly shame the designer 20 hours later after she decided not to?

 

 

According to Molly, the chart appeared on the website of an NFT company. [P.S. See note below]

Here's the top of the page that Molly White linked to:

Mollywhite_twitter_outrageous_web3isgoinggreat

Notice the author of this page. That's "Molly White",  who is the owner of this NFT company! [See note below: she's the owner of a satire website who was calling out the owner of this company.]

Who's more outrageous?

Someone creating the most outrageous chart in order to get clout from outraged Twitter users and drive traffic to her new NFT venture? Or someone creating the template for the outrageous chart form, spawning an international collection?

 

[P.S. 3/17/2022 The answer is provided by other Twitter users, and the commentors. The people spreading this chart form is more ourageous. I now realized that Molly runs a sarcastic site. When she linked to the "source", she linked to her own website, which I interpreted as the source of the image. The page did contain that image, which added to the confusion. I must also add her work looks valuable, as it assesses some of the wild claims in Web3 land.

Mollywhite_site
]

[P.S. 3/17/2022 Molly also pointed out that her second tweet about the source came around 45 minutes after the first tweet. Twitter showed "20 hours" because it was 20 hours from the time I read the tweet.]


The what of visualization, beyond the how

A long-time reader sent me the following chart from a Nature article, pointing out that it is rather worthless.

Nautre_scihub

The simple bar chart plots the number of downloads, organized by country, from the website called Sci-Hub, which I've just learned is where one can download scientific articles for free - working around the exorbitant paywalls of scientific journals.

The bar chart is a good example of a Type D chart (Trifecta Checkup). There is nothing wrong with the purpose or visual design of the chart. Nevertheless, the chart paints a misleading picture. The Nature article addresses several shortcomings of the data.

The first - and perhaps most significant - problem is that many Sci-Hub users are expected to access the site via VPN servers that hide their true countries of origin. If the proportion of VPN users is high, the entire dataset is called into doubt. The data would contain both false positives (in countries with VPN servers) and false negatives (in countries with high numbers of VPN users). 

The second problem is seasonality. The dataset covered only one month. Many users are expected to be academics, and in the southern hemisphere, schools are on summer vacation in January and February. Thus, the data from those regions may convey the wrong picture.

Another problem, according to the Nature article, is that Sci-Hub has many competitors. "The figures include only downloads from original Sci-Hub websites, not any replica or ‘mirror’ site, which can have high traffic in places where the original domain is banned."

This mirror-site problem may be worse than it appears. Yes, downloads from Sci-Hub underestimate the entire market for "free" scientific articles. But these mirror sites also inflate Sci-Hub statistics. Presumably, these mirror sites obtain their inventory from Sci-Hub by setting up accounts, thus contributing lots of downloads.

***

Even if VPN and seasonality problems are resolved, the total number of downloads should be adjusted for population. The most appropriate adjustment factor is the population of scientists, but that statistic may be difficult to obtain. A useful proxy might be the number of STEM degrees by country - obtained from a UNESCO survey (link).

A metric of the type "number of Sci-Hub downloads per STEM degree" sounds odd and useless. I'd argue it's better than the unadjusted total number of Sci-Hub downloads. Just don't focus on the absolute values but the relative comparisons between countries. Even better, we can convert the absolute values into an index to focus attention on comparisons.

 


Improving simple bar charts

Here's another bar chart I came across recently. The chart - apparently published by Kaggle - appeared to present challenges data scientists face in industry:

Kaggle

This chart is pretty standard, and inoffensive. But we can still make it better.

Version 1

Redo_kaggle_nodecimals

I removed the decimals from the data labels.

Version 2

Redo_kaggle_noaxislabels

Since every bar is labelled, is anyone looking at the axis labels?

Version 3

Redo_kaggle_nodatalabels

You love axis labels. Then, let's drop the data labels.

Version 4

Redo_kaggle_categories

Ahh, so data scientists struggle with data problems, and people issues. They don't need better tools.


There's more to the composite rating chart

In my previous post, I sketched a set of charts to illustrate composite ratings of maps platforms (e.g. Google Maps, TomTom). Here is the sketch again:

Redo_mapsplatformsratings.002

For those readers who are interested in understanding these ratings beyond the obvious, this set of charts has more to offer.

Take a look first at the two charts on the left hand side.

Redo_junkcharts_autoevolution_ratings_left

Compare the patterns of dots between the two charts. You should note that the Maps Data ratings (blue dots) are less variable than the Platform ratings (green dots).

For Maps Data, the range is from 30 to 85 (out of 110) but the majority of the dots line up around 50.

For Platform, the range is 20 to 70 (out of 90) and the dots are quite spread out within this range.

This means competitiveness based on Platform is more differentiating among these brands than is Maps Data.

In the previous post, I already noted that the other key insight is that the Maps Data values hang quite closely to the overall average ratings while the Platform values are much less correlated.

***

Another informative observation can be found in the bottom row of charts.

The yellow dots (Developer Ecosystem) are mostly to the right of the overall ratings, meaning most of these brands were given scores on Developer Ecosystem that are higher than their average scores.

That is not the case with the green dots (Platform). For this sub-rating, most of the brands score lower than they do in the overall rating.

Redo_junkcharts_autoevolution_ratings_bottom

***

None of these insights are readily learned from the stacked column chart. A key skill in data visualization is whether one can pile on insights without overloading the chart.

 

 


Visualizing composite ratings

A twitter reader submitted the following chart from Autoevolution (link):

Google-maps-is-no-longer-the-top-app-for-navigation-and-offline-maps-179196_1

This is not a successful chart for the simple reason that readers want to look away from it. It's too busy. There is so much going on that one doesn't know where to look.

The underlying dataset is quite common in the marketing world. Through surveys, people are asked to rate some product along a number of dimensions (here, seven). Each dimension has a weight, and combined, the weighted sum becomes a composite ranking (shown here in gray).

Nothing in the chart stands out as particularly offensive even though the overall effect is repelling. Adding the overall rating on top of each column is not the best idea as it distorts the perception of the column heights. But with all these ingredients, the food comes out bland.

***

The key is editing. Find the stories you want to tell, and then deconstruct the chart to showcase them.

I start with a simple way to show the composite ranking, without any fuss:

Redo_junkcharts_autoevolution_top

[Since these are mockups, I have copied all of the data, just the top 11 items.]

Then, I want to know if individual products have particular strengths or weaknesses along specific dimensions. In a ranking like this, one should expect that some component ratings correlate highly with the overall rating while other components deviate from the overall average.

An example of correlated ratings is the Customers dimension.

Redo_junkcharts_autoevolution_customer

The general pattern of the red dots clings closely to that of the gray bars. The gray bars are the overall composite ratings (re-scaled to the rating range for the Customers dimension). This dimension does not tell us more than what we know from the composite rating.

By contrast, the Developers Ecosystem dimension provides additional information.

Redo_junkcharts_autoevolution_developer

Esri, AzureMaps and Mapbox performed much better on this dimension than on the average dimension. 

***

The following construction puts everything together in one package:

Redo_mapsplatformsratings.002


Speaking to the choir

A friend found the following chart about the "carbon cycle", and sent me an exasperated note, having given up on figuring it out. The chart came from a report, and was reprinted in Ars Technica (link).

Gcp_s09_2021_global_perturbation-800x371

The problem with the chart is that the designer is speaking to the choir. One must know a lot about the carbon cycle already to make sense of everything that's going on.

We see big and small arrows pointing up or down. Each arrow has a number attached to it, plus a range inside brackets. These numbers have no units, and it's not obvious what they are measuring.

The arrows come in a variety of colors. The colors are explained by labels but the labels dexcribe apparently unrelated concepts (e.g. fossil CO2 and land-use change).

Interspersed with the arrows is a singular dot. The dot also has a number attached to it. The number wears a plus sign, which signals it's being treated differently than the quantities with up arrows.

The singular dot is an outcast, ostracized from the community of dots in the bottom part of the chart. These dots have labels but no numbers. They come in different sizes but no scale is provided.

The background is divided into three parts, showing the atmosphere, the land mass, and the ocean. The placement of the arrows and dots suggests each measured quantity concerns one of these three parts. Well... except the dot labeled "surface sediments" that sit on the boundary of the land mass and the ocean.

The three-way classification is only one layer of the chart. A different classification is embedded in the color scheme. The gray, light green, and aquamarine arrows in the sky find their counterparts in the dots of the land mass, and the ocean.

What's more, the boundaries between land and sky, and between land and ocean are also painted with those colors. These boundary segments have been given different colors so that the lengths of these segments seem to contain data but we aren't sure what.

At this point, I noticed thin arrows which appear to depict back and forth flows. There may be two types of such exchanges, one indicated by a cycle, the other by two straight arrows in opposite directions. The cycles have no numbers while each pair of straight thin arrows gets two numbers, always identical.

At the bottom of the chart is a annotation in red: "Budget imbalance = -1.0". Presumably some formula ties the numbers shown above to this -1.0 result. We still don't know the units, and it's unclear if -1.0 is a bad number. A negative number shown in red typically indicates a bad number but how bad is it?

Finally, on the top right corner, I found a legend. It's not obvious at first because the legend symbols (arrows and dots) are shown in gray, a color not used elsewhere on the chart. It appears as if it represents another color category. The legend labels do little for me. What is an "anthropogenic flux"? What does the unit of "GtCO2" stand for? Other jargon includes "carbon cycling" and "stocks". The entire diagram is titled "carbon cycle" while the "carbon cycling" thin arrows are only a small part of the diagram.

The bottom line is I have no idea what this chart is saying to me, other than that the earth is a complex system, and that the designer has tried valiantly to impregnate the diagram with lots of information. If I am well read in environmental science, my experience is likely different.