What does Elon Musk do every day?

The Wall Street Journal published a fun little piece about tweets by Elon Musk (link).

Here is an overview of every tweet he sent since he started using Twitter more than a decade ago.

Wsj_musk_tweets_alldaylong2
Apparently, he sent at least one tweet almost every day for the last four years. In addition, his tweets appear at all hours of the day. (Presumably, he is not the only one tweeting from his account.)

He doesn't just spend time writing tweets; he also reads other people's tweets. WSJ finds that up to 80% of his tweets include mentions of other users.

Wsj_musk_tweets_mentionsothers7

***

One problem with "big data" analytics is that they often don't answer interesting questions. Twitter is already one of the companies that put more of their data out there, but still, analysts are missing some of the most important variables.

We know that Musk has 93 million followers. We already know from recent news that a large proportion of such users may be spam/fake. It is frequently assumed in twitter analysis that any tweet he makes reaches 93 million accounts. That's actually far from correct. Twitter uses algorithms to decide what posts show up in each user's feed so we have no idea how many of the 93 million accounts are in fact exposed to any of Musk's tweets.

Further, not every user reads everything on their Twitter feed. I don't even check it every day. Because Twitter operates as a 'firehose" with ever-changing content as users send out short messages at all hours, what one sees depends on when one reads. If Musk tweets in the morning, the users who log on in the afternoon won't see it.

Let's say an analyst wants to learn how impactful Musk's tweets are. That's pretty difficult when one can't figure out which of the 93 million followers were shown these tweets, and who read them. The typical data used to measure response are retweets and likes. Those are convenient metrics because they are available. They are very limited in what they measure. There are lots of users who don't like or retweet at all.

***

The available data do make for some fun charts. This one gave me a big smile:

Wsj_musk_tweets_emojis9

Between writing tweets, reading tweets, and ROTFL, every hour of almost every day, Musk finds time to run his several companies. That's impressive.

 


Dots, lines, and 2D histograms

Daniel Z. tweeted about my post from last week. In particular, he took a deeper look at the chart of energy demand that put all hourly data onto the same plot, originally published at the StackOverflow blog:

Stackoverflow_variabilitychart

I noted that this is not a great chart particularly since what catches our eyes are not the key features of the underlying data. Daniel made a clearly better chart:

Danielzvinca_densitychart

This is a dot plot, rather than a line chart. The dots are painted in light gray, pushed to the background, because readers should be looking at the orange line. (I'm not sure what is going on with the horizontal scale as I could not get the peaks to line up on the two charts.)

What is this orange line? It's supposed to prove the point that the apparent dark band seen in the line chart does not represent the most frequently occurring values, as one might presume.

Looking closer, we see that the gray dots do not show all the hourly data but binned values.

Danielzvinca_densitychart_inset
We see vertical columns of dots, each representing a bin of values. The size of the dots represents the frequency of values of each bin. The orange line connects the bins with the highest number of values.

Daniel commented that

"The visual aggregation doesn't in fact map to the most frequently occurring values. That is because the ink of almost vertical lines fills in all the space between start and end."

Xan Gregg investigated further, and made a gif to show this effect better. Here is a screenshot of it (see this tweet):

Xangregg_dots_vs_line

The top chart is a true dot plot so that the darker areas are denser as the dots overlap. The bottom chart is the line chart that has the see-saw pattern. As Xan noted, the values shown are strangely very well behaved (aggregated? modeled?) - with each day, it appears that the values sweep up and down consistently.  This means the values are somewhat evenly spaced on the underlying trendline, so I think this dataset is not the best one to illustrate Daniel's excellent point.

It's usually not a good idea to connect lots of dots with a single line.

 

[P.S. 3/21/2022: Daniel clarified what the orange line shows: "In the posted chart, the orange line encodes the daily demand average (the mean of the daily distribution), rounded, for displaying purposes, to the closed bin. Bin size = 1000. Orange could have encode the daily median as well."]

 


Multicultural, multicolor, manufactured outrage

Twitter users were incensed by this chart:

Twitter_worstpiechart

It's being slammed as one of the most outrageous charts ever.

Mollywhite_twitter_outrageous

***

An image search reveals this chart form has international appeal.

In Kazakh:

Eurasianbank_piechart_kazakh

In Turkish:

Medirevogrupperformans_piechart_turkey

In Arabic, but the image source is a Spanish company:

Socialpubli_piechart_spain

In English, from an Indian source:

Panipatinstitute_piechart_india

In Russian:

Russian_piechart

***

Some people are calling this a pie chart.

But it isn't a pie chart since the slices clearly add up to more than one full circle.

It may be a graph template from an infographics website. You see people are applying data labels without changing the sizes or orientation or even colors of the slices. So the chart form is used as a container for data, rather than an encoder.

***

The Twitter user who called this "outrageous" appears to want to protect the designer, as the words have been deliberately snipped from the chart.

Mollywhite_twitter_outrageous_tweet

Nevertheless, Molly White coughed up the source in a subsequent tweet.

Mollywhite_twitter_outrageous_source

A bit strange, if you stop and think a little. Why would Molly shame the designer 20 hours later after she decided not to?

 

 

According to Molly, the chart appeared on the website of an NFT company. [P.S. See note below]

Here's the top of the page that Molly White linked to:

Mollywhite_twitter_outrageous_web3isgoinggreat

Notice the author of this page. That's "Molly White",  who is the owner of this NFT company! [See note below: she's the owner of a satire website who was calling out the owner of this company.]

Who's more outrageous?

Someone creating the most outrageous chart in order to get clout from outraged Twitter users and drive traffic to her new NFT venture? Or someone creating the template for the outrageous chart form, spawning an international collection?

 

[P.S. 3/17/2022 The answer is provided by other Twitter users, and the commentors. The people spreading this chart form is more ourageous. I now realized that Molly runs a sarcastic site. When she linked to the "source", she linked to her own website, which I interpreted as the source of the image. The page did contain that image, which added to the confusion. I must also add her work looks valuable, as it assesses some of the wild claims in Web3 land.

Mollywhite_site
]

[P.S. 3/17/2022 Molly also pointed out that her second tweet about the source came around 45 minutes after the first tweet. Twitter showed "20 hours" because it was 20 hours from the time I read the tweet.]


The what of visualization, beyond the how

A long-time reader sent me the following chart from a Nature article, pointing out that it is rather worthless.

Nautre_scihub

The simple bar chart plots the number of downloads, organized by country, from the website called Sci-Hub, which I've just learned is where one can download scientific articles for free - working around the exorbitant paywalls of scientific journals.

The bar chart is a good example of a Type D chart (Trifecta Checkup). There is nothing wrong with the purpose or visual design of the chart. Nevertheless, the chart paints a misleading picture. The Nature article addresses several shortcomings of the data.

The first - and perhaps most significant - problem is that many Sci-Hub users are expected to access the site via VPN servers that hide their true countries of origin. If the proportion of VPN users is high, the entire dataset is called into doubt. The data would contain both false positives (in countries with VPN servers) and false negatives (in countries with high numbers of VPN users). 

The second problem is seasonality. The dataset covered only one month. Many users are expected to be academics, and in the southern hemisphere, schools are on summer vacation in January and February. Thus, the data from those regions may convey the wrong picture.

Another problem, according to the Nature article, is that Sci-Hub has many competitors. "The figures include only downloads from original Sci-Hub websites, not any replica or ‘mirror’ site, which can have high traffic in places where the original domain is banned."

This mirror-site problem may be worse than it appears. Yes, downloads from Sci-Hub underestimate the entire market for "free" scientific articles. But these mirror sites also inflate Sci-Hub statistics. Presumably, these mirror sites obtain their inventory from Sci-Hub by setting up accounts, thus contributing lots of downloads.

***

Even if VPN and seasonality problems are resolved, the total number of downloads should be adjusted for population. The most appropriate adjustment factor is the population of scientists, but that statistic may be difficult to obtain. A useful proxy might be the number of STEM degrees by country - obtained from a UNESCO survey (link).

A metric of the type "number of Sci-Hub downloads per STEM degree" sounds odd and useless. I'd argue it's better than the unadjusted total number of Sci-Hub downloads. Just don't focus on the absolute values but the relative comparisons between countries. Even better, we can convert the absolute values into an index to focus attention on comparisons.

 


Improving simple bar charts

Here's another bar chart I came across recently. The chart - apparently published by Kaggle - appeared to present challenges data scientists face in industry:

Kaggle

This chart is pretty standard, and inoffensive. But we can still make it better.

Version 1

Redo_kaggle_nodecimals

I removed the decimals from the data labels.

Version 2

Redo_kaggle_noaxislabels

Since every bar is labelled, is anyone looking at the axis labels?

Version 3

Redo_kaggle_nodatalabels

You love axis labels. Then, let's drop the data labels.

Version 4

Redo_kaggle_categories

Ahh, so data scientists struggle with data problems, and people issues. They don't need better tools.


There's more to the composite rating chart

In my previous post, I sketched a set of charts to illustrate composite ratings of maps platforms (e.g. Google Maps, TomTom). Here is the sketch again:

Redo_mapsplatformsratings.002

For those readers who are interested in understanding these ratings beyond the obvious, this set of charts has more to offer.

Take a look first at the two charts on the left hand side.

Redo_junkcharts_autoevolution_ratings_left

Compare the patterns of dots between the two charts. You should note that the Maps Data ratings (blue dots) are less variable than the Platform ratings (green dots).

For Maps Data, the range is from 30 to 85 (out of 110) but the majority of the dots line up around 50.

For Platform, the range is 20 to 70 (out of 90) and the dots are quite spread out within this range.

This means competitiveness based on Platform is more differentiating among these brands than is Maps Data.

In the previous post, I already noted that the other key insight is that the Maps Data values hang quite closely to the overall average ratings while the Platform values are much less correlated.

***

Another informative observation can be found in the bottom row of charts.

The yellow dots (Developer Ecosystem) are mostly to the right of the overall ratings, meaning most of these brands were given scores on Developer Ecosystem that are higher than their average scores.

That is not the case with the green dots (Platform). For this sub-rating, most of the brands score lower than they do in the overall rating.

Redo_junkcharts_autoevolution_ratings_bottom

***

None of these insights are readily learned from the stacked column chart. A key skill in data visualization is whether one can pile on insights without overloading the chart.

 

 


Visualizing composite ratings

A twitter reader submitted the following chart from Autoevolution (link):

Google-maps-is-no-longer-the-top-app-for-navigation-and-offline-maps-179196_1

This is not a successful chart for the simple reason that readers want to look away from it. It's too busy. There is so much going on that one doesn't know where to look.

The underlying dataset is quite common in the marketing world. Through surveys, people are asked to rate some product along a number of dimensions (here, seven). Each dimension has a weight, and combined, the weighted sum becomes a composite ranking (shown here in gray).

Nothing in the chart stands out as particularly offensive even though the overall effect is repelling. Adding the overall rating on top of each column is not the best idea as it distorts the perception of the column heights. But with all these ingredients, the food comes out bland.

***

The key is editing. Find the stories you want to tell, and then deconstruct the chart to showcase them.

I start with a simple way to show the composite ranking, without any fuss:

Redo_junkcharts_autoevolution_top

[Since these are mockups, I have copied all of the data, just the top 11 items.]

Then, I want to know if individual products have particular strengths or weaknesses along specific dimensions. In a ranking like this, one should expect that some component ratings correlate highly with the overall rating while other components deviate from the overall average.

An example of correlated ratings is the Customers dimension.

Redo_junkcharts_autoevolution_customer

The general pattern of the red dots clings closely to that of the gray bars. The gray bars are the overall composite ratings (re-scaled to the rating range for the Customers dimension). This dimension does not tell us more than what we know from the composite rating.

By contrast, the Developers Ecosystem dimension provides additional information.

Redo_junkcharts_autoevolution_developer

Esri, AzureMaps and Mapbox performed much better on this dimension than on the average dimension. 

***

The following construction puts everything together in one package:

Redo_mapsplatformsratings.002


Speaking to the choir

A friend found the following chart about the "carbon cycle", and sent me an exasperated note, having given up on figuring it out. The chart came from a report, and was reprinted in Ars Technica (link).

Gcp_s09_2021_global_perturbation-800x371

The problem with the chart is that the designer is speaking to the choir. One must know a lot about the carbon cycle already to make sense of everything that's going on.

We see big and small arrows pointing up or down. Each arrow has a number attached to it, plus a range inside brackets. These numbers have no units, and it's not obvious what they are measuring.

The arrows come in a variety of colors. The colors are explained by labels but the labels dexcribe apparently unrelated concepts (e.g. fossil CO2 and land-use change).

Interspersed with the arrows is a singular dot. The dot also has a number attached to it. The number wears a plus sign, which signals it's being treated differently than the quantities with up arrows.

The singular dot is an outcast, ostracized from the community of dots in the bottom part of the chart. These dots have labels but no numbers. They come in different sizes but no scale is provided.

The background is divided into three parts, showing the atmosphere, the land mass, and the ocean. The placement of the arrows and dots suggests each measured quantity concerns one of these three parts. Well... except the dot labeled "surface sediments" that sit on the boundary of the land mass and the ocean.

The three-way classification is only one layer of the chart. A different classification is embedded in the color scheme. The gray, light green, and aquamarine arrows in the sky find their counterparts in the dots of the land mass, and the ocean.

What's more, the boundaries between land and sky, and between land and ocean are also painted with those colors. These boundary segments have been given different colors so that the lengths of these segments seem to contain data but we aren't sure what.

At this point, I noticed thin arrows which appear to depict back and forth flows. There may be two types of such exchanges, one indicated by a cycle, the other by two straight arrows in opposite directions. The cycles have no numbers while each pair of straight thin arrows gets two numbers, always identical.

At the bottom of the chart is a annotation in red: "Budget imbalance = -1.0". Presumably some formula ties the numbers shown above to this -1.0 result. We still don't know the units, and it's unclear if -1.0 is a bad number. A negative number shown in red typically indicates a bad number but how bad is it?

Finally, on the top right corner, I found a legend. It's not obvious at first because the legend symbols (arrows and dots) are shown in gray, a color not used elsewhere on the chart. It appears as if it represents another color category. The legend labels do little for me. What is an "anthropogenic flux"? What does the unit of "GtCO2" stand for? Other jargon includes "carbon cycling" and "stocks". The entire diagram is titled "carbon cycle" while the "carbon cycling" thin arrows are only a small part of the diagram.

The bottom line is I have no idea what this chart is saying to me, other than that the earth is a complex system, and that the designer has tried valiantly to impregnate the diagram with lots of information. If I am well read in environmental science, my experience is likely different.

 

 

 

 

 


And you thought that pie chart was bad...

Vying for some of the worst charts of the year, Adobe came up with a few gems in its Digital Trends Survey. This was a tip from Nolan H. on Twitter.

There are many charts that should be featured; I'll focus on this one.

Digitaltrendssurvey2

This is one of those survey questions that allow each respondent to select multiple responses so that adding up the percentages exceeds 100%. The survey asks people which of these futuristic products do they think is most important. There were two separate groups of respondents, consumers (lighter red) and businesses (darker red).

If, like me, you are a left-to-right, top-to-bottom reader, you'd have consumed this graphic in the following way:

Junkcharts_adobedigitaltrends_left2right

The most important item is found in the lower bottom corner while the least important is placed first.

Here is a more sensible order of these objects:

Junkcharts_adobedigitaltrends_big2small

To follow this order, our eyes must do this:

Junkcharts_adobedigitaltrends_big2small_2

Now, let me say I like what they did with the top of the chart:

Junkcharts_adobedigitaltrends_subtitle

Put the legend above the chart because no one can understand it without first reading the legend.

***

Junkcharts_adobedigitaltrends_datadistortionData are embedded into part-circles (i.e. sectors)... but where do we find the data? The most obvious place to look for them is the areas of the sectors. But that's the wrong place. As I show in the explainer, the designer placed the data in the "height" - the distance from the peak point of the object to the horizontal baseline.

As a result of this choice, the areas of the sectors distort the data - they are proportional to the square of the data.

One simple way to figure out that your graphical objects have obscured the data is the self-sufficiency test. Remove all data labels from the chart, and ask if you still have something understandable.

Junkcharts_adobedigitaltrends_sufficiency

With these unusual shapes, it's not easy to judge how much larger is one object from the next. That's why the data labels were included - the readers are looking at the data values, rather than the graphical objects. That's sad, if you are the designer.

***

One last mystery. What decides the layering of the light vs dark red sectors?

Junkcharts_adobedigitaltrends_frontback

This design always places the smaller object in front of the larger object. Recall that the light red is for consumers and dark red for businesses. The comparison between these disjoint segments is not as interesting as the comparison of different ratings of technologies with each segment. So it's unfortunate that this aspect may get more attention than it deserves. It's also a consequence of the chart form. If the light red is always placed in front, then in some panels (such as the middle one shown above), the light red completely blocks the dark red.

 


Re-engineering #onelesspie

Marco tweeted the following pie chart to me (tip from Danilo), which is perfect since today is Pi Day, and I have to do my #onelesspie duty. This started a few years ago with Xan Gregg.

Onelesspie2021

This chart supposedly was published in an engineering journal. I don't have a clue what the question might be that this chart is purportedly answering. Maybe the reason for picking a cellphone?

The particular bits that make this chart hard to comprehend are these:

Junkcharts_onelesspie2021_problems

The chart also fails the ordering rule, as it spreads the largest pieces around.

It doesn't have to be so complicated.

Here is a primitive chart that doesn't even require a graphics software.

Junkcharts_redo_onelesspie2021_1color

Younger readers have not experienced the days (pre 2000) when color printing was at a premium, and most graphics were grayscale. Nevertheless, restrained use of color is recommended.

Junkcharts_redo_onelesspie2021_2colors

Happy Pi Day!