A German obstacle course

Tagesschau_originalA twitter user sent me this chart from Germany.

It came with a translation:

"Explanation: The chart says how many car drivers plan to purchase a new state-sponsored ticket for public transport. And of those who do, how many plan to use their car less often."

Because visual language should be universal, we shouldn't be deterred by not knowing German.

The structure of the data can be readily understood: we expect three values that add up to 100% from the pie chart. The largest category accounts for 58% of the data, followed by the blue category (40%). The last and smallest category therefore has 2% of the data.

The blue category is of the most interest, and the designer breaks that up into four sub-groups, three of which are roughly similarly popular.

The puzzle is the identities of these categories.

The sub-categories are directly labeled so these are easy for German speakers. From a handy online translator, these labels mean "definitely", "probably", "rather not", "definitely not". Well, that's not too helpful when we don't know what the survey question is.

According to our correspondent, the question should be "of those who plan to buy the new ticket, how many plan to use their car less often?"

I suppose the question is found above the column chart under the car icon. The translator dutifully outputs "Thus rarer (i.e. less) car use". There is no visual cue to let readers know we are supposed to read the right hand side as a single column. In fact, for this reader, I was reading horizontally from top to bottom.

Now, the two icons on the left and the middle of the top row should map to not buying and buying the ticket. The check mark and cross convey that message. But... what do these icons map to on the chart below? We get no clue.

In fact, the will-buy ticket group is the 40% blue category while the will-not group is the 58% light gray category.

What about the dark gray thin sector? Well, one needs to read the fine print. The footnote says "I don't know/ no response".

Since this group is small and uninformative, it's fine to push it into the footnote. However, the choice of a dark color, and placing it at the 12-o'clock angle of the pie chart run counter to de-emphasizing this category!

Another twitter user visually depicts the journey we take to understand this chart:

Tagesschau_reply

The structure of the data is revealed better with something like this:

Redo_tagesschau_newticket

The chart doesn't need this many colors but why not? It's summer.

 

 

 

 


What does Elon Musk do every day?

The Wall Street Journal published a fun little piece about tweets by Elon Musk (link).

Here is an overview of every tweet he sent since he started using Twitter more than a decade ago.

Wsj_musk_tweets_alldaylong2
Apparently, he sent at least one tweet almost every day for the last four years. In addition, his tweets appear at all hours of the day. (Presumably, he is not the only one tweeting from his account.)

He doesn't just spend time writing tweets; he also reads other people's tweets. WSJ finds that up to 80% of his tweets include mentions of other users.

Wsj_musk_tweets_mentionsothers7

***

One problem with "big data" analytics is that they often don't answer interesting questions. Twitter is already one of the companies that put more of their data out there, but still, analysts are missing some of the most important variables.

We know that Musk has 93 million followers. We already know from recent news that a large proportion of such users may be spam/fake. It is frequently assumed in twitter analysis that any tweet he makes reaches 93 million accounts. That's actually far from correct. Twitter uses algorithms to decide what posts show up in each user's feed so we have no idea how many of the 93 million accounts are in fact exposed to any of Musk's tweets.

Further, not every user reads everything on their Twitter feed. I don't even check it every day. Because Twitter operates as a 'firehose" with ever-changing content as users send out short messages at all hours, what one sees depends on when one reads. If Musk tweets in the morning, the users who log on in the afternoon won't see it.

Let's say an analyst wants to learn how impactful Musk's tweets are. That's pretty difficult when one can't figure out which of the 93 million followers were shown these tweets, and who read them. The typical data used to measure response are retweets and likes. Those are convenient metrics because they are available. They are very limited in what they measure. There are lots of users who don't like or retweet at all.

***

The available data do make for some fun charts. This one gave me a big smile:

Wsj_musk_tweets_emojis9

Between writing tweets, reading tweets, and ROTFL, every hour of almost every day, Musk finds time to run his several companies. That's impressive.

 


The what of visualization, beyond the how

A long-time reader sent me the following chart from a Nature article, pointing out that it is rather worthless.

Nautre_scihub

The simple bar chart plots the number of downloads, organized by country, from the website called Sci-Hub, which I've just learned is where one can download scientific articles for free - working around the exorbitant paywalls of scientific journals.

The bar chart is a good example of a Type D chart (Trifecta Checkup). There is nothing wrong with the purpose or visual design of the chart. Nevertheless, the chart paints a misleading picture. The Nature article addresses several shortcomings of the data.

The first - and perhaps most significant - problem is that many Sci-Hub users are expected to access the site via VPN servers that hide their true countries of origin. If the proportion of VPN users is high, the entire dataset is called into doubt. The data would contain both false positives (in countries with VPN servers) and false negatives (in countries with high numbers of VPN users). 

The second problem is seasonality. The dataset covered only one month. Many users are expected to be academics, and in the southern hemisphere, schools are on summer vacation in January and February. Thus, the data from those regions may convey the wrong picture.

Another problem, according to the Nature article, is that Sci-Hub has many competitors. "The figures include only downloads from original Sci-Hub websites, not any replica or ‘mirror’ site, which can have high traffic in places where the original domain is banned."

This mirror-site problem may be worse than it appears. Yes, downloads from Sci-Hub underestimate the entire market for "free" scientific articles. But these mirror sites also inflate Sci-Hub statistics. Presumably, these mirror sites obtain their inventory from Sci-Hub by setting up accounts, thus contributing lots of downloads.

***

Even if VPN and seasonality problems are resolved, the total number of downloads should be adjusted for population. The most appropriate adjustment factor is the population of scientists, but that statistic may be difficult to obtain. A useful proxy might be the number of STEM degrees by country - obtained from a UNESCO survey (link).

A metric of the type "number of Sci-Hub downloads per STEM degree" sounds odd and useless. I'd argue it's better than the unadjusted total number of Sci-Hub downloads. Just don't focus on the absolute values but the relative comparisons between countries. Even better, we can convert the absolute values into an index to focus attention on comparisons.

 


The envelope of one's data

This post is the second post in response to a blog post at StackOverflow (link) in which the author discusses the "harm" of "aggregating away the signal" in your dataset. The first post appears on my book blog earlier this week (link).

One stop in their exploratory data analysis journey was the following chart:

Stackoverflow_variabilitychart

This chart plots all the raw data, all 8,760 values of electricity consumption in California in 2020. Most analysts know this isn't a nice chart, and it's an abuse of ink. This chart is used as a contrast to the 4-week moving average, which was hoisted up as an example of "over-aggregation".

Why is the above chart bad (aside from the waste of ink)? Think about how you consume the information. For me, I notice these features in the following order:

  1. I see the upper "envelope" of the data, i.e. the top values at each hour of each day throughout the year. This gives me the seasonal pattern with a peak in the summer months.
  2. I see the lower "envelope" of the data
  3. I see the "height" of the data, which is, roughly speaking, the range of values within a day
  4. If I squint hard enough, I see a darker band within the band, which roughly maps to the most frequently occurring values (this feature becomes more prominent if we select a lighter shade of gray)

The chart may not be as bad as it looks. The "moving average" is sort of visible. The variability of consumption is visible. The primary problem is it draws attention to the outliers, rather than the more common values.

The envelope of any dataset is composed of extreme values, by definition. For most analysis objectives, extreme values are "noise". In the chart above, it's hard to tell how common the maximum values are relative to other possible values but it's the upper envelope that captures my attention - simply because it's the easiest trend to make out.

***

The same problem actually surfaces in the "improved" chart:

Stackoverflow_weekofyearchart

As explained in the preceding post, this chart rearranges the data. Instead of a single line, therea are now 52 overlapping lines, one for each week of the year. So each line is much less dense and we can make out the hour of day/day of week pattern.

Notice that the author draws attention to the upper envelope of this chart. They notice the line(s) near the top are from the summer, and this further guides their next analysis.

The reason for focusing on the envelope is the same as in the other chart. Where the lines are dense, it's not easy to make out the pattern.

Even the envelope is not as clear as it seems! There is no reason why the highlighted week (August 16 to 23) should have the highest consumption value each hour of each day of the week. It's possible that the line dips into the middle of the range at various points along the line. In the following chart, I highlight two time points in which lines may or may not have crossed:

Junkcharts_stackoverflow_confusingenvelope

In an interactive chart, each line can be highlighted to resolve the confusion.

Note that the lower envelope is much harder to decipher, given the density of lines.

***
The author then pursues a hypothesis that there are lines (weeks) with one intra-day peak and there are those with two peaks.

I'd propose that those are not discrete states but continuous. The base pattern can be one with two peaks, a higher peak in the evening, and a lower peak in the morning. Now, if you imagine pushing up the evening peak while holding the lower peak at its height, you'd gradually "erase" the lower peak but it's just receded into the background.

Possibly the underlying driver is the total demand for energy. The higher the demand, the more likely it's concentrated in the evening, which causes the lower peak to recede. The lower the demand, the more likely we see both peaks.

In either case, the prior chart drives the direction of the next analysis.

 

 

 

 

 


Type D charts

A twitter follower sent the following chart:

China_military_spending

It's odd to place the focus on China when the U.S. line is much higher, and the growth in spending in the last few years in the U.S. is much higher than the growth rate in China.

_trifectacheckup_imageIn the Trifecta Checkup, this chart is Type D (link): the data are at odds with the message of the chart. The intended message likely is China is building up its military in an alarming way. This dataset does not support such a conclusion.

The visual design of the chart can't be faulted though. It's clean, and restrained. It even places line labels at the end of each line. Also, the topic of the chart - the arms race - is unambiguous.

One fix is to change the message to bring it in line with the data. If the question being addressed is which country spends the most on the military, or which country has been raising spending at the fastest rate, then the above chart is appropriate.

If the question is about spending in China, then a different measure such as average annual spending increase may work.

Neither solution requires changing the visual form. That's why data visualization excellence is more than just selecting the right chart form.


Visual design is hard, brought to you by NYC subway

This poster showed up in a NY subway train recently.

Rootin-sm

Visual design is hard!

What is the message? The intention is, of course, to say Rootine is better than others. (That's the Q corner, if you're following the Trifecta Checkup.)

What is the visual telling us (V corner)? It says Rootine is yellow while Others are purple. What do these color mean? There is no legend to help decipher it. And yellow-purple doesn't have a canonical interpretation (unlike say, red-green). In theory, purple can be better than yellow.

The other mystery is the black dot on the fifth item. (This is the NYC subway so the poster could have been vandalized.) It could mean "diet + lifestyle analyzed" is a unique feature of Rootine, not available on any other platform. That implies purple to mean available but not as effective, which significantly lessnes the impact of the chart.

***

Finally, let's imagine the data that may exist to support this chart.

The aggregation of all competitors to "Others" imposes a major challenge. If yellow means yes, and purple means no, we'd expect few if any purple dots because across all competitors, there is a good chance that at least one of them has a particular feature.

Next, I'm dubious about the claim of "precision dosed, unique to you". I'm imagining they are selling some kind of medicine or health food, which can be "dosed". Predictive modelers like to market their models as "personalized," unique to each person but such a thing is impractical. Before you start using their products, they have no data on you, or your response to those products. How could the recommendation be "precision dosed, unique to you"?

Even if you've used the product for a while, it will be tough to achieve a good level of optimality with so little data. In fact, given that your past data are used to generate actions intended to improve your health - that is to say, to cause the future data to diverge from the past data, how do you know that any change you observe next period is caused by the actions you took? The pre-post difference is both affected by temporal shifts and the actions you've taken. If the next period's metric improves, you may want to believe that the actions worked. If the next period's metric declines, are you willing to conclude that the actions you took backfired?

"Formulas improve with you". This makes me more worried than relieved.

***

Problems like these can be solved by showing our work to others. Sometimes, we're too immersed in our own world we don't see we have left off key information.

 

 


Check your presumptions while you're reading this chart about Israel's vaccination campaign

On July 30, Israel began administering third doses of mRNA vaccines to targeted groups of people. This decision was controversial since there is no science to support it. The policymakers do have educated guesses by experts based on best-available information. By science, I mean actual evidence. Since no one has previously been given three shots, there can be no data on which anyone can root such a decision. Nevertheless, the pandemic does not always give us time to collect relevant data, and so speculative analysis has found its calling.

Dvir Aran, at Technion, has been diligently tracking the situation in Israel on his Twitter. Ten days after July 30, he posted the following chart, which immediately led many commentators to bounce out of their seats crowning the third shot as a magic bullet. Notably, Dvir himself did not endorse such a claim. (See here to learn how other hasty conclusions by experts have fared.)

When you look at Dvir's chart, what do we see?

Dvir_aran_chart

Possibly one of the following two things, depending on what concern you have in your head.

1) The red line sits far above the other two lines, showing that unvaccinated people are much more likely to get infected.

2) The blue line diverges from the green line almost immediately after the 3rd shots started getting into arms, showing that the 3rd shot is super effective.

If you take another moment to look, you might start asking questions, as many in Twitter world did. Dvir was startlingly efficient at answering these queries.

A) Does the green line represent people with 2 or 3 doses, or is it strictly 2 doses? Aron asked this question and got the answer (the former):

AronBrand_israelcases_twoorthreedoses

It's time to check our presumptions. When you read that chart, did you presume it's exactly 2 doses or did you presume it's 2 or 3 doses? Or did you immediately spot the ambiguity? As I said in this article, graphs attain efficiency at communication because the designer leverages unspoken rules - the chart conveys certain information without explicitly placing it on the chart. But this can backfire. In this case, I presumed the three lines to display three non-overlapping groups of people, and thus the green line indicates those with 2 doses but not 3. That presumption led me to misinterpret what's on the chart.

B) What is the denominator of the case rates? Is it literal - by that I mean, all unvaccinated people for the red line, and all people with 3 doses for the blue line? Or is the denominator the population of Israel, the same number for all three lines? Lukas asked this question, and got the answer (the former).

Lukas_denominator

C) Since third shots are recommended for 60 year olds and over who were vaccinated at least 5 months ago, and most unvaccinated Israelis are below 60, this answer opens the possibility that the lines compare apples and oranges. Joe. S. asked about this, and received an answer (all lines display only 60 year olds and over.)

Joescholar_basepopulationquestion

Jason P. asked, and learned that the 5-month-out criterion is immaterial since 90% of the vaccinated have already reached that time point.

JasonPogue_5monthsout

D) We have even more presumptions. Like me, did you presume that the red line represents the "unvaccinated," meaning people who have not had any vaccine shots? If so, we may both be wrong about this. It has become the norm by vaccine researchers to lump "partially vaccinated" people with "unvaccinated", and call this combined group "unvaccinated". Here is an excerpt from a recent report from Public Health Ontario (link to PDF), which clearly states this unintuitive counting rule:

Ontario_case_definition

Notice that in this definition, someone who got infected within 14 days of the first shot is classified as an "unvaccinated" case and not a "partially vaccinated case".

In the following tweet, Dvir gave a hint of what he plotted:

Dvir_group_definition

In a previous analysis, he averaged the rates of people with 0 doses and 1 dose, which is equivalent to combining them and calling them unvaccinated. It's unclear to me what he did to the 1-dose subgroup in our featured chart - did it just vanish from the chart? (How people and cases are classified into these groups is a major factor in all vaccine effectiveness calculations - a topic I covered here. Unfortunately, most published reports do a poor job explaining what the analysts did).

E) Did you presume that all three lines are equally important? That's far from true. Since Israel is the world champion in vaccination, the bulk of the 60+ population form the green line. I asked Dvir and he responded that only 7.5%, or roughly 100K are unvaccinated.

DvirAran_proportionofunvaccinated

That means 1.2 million people are part of the green line, 12 times higher. There are roughly 50 cases per day among unvaccinated, and 370 daily cases among those with 2 or 3 doses. In other words, vaccinated people account for almost 90% of all cases.

Yes, this is inevitable when over 90% of the age group have been vaccinated (but it is predictable on the first day someone blasted everywhere that real-world VE is proved by the fact that almost all new cases were in the unvaccinated.)

If your job is to minimize infections, you should be spending most of your time thinking about the 370 cases among vaccinated than the 50 cases among unvaccinated. If you halve the case rate, that would be a difference of 185 cases vs 25. In Israel, the vaccination campaign has already succeeded; it's time to look forward, which is exactly why they are re-focusing on the already vaccinated.

***

If what you worry about most is the effectiveness of the original two-dose regimen, Dvir's chart raises a puzzle. Ignore the blue line, and remember that the green line already includes everybody represented by the blue line.

In the following chart, I removed the blue line, and added reference lines in dashed purple that correspond to 25%, 50% and 75% vaccine effectiveness. The data plotted on this chart are unadjusted case rates. A 75% effective vaccine cuts case rate by three quarters.

Junkcharts_dviraran_israel_threeshotschart

This chart shows the 2-dose mRNA vaccine was nowhere near 90% effective. (As regular readers know, I don't endorse this simplistic calculation and have outlined the problems here, but this style of calculation keeps getting published and passed around. Those who use it to claim real-world studies confirm prior clinical trial outcomes can either (a) insist on using it and retract their earlier conclusions, or (b) admit that such a calculation was, and is, a bad take.)

Also observe how the vaccinated (green) line is moving away from the unvaccinated (red) line. The vaccine apparently is becoming more effective, which runs counter to the trend used by the Israeli government to justify third doses. This improvement also precedes the start of the third-shot campaign. When the analytical method is bad, it generates all sorts of spurious findings.

***

As Dvir said, it is premature to comment on the third doses based on 10 days of data. For one thing, the vaccine developers insist that their vaccines must be given 14 days to work. In a typical calculation, all of the cases in the blue line fall outside the case-counting window. The effective number of cases that would be attributed to the 3-dose group right now is zero, and the vaccine effectiveness using the standard methodology is 100%, even better than shown in the chart.

There is an alternative interpretation of this graph. Statisticians call this the selection effect. On July 30, the blue line split out of the green: some people were selected to receive the 3rd dose - this includes an official selection (the government makes certain subgroups eligible) as well as a self-selection (within the eligible subgroup, certain people decide to get the 3rd shot earlier.) If those who are less exposed to the virus, or more risk averse, get the shots first, then all that is happening may be that we have split off a high VE subgroup from the green line. Even if the third shot were useless, the selection effect itself could explain the gap.

Statistics is about grays. It's not either-or. It's usually some of each. If you feel like Groundhog Day, you're getting the picture. When they rolled out two doses, we lived through an optimistic period in which most experts rejoiced about 90-100% real-world effectiveness, and then as more people get vaccinated, the effect washed away. The selection effect gradually disappears when vaccination becomes widespread. Are we starting a new cycle of hope and despair? We'll find out soon enough.


What metaphors give, they take away

Aleks pointed me to the following graphic making the rounds on Twitter:

Whyaxis_covid_men

It's being passed around as an example of great dataviz.

The entire attraction rests on a risque metaphor. The designer is illustrating a claim that Covid-19 causes erectile dysfunction in men.

That's a well-formed question so in using the Trifecta Checkup, that's a pass on the Q corner.

What about the visual metaphor? I advise people to think twice before using metaphors because these devices can give as they can take. This example is no exception. Some readers may pay attention to the orientation but other readers may focus on the size.

I pulled out the tape measure. Here's what I found.

Junkcharts_covid_eds

The angle is accurate on the first chart but the diameter has been exaggerated relative to the other. The angle is slightly magnified in the bottom chart which has a smaller circumference.

***

Let's look at the Data to round out our analysis. They come from a study from Italy (link), utilizing survey responses. There were 25 male respondents in the survey who self-reported having had Covid-19. Seven of these submitted answers to a set of five questions that were "suggestive of erectile dysfunction". (This isn't as arbitrary as it sounds - apparently it is an internationally accepted way of conducting reseach.) Seven out of 25 is 28 percent. Because the sample size is small, the 95% confidence range is 10% to 46%.

The researchers then used the propensity scoring method to find 3 matches per each infected person. Each match is a survey respondent who did not self-report having had Covid-19. See this post about a real-world vaccine study to learn more about propensity scoring. Among the 75 non-infected men, 7 were judged to have ED. The 95% range is 3% to 16%.

The difference between the two subgroups is quite large. The paper also includes other research that investigates the mechanisms that can explain the observed correlation. Nevertheless, the two proportions depicted in the chart have wide error bars around them.

I have always had a question about analysis using this type of survey data (including my own work). How do they know that ED follows infection rather than precedes it? One of the inviolable rules of causation is that the effect follows the cause. If it's a series of surveys, the sequencing may be measurable but a single survey presents challenges. 

The headline of the dataviz is "Get your vaccines". This comes from a "story time" moment in the paper. On page 1, under Discussion and conclusion, they inserted the sentence "Universal vaccination against COVID-19 and the personal protective equipment could possibly have the added benefit of preventing sexual dysfunctions." Nothing in the research actually supports this claim. The only time the word "vaccine" appears in the entire paper is on that first page.

"Story time" is the moment in a scientific paper when the researchers - after lulling readers to sleep over some interesting data - roll out statements that are not supported by the data presented before.

***

The graph succeeds in catching people's attention. The visual metaphor works in one sense but not in a different sense.

 

P.S. [8/6/2021] One final note for those who do care about the science: the internet survey not surprisingly has a youth bias. The median age of 25 infected people was 39, maxing out at 45 while the median of the 75 not infected was 42, maxing out at 49.


Did prices go up or down? Depends on how one looks at the data

The U.S. media have been flooded with reports of runaway inflation recently, and it's refreshing to see a nice article in the Wall Street Journal that takes a second look at the data. Because as my readers know, raw data can be incredibly deceptive.

Inflation typically describes the change in price level relative to the prior year. The month-on-month change in price levels is a simple seasonal adjustment used to remove the effect of seasonality that masks the true change in price levels. (See this explainer of seasonal adjustment.)

As the pandemic enters the second year, this methodology is comparing 2021 price levels to pandemic-impacted price levels of 2020. This produces a very confusing picture. As the WSJ article explains, prices can be lower than they were in 2019 (pre-pandemic) and yet substantially higher than they were in 2020 (during the pandemic). This happens in industry sectors that were heavily affected by the economic shutdown, e.g. hotels, travel, entertainment.

Wsj_pricechangehotels_20192021Here is how they visualized this phenomenon. Amusingly, some algorithm estimated that it should take 5 minutes to read the entire article. It may take that much time to understand properly what this chart is showing.

Let me save you some time.

The chart shows monthly inflation rates of hotel price levels.

The pink horizontal stripes represent the official inflation numbers, which compare each month's hotel prices to those of a year prior. The most recent value for May of 2021 says hotel prices rose by 9% compared to May of 2020.

The blue horizontal stripes show an alternative calculation which compares each month's hotel prices to those of two years prior. Think of 2018-9 as "normal" years, pre-pandemic. Using this measure, we find that hotel prices for May of 2021 are about 4% lower than for May of 2019.

(This situation affects all of our economic statistics. We may see an expansion in employment levels from a year ago which still leaves us behind where we were before the pandemic.)

What confused me on the WSJ chart are the blocks of color. In a previous chart, the readers learn that solid colors mean inflation rose while diagonal lines mean inflation decreased. It turns out that these are month-over-month changes in inflation rates (notice that one end of the column for the previous month touches one end of the column of the next month).

The color patterns become the most dominant feature of this chart, and yet the month-over-month change in inflation rates isn't the crux of the story. The real star of the story should be the difference in inflation rates - for any given month - between two reference years.

***

In the following chart, I focus attention on the within-month, between-reference-years comparisons.

Junkcharts_redo_wsj_inflationbaserate

Because hotel prices dropped drastically during the pandemic, and have recovered quite well in recent months as the U.S. reopens the economy, the inflation rate of hotel prices is almost 10%. Nevertheless, the current price level is still 7% below the pre-pandemic level.

 



 


Start at zero improves this chart but only slightly

The following chart was forwarded to me recently:

Average_female_height

It's a good illustration of why the "start at zero" rule exists for column charts. The poor Indian lady looks extremely short in this women's club. Is the average Indian woman really half as tall as the average South African woman? (Surely not!)

Junkcharts_redo_womenheight_columnThe problem is only superficially fixed by starting the vertical axis at zero. Doing so highlights the fact that the difference in average heights is but a fraction of the average heights themselves. The intra-country differences are squashed in such a representation - which works against the primary goal of the data visualization itself.

Recall the Trifecta Checkup. At the top of the trifecta is the Question. The designer obviously wants to focus our attention on the difference of the averages. A column chart showing average heights fails the job!

This "proper" column chart sends the message that the difference in average heights is noise, unworthy of our attention. But this is a bad take of the underlying data. The range of average heights across countries isn't that wide, by virtue of large population sizes.

According to Wikipedia, they range from 4 feet 10.5 to 5 feet 6 (I'm ignoring several entries in the table based on non representative small samples.) How do we know that the difference of 2 inches between averages of South Africa and India is actually a sizable difference? The Wikipedia table has the average heights for most of the world's countries. There are perhaps 200 values. These values are sprinkled inside the range of about 8 inches top to bottom. If we divide the full range into 10 equal bins, that's roughly 0.8 inches per bin. So if we have two numbers that are 2 inches apart, they almost span 2 bins. If the data were evenly distributed, that's a huge shift.

(In reality, the data should be normally distributed, bell-shaped, with much more at the center than on the edges. That makes a difference of 2 inches even more significant if these are normal values near the center but less significant if these are extreme values on the tails. Stats students should be able to articulate why we are sure the data are normally distributed without having to plot the data.)

***

The original chart has further problems.

Another source of distortion comes from the scaling of the stick figures. The aspect ratio is being preserved, which means the area is being scaled. Given that the heights are scaled as per the data, the data are encoded twice, the second time in the widths. This means that the sizes of these figures grow at the rate of the square of the heights. (Contrast this with the scaling discussed in my earlier post this week which preserves the relative areas.)

At the end of that last post, I discuss why adding colors to a chart when the colors do not encode any data is a distraction to the reader. And this average height chart is an example.

From the Data corner of the Trifecta Checkup, I'm intrigued by the choice of countries. Why is Scotland highlighted instead of the U.K.? Why Latvia? According to Wikipedia, the Latvia estimate is based on a 1% sample of only 19 year olds.

Some of the data appear to be incorrect (or the designer used a different data source). Wikipedia lists the average height of Latvian women as 5 ft 6.5 while the chart shows 5 ft 5 in. Peru's average height of females is listed as 4 ft 11.5 and of males as 5 ft 4.5. The chart shows 5 ft 4 in.

***

Lest we think only amateurs make this type of chart, here is an example of a similar chart in a scientific research journal:

Fnhum-14-00338-g007

(link to original)

I have seen many versions of the above column charts with error bars, and the vertical axes not starting at zero. In every case, the heights (and areas) of these columns do not scale with the underlying data.

***

I tried a variant of the stem-and-leaf plot:

Junkcharts_redo_womenheight_stemleaf

The scale is chosen to reflect the full range of average heights given in Wikipedia. The chart works better with more countries to fill out the distribution. It shows India is on the short end of the scale but not quite the lowest. (As mentioned above, Peru actually should be placed close to the lower edge.)