Check your presumptions while you're reading this chart about Israel's vaccination campaign

On July 30, Israel began administering third doses of mRNA vaccines to targeted groups of people. This decision was controversial since there is no science to support it. The policymakers do have educated guesses by experts based on best-available information. By science, I mean actual evidence. Since no one has previously been given three shots, there can be no data on which anyone can root such a decision. Nevertheless, the pandemic does not always give us time to collect relevant data, and so speculative analysis has found its calling.

Dvir Aran, at Technion, has been diligently tracking the situation in Israel on his Twitter. Ten days after July 30, he posted the following chart, which immediately led many commentators to bounce out of their seats crowning the third shot as a magic bullet. Notably, Dvir himself did not endorse such a claim. (See here to learn how other hasty conclusions by experts have fared.)

When you look at Dvir's chart, what do we see?

Dvir_aran_chart

Possibly one of the following two things, depending on what concern you have in your head.

1) The red line sits far above the other two lines, showing that unvaccinated people are much more likely to get infected.

2) The blue line diverges from the green line almost immediately after the 3rd shots started getting into arms, showing that the 3rd shot is super effective.

If you take another moment to look, you might start asking questions, as many in Twitter world did. Dvir was startlingly efficient at answering these queries.

A) Does the green line represent people with 2 or 3 doses, or is it strictly 2 doses? Aron asked this question and got the answer (the former):

AronBrand_israelcases_twoorthreedoses

It's time to check our presumptions. When you read that chart, did you presume it's exactly 2 doses or did you presume it's 2 or 3 doses? Or did you immediately spot the ambiguity? As I said in this article, graphs attain efficiency at communication because the designer leverages unspoken rules - the chart conveys certain information without explicitly placing it on the chart. But this can backfire. In this case, I presumed the three lines to display three non-overlapping groups of people, and thus the green line indicates those with 2 doses but not 3. That presumption led me to misinterpret what's on the chart.

B) What is the denominator of the case rates? Is it literal - by that I mean, all unvaccinated people for the red line, and all people with 3 doses for the blue line? Or is the denominator the population of Israel, the same number for all three lines? Lukas asked this question, and got the answer (the former).

Lukas_denominator

C) Since third shots are recommended for 60 year olds and over who were vaccinated at least 5 months ago, and most unvaccinated Israelis are below 60, this answer opens the possibility that the lines compare apples and oranges. Joe. S. asked about this, and received an answer (all lines display only 60 year olds and over.)

Joescholar_basepopulationquestion

Jason P. asked, and learned that the 5-month-out criterion is immaterial since 90% of the vaccinated have already reached that time point.

JasonPogue_5monthsout

D) We have even more presumptions. Like me, did you presume that the red line represents the "unvaccinated," meaning people who have not had any vaccine shots? If so, we may both be wrong about this. It has become the norm by vaccine researchers to lump "partially vaccinated" people with "unvaccinated", and call this combined group "unvaccinated". Here is an excerpt from a recent report from Public Health Ontario (link to PDF), which clearly states this unintuitive counting rule:

Ontario_case_definition

Notice that in this definition, someone who got infected within 14 days of the first shot is classified as an "unvaccinated" case and not a "partially vaccinated case".

In the following tweet, Dvir gave a hint of what he plotted:

Dvir_group_definition

In a previous analysis, he averaged the rates of people with 0 doses and 1 dose, which is equivalent to combining them and calling them unvaccinated. It's unclear to me what he did to the 1-dose subgroup in our featured chart - did it just vanish from the chart? (How people and cases are classified into these groups is a major factor in all vaccine effectiveness calculations - a topic I covered here. Unfortunately, most published reports do a poor job explaining what the analysts did).

E) Did you presume that all three lines are equally important? That's far from true. Since Israel is the world champion in vaccination, the bulk of the 60+ population form the green line. I asked Dvir and he responded that only 7.5%, or roughly 100K are unvaccinated.

DvirAran_proportionofunvaccinated

That means 1.2 million people are part of the green line, 12 times higher. There are roughly 50 cases per day among unvaccinated, and 370 daily cases among those with 2 or 3 doses. In other words, vaccinated people account for almost 90% of all cases.

Yes, this is inevitable when over 90% of the age group have been vaccinated (but it is predictable on the first day someone blasted everywhere that real-world VE is proved by the fact that almost all new cases were in the unvaccinated.)

If your job is to minimize infections, you should be spending most of your time thinking about the 370 cases among vaccinated than the 50 cases among unvaccinated. If you halve the case rate, that would be a difference of 185 cases vs 25. In Israel, the vaccination campaign has already succeeded; it's time to look forward, which is exactly why they are re-focusing on the already vaccinated.

***

If what you worry about most is the effectiveness of the original two-dose regimen, Dvir's chart raises a puzzle. Ignore the blue line, and remember that the green line already includes everybody represented by the blue line.

In the following chart, I removed the blue line, and added reference lines in dashed purple that correspond to 25%, 50% and 75% vaccine effectiveness. The data plotted on this chart are unadjusted case rates. A 75% effective vaccine cuts case rate by three quarters.

Junkcharts_dviraran_israel_threeshotschart

This chart shows the 2-dose mRNA vaccine was nowhere near 90% effective. (As regular readers know, I don't endorse this simplistic calculation and have outlined the problems here, but this style of calculation keeps getting published and passed around. Those who use it to claim real-world studies confirm prior clinical trial outcomes can either (a) insist on using it and retract their earlier conclusions, or (b) admit that such a calculation was, and is, a bad take.)

Also observe how the vaccinated (green) line is moving away from the unvaccinated (red) line. The vaccine apparently is becoming more effective, which runs counter to the trend used by the Israeli government to justify third doses. This improvement also precedes the start of the third-shot campaign. When the analytical method is bad, it generates all sorts of spurious findings.

***

As Dvir said, it is premature to comment on the third doses based on 10 days of data. For one thing, the vaccine developers insist that their vaccines must be given 14 days to work. In a typical calculation, all of the cases in the blue line fall outside the case-counting window. The effective number of cases that would be attributed to the 3-dose group right now is zero, and the vaccine effectiveness using the standard methodology is 100%, even better than shown in the chart.

There is an alternative interpretation of this graph. Statisticians call this the selection effect. On July 30, the blue line split out of the green: some people were selected to receive the 3rd dose - this includes an official selection (the government makes certain subgroups eligible) as well as a self-selection (within the eligible subgroup, certain people decide to get the 3rd shot earlier.) If those who are less exposed to the virus, or more risk averse, get the shots first, then all that is happening may be that we have split off a high VE subgroup from the green line. Even if the third shot were useless, the selection effect itself could explain the gap.

Statistics is about grays. It's not either-or. It's usually some of each. If you feel like Groundhog Day, you're getting the picture. When they rolled out two doses, we lived through an optimistic period in which most experts rejoiced about 90-100% real-world effectiveness, and then as more people get vaccinated, the effect washed away. The selection effect gradually disappears when vaccination becomes widespread. Are we starting a new cycle of hope and despair? We'll find out soon enough.


What metaphors give, they take away

Aleks pointed me to the following graphic making the rounds on Twitter:

Whyaxis_covid_men

It's being passed around as an example of great dataviz.

The entire attraction rests on a risque metaphor. The designer is illustrating a claim that Covid-19 causes erectile dysfunction in men.

That's a well-formed question so in using the Trifecta Checkup, that's a pass on the Q corner.

What about the visual metaphor? I advise people to think twice before using metaphors because these devices can give as they can take. This example is no exception. Some readers may pay attention to the orientation but other readers may focus on the size.

I pulled out the tape measure. Here's what I found.

Junkcharts_covid_eds

The angle is accurate on the first chart but the diameter has been exaggerated relative to the other. The angle is slightly magnified in the bottom chart which has a smaller circumference.

***

Let's look at the Data to round out our analysis. They come from a study from Italy (link), utilizing survey responses. There were 25 male respondents in the survey who self-reported having had Covid-19. Seven of these submitted answers to a set of five questions that were "suggestive of erectile dysfunction". (This isn't as arbitrary as it sounds - apparently it is an internationally accepted way of conducting reseach.) Seven out of 25 is 28 percent. Because the sample size is small, the 95% confidence range is 10% to 46%.

The researchers then used the propensity scoring method to find 3 matches per each infected person. Each match is a survey respondent who did not self-report having had Covid-19. See this post about a real-world vaccine study to learn more about propensity scoring. Among the 75 non-infected men, 7 were judged to have ED. The 95% range is 3% to 16%.

The difference between the two subgroups is quite large. The paper also includes other research that investigates the mechanisms that can explain the observed correlation. Nevertheless, the two proportions depicted in the chart have wide error bars around them.

I have always had a question about analysis using this type of survey data (including my own work). How do they know that ED follows infection rather than precedes it? One of the inviolable rules of causation is that the effect follows the cause. If it's a series of surveys, the sequencing may be measurable but a single survey presents challenges. 

The headline of the dataviz is "Get your vaccines". This comes from a "story time" moment in the paper. On page 1, under Discussion and conclusion, they inserted the sentence "Universal vaccination against COVID-19 and the personal protective equipment could possibly have the added benefit of preventing sexual dysfunctions." Nothing in the research actually supports this claim. The only time the word "vaccine" appears in the entire paper is on that first page.

"Story time" is the moment in a scientific paper when the researchers - after lulling readers to sleep over some interesting data - roll out statements that are not supported by the data presented before.

***

The graph succeeds in catching people's attention. The visual metaphor works in one sense but not in a different sense.

 

P.S. [8/6/2021] One final note for those who do care about the science: the internet survey not surprisingly has a youth bias. The median age of 25 infected people was 39, maxing out at 45 while the median of the 75 not infected was 42, maxing out at 49.


Did prices go up or down? Depends on how one looks at the data

The U.S. media have been flooded with reports of runaway inflation recently, and it's refreshing to see a nice article in the Wall Street Journal that takes a second look at the data. Because as my readers know, raw data can be incredibly deceptive.

Inflation typically describes the change in price level relative to the prior year. The month-on-month change in price levels is a simple seasonal adjustment used to remove the effect of seasonality that masks the true change in price levels. (See this explainer of seasonal adjustment.)

As the pandemic enters the second year, this methodology is comparing 2021 price levels to pandemic-impacted price levels of 2020. This produces a very confusing picture. As the WSJ article explains, prices can be lower than they were in 2019 (pre-pandemic) and yet substantially higher than they were in 2020 (during the pandemic). This happens in industry sectors that were heavily affected by the economic shutdown, e.g. hotels, travel, entertainment.

Wsj_pricechangehotels_20192021Here is how they visualized this phenomenon. Amusingly, some algorithm estimated that it should take 5 minutes to read the entire article. It may take that much time to understand properly what this chart is showing.

Let me save you some time.

The chart shows monthly inflation rates of hotel price levels.

The pink horizontal stripes represent the official inflation numbers, which compare each month's hotel prices to those of a year prior. The most recent value for May of 2021 says hotel prices rose by 9% compared to May of 2020.

The blue horizontal stripes show an alternative calculation which compares each month's hotel prices to those of two years prior. Think of 2018-9 as "normal" years, pre-pandemic. Using this measure, we find that hotel prices for May of 2021 are about 4% lower than for May of 2019.

(This situation affects all of our economic statistics. We may see an expansion in employment levels from a year ago which still leaves us behind where we were before the pandemic.)

What confused me on the WSJ chart are the blocks of color. In a previous chart, the readers learn that solid colors mean inflation rose while diagonal lines mean inflation decreased. It turns out that these are month-over-month changes in inflation rates (notice that one end of the column for the previous month touches one end of the column of the next month).

The color patterns become the most dominant feature of this chart, and yet the month-over-month change in inflation rates isn't the crux of the story. The real star of the story should be the difference in inflation rates - for any given month - between two reference years.

***

In the following chart, I focus attention on the within-month, between-reference-years comparisons.

Junkcharts_redo_wsj_inflationbaserate

Because hotel prices dropped drastically during the pandemic, and have recovered quite well in recent months as the U.S. reopens the economy, the inflation rate of hotel prices is almost 10%. Nevertheless, the current price level is still 7% below the pre-pandemic level.

 



 


Start at zero improves this chart but only slightly

The following chart was forwarded to me recently:

Average_female_height

It's a good illustration of why the "start at zero" rule exists for column charts. The poor Indian lady looks extremely short in this women's club. Is the average Indian woman really half as tall as the average South African woman? (Surely not!)

Junkcharts_redo_womenheight_columnThe problem is only superficially fixed by starting the vertical axis at zero. Doing so highlights the fact that the difference in average heights is but a fraction of the average heights themselves. The intra-country differences are squashed in such a representation - which works against the primary goal of the data visualization itself.

Recall the Trifecta Checkup. At the top of the trifecta is the Question. The designer obviously wants to focus our attention on the difference of the averages. A column chart showing average heights fails the job!

This "proper" column chart sends the message that the difference in average heights is noise, unworthy of our attention. But this is a bad take of the underlying data. The range of average heights across countries isn't that wide, by virtue of large population sizes.

According to Wikipedia, they range from 4 feet 10.5 to 5 feet 6 (I'm ignoring several entries in the table based on non representative small samples.) How do we know that the difference of 2 inches between averages of South Africa and India is actually a sizable difference? The Wikipedia table has the average heights for most of the world's countries. There are perhaps 200 values. These values are sprinkled inside the range of about 8 inches top to bottom. If we divide the full range into 10 equal bins, that's roughly 0.8 inches per bin. So if we have two numbers that are 2 inches apart, they almost span 2 bins. If the data were evenly distributed, that's a huge shift.

(In reality, the data should be normally distributed, bell-shaped, with much more at the center than on the edges. That makes a difference of 2 inches even more significant if these are normal values near the center but less significant if these are extreme values on the tails. Stats students should be able to articulate why we are sure the data are normally distributed without having to plot the data.)

***

The original chart has further problems.

Another source of distortion comes from the scaling of the stick figures. The aspect ratio is being preserved, which means the area is being scaled. Given that the heights are scaled as per the data, the data are encoded twice, the second time in the widths. This means that the sizes of these figures grow at the rate of the square of the heights. (Contrast this with the scaling discussed in my earlier post this week which preserves the relative areas.)

At the end of that last post, I discuss why adding colors to a chart when the colors do not encode any data is a distraction to the reader. And this average height chart is an example.

From the Data corner of the Trifecta Checkup, I'm intrigued by the choice of countries. Why is Scotland highlighted instead of the U.K.? Why Latvia? According to Wikipedia, the Latvia estimate is based on a 1% sample of only 19 year olds.

Some of the data appear to be incorrect (or the designer used a different data source). Wikipedia lists the average height of Latvian women as 5 ft 6.5 while the chart shows 5 ft 5 in. Peru's average height of females is listed as 4 ft 11.5 and of males as 5 ft 4.5. The chart shows 5 ft 4 in.

***

Lest we think only amateurs make this type of chart, here is an example of a similar chart in a scientific research journal:

Fnhum-14-00338-g007

(link to original)

I have seen many versions of the above column charts with error bars, and the vertical axes not starting at zero. In every case, the heights (and areas) of these columns do not scale with the underlying data.

***

I tried a variant of the stem-and-leaf plot:

Junkcharts_redo_womenheight_stemleaf

The scale is chosen to reflect the full range of average heights given in Wikipedia. The chart works better with more countries to fill out the distribution. It shows India is on the short end of the scale but not quite the lowest. (As mentioned above, Peru actually should be placed close to the lower edge.)

 


Plotting the signal or the noise

Antonio alerted me to the following graphic that appeared in the Economist. This is a playful (?) attempt to draw attention to racism in the game of football (soccer).

The analyst proposed that non-white players have played better in stadiums without fans due to Covid19 in 2020 because they have not been distracted by racist abuse from fans, using Italy's Serie A as the case study.

Econ_seriea_racism

The chart struggles to bring out this finding. There are many lines that criss-cross. The conclusion is primarily based on the two thick lines - which show the average performance with and without fans of white and non-white players. The blue line (non-white) inched to the right (better performance) while the red line (white) shifted slightly to the left.

If the reader wants to understand the chart fully, there's a lot to take in. All (presumably) players are ranked by the performance score from lowest to highest into ten equally sized tiers (known as "deciles"). They are sorted by the 2019 performance when fans were in the stadiums. Each tier is represented by the average performance score of its members. These are the values shown on the top axis labeled "with fans".

Then, with the tiers fixed, the players are rated in 2020 when stadiums were empty. For each tier, an average 2020 performance score is computed, and compared to the 2019 performance score.

The following chart reveals the structure of the data:

Junkcharts_redo_seriea_racism

The players are lined up from left to right, from the worst performers to the best. Each decile is one tenth of the players, and is represented by the average score within the tier. The vertical axis is the actual score while the horizontal axis is a relative ranking - so we expect a positive correlation.

The blue line shows the 2019 (with fans) data, which are used to determine tier membership. The gray dotted line is the 2020 (no fans) data - because they don't decide the ranking, it's possible that the average score of a lower tier (e.g. tier 3 for non-whites) is higher than the average score of a higher tier (e.g. tier 4 for non-whites).

What do we learn from the graphic?

It's very hard to know if the blue and gray lines are different by chance or by whether fans were in the stadium. The maximum gap between the lines is not quite 0.2 on the raw score scale, which is roughly a one-decile shift. It'd be interesting to know the variability of the score of a given player across say 5 seasons prior to 2019. I suspect it could be more than 0.2. In any case, the tiny shifts in the averages (around 0.05) can't be distinguished from noise.

***

This type of analysis is tough to do. Like other observational studies, there are multiple problems of biases and confounding. Fan attendance was not the only thing that changed between 2019 and 2020. The score used to rank players is a "Fantacalcio algorithmic match-level fantasy-football score." It's odd that real-life players should be judged by their fantasy scores rather than their on-the-field performance.

The causal model appears to assume that every non-white player gets racially abused. At least, the analyst didn't look at the curves above and conclude, post-hoc, that players in the third decile are most affected by racial abuse - which is exactly what has happened with the observational studies I have featured on the book blog recently.

Being a Serie A fan, I happen to know non-white players are a small minority so the error bars are wider, which is another issue to think about. I wonder if this factor by itself explains the shifts in those curves. The curve for white players has a much higher sample size thus season-to-season fluctuations are much smaller (regardless of fans or no fans).

 

 

 

 


Did the pandemic drive mass migration?

The Wall Street Journal ran this nice compact piece about migration patterns during the pandemic in the U.S. (link to article)

Wsj_migration

I'd look at the chart on the right first. It shows the greatest net flow of people out of the Northeast to the South. This sankey diagram is nicely done. The designer shows restraint in not printing the entire dataset on the chart. If a reader really cares about the net migration from one region to a specific other region, it's easy to estimate the number even though it's not printed.

The maps succinctly provide readers the definition of the regions.

To keep things in perspective, we are talking around 100,000 when the death toll of Covid-19 is nearing 600,000. Some people have moved but almost everyone else haven't.

***

The chart on the left breaks down the data in a different way - by urbanicity. This is a variant of the stacked column chart. It is a chart form that fits the particular instance of the dataset. It works only because in every month of the last three years, there was a net outflow from "large metro cores". Thus, the entire series for large metro cores can be pointed downwards.

The fact that this design is sensitive to the dataset is revealed in the footnote, which said that the May 2018 data for "small/medium metro" was omitted from the chart. Why didn't they plot that number?

It's the one datum that sticks out like a sore thumb. It's the only negative number in the entire dataset that is not associated with "large metro cores". I suppose they could have inserted a tiny medium green slither in the bottom half of that chart for May 2018. I don't think it hurts the interpretation of the chart. Maybe the designer thinks it might draw unnecessary attention to one data point that really doesn't warrant it.

***

See my collection of posts about Wall Street Journal graphics.


These are the top posts of 2020

It's always very interesting as a writer to look back at a year's of posts and find out which ones were most popular with my readers.

Here are the top posts on Junk Charts from 2020:

How to read this chart about coronavirus risk

This post about a New York Times scatter plot dates from February, a time when many Americans were debating whether Covid-19 was just the flu.

Proportions and rates: we are no dupes

This post about a ArsTechnica chart on the effects of Covid-19 by age is an example of designing the visual to reflect the structure of the data.

When the pie chart is more complex than the data

This post shows a 3D pie chart which is worse than a 2D pie chart.

Twitter people upset with that Covid symptoms diagram

This post discusses some complicated graphics designed to illustrate complicated datasets on Covid-19 symptoms.

Cornell must remove the logs before it reopens in the fall

This post is another warning to think twice before you use log scales.

What is the price of objectivity?

This post turns an "objective" data visualization into a piece of visual story-telling.

The snake pit chart is the best election graphic ever

This post introduces my favorite U.S. presidential election graphic, designed by the FiveThirtyEight team.

***

Here is a list of posts that deserve more attention:

Locating the political center

An example of bringing readers as close to the insights as possible

Visualizing change over time

An example of designing data visualization to reflect the structure of multivariate data

Bloomberg made me digest these graphics slowly

An example of simple and thoughtful graphics

The hidden bad assumption behind most dual-axis time-series charts

Read this before you make a dual-axis chart

Pie chart conventions

Read this before you make a pie chart

***
Looking forward to bring you more content in 2021!

Happy new year.


Is this an example of good or bad dataviz?

This chart is giving me feelings:

Trump_mcconnell_chart

I first saw it on TV and then a reader submitted it.

Let's apply a Trifecta Checkup to the chart.

Starting at the Q corner, I can say the question it's addressing is clear and relevant. It's the relationship between Trump and McConnell's re-election. The designer's intended message comes through strongly - the chart offers evidence that McConnell owes his re-election to Trump.

Visually, the graphic has elements of great story-telling. It presents a simple (others might say, simplistic) view of the data - just the poll results of McConnell vs McGrath at various times, and the election result. It then flags key events, drawing the reader's attention to those. These events are selected based on key points on the timeline.

The chart includes wise design choices, such as no gridlines, infusing the legend into the chart title, no decimals (except for last pair of numbers, the intention of which I'm not getting), and leading with the key message.

I can nitpick a few things. Get rid of the vertical axis. Also, expand the scale so that the difference between 51%-40% and 58%-38% becomes more apparent. Space the time points in proportion to the dates. The box at the bottom is a confusing afterthought that reduces rather than assists the messaging.

But the designer got the key things right. The above suggestions do not alter the reader's expereince that much. It's a nice piece of visual story-telling, and from what I can see, has made a strong impact with the audience it is intended to influence.

_trifectacheckup_junkchartsThis chart is proof why the Trifecta Checkup has three corners, plus linkages between them. If we just evaluate what the visual is conveying, this chart is clearly above average.

***

In the D corner, we ask: what the Data are saying?

This is where the chart runs into several problems. Let's focus on the last two sets of numbers: 51%-40% and 58%-38%. Just add those numbers and do you notice something?

The last poll sums to 91%. This means that up to 10% of the likely voters responded "not sure" or some other candidate. If these "shy" voters show up at the polls as predicted by the pollsters, and if they voted just like the not shy voters, then the election result would have been 56%-44%, not 51%-40%. So, the 58%-38% result is within the margin of error of these polls. (If the "shy" voters break for McConnell in a 75%-25% split, then he gets 58% of the total votes.)

So, the data behind the line chart aren't suggesting that the election outcome is anomalous. This presents a problem with the Q-D and D-V green arrows as these pairs are not in sync.

***

In the D corner, we should consider the totality of the data available to the designer, not just what the designer chooses to utilize. The pivot of the chart is the flag annotating the "Trump robocall."

Here are some questions I'd ask the designer:

What else happened on October 31 in Kentucky?

What else happened on October 31, elsewhere in the country?

Was Trump featured in any other robocalls during the period portrayed?

How many robocalls were made by the campaign, and what other celebrities were featured?

Did any other campaign event or effort happen between the Trump robocall and election day?

Is there evidence that nothing else that happened after the robocall produced any value?

The chart commits the XYopia (i.e. X-Y myopia) fallacy of causal analysis. When the data analyst presents one cause and one effect, we are cued to think the cause explains the effect but in every scenario that is not a designed experiment, there are multiple causes at play. Sometimes, the more influential cause isn't the one shown in the chart.

***

Finally, let's draw out the connection between the last set of poll numbers and the election results. This shows why causal inference in observational data is such a beast.

Poll numbers are about a small number of people (500-1,000 in the case of Kentucky polls) who respond to polling. Election results are based on voters (> 2 million). An assumption made by the designer is that these polls are properly conducted, and their results are credible.

The chart above makes the claim that Trump's robocall gave McConnell 7% more votes than expected. This implies the robocall influenced at least 140,000 voters. Each such voter must fit the following criteria:

  • Was targeted by the Trump robocall
  • Was reached by the Trump robocall (phone was on, etc.)
  • Responded to the Trump robocall, by either picking up the phone or listening to the voice recording or dialing a call-back number
  • Did not previously intend to vote for McConnell
  • If reached by a pollster, would refuse to respond, or say not sure, or voting for McGrath or a third candidate
  • Had no other reason to change his/her behavior

Just take the first bullet for example. If we found a voter who switched to McConnell after October 31, and if this person was not on the robocall list, then this voter contributes to the unexpected gain in McConnell votes but weakens the case that the robocall influenced the election.

As analysts, our job is to find data to investigate all of the above. Some of these are easier to investigate. The campaign knows, for example, how many people were on the target list, and how many listened to the voice recording.

 

 

 

 


Podcast highlights

Recently, I made a podcast for Ryan Ray, which you can access here. The link sends you to a 14-day free trial to his newsletter, which is where he publishes his podcasts.

Kaiserfung_warroommedia

Ryan contacted me after he read my book Numbers Rule Your World (link). I was happy to learn that he enjoyed the stories, and during the podcast, he gave an example of how he applied the statistical concepts to other situations.

During the podcast, you will hear:

  • I have a line in my course syllabus that reads "after you take this class, you will not be able to look at numbers (in the media) with a straight face ever again." That's a goal of mine. And it also applies to my books.

  • Why are most statisticians skeptics

  • Figuring out the statistical conclusions is the easy part while the hardest challenge is to find a way to communicate them to a non-technical audience. I went through many drafts before I landed on the precise language used in those stories.

  • Why "correlation is not causation" is not useful practical advice
  • You can't unsee something you've already seen, and this creates hindsight bias
  • The biggest bang for the buck when improving statistical models is improving data quality

  • Some models, such as polls and election forecasts, can be thought of as thermometers measuring the mood of the respondents at the time of polling.

***

To hear the podcast, visit Ryan Ray's website.


A testing mess: one chart, four numbers, four colors, three titles, wrong units, wrong lengths, wrong data

Twitterstan wanted to vote the following infographic off the island:

Tes_Alevelsresults

(The publisher's website is here but I can't find a direct link to this graphic.)

The mishap is particularly galling given the controversy swirling around this year's A-Level results in the U.K. For U.S. readers, you can think of A-Levels as SAT Subject Tests, which in the U.K. are required of all university applicants, and represent the most important, if not the sole, determinant of admissions decisions. Please see the upcoming post on my book blog for coverage of the brouhaha surrounding the statistical adjustments (to be posted sometime this week, it's here.).

The first issue you may notice about the chart is that the bar lengths have no relationship with the numbers printed on them. Here is a scatter plot correlating the bar lengths and the data.

Junkcharts_redo_tes_alevels_scatter


As you can see, nothing.

Then, you may wonder what the numbers mean. The annotation at the bottom right says "Average number of A level qualifications per student". Wow, the British (in this case, English) education system is a genius factory - with the average student mastering close to three thousand subjects in secondary (high) school!

TES is the cool name for what used to be the Times Educational Supplement. I traced the data back to Ofqual, which is the British regulator for these examinations. This is the Ofqual version of the above chart:

Ofqual_threeAstar

The data match. You may see that the header of the data table reads "Number of students in England getting 3 x A*". This is a completely different metric than number of qualifications - in fact, this metric measures geniuses. "A*" is the U.K. equivalent of "A+". When I studied under the British system, there was no such grade. I guess grade inflation is happening all over the world. What used to be A is now A+, and what used to be B is now A. Scoring three A*s is tops - I wonder if this should say 3 or more because I recall that you can take as many subjects as you desire but most students max out at three (may have been four).

The number of students attaining the highest achievement has increased in the last two years compared to the two years before. We can't interpret these data unless we know if the number of students also grew at similar rates.

The units are students while the units we expect from the TES graphic should be subjects. The cutoff for the data defines top students while the TES graphic should connote minimum qualification, i.e. a passing grade.

***
Now, the next section of the Ofqual infographic resolves the mystery. Here is the chart:

Ofqual_Alevelquals

This dataset has the right units and measurement. There is almost no meaningful shift in the last four years. The average number of qualifications per student is only different at the second decimal place. Replacing the original data with this set removes the confusion.

Junkcharts_redo_tes_alevels_correctdata

While I was re-making this chart, I also cleaned out the headers and sub-headers. This is an example of software hegemony: the designer wouldn't have repeated the same information three times on a chart with four numbers if s/he wasn't prompted by software defaults.

***

The corrected chart violates one of the conventions I described in my tutorial for DataJournalism.com: color difference should reflect data difference.

In the following side-by-side comparison, you see that the use of multiple colors on the left chart signals different data - note especially the top and bottom bars which carry the same number, but our expectation is frustrated.

Junkcharts_redo_tes_alevels_sidebyside

***

[P.S. 8/25/2020. Dan V. pointed out another problem with these bar charts: the bars were truncated so that the bar lengths are not proportional to the data. The corrected chart is shown on the right below:

Junkcharts_redo_tes_alevels_barlengths

8/26/2020: added link to the related post on my book blog.]