Asymmetry and orientation

An author in Significance claims that a single season of Premier League football without live spectators is enough to prove that the so-called home field advantage is really a live-spectator advantage.

The following chart depicts the data going back many seasons:


I find this bar chart challenging.

It plots the ratio of home wins to away wins using an odds scale, which is not intuitive. The odds scale (probability of success divided by probability of failure) runs from 0 to positive infinity, with 1 being a special value indicating equal odds. But all the values for which away wins exceed home wins are squeezed into the interval between 0 and 1 while the values for which home wins exceed away wins are laid out between 1 and infinity. So it's an inherently asymmetric graphic for a symmetric formula.

The section labeled "more away wins than home wins" are filled with red bars for all those seasons with positive home field advantage while the most recent season, the outlier, has a shorter bar in that section than the rest.

Here's an alternative view:


I have incorporated dual axes here - but both axes are different only by scaling. There are 380 games in a Premier League season so the percentage scale is just a re-expression of the counts.



Plotting the signal or the noise

Antonio alerted me to the following graphic that appeared in the Economist. This is a playful (?) attempt to draw attention to racism in the game of football (soccer).

The analyst proposed that non-white players have played better in stadiums without fans due to Covid19 in 2020 because they have not been distracted by racist abuse from fans, using Italy's Serie A as the case study.


The chart struggles to bring out this finding. There are many lines that criss-cross. The conclusion is primarily based on the two thick lines - which show the average performance with and without fans of white and non-white players. The blue line (non-white) inched to the right (better performance) while the red line (white) shifted slightly to the left.

If the reader wants to understand the chart fully, there's a lot to take in. All (presumably) players are ranked by the performance score from lowest to highest into ten equally sized tiers (known as "deciles"). They are sorted by the 2019 performance when fans were in the stadiums. Each tier is represented by the average performance score of its members. These are the values shown on the top axis labeled "with fans".

Then, with the tiers fixed, the players are rated in 2020 when stadiums were empty. For each tier, an average 2020 performance score is computed, and compared to the 2019 performance score.

The following chart reveals the structure of the data:


The players are lined up from left to right, from the worst performers to the best. Each decile is one tenth of the players, and is represented by the average score within the tier. The vertical axis is the actual score while the horizontal axis is a relative ranking - so we expect a positive correlation.

The blue line shows the 2019 (with fans) data, which are used to determine tier membership. The gray dotted line is the 2020 (no fans) data - because they don't decide the ranking, it's possible that the average score of a lower tier (e.g. tier 3 for non-whites) is higher than the average score of a higher tier (e.g. tier 4 for non-whites).

What do we learn from the graphic?

It's very hard to know if the blue and gray lines are different by chance or by whether fans were in the stadium. The maximum gap between the lines is not quite 0.2 on the raw score scale, which is roughly a one-decile shift. It'd be interesting to know the variability of the score of a given player across say 5 seasons prior to 2019. I suspect it could be more than 0.2. In any case, the tiny shifts in the averages (around 0.05) can't be distinguished from noise.


This type of analysis is tough to do. Like other observational studies, there are multiple problems of biases and confounding. Fan attendance was not the only thing that changed between 2019 and 2020. The score used to rank players is a "Fantacalcio algorithmic match-level fantasy-football score." It's odd that real-life players should be judged by their fantasy scores rather than their on-the-field performance.

The causal model appears to assume that every non-white player gets racially abused. At least, the analyst didn't look at the curves above and conclude, post-hoc, that players in the third decile are most affected by racial abuse - which is exactly what has happened with the observational studies I have featured on the book blog recently.

Being a Serie A fan, I happen to know non-white players are a small minority so the error bars are wider, which is another issue to think about. I wonder if this factor by itself explains the shifts in those curves. The curve for white players has a much higher sample size thus season-to-season fluctuations are much smaller (regardless of fans or no fans).





The windy path to the Rugby World Cup

When I first saw the following chart, I wondered whether it is really that challenging for these eight teams to get into the Rugby World Cup, currently playing in Japan:


Another visualization of the process conveys a similar message. Both of these are uploaded to Wikipedia.


(This one hasn't been updated and still contains blank entries.)


What are some of the key messages one would want the dataviz to deliver?

  • For the eight countries that got in (not automatically), track their paths to the World Cup. How many competitions did they have to play?
  • For those countries that failed to qualify, track their paths to the point that they were stopped. How many competitions did they play?
  • What is the structure of the qualification rounds? (These are organized regionally, in addition to certain playoffs across regions.)
  • How many countries had a chance to win one of the eight spots?
  • Within each competition, how many teams participated? Did the winner immediately qualify, or face yet another hurdle? Did the losers immediately disqualify, or were they offered another chance?

Here's my take on this chart:



Tennis greats at the top of their game

The following chart of world No. 1 tennis players looks pretty but the payoff of spending time to understand it isn't high enough. The light colors against the tennis net backdrop don't work as intended. The annotation is well done, and it's always neat to tug a legend inside the text.


The original is found at Tableau Public (link).

The topic of the analysis appears to be the ages at which tennis players attained world #1 ranking. Here are the male players visualized differently:


Some players like Jimmy Connors and Federer have second springs after dominating the game in their late twenties. It's relatively rare for players to get to #1 after 30.

This Wimbledon beauty will be ageless


This Financial Times chart paints the picture of the emerging trend in Wimbledon men’s tennis: the average age of players has been rising, and hits 30 years old for the first time ever in 2019.

The chart works brilliantly. Let's look at the design decisions that contributed to its success.

The chart contains a good amount of data and the presentation is carefully layered, with the layers nicely tied to some visual cues.

Readers are drawn immediately to the average line, which conveys the key statistical finding. The blue dot  reinforces the key message, aided by the dotted line drawn at 30 years old. The single data label that shows a number also highlights the message.

Next, readers may notice the large font that is applied to selected players. This device draws attention to the human stories behind the dry data. Knowledgable fans may recall fondly when Borg, Becker and Chang burst onto the scene as teenagers.


Then, readers may pick up on the ticker-tape data that display the spread of ages of Wimbledon players in any given year. There is some shading involved, not clearly explained, but we surmise that it illustrates the range of ages of most of the contestants. In a sense, the range of probable ages and the average age tell the same story. The current trend of rising ages began around 2005.


Finally, a key data processing decision is disclosed in chart header and sub-header. The chart only plots the players who reached the fourth round (16). Like most decisions involved in data analysis, this choice has both desirable and undesirable effects. I like it because it thins out the data. The chart would have appeared more cluttered otherwise, in a negative way.

The removal of players eliminated in the early rounds limits the conclusion that one can draw from the chart. We are tempted to generalize the finding, saying that the average men’s player has increased in age – that was what I said in the first paragraph. Thinking about that for a second, I am not so sure the general statement is valid.

The overall field might have gone younger or not grown older, even as the older players assert their presence in the tournament. (This article provides side evidence that the conjecture might be true: the author looked at the average age of players in the top 100 ATP ranking versus top 1000, and learned that the average age of the top 1000 has barely shifted while the top 100 players have definitely grown older.)

So kudos to these reporters for writing a careful headline that stays true to the analysis.

I also found this video at FT that discussed the chart.


This chart about Wimbledon players hits the Trifecta. It has an interesting – to some, surprising – message (Q). It demonstrates thoughtful processing and analysis of the data (D). And the visual design fits well with its intended message (V). (For a comprehensive guide to the Trifecta Checkup, see here.)

Book review: Visualizing Baseball

I requested a copy of Jim Albert’s Visualizing Baseball book, which is part of the ASA-CRC series on Statistical Reasoning in Science and Society that has the explicit goal of reaching a mass audience.

Visualizingbaseball_coverThe best feature of Albert’s new volume is its brevity. For someone with a decent background in statistics (and grasp of basic baseball jargon), it’s a book that can be consumed within one week, after which one receives a good overview of baseball analytics, otherwise known as sabermetrics.

Within fewer than 200 pages, Albert outlines approaches to a variety of problems, including:

  • Comparing baseball players by key hitting (or pitching) metrics
  • Tracking a player’s career
  • Estimating the value of different plays, such as a single, a triple or a walk
  • Predicting expected runs in an inning from the current state of play
  • Analyzing pitches and swings using PitchFX data
  • Describing the effect of ballparks on home runs
  • Estimating the effect of particular plays on the outcome of a game
  • Simulating “fake” games and seasons in order to produce probabilistic forecasts such as X% chance that team Y will win the World Series
  • Examining whether a hitter is “streaky” or not

Most of the analyses are descriptive in nature, e.g. describing the number and types of pitches thrown by a particular pitcher, or the change in on-base percentage over the career of a particular hitter. A lesser number of pages are devoted to predictive analytics. This structure is acceptable in a short introductory book. In practice, decision-makers require more sophisticated work on top of these descriptive analyses. For example, what’s the value of telling a coach that the home run was the pivotal moment in a 1-0 game that has played out?

To appreciate the practical implications of the analyses included in this volume, I’d recommend reading Moneyball by Michael Lewis, or the more recent Astroball by Ben Reiter.

For the more serious student of sabermetrics, key omitted details will need to be gleaned from other sources, including other books by the same author – for years, I have recommended Curve Ball by Albert and Bennett to my students.


In the final chapters, Albert introduced the simulation of “fake” seasons that underlies predictions. An inquiring reader should investigate how the process is tied back to the reality of what actually happened; otherwise, the simulation will have a life of its own. Further, if one simulates 1,000 seasons of 2018 baseball, a large number of these fake seasons would crown some team other than the Red Sox as the 2018 World Series winner. Think about it: that’s how it is possible to make the prediction that the Red Sox has a say 60 percent chance of winning the World Series in 2018! A key to understanding the statistical way of thinking is to accept the logic of this fake simulated world. It is not the stated goal of Albert to convince readers of the statistical way of thinking – but you’re not going to be convinced unless you think about why we do it this way.


While there are plenty of charts included in the book, a more appropriate title for “Visualizing Baseball” would have been “Fast Intro to Baseball Analytics”. With several exceptions, the charts are not essential to understanding the analyses. The dominant form of exposition is first describe the analytical conclusion, then introduce a chart to illustrate that conclusion. The inverse would be: Start with the chart, and use the chart to explain the analysis.

The visualizations are generally of good quality, emphasizing clarity over prettiness. The choice of sticking to one software, ggplot2 in R, without post-production, constrains the visual designer to the preferences of the software designer. Such limitations are evident in chart elements like legends and titles. Here is one example (Chapter 5, Figure 5.8):


By default, the software prints the names of data columns in the titles. Imagine if the plot titles were Changeup, Fastball and Slider instead of CU, FF and SL. Or that the axis labels were “horizontal location” and “vertical location” (check) instead of px and pz. [Note: The chart above was taken from the book's github site; in the  Figure 5.8 in the printed book, the chart titles were edited as suggested.]

The chart analyzes the location relative to the strike zone of pitches that were missed versus pitches that were hit (not missed). By default, the software takes the name of the binary variable (“Miss”) as the legend title, and lists the values of the variable (“True” and “False”) as the labels of the two colors. Imagine if True appeared as “Miss” and False as “Hit” .

Finally, the chart exhibits over-plotting, making it tough to know how many blue or gray dots are present. Smaller dot size might help, or else some form of aggregation.


Visualizing Baseball is not the book for readers who learn by running code as no code is included in the book. A github page by the author hosts the code, but only the R/ggplot2 code for generating the data visualization. Each script begins after the analysis or modeling has been completed. If you already know R and ggplot2, the github is worth a visit. In any case, I don’t recommend learning coding from copying and pasting clean code.

All in all, I can recommend this short book to any baseball enthusiast who’s beginning to look at baseball data. It may expand your appreciation of what can be done. For details, and practical implications, look elsewhere.

The Bumps come to the NBA, courtesy of 538

The team at 538 did a post-mortem of their in-season forecasts of NBA playoffs, using Bumps charts. These charts have a long history and can be traced back to Cambridge rowing. I featured them in these posts from a long time ago (link 1, link 2). 

Here is the Bumps chart for the NBA West Conference showing all 15 teams, and their ranking by the 538 model throughout the season. 


The highlighted team is the Kings. It's a story of ascent especially in the second half of the season. It's also a story of close but no cigar. It knocked at the door for the last five weeks but failed to grab the last spot. The beauty of the Bumps chart is how easy it is to see this story.

Now, if you'd focus on the dotted line labeled "Makes playoffs," and note that beyond the half-way point (1/31), there are no further crossings. This means that the 538 model by that point has selected the eight playoff teams accurately.


Now what about NBA East?


This chart highlights the two top teams. This conference is pretty easy to predict at the top. 

What is interesting is the spaghetti around the playoff line. The playoff race was heart-stopping and it wasn't until the last couple of weeks that the teams were settled. 

Also worthy of attention are the bottom-dwellers. Note that the chart is disconnected in the last four rows (ranks 12 to 15). These four teams did not ever leave the cellar, and the model figured out the final rankings around February.

Using a similar analysis, you can see that the model found the top 5 teams by mid December in this Conference, as there are no further crossings beyond that point. 

Go check out the FiveThirtyEight article for their interpretation of these charts. 

While you're there, read the article about when to leave the stadium if you'd like to leave a baseball game early, work that came out of my collaboration with Pravin and Sriram.

Beauty is in the eyes of the fishes

Reader Patrick S. sent in this old gem from Germany.


He said:

It displays the change in numbers of visitors to public pools in the German city of Hanover. The invisible y-axis seems to be, um, nonlinear, but at least it's monotonic, in contrast to the invisible x-axis.

There's a nice touch, though: The eyes of the fish are pie charts. Black: outdoor pools, white: indoor pools (as explained in the bottom left corner).

It's taken from a 1960 publication of the city of Hanover called *Hannover: Die Stadt in der wir leben*.

This is the kind of chart that Ed Tufte made (in)famous. The visual elements do not serve the data at all, except for the eyeballs. The design becomes a mere vessel for the data table. The reader who wants to know the growth rate of swimmers has to do a tank of work.

The eyeballs though.

I like the fact that these pie charts do not come with data labels. This part of the chart passes the self-sufficiency test. In fact, the eyeballs contain the most interesting story in this chart. In those four years, the visitors to public pools switched from mostly indoor pools to mostly outdoor pools. These eyeballs show that pie charts can be effective in specific situations.

Now, Hanover fishes are quite lucky to have free admission to the public pools!

A gem among the snowpack of Olympics data journalism

It's not often I come across a piece of data journalism that pleases me so much. Here it is, the "Happy 700" article by Washington Post is amazing.



When data journalism and dataviz are done right, the designers have made good decisions. Here are some of the key elements that make this article work:

(1) Unique

The topic is timely but timeliness heightens both the demand and supply of articles, which means only the unique and relevant pieces get the readers' attention.

(2) Fun

The tone is light-hearted. It's a fun read. A little bit informative - when they describe the towns that few have heard of. The notion is slightly silly but the reader won't care.

(3) Data

It's always a challenge to make data come alive, and these authors succeeded. Most of the data work involves finding, collecting and processing the data. There isn't any sophisticated analysis. But a powerful demonstration that complex analysis is not always necessary.

(4) Organization

The structure of the data is three criteria (elevation, population, and terrain) by cities. A typical way of showing such data might be an annotated table, or a Bumps-type chart, grouped columns, and so on. All these formats try to stuff the entire dataset onto one chart. The designers chose to highlight one variable at a time, cumulatively, on three separate maps. This presentation fits perfectly with the flow of the writing. 

(5) Details

The execution involves some smart choices. I am a big fan of legend/axis labels that are informative, for example, note that the legend doesn't say "Elevation in Meters":


The color scheme across all three maps shows a keen awareness of background/foreground concerns. 

Light entertainment: this looks like a bar chart

Long-time reader Daniel L. said this made him laugh.


This prompted me revive a feature I used to run on here called "Light entertainment." Dataviz work that are so easy to ridicule that one wonders if they weren't just made for the laughs. See all previous installments here.

Daniel also said it fails the Trifecta Checkup. What is the question the chart is addressing and what's the message? It's a bar chart, the axis not starting at zero, with multiple colors and Moire effects. and missing labels!