« May 2019 | Main | July 2019 »

Book review: Visualizing Baseball

I requested a copy of Jim Albert’s Visualizing Baseball book, which is part of the ASA-CRC series on Statistical Reasoning in Science and Society that has the explicit goal of reaching a mass audience.

Visualizingbaseball_coverThe best feature of Albert’s new volume is its brevity. For someone with a decent background in statistics (and grasp of basic baseball jargon), it’s a book that can be consumed within one week, after which one receives a good overview of baseball analytics, otherwise known as sabermetrics.

Within fewer than 200 pages, Albert outlines approaches to a variety of problems, including:

  • Comparing baseball players by key hitting (or pitching) metrics
  • Tracking a player’s career
  • Estimating the value of different plays, such as a single, a triple or a walk
  • Predicting expected runs in an inning from the current state of play
  • Analyzing pitches and swings using PitchFX data
  • Describing the effect of ballparks on home runs
  • Estimating the effect of particular plays on the outcome of a game
  • Simulating “fake” games and seasons in order to produce probabilistic forecasts such as X% chance that team Y will win the World Series
  • Examining whether a hitter is “streaky” or not

Most of the analyses are descriptive in nature, e.g. describing the number and types of pitches thrown by a particular pitcher, or the change in on-base percentage over the career of a particular hitter. A lesser number of pages are devoted to predictive analytics. This structure is acceptable in a short introductory book. In practice, decision-makers require more sophisticated work on top of these descriptive analyses. For example, what’s the value of telling a coach that the home run was the pivotal moment in a 1-0 game that has played out?

To appreciate the practical implications of the analyses included in this volume, I’d recommend reading Moneyball by Michael Lewis, or the more recent Astroball by Ben Reiter.

For the more serious student of sabermetrics, key omitted details will need to be gleaned from other sources, including other books by the same author – for years, I have recommended Curve Ball by Albert and Bennett to my students.


In the final chapters, Albert introduced the simulation of “fake” seasons that underlies predictions. An inquiring reader should investigate how the process is tied back to the reality of what actually happened; otherwise, the simulation will have a life of its own. Further, if one simulates 1,000 seasons of 2018 baseball, a large number of these fake seasons would crown some team other than the Red Sox as the 2018 World Series winner. Think about it: that’s how it is possible to make the prediction that the Red Sox has a say 60 percent chance of winning the World Series in 2018! A key to understanding the statistical way of thinking is to accept the logic of this fake simulated world. It is not the stated goal of Albert to convince readers of the statistical way of thinking – but you’re not going to be convinced unless you think about why we do it this way.


While there are plenty of charts included in the book, a more appropriate title for “Visualizing Baseball” would have been “Fast Intro to Baseball Analytics”. With several exceptions, the charts are not essential to understanding the analyses. The dominant form of exposition is first describe the analytical conclusion, then introduce a chart to illustrate that conclusion. The inverse would be: Start with the chart, and use the chart to explain the analysis.

The visualizations are generally of good quality, emphasizing clarity over prettiness. The choice of sticking to one software, ggplot2 in R, without post-production, constrains the visual designer to the preferences of the software designer. Such limitations are evident in chart elements like legends and titles. Here is one example (Chapter 5, Figure 5.8):


By default, the software prints the names of data columns in the titles. Imagine if the plot titles were Changeup, Fastball and Slider instead of CU, FF and SL. Or that the axis labels were “horizontal location” and “vertical location” (check) instead of px and pz. [Note: The chart above was taken from the book's github site; in the  Figure 5.8 in the printed book, the chart titles were edited as suggested.]

The chart analyzes the location relative to the strike zone of pitches that were missed versus pitches that were hit (not missed). By default, the software takes the name of the binary variable (“Miss”) as the legend title, and lists the values of the variable (“True” and “False”) as the labels of the two colors. Imagine if True appeared as “Miss” and False as “Hit” .

Finally, the chart exhibits over-plotting, making it tough to know how many blue or gray dots are present. Smaller dot size might help, or else some form of aggregation.


Visualizing Baseball is not the book for readers who learn by running code as no code is included in the book. A github page by the author hosts the code, but only the R/ggplot2 code for generating the data visualization. Each script begins after the analysis or modeling has been completed. If you already know R and ggplot2, the github is worth a visit. In any case, I don’t recommend learning coding from copying and pasting clean code.

All in all, I can recommend this short book to any baseball enthusiast who’s beginning to look at baseball data. It may expand your appreciation of what can be done. For details, and practical implications, look elsewhere.

Inspiration from a waterfall of pie charts: illustrating hierarchies

Reader Antonio R. forwarded a tweet about the following "waterfall of pie charts" to me:


Maarten Lamberts loved these charts (source: here).

I am immediately attracted to the visual thinking behind this chart. The data are presented in a hierarchy with three levels. The levels are nested in the sense that the pieces in each pie chart add up to 100%. From the first level to the second, the category of freshwater is sub-divided into three parts. From the second level to the third, the "others" subgroup under freshwater is sub-divided into five further categories.

The designer faces a twofold challenge: presenting the proportions at each level, and integrating the three levels into one graphic. The second challenge is harder to master.

The solution here is quite ingenious. A waterfall/waterdrop metaphor is used to link each layer to the one below. It visually conveys the hierarchical structure.


There remains a little problem. There is a confusion related to the part and the whole. The link between levels should be that one part of the upper level becomes the whole of the lower level. Because of the color scheme, it appears that the part above does not account for the entirety of the pie below. For example, water in lakes is plotted on both the second and third layers while water in soil suddenly enters the diagram at the third level even though it should be part of the "drop" from the second layer.


I started playing around with various related forms. I like the concept of linking the layers and want to retain it. Here is one graphic inspired by the waterfall pies from above:



The Periodic Table, a challenge in information organization

Reader Chris P. points me to this article about the design of the Periodic Table. I then learned that 2019 is the “International Year of the Periodic Table,” according to the United Nations.

Here is the canonical design of the Periodic Table that science students are familiar with.


(Source: Wikipedia.)

The Periodic Table is an exercise of information organization and display. It's about adding structure to over 100 elements, so as to enhance comprehension and lookup. The canonical tabular design has columns and rows. The columns (Groups) impose a primary classification; the rows (Periods) provide a secondary classification. The elements also follow an aggregate order, which is traced by reading from top left to bottom right. The row structure makes clear the "periodicity" of the elements: the "period" of recurrence is not constant, tending to increase with the heavier elements at the bottom.

As with most complex datasets, these elements defy simple organization, due to a curse of dimensionality. The general goal is to put the similar elements closer together. Similarity can be defined in an infinite number of ways, such as chemical, physical or statistical properties. The canonical design, usually attributed to Russian chemist Mendeleev, attained its status because the community accepted his organizing principles, that is, his definitions of similarity (subsequently modified).


Of interest, there is a list of unsettled issues. According to Wikipedia, the most common arguments concern:

  • Hydrogen: typically shown as a member of Group 1 (first column), some argue that it doesn’t belong there since it is a gas not a metal. It is sometimes placed in Group 17 (halogens), where it forms a nice “triad” with fluorine and chlorine. Other designers just float hydrogen up top.
  • Helium: typically shown as a member of Group 18 (rightmost column), the  halogens noble gases, it may also be placed in Group 2.
  • Mercury: usually found in Group 12, some argue that it is not a metal like cadmium and zinc.
  • Group 3: other than the first two elements , there are various voices about how to place the other elements in Group 3. In particular, the pairs of lanthanum / actinium and lutetium / lawrencium are sometimes shown in the main table, sometimes shown in the ‘f-orbital’ sub-table usually placed below the main table.


Over the years, there have been numerous attempts to re-design the Periodic table. Some of these are featured in the article that Chris sent me (link).

I checked how these alternative designs deal with those unsettled issues. The short answer is they don't settle the issues.

Wide Table (Janet)

The key change is to remove the separation between the main table and the f-orbital (pink) section shown below, as a "footnote". This change clarifies the periodicity of the elements, especially the elongating periods as one moves down the table. This form is also called "long step".


As a tradeoff, this table requires more space and has an awkward aspect ratio.

In this version of the wide table, the designer chooses to stack lutetium / lawrencium in Group 3 as part of the main table. Other versions place lanthanum / actinium in Group 3 as part of the main table. There are even versions that leave Group 3 with two elements.

Hydrogen, helium and mercury retain their conventional positions.


Spiral Design (Hyde)

There are many attempts at spiral designs. Here is one I found on this tumblr:


The spiral leverages the correspondence between periodic and circular. It is visually more pleasing than a tabular arrangement. But there is a tradeoff. Because of the increasing "diameter" from inner to outer rings, the inner elements are visually constrained compared to the outer ones.

In these spiral diagrams, the designer solves the aspect-ratio problem by creating local loops, sometimes called peninsulas. This is analogous to the footnote table solution, and visually distorts the longer periodicity of the heavier elements.

For Hyde's diagram, hydrogen is floated, helium is assigned to Group 2, and mercury stays in Group 12.



I also found this design on the same tumblr, but unattributed. It may have come from Life magazine.


It's a variant of the spiral. Instead of peninsulas, the designer squeezes the f-orbital section under Group 3, so this is analogous to the wide table solution.

The circular diagrams convey the sense of periodic return but the wide table displays the magnitudes more clearly.

This designer places hydrogen in group 18 forming a triad with fluorine and chlorine. Helium is in Group 17 and mercury in the usual Group 12 .


Cartogram (Sheehan)

This version is different.


The designer chooses a statistical property (abundance) as the primary organizing principle. The key insight is that the lighter elements in the top few rows are generally more abundant - thus more important in a sense. The cartogram reveals a key weakness of the spiral diagrams that draw the reader's attention to the outer (heavier) elements.

Because of the distorted shapes, the cartogram form obscures much of the other data. In terms of the unsettled issues, hydrogen and helium are placed in Groups 1 and 2. Mercury is in Group 12. Group 3 is squeezed inside the main table rather than shown below.



The centerpiece of the article Chris sent me is a network graph.


This is a complete redesign, de-emphasizing the periodicity. It's a result of radically changing the definition of similarity between elements. One barrier when introducing entirely new displays is the tendency of readers to expect the familiar.


I found the following articles useful when researching this post:

The Conversation

Royal Chemistry Society


A chart makes an appearance in my new video

Been experimenting with short videos recently. My latest is a short explainer on why some parents are willing to spend over a million dollars to open back doors to college admissions. I even inserted a chart showing some statistics. Click here to see the video.


Also, subscribe to my channel to see future episodes of Inside the Black Box.


Here are a couple of recent posts related to college admissions.

  • About those so-called adversity scores (link)
  • A more detailed post on various college admissions statistics (link)

Wayward legend takes sides in a chart of two sides, plus data woes

Reader Chris P. submitted the following graph, found on Axios:


From a Trifecta Checkup perspective, the chart has a clear question: are consumers getting what they wanted to read in the news they are reading?

Nevertheless, the chart is a visual mess, and the underlying data analytics fail to convince. So, it’s a Type DV chart. (See this overview of the Trifecta Checkup for the taxonomy.)


The designer did something tricky with the axis but the trick went off the rails. The underlying data consist of two set of ranks, one for news people consumed and the other for news people wanted covered. With 14 topics included in the study, the two data series contain the same values, 1 to 14. The trick is to collapse both axes onto one. The trouble is that the same value occurs twice, and the reader must differentiate the plot symbols (triangle or circle) to figure out which is which.

It does not help that the lines look like arrows suggesting movement. Without first reading the text, readers may assume that topics change in rank between two periods of time. Some topics moved right, increasing in importance while others shifted left.

The design wisely separated the 14 topics into three logical groups. The blue group comprises news topics for which “want covered” ranking exceeds the “read” ranking. The orange group has the opposite disposition such that the data for “read” sit to the right side of the data for “want covered”. Unfortunately, the legend up top does more harm than good: it literally takes sides!


Here, I've put the data onto a scatter plot:


The two sets of ranks are basically uncorrelated, as the regression line is almost flat, with “R-squared” of 0.02.

The analyst tried to "rescue" the data in the following way. Draw the 45-degree line, and color the points above the diagonal blue, and those below the diagonal orange. Color the points on the line gray. Then, write stories about those three subgroups.


Further, the ranking of what was read came from Parse.ly, which appears to be surveillance data (“traffic analytics”) while the ranking of what people want covered came from an Axios/SurveyMonkey poll. As for as I could tell, there was no attempt to establish that the two populations are compatible and comparable.






Clarifying comparisons in censored cohort data: UK housing affordability

If you're pondering over the following chart for five minutes or more, don't be ashamed. I took longer than that.


The chart accompanied a Financial Times article about inter-generational fairness in the U.K. To cut to the chase, a recently released study found that younger generations are spending substantially higher proportions of their incomes to pay for housing costs. The FT article is here (behind paywall). FT actually slightly modified the original chart, which I pulled from the Home Affront report by the Intergenerational Commission.


One stumbling block is to figure out what is plotted on the horizontal axis. The label "Age" has gone missing. Even though I am familiar with cohort analysis (here, generational analysis), it took effort to understand why the lines are not uniformly growing in lengths. Typically, the older generation is observed for a longer period of time, and thus should have a longer line.

In particular, the orange line, representing people born before 1895 only shows up for a five-year range, from ages 70 to 75. This was confusing because surely these people have lived through ages 20 to 70. I'm assuming the "left censoring" (missing data on the left side) is because of non-existence of old records.

The dataset is also right-censored (missing data on the right side). This occurs with the younger generations (the top three lines) because those cohorts have not yet reached certain ages. The interpretation is further complicated by the range of birth years in each cohort but let me not go there.

TL;DR ... each line represents a generation of Britons, defined by their birth years. The generations are compared by how much of their incomes did they spend on housing costs. The twist is that we control for age, meaning that we compare these generations at the same age (i.e. at each life stage).


Here is my version of the same chart:


Here are some of the key edits:

  • Vertical blocks are introduced to break up the analysis by life stage. These guide readers to compare the lines vertically i.e. across generations
  • The generations are explicitly described as cohorts by birth years
  • The labels for the generations are placed next to the lines
  • Gridlines are pushed to the back
  • The age axis is explicitly labeled
  • Age labels are thinned
  • A hierarchy on colors
  • The line segments with incomplete records are dimmed

The harmful effect of colors can be seen below. This chart is the same as the one above, except for retaining the colors of the original chart:




Clearing a forest of labels

This chart by the Financial Times has a strong message, and I like a lot about it:


The countries are by and large aligned along a diagonal, with the poorer countries growing strongly between 2007-2019 while the richer countries suffered negative growth.

A small issue with the chart is the thick forest of text - redundant text. The sub-title, the axis titles, the quadrant labels, and the left-right-half labels all repeat the same things. In the following chart, I simplify the text:


Typically, I don't put axis titles as a sub-header (or, header of the graphic) but as this may be part of the FT style, I respected it.

Re-thinking a standard business chart of stock purchases and sales

Here is a typical business chart.


A possible story here: institutional investors are generally buying AMD stock, except in Q3 2018.

Let's give this chart a three-step treatment.

STEP 1: The Basics

Remove the data labels, which stand sideways awkwardly, and are redundant given the axis labels. If the audience includes people who want to take the underlying data, then supply a separate data table. It's easier to copy and paste from, and doing so removes clutter from the visual.

The value axis is probably created by an algorithm - hard to imagine someone deliberately placing axis labels  $262 million apart.

The gridlines are optional.


STEP 2: Intermediate

Simplify and re-organize the time axis labels; show the quarter and year structure. The years need not repeat.

Align the vocabulary on the chart. The legend mentions "inflows and outflows" while the chart title uses the words "buying and selling". Inflows is buying; outflows is selling.


STEP 3: Advanced

This type of data presents an interesting design challenge. Arguably the most important metric is the net purchases (or the net flow), i.e. inflows minus outflows. And yet, the chart form leaves this element in the gaps, visually.

The outflows are numerically opposite to inflows. The sign of the flow is encoded in the color scheme. An outflow still points upwards. This isn't a criticism, but rather a limitation of the chart form. If the red bars are made to point downwards to indicate negative flow, then the "net flow" is almost impossible to visually compute!

Putting the columns side by side allows the reader to visually compute the gap, but it is hard to visually compare gaps from quarter to quarter because each gap is hanging off a different baseline.

The following graphic solves this issue by focusing the chart on the net flows. The buying and selling are still plotted but are deliberately pushed to the side:


The structure of the data is such that the gray and pink sections are "symmetric" around the brown columns. A purist may consider removing one of these columns. In other words:


Here, the gray columns represent gross purchases while the brown columns display net purchases. The reader is then asked to infer the gross selling, which is the difference between the two column heights.

We are almost back to the original chart, except that the net buying is brought to the foreground while the gross selling is pushed to the background.