Book review: Visualizing Baseball

I requested a copy of Jim Albert’s Visualizing Baseball book, which is part of the ASA-CRC series on Statistical Reasoning in Science and Society that has the explicit goal of reaching a mass audience.

Visualizingbaseball_coverThe best feature of Albert’s new volume is its brevity. For someone with a decent background in statistics (and grasp of basic baseball jargon), it’s a book that can be consumed within one week, after which one receives a good overview of baseball analytics, otherwise known as sabermetrics.

Within fewer than 200 pages, Albert outlines approaches to a variety of problems, including:

  • Comparing baseball players by key hitting (or pitching) metrics
  • Tracking a player’s career
  • Estimating the value of different plays, such as a single, a triple or a walk
  • Predicting expected runs in an inning from the current state of play
  • Analyzing pitches and swings using PitchFX data
  • Describing the effect of ballparks on home runs
  • Estimating the effect of particular plays on the outcome of a game
  • Simulating “fake” games and seasons in order to produce probabilistic forecasts such as X% chance that team Y will win the World Series
  • Examining whether a hitter is “streaky” or not

Most of the analyses are descriptive in nature, e.g. describing the number and types of pitches thrown by a particular pitcher, or the change in on-base percentage over the career of a particular hitter. A lesser number of pages are devoted to predictive analytics. This structure is acceptable in a short introductory book. In practice, decision-makers require more sophisticated work on top of these descriptive analyses. For example, what’s the value of telling a coach that the home run was the pivotal moment in a 1-0 game that has played out?

To appreciate the practical implications of the analyses included in this volume, I’d recommend reading Moneyball by Michael Lewis, or the more recent Astroball by Ben Reiter.

For the more serious student of sabermetrics, key omitted details will need to be gleaned from other sources, including other books by the same author – for years, I have recommended Curve Ball by Albert and Bennett to my students.

***

In the final chapters, Albert introduced the simulation of “fake” seasons that underlies predictions. An inquiring reader should investigate how the process is tied back to the reality of what actually happened; otherwise, the simulation will have a life of its own. Further, if one simulates 1,000 seasons of 2018 baseball, a large number of these fake seasons would crown some team other than the Red Sox as the 2018 World Series winner. Think about it: that’s how it is possible to make the prediction that the Red Sox has a say 60 percent chance of winning the World Series in 2018! A key to understanding the statistical way of thinking is to accept the logic of this fake simulated world. It is not the stated goal of Albert to convince readers of the statistical way of thinking – but you’re not going to be convinced unless you think about why we do it this way.

***

While there are plenty of charts included in the book, a more appropriate title for “Visualizing Baseball” would have been “Fast Intro to Baseball Analytics”. With several exceptions, the charts are not essential to understanding the analyses. The dominant form of exposition is first describe the analytical conclusion, then introduce a chart to illustrate that conclusion. The inverse would be: Start with the chart, and use the chart to explain the analysis.

The visualizations are generally of good quality, emphasizing clarity over prettiness. The choice of sticking to one software, ggplot2 in R, without post-production, constrains the visual designer to the preferences of the software designer. Such limitations are evident in chart elements like legends and titles. Here is one example (Chapter 5, Figure 5.8):

Albert_visualizingbaseball_chart

By default, the software prints the names of data columns in the titles. Imagine if the plot titles were Changeup, Fastball and Slider instead of CU, FF and SL. Or that the axis labels were “horizontal location” and “vertical location” (check) instead of px and pz. [Note: The chart above was taken from the book's github site; in the  Figure 5.8 in the printed book, the chart titles were edited as suggested.]

The chart analyzes the location relative to the strike zone of pitches that were missed versus pitches that were hit (not missed). By default, the software takes the name of the binary variable (“Miss”) as the legend title, and lists the values of the variable (“True” and “False”) as the labels of the two colors. Imagine if True appeared as “Miss” and False as “Hit” .

Finally, the chart exhibits over-plotting, making it tough to know how many blue or gray dots are present. Smaller dot size might help, or else some form of aggregation.

***

Visualizing Baseball is not the book for readers who learn by running code as no code is included in the book. A github page by the author hosts the code, but only the R/ggplot2 code for generating the data visualization. Each script begins after the analysis or modeling has been completed. If you already know R and ggplot2, the github is worth a visit. In any case, I don’t recommend learning coding from copying and pasting clean code.

All in all, I can recommend this short book to any baseball enthusiast who’s beginning to look at baseball data. It may expand your appreciation of what can be done. For details, and practical implications, look elsewhere.


Webinar Wednesday

Lyon_onlinestreaming


I'm delivering a quick-fire Webinar this Wednesday on how to make impactful data graphics for communication and persuasion. Registration is free, at this link.

***

In the meantime, I'm preparing a guest lecture for the Data Visualization class at Yeshiva University Sims School of Management. The goal of the lecture is to emphasize the importance of incorporating analytics into the data visualization process.

Here is the lesson plan:

  1. Introduce the Trifecta checkup (link) which is the general framework for effective data visualizations
  2. Provide examples of Type D data visualizations, i.e. graphics that have good production values but fail due to issues with the data or the analysis
  3. Hands-on demo of an end-to-end data visualization process
  4. Lessons from the demo including the iterative nature of analytics and visualization; and sketching
  5. Overview of basic statistics concepts useful to visual designers

 


Choosing the right metric reveals the story behind the subway mess in NYC

I forgot who sent this chart to me - it may have been a Twitter follower. The person complained that the following chart exaggerated how much trouble the New York mass transit system (MTA) has been facing in 2017, because of the choice of the vertical axis limits.

Streetsblog_mtatraffic

This chart is vintage Excel, using Excel defaults. I find this style ugly and uninviting. But the chart does contain some good analysis. The analyst made two smart moves: the chart controls for month-to-month seasonality by plotting the data for the same month over successive years; and the designation "12 month averages" really means moving averages with a window size of 12 months - this has the effect of smoothing out the short-term fluctuations to reveal the longer-term trend.

The red line is very alarming as it depicts a sustained negative trend over the entire year of 2017, even though the actual decline is a small percentage.

If this chart showed up on a business dashboard, the CEO would have been extremely unhappy. Slow but steady declines are the most difficult trends to deal with because it cannot be explained by one-time impacts. Until the analytics department figures out what the underlying cause is, it's very difficult to curtail, and with each monthly report, the sense of despair grows.

Because the base number of passengers in the New York transit system is so high, using percentages to think about the shift in volume underplays the message. It's better to use actual millions of passengers lost. That's what I did in my version of this chart:

Redo_jc_mtarevdecline

The quantity depicted is the unexpected loss of revenue passengers, measured against a forecast. The forecast I used is the average of the past two years' passenger counts. Above the zero line means out-performing the forecast but of course, in this case, since October 2016, the performance has dipped ever farther below the forecast. By April, 2017, the gap has widened to over 5 million passengers. That's a lot of lost customers and lost revenues, regardless of percent!

The biggest headache is to investigate what is the cause of this decline. Most likely, it is a combination of factors.


Excel is the graveyard of charts, no!

It's true that Excel is responsible for large numbers of horrible charts. I just came across a typical example recently:

Ewolff_meanmedianincome

This figure comes from Edward Wolff's 2012 paper, "The Asset Price Meltdown and the Wealth of the Middle Class." It's got all the hallmarks of Excel defaults. It's not a pleasing object to look at.

However, it's also true that Excel can be used to make nice charts. Here is a remake:

Redo_meanmedianincome2

This chart is made almost entirely in Excel - the only edit I made outside Excel is to decompose the legend box.

It takes five minutes to make the first chart; it takes probably 30 minutes to make the second chart. That is the difference between good and bad graphics. Excel users: let that be your inspiration!


It's your fault when you use defaults

The following chart showed up on my Twitter feed last week. It's a cautionary tale for using software defaults.

Booksaleschart_sourceBISG_fromtwitter

 At first glance, the stacking of years in a bar chart makes little sense. This is particularly so when there appears not to be any interesting annual trend: the four segments seem to have roughly equal length almost everywhere.

This designer might be suffering from what I have called "loss aversion" (link). Loss aversion in data visualization is the fear of losing your data, which causes people to cling on to every little bit of data they have.

Several challenges of the chart come from the software defaults. The bars are ordered alphabetically, making it difficult to discern a trend. The horizontal axis labels are given in single dollars and units, and yet the intention of the designer is to use millions, as indicated in the chart titles.

The one horrifying feature of this chart is the 3D effect. The third dimension contains no information at all. In fact, it destroys information, as readers who use the vertical gridlines to estimate the lengths of the bars will be sadly misled. As shown below, readers must draw imaginary lines to figure out the horizontal values.

Twitter_booksalescategories_0

The Question of this chart is the distribution of book sales (revenues and units) across different genres. When the designer chose to stack the bars (i.e. sum the yearly data), he or she has decided that the details of specific years are not as important as the total - this is the right conclusion since the bar segments have similar measurement within each genre.

So let's pursue the revolution of averaging the data, plotting average yearly sales.

Redo_twitter_bookssalescategories

This chart shows that there are two major types of genres. In the education world, the unit prices of (text)books are very high while sales are relatively small by units but in aggregate, the dollar revenues are high. In the "adult" world, whether it's fiction or non-fiction, the unit price is low while the number of units is high, which results in similar total dollar revenues as the education genres.

***

Simple lesson here: learn to hate software defaults


How to print cash, graphically

Twitter user @glennrice called out a "journalist" for producing the following chart:

Columbiaheartbeat_cashbalance

You can't say the Columbia Heartbeat site doesn't deserve a beating over this graph. I don't recognize the software but my guess is one of these business intelligence (BI) tools that produce canned reports with a button click.

Until I read the article, I kept thinking that there are several overlapping lines being plotted. But it's really a 3D plus color effect!

Wait there's more. This software treats years as categories rather than a continuous number. So it made equal-sized intervals of 2 years, 1 year, 2 years, and 8 years. I am still not sure how this happened because the data set given at the bottom of the article contains annual data.

The y-axis labels, the gridlines, the acronym in the chart title, the unnecessary invocation of start-at-zero, etc. almost make this feel like a parody.

***

Aside from visual design issues, I am not liking the analysis either. The claim is that taxes have been increasing every year in Columbia, Missouri, and that the additional revenue ended up sitting in banks as cash. 

We need to see a number of other data series in order to accept this conclusion. What was the growth in tax revenues relative to the increase in cash? What was the growth in population in Columbia during this period? Did the cash holding per capita increase or decrease? What were the changes in expenditure on schools, public works, etc.?

This is a Type DV chart. There is an interesting question being asked but the analysis must be sharpened and the graphing software must be upgraded asap.

PS. On second thought, I think the time axis might be deliberately distorted. Judging from the slope of the line, the cumulative increase in the last 8 years equals the increase in past two-year increments so if the proper scale is used, the line would flatten out significantly, demolishing the thesis of the article. Thus, it is a case of printing cash, graphically.

 

 


Statistics report raises mixed emotions

It's gratifying to live through the incredible rise of statistics as a discipline. In a recent report by the American Statistical Association (ASA), we learned that enrollment at all levels (bachelor, master and doctorate) has exploded in the last 5-10 years, as "Big Data" gather momentum.

But my sense of pride takes a hit while looking at the charts that appear in the report. These graphs demonstrate again the hegemony of Excel defaults in the world of data visualization.

Here are all five charts organized in a panel:

Asa_enrollment_panel

Chart #5 (bottom right) catches the eye because it is the only chart with two lines instead of three. You then flip to the prior page to find the legend. The legend tells you the red line is Bachelor and the green line is PhD. That seems wrong, unless biostats departments do not give out Master degrees.

This is confirmed by chart #2, where we find the blue line (Master) hugging zero.

Presumably the designer removed the blue line from chart #5 because the low counts mean that it fluctuates wildly between 0 and 100 percent and so disrupts the visual design. But the designer forgets to tell readers why the blue line is missing.

***

It turns out the article itself contradicts all of the above:

For biostatistics degrees, for which NCES started providing data specifically in 1992, master’s degrees track the overall increase from 2010– 2014 at 47%...The number of undergraduate degrees in biostatistics remains below 30.

Asa_enrollment_legendIn other words, the legend is mislabeled. The blue line represents Bachelor while the red line, Master. (The error was noticed after the print edition went out because the online version has the correct legend.)

***

There is another mystery. Charts #2, #3, and #5, all dealing with biostats, have time starting from 1992, while Charts #1 and #4 starts from 1987. The charts aren't lined up in a way that would allow comparisons across time.

Similarly, the vertical scale of each chart is different (aside from Charts #3 and #4). This design choice impairs comparison across charts.

In the article, it is explained that 1992 was when the agency started collecting data about biostatistics degrees. Between 1987 and 1992, were there no biostatistics majors? were biostatistics majors lumped into the counts of statistics majors? It's hard to tell.

***

While Excel is a powerful tool that has served our community well, its flexibility is often a source of errors. The remedy to this problem is to invest ample time in over-riding pretty much every default decision in the system.

For example:

Redo_asa_enrollment

This chart, a reproduction of Chart #1 above, was entirely produced in Excel.

 

 

 

 

 

 


Me and Alberto Cairo in one room tomorrow

JMP_LogoI have been a fan of Alberto Cairo for a while, and am slowly working my way through his great book, The Functional Art, which I will review soon.

Thanks to the folks at JMP, the two of us will be appearing together in the Analytically Speaking webcast, on Friday, 1-2 pm EST. Sign up here. We are both opinionated people, so the discussion will be lively. Come and ask us questions.

 


Learn EDA (exploratory data analysis) from the experts

The Facebook data science team has put together a great course on EDA at Udacity.

EDA stands for exploratory data analysis. It is the beginning of any data analysis when you have a pile of data (or datasets) and you need to get a feel for what you're looking at. It's when you develop some intuition about what sort of methodology would be appropriate to analyze the data. 

Not surprisingly, graphical methods form a big part of EDA. You will commonly see histograms, boxplots, and scatter plots. The scatterplot matrix (see my discussion of this) makes an appearance here as well.

The course uses R and in particular, Hadley's ggplot package throughout. I highly recommend the course for anyone who wants to become an expert in ggplot. ggplot does use quite a bit of proprietary syntax. This EDA course offers a lot of instruction in coding. You do have to work hard, but you will learn a lot. By working hard, I mean reading supplementary materials, and doing the exercises throughout the course. As good instruction goes, they expect students to discover things, and do not feed you bullet points.

While this course is not freeThis course is free, plus the quality of the instruction is heads and shoulders above other MOOCs out there. The course is designed from the ground up for online instruction, and it shows. If you have tried other online courses, you will immediately notice the difference in quality. For example, the people in these videos talk directly to you, and not a bunch of tuition-paying students in some remote classroom.

Sign up before they get started at Udacity. Disclaimer: No one paid me to write this post.


Update on Dataviz Workshop 2

The class practised doing critiques on the famous Wind Map by Fernanda Viegas and Martin Wattenberg.

Windmap

Click here for a real-time version of the map.

I selected this particular project because it is a heartless person indeed who does not see the "beauty" in this thing.

Beauty is a word that is thrown around a lot in data visualization circles. What do we mean by beauty?

***

The discussion was very successful and the most interesting points of discussion were these:

  • Something that is beautiful should take us to some truth.
  • If we take this same map but corrupt all the data (e.g. reverse all wind directions), is the map still beautiful?
  • What is the "truth" in this map? What is its utility?
  • The emotional side of beauty is separate from the information side.
  • "Truth" comes before the emotional side of beauty.

Readers: would love to hear what you think.

 

PS. Click here for class syllabus. Click here for first update.