Book review: Visualizing Baseball

I requested a copy of Jim Albert’s Visualizing Baseball book, which is part of the ASA-CRC series on Statistical Reasoning in Science and Society that has the explicit goal of reaching a mass audience.

Visualizingbaseball_coverThe best feature of Albert’s new volume is its brevity. For someone with a decent background in statistics (and grasp of basic baseball jargon), it’s a book that can be consumed within one week, after which one receives a good overview of baseball analytics, otherwise known as sabermetrics.

Within fewer than 200 pages, Albert outlines approaches to a variety of problems, including:

  • Comparing baseball players by key hitting (or pitching) metrics
  • Tracking a player’s career
  • Estimating the value of different plays, such as a single, a triple or a walk
  • Predicting expected runs in an inning from the current state of play
  • Analyzing pitches and swings using PitchFX data
  • Describing the effect of ballparks on home runs
  • Estimating the effect of particular plays on the outcome of a game
  • Simulating “fake” games and seasons in order to produce probabilistic forecasts such as X% chance that team Y will win the World Series
  • Examining whether a hitter is “streaky” or not

Most of the analyses are descriptive in nature, e.g. describing the number and types of pitches thrown by a particular pitcher, or the change in on-base percentage over the career of a particular hitter. A lesser number of pages are devoted to predictive analytics. This structure is acceptable in a short introductory book. In practice, decision-makers require more sophisticated work on top of these descriptive analyses. For example, what’s the value of telling a coach that the home run was the pivotal moment in a 1-0 game that has played out?

To appreciate the practical implications of the analyses included in this volume, I’d recommend reading Moneyball by Michael Lewis, or the more recent Astroball by Ben Reiter.

For the more serious student of sabermetrics, key omitted details will need to be gleaned from other sources, including other books by the same author – for years, I have recommended Curve Ball by Albert and Bennett to my students.

***

In the final chapters, Albert introduced the simulation of “fake” seasons that underlies predictions. An inquiring reader should investigate how the process is tied back to the reality of what actually happened; otherwise, the simulation will have a life of its own. Further, if one simulates 1,000 seasons of 2018 baseball, a large number of these fake seasons would crown some team other than the Red Sox as the 2018 World Series winner. Think about it: that’s how it is possible to make the prediction that the Red Sox has a say 60 percent chance of winning the World Series in 2018! A key to understanding the statistical way of thinking is to accept the logic of this fake simulated world. It is not the stated goal of Albert to convince readers of the statistical way of thinking – but you’re not going to be convinced unless you think about why we do it this way.

***

While there are plenty of charts included in the book, a more appropriate title for “Visualizing Baseball” would have been “Fast Intro to Baseball Analytics”. With several exceptions, the charts are not essential to understanding the analyses. The dominant form of exposition is first describe the analytical conclusion, then introduce a chart to illustrate that conclusion. The inverse would be: Start with the chart, and use the chart to explain the analysis.

The visualizations are generally of good quality, emphasizing clarity over prettiness. The choice of sticking to one software, ggplot2 in R, without post-production, constrains the visual designer to the preferences of the software designer. Such limitations are evident in chart elements like legends and titles. Here is one example (Chapter 5, Figure 5.8):

Albert_visualizingbaseball_chart

By default, the software prints the names of data columns in the titles. Imagine if the plot titles were Changeup, Fastball and Slider instead of CU, FF and SL. Or that the axis labels were “horizontal location” and “vertical location” (check) instead of px and pz. [Note: The chart above was taken from the book's github site; in the  Figure 5.8 in the printed book, the chart titles were edited as suggested.]

The chart analyzes the location relative to the strike zone of pitches that were missed versus pitches that were hit (not missed). By default, the software takes the name of the binary variable (“Miss”) as the legend title, and lists the values of the variable (“True” and “False”) as the labels of the two colors. Imagine if True appeared as “Miss” and False as “Hit” .

Finally, the chart exhibits over-plotting, making it tough to know how many blue or gray dots are present. Smaller dot size might help, or else some form of aggregation.

***

Visualizing Baseball is not the book for readers who learn by running code as no code is included in the book. A github page by the author hosts the code, but only the R/ggplot2 code for generating the data visualization. Each script begins after the analysis or modeling has been completed. If you already know R and ggplot2, the github is worth a visit. In any case, I don’t recommend learning coding from copying and pasting clean code.

All in all, I can recommend this short book to any baseball enthusiast who’s beginning to look at baseball data. It may expand your appreciation of what can be done. For details, and practical implications, look elsewhere.


Book Preview: How Charts Lie, by Alberto Cairo

Howchartslie_coverIf you’re like me, your first exposure to data visualization was as a consumer. You may have run across a pie chart, or a bar chart, perhaps in a newspaper or a textbook. Thanks to the power of the visual language, you got the message quickly, and moved on. Few of us learned how to create charts from first principles. No one taught us about axes, tick marks, gridlines, or color coding in science or math class. There is a famous book in our field called The Grammar of Graphics, by Leland Wilkinson, but it’s not a For Dummies book. This void is now filled by Alberto Cairo’s soon-to-appear new book, titled How Charts Lie: Getting Smarter about Visual Information.

As a long-time fan of Cairo’s work, I was given a preview of the book, and I thoroughly enjoyed it and recommend it as an entry point to our vibrant discipline.

In the first few chapters of the book, Cairo describes how to read a chart. Some may feel that there is not much to it but if you’re here at Junk Charts, you probably agree with Cairo’s goal. Indeed, it is easy to mis-read a chart. It’s also easy to miss the subtle and brilliant design decisions when one doesn’t pay close attention. These early chapters cover all the fundamentals to become a wiser consumer of data graphics.

***

How Charts Lie will open your eyes to how everyone uses visuals to push agendas. The book is an offshoot of a lecture tour Cairo took during the last year or so, which has drawn large crowds. He collected plenty of examples of politicians and others playing fast and loose with their visual designs. After reading this book, you can’t look at charts with a straight face!

***

In the second half of his book, Cairo moves beyond purely visual matters into analytical substance. In particular, I like the example on movie box office from Chapter 4, titled “How Charts Lie by Displaying Insufficient Data”. Visual analytics of box office receipts seems to be a perennial favorite of job-seekers in data-related fields.

The movie data is a great demonstration of why one needs to statistically adjust data. Cairo explains why Marvel’s Blank Panther is not the third highest-grossing film of all time in the U.S., as reported in the media. That is because gross receipts should be inflation-adjusted. A ticket worth $15 today cost $5 some time ago.

This discussion features a nice-looking graphic, which is a staircase chart showing how much time a #1 movie has stayed in the top position until it is replaced by the next higher grossing film.

Cairo_howchartslie_movies

Cairo’s discussion went further, exploring the number of theaters as a “lurking” variable. For example, Jaws opened in about 400 theaters while Star Wars: The Force Awakens debuted in 10 times as many. A chart showing per-screen inflation-adjusted gross receipts looks much differently from the original chart shown above.

***

Another highlight is Cairo’s analysis of the “cone of uncertainty” chart frequently referenced in anticipation of impending hurricanes in Florida.

Cairo_howchartslie_hurricanes

Cairo and his colleagues have found that “nearly everybody who sees this map reads it wrongly.” The casual reader interprets the “cone” as a sphere of influence, showing which parts of the country will suffer damage from the impending hurricane. In other words, every part of the shaded cone will be impacted to a larger or smaller extent.

That isn’t the designer’s intention! The cone embodies uncertainty, showing which parts of the country has what chance of being hit by the impending hurricane. In the aftermath, the hurricane would have traced one specific path, and that path would have run through the cone if the predictive models were accurate. Most of the shaded cone would have escaped damage.

Even experienced data analysts are likely to mis-read this chart: as Cairo explained, the cone has a “confidence level” of 68% not 95% which is more conventional. Areas outside the cone still has a chance of being hit.

This map clinches the case for why you need to learn how to read charts. And Alberto Cairo, who is a master visual designer himself, is a sure-handed guide for the start of this rewarding journey.

***

Here is Alberto introducing his book.


Report from Data Visualization Meetup

Kristen_bookcoverOn Monday, Principal Analytics Prep sponsored the Data Visualization Meetup, organized by the indefatigable Naomi Robbins. The keynote speaker is NYU professor Kristen Sosulski, who just published a book titled “Data Visualization Made Simple” (link).

At the Meetup, we announced a Part-Time Immersive Program. This allows the completion of the Certified Data Specialist program in three levels on a more relaxed, evening schedule. Level 1 will run two nights a week for 12 weeks, starting Spring 2019. For more details, contact us here.

***

Kristen, a professor in the Stern School, has an interesting take on the data visualization function – placing it within the larger enterprise. In the first part of her talk, she presents a number of real-world case examples of how data analysts used data visualization to create impact within an organization.

The end goal in each of these projects is a “business insight” that is delivered to decision-makers with the primary goal of persuasion – something I also emphasize in my own seminars. It’s not that data visualization isn’t used for analysis, exploration, and story-telling (see postscript), and so on but at the tail end of the process, the need to persuade becomes paramount.

***

For example, the graphic on the cover of her book is from a project undertaken by Jet.com, the online retailer purchased by Walmart. The managers are interested in the patterns of purchasing of the customers, and generally views products as “consumables” or “durables,” the latter have lower purchasing frequencies. The nodes in the network graph are colored accordingly. Through the links between these nodes, the analyst concluded that certain products (an example given was batteries) are considered durables but have purchasing patterns that appear more like consumables.

Kristen’s message is how the data turned into a business insight (the “story”) which impressed the managers enough so that they took action by adjusting orders and inventories.

Kristen described other examples such as the use of salary data to place employees into bands, or the use of predictive models to predict which partners in a venture-capital firm will bring in more investment. Many of these examples make me believe that a course of causal reasoning should be required for all data analysts.

***

The second half of Kristen’s talk addresses how to raise the profile of data visualization within an enterprise. This is a clearly needed discussion. More and more industry jobs are created that are specific to data visualization so these new teams must establish themselves within the corporate culture. Kristen recommends a five-step process, starting with establishing a data practice and ending with measuring one’s impact.

In answering my question about evangelizing new visualization formats to replace inferior existing chart designs, she emphasizes the need to involve stakeholders early in the process. Don't surprise them with something novel during a meeting.

We were pleased that people braved the adverse weather to attend Kristen’s talk, and good pizza was served at the end of the evening.

 

P.S.

The word “story-telling” seems to have gone from hero to villain lately. Some commenters are thinking the word “story” implies made-up fiction, and thus oppose its use. A related complaint concerns the “subjectivity” of stories. Once you realize that most of our data sources are observational in nature, you will soon discover that causal reasoning entails the selection of the most plausible story among many. Statisticians and others have come up with causal models, which are sets of equations used to describe relationships between data, but all of these rely on causal assumptions. In essence, they are structured ways to select the most plausible story. It’s dangerous to see these models as “objective.”


Book review: The Truthful Art by Alberto Cairo, and the Enduring Problem of Statistical Illiteracy

Truthfulart_cover


I have been looking forward to reading Alberto Cairo’s new book since he started teasing about it last year. I enjoyed his first book, The Functional Art, mostly because we share the desire to bring the design and the analytical schools of data visualization closer together. His new book, The Truthful Art, represents another step in this ambitious project, and I found much to like about it.

The Truthful Art is really two books for the price of one. There is one book about analytical thinking, and interspersed between these chapters, there is a short book about graphical design. The chapters on analytical thinking take readers--presumably many will be journalists--through the standard diet of statistics, from summary statistics (ch. 6) to distributions (ch. 7) to correlations (ch. 9) to sampling theory (ch. 11) to time-series data (ch. 8). Cairo also devotes some delightful pages to cognitive biases (ch. 3) and research design (ch. 4).

Readers meet these analytical chapters as if they are enjoying a chef’s tasting menu at a top-line restaurant. Small delicious bites of knowledge are served quickly, with the expectation that readers will pursue the advanced reading list, curated by the author and printed at the end of each chapter. Cairo’s love of reading bursts through the pages.

At various points, Cairo delves into equations. In a wise move to balance the book, and to keep readers awake (we all know how boring statistics is), he weaves in several chapters on basic graphical design. There are materials on chart forms (ch. 5), on maps (ch. 10), and on visual design (ch. 2, a nice summary of his previous book). My favorite is chapter 12, which is a kaleidoscope of data visualization projects, at once celebrating the vitality of the field, and revealing its unruly, sometimes clashing, strands.

A good book is one that leaves the reader with lingering thoughts that transcend its pages. The Truthful Art succeeds in this respect. The purview of this book intersects directly with several lines of my own work: on the ethics of data analytics, and the teaching of statistical reasoning.

In chapter 11, Cairo tells a story familiar to anyone who is paying attention to the current U.S. presidential elections. In some past elections, El Pais, a Spanish newspaper, published a headline proclaiming that "Catalan public opinion swings to 'no' for independence," when the margin of difference was well within the margin of error of the survey. Cairo complains about the misleading headline, and pointed out the need to look at the margin of error. It’s refreshing that a journalist points this out. I used to see this as a matter of statistical illiteracy, but now I see this as an ethical issue.

Let’s say a pollster runs a new poll every hour. Because the race is a deadheat, the hourly results would flip-flop, as if one were observing a sequence of coin flips. A journalist would report the sequence as A, A, B, A, B, B, A, … while a statistician would write down tie, tie, tie, tie, …. Notice how boring this last sequence is, despite it being more truthful. I don't believe that journalists report the horse-race because of ignorance of margins of errors--they just choose to ignore them.

Readers of my blogs and books know what I am about to say about the teaching of statistics. Cairo follows a conventional approach, even including some equations, in an otherwise very readable account. The convention is the “how-to” approach that assumes learning comes from knowing formulas. It’s also an approach that has earned statistics departments at all universities a reputation of being uninspiring and obtuse.

Take hypothesis testing and p-values for example. Cairo’s account is more readable than most textbooks but at heart, it is a step-by-step manual for how to do hypothesis testing. To me, this method of instruction solves the wrong problem. The real issue is whether journalists are equipped to separate the wheat from the chaff when they read peer-reviewed journal articles, all of which use conventional hypothesis testing and attain p < 0.05. There are many things in life we learn to use without knowing any formulas. We learn to use a smartphone app without knowing how to code an app. We learn to ride a bike without having to learn mechanical engineering formulas.

These comments are not a specific criticism of Cairo’s project. I leave them here to encourage some creative thinking around the problem of statistical illiteracy that seems to never go away. I’m suggesting two shifts: from a set of formulas to a system of thinking; from imparting knowledge to promoting ethics.

***

Cairo’s book is an important contribution to bringing together the design and analytical perspectives on data visualization. He is an entertaining and lucid writer and thinker. Since he does not have mathematical training, he is able to explain the analytical materials in a way that would make sense to readers with non-technical backgrounds. So I highly recommend that you get a copy, get hooked, and do the advanced reading.


A letter to high-school students

Imagine Magazine, a youth-focused journal by Johns Hopkins's Center of Talented Youth, invited me to contribute an article in celebration of statistics. I try to convey the fun and joy of working with numbers and charts.

You can read it here, starting on p. 22. While you're at it, the rest of the magazine is really great too, and I hope we're able to influence at least a few youngsters to take up math-stats as a career.

Please forward to kids and teachers!


Mark your calendar

I'll be speaking at the NYU Bookstore on Oct 8 (next Tuesday), 6-7:30 pm. See here.

On Oct 9 (Wed), I'll be speaking at the Princeton Tech Meetup. The meeting starts at 7; my talk starts at 8. Details here.


Lunch and talk Wednesday

Numbersense_cover_smI will be the luncheon speaker at INFORMS NYC on Wednesday in NYC. The talk will provide some context for my new book Numbersense (link), and discuss a few examples from the book. You can pre-register here.

INFORMS is the professional society for Operations Research and Management Science people. For some years, I have attended these regularly and learned a lot from other industry speakers.

If you decide at the last minute, you can pay the $5 extra fee on the day of the talk. Or register now.

***

Junk Charts is featured in an article in Harvard Business Review about data visualization. A few new reviews have appeared: CFA InstituteFlagstaff Business News.

***

I maintain a list of events on my book blog. Look to the right column.


Chance to ask me a question this Friday

I will be at Book Expo this Friday signing books at the McGraw-Hill booth. If you're in NYC, drop by and say hi between 11 and 12.

Yes, it's a new book!  The title is Numbersense: How to Use Big Data to Your Advantage (link). If you read my blogs, you already know where I'm going with this. How can we be smart consumers of data analyses in a world overflowing with data? It will be in stores in July. Between now and then, you can come back here to learn more.

Also, at 12:30, I'll be interviewed at the Shindig event by Peggy Sanservieri, who blogs at Huffington Post on book marketing. This is an online live chat event. Go to their site to register, and you'd have the opportunity to ask me questions.

(This is cross-posted on both blogs.)