« September 2024 | Main | November 2024 »

Book review: Getting (more out of ) Graphics by Antony Unwin

Unwin_gettingmoreoutofgraphics_coverAntony Unwin, a statistics professor at Augsburg, has published a new dataviz textbook called "Getting (more out of) Graphics", and he kindly sent me a review copy. (Amazon link)

I am - not surprisingly - in the prime audience for such a book. It covers some gaps in the market:
a) it emphasizes exploratory graphics rather than presentation graphics
b) it deals not just with designing graphics but also interpreting (i.e. reading) them
c) it covers data pre-processing and data visualization in a more balanced way
d) it develops full case studies involving multiple graphics from the same data sources

The book is divided into two parts: the first, which covers 75% of the materials, details case studies, while the final quarter of the book offers "advice". The book has a github page containing R code which, as I shall explain below, is indispensable to the serious reader.

Given the aforementioned design, the case studies in Unwin's book have a certain flavor: most of the data sets are relatively complex, with many variables, including a time component. The primary goal of Unwin's exploratory graphics can be stated as stimulating "entertaining discussions" about and "involvment" with the data. They are open-ended, and frequently inconclusive. This is a major departure from other data visualization textbooks on the market, and also many of my own blog posts, where we focus on selecting a good graphic for presenting insights visually to an intended audience, without assuming domain expertise.

I particularly enjoyed the following sections: a discussion of building graphs via "layering" (starting on p. 326), enumeration of iterative improvement to graphics (starting on p. 402), and several examples of data wrangling (e.g. p.52).

Unwin_fig4.7

Unwin does not give "advice" in the typical style of do this, don't do that. His advice is fashioned in the style of an analyst. He frames and describes the issues, shows rather than tells. This paragraph from the section about grouping data is representative:

Sorting into groups gets complicated when there are several grouping variables. Variables may be nested in a hierarchy... or they may have no such structure... Groupings need to be found that reflect the aims of the study. (p. 371)

He writes down what he has done, may provide a reason for his choices, but is always understated. He sees no point in selling his reasoning.

The structure of the last part of the book, the "advice" chapters, is quite unusual. The chapter headers are: (data) provenance and quality; wrangling; colour; setting the scene (scaling, layout, etc.); ordering, sorting and arranging; what affects interpretation; and varieties of plots.

What you won't find are extended descriptions of chart forms, rules of visualization, or flowcharts tying data types to chart forms. Those are easily found online if you want them (you probably won't care if you're reading Unwin's book.)

***

For the serious reader, the book should be consumed together with the code on github. Find specific graphs from the case studies that interest you, open the code in your R editor, and follow how Unwin did it. The "advice" chapters highlight points of interest from the case studies presented earlier so you may start there, cross-reference the case studies, then jump to the code.

Unfortunately, the code is sparsely commented. So also open up your favorite chatbot, which helps to explain the code, and annotate it yourself. Unwin uses R, and in particular, lives in the "tidyverse".

To understand the data manipulation bits, reviewing the code is essential. It's hard to grasp what is being done to the data without actually seeing the datasets. There are no visuals of the datasets in the book, as the text is primarily focused on the workflow leading to a graphic. The data processing can get quite involved, such as Chapter 16.

I'm glad Unwin has taken the time to write this book and publish the code. It rewards the serious reader with skills that are not commonly covered in other textbooks. For example, I was rather amazed to find this sentence (p. 366):

To ensure that a return to a particular ordering is always possible, it is essential to have a variable with a unique value for every case, possibly an ID variable constructed for just this reason. Being able to return to the initial order of a dataset is useful if something goes wrong (and something will).

Anyone who has analyzed real-world datasets would immediately recognize this as good advice but who'd have thought to put it down in a book?


Visualizing extremes

The New York Times published this chart to illustrate the extreme ocean heat driving the formation of hurricane Milton (link):

Nyt_oceanheatmilton

The chart expertly shows layers of details.

The red line tracks the current year's data on ocean heat content up to yesterday.

Meaning is added through the gray line that shows the average trajectory of the past decade. With the addition of this average line, we can readily see how different the current year's data is from the past. In particular, we see that the current season can be roughly divided into three parts: from May to mid June, and also from August to October, the ocean heat this year was quite a bit higher than the 10-year average, whereas from mid June to August, it was just above average.

Even more meaning is added when all the individual trajectories from the last decade are shown (in light gray lines). With this addition, we can readily learn how extreme is this year's data. For the post-August period, it's clear that the Gulf of Mexico is the hottest it's been in the past decade. Also, this extreme is not too far from the previous extreme. On the other hand, the extreme in late May-early June is rather more scary.