Criminal graphics graphical crime

One of my Twitter followers disliked the following chart showing FBI crime statistics for 2023 (link):

Cremieuxrecueil_homicide_age23_twitter

If read quickly, the clear message of the chart is that something spiked on the right side of the curve.

But that isn't the message of the chart. The originator applied this caption: "The age-crime curve last year looked pretty typical. How about this year? Same as always. Victims and offenders still have highly similar, relatively young ages."

So the intended message is that the blue and the red lines are more or less the same.

***

What about the spike on the far right? 

If read too quickly, one might think that the oldest segment of Americans went on a killing spree last year. One must read the axis label to learn that elders weren't committing more homicides, but what spiked were murderers with "unknown" age.

A quick fix of this is to plot the unknowns as a column chart on the right, disconnecting it from the age distribution. Like this:

Junkcharts_redo_fbicrimestats_0

***

This spike in unknowns appears consequential: the count is over 2,000, larger than the numbers for most age groups.

Curiously, unknowns in age spiked only for offenders but not victims. So perhaps those are unsolved cases, for which the offender's age is unknown but the victim's age is known.

If that hypothesis is correct, then the same pattern will be seen year upon year. I checked this in the FBI database, and found that every year about 2,000 offenders have unknown age.

In other words, the unknowns cannot be the main story here. Instead of dominating our attention, it should be pushed to the background, e.g. in a footnote.

***

Next, because the amount of unknowns is so different between the offenders and victims, comparing two curves of counts is problematic. Such a comparison is based on the assumption that there are similar total numbers of offenders and victims. (There were in fact 5% more offenders than there were victims in 2023.)

The red and blue lines are not as similar as one might think.

Take the 40-49 age group. The blue value is 1,746 while the red value is 2,431, a difference of 685, which is 40 percent of 1,746! If we convert each to proportions, ignoring unknowns, the blue value is 12% compared to the red value of 15%, a difference of 3% which is a quarter of 12%.

By contrast, in the 10-19 age group, the blue value is 3,101 while the red value is 2,147, a difference of about 1,000, which is a third of 3,101. Converted to proportions, ignoring unknowns, the blue value is 21% compared to the red value of 13%, a difference of 8% which is almost 40% of 21%.

It's really hard to argue that these age distributions are "similar".

Junkcharts_redo_fbicrimestats

As seen from the above, offenders are much more likely to be younger (10-29 years old) than victims, and they are also much more likely to be 90+! Meanwhile, the victims are more likely to be 60-89.

 

 

 

 

 

 

 


Book review: Getting (more out of ) Graphics by Antony Unwin

Unwin_gettingmoreoutofgraphics_coverAntony Unwin, a statistics professor at Augsburg, has published a new dataviz textbook called "Getting (more out of) Graphics", and he kindly sent me a review copy. (Amazon link)

I am - not surprisingly - in the prime audience for such a book. It covers some gaps in the market:
a) it emphasizes exploratory graphics rather than presentation graphics
b) it deals not just with designing graphics but also interpreting (i.e. reading) them
c) it covers data pre-processing and data visualization in a more balanced way
d) it develops full case studies involving multiple graphics from the same data sources

The book is divided into two parts: the first, which covers 75% of the materials, details case studies, while the final quarter of the book offers "advice". The book has a github page containing R code which, as I shall explain below, is indispensable to the serious reader.

Given the aforementioned design, the case studies in Unwin's book have a certain flavor: most of the data sets are relatively complex, with many variables, including a time component. The primary goal of Unwin's exploratory graphics can be stated as stimulating "entertaining discussions" about and "involvment" with the data. They are open-ended, and frequently inconclusive. This is a major departure from other data visualization textbooks on the market, and also many of my own blog posts, where we focus on selecting a good graphic for presenting insights visually to an intended audience, without assuming domain expertise.

I particularly enjoyed the following sections: a discussion of building graphs via "layering" (starting on p. 326), enumeration of iterative improvement to graphics (starting on p. 402), and several examples of data wrangling (e.g. p.52).

Unwin_fig4.7

Unwin does not give "advice" in the typical style of do this, don't do that. His advice is fashioned in the style of an analyst. He frames and describes the issues, shows rather than tells. This paragraph from the section about grouping data is representative:

Sorting into groups gets complicated when there are several grouping variables. Variables may be nested in a hierarchy... or they may have no such structure... Groupings need to be found that reflect the aims of the study. (p. 371)

He writes down what he has done, may provide a reason for his choices, but is always understated. He sees no point in selling his reasoning.

The structure of the last part of the book, the "advice" chapters, is quite unusual. The chapter headers are: (data) provenance and quality; wrangling; colour; setting the scene (scaling, layout, etc.); ordering, sorting and arranging; what affects interpretation; and varieties of plots.

What you won't find are extended descriptions of chart forms, rules of visualization, or flowcharts tying data types to chart forms. Those are easily found online if you want them (you probably won't care if you're reading Unwin's book.)

***

For the serious reader, the book should be consumed together with the code on github. Find specific graphs from the case studies that interest you, open the code in your R editor, and follow how Unwin did it. The "advice" chapters highlight points of interest from the case studies presented earlier so you may start there, cross-reference the case studies, then jump to the code.

Unfortunately, the code is sparsely commented. So also open up your favorite chatbot, which helps to explain the code, and annotate it yourself. Unwin uses R, and in particular, lives in the "tidyverse".

To understand the data manipulation bits, reviewing the code is essential. It's hard to grasp what is being done to the data without actually seeing the datasets. There are no visuals of the datasets in the book, as the text is primarily focused on the workflow leading to a graphic. The data processing can get quite involved, such as Chapter 16.

I'm glad Unwin has taken the time to write this book and publish the code. It rewards the serious reader with skills that are not commonly covered in other textbooks. For example, I was rather amazed to find this sentence (p. 366):

To ensure that a return to a particular ordering is always possible, it is essential to have a variable with a unique value for every case, possibly an ID variable constructed for just this reason. Being able to return to the initial order of a dataset is useful if something goes wrong (and something will).

Anyone who has analyzed real-world datasets would immediately recognize this as good advice but who'd have thought to put it down in a book?


Visualizing extremes

The New York Times published this chart to illustrate the extreme ocean heat driving the formation of hurricane Milton (link):

Nyt_oceanheatmilton

The chart expertly shows layers of details.

The red line tracks the current year's data on ocean heat content up to yesterday.

Meaning is added through the gray line that shows the average trajectory of the past decade. With the addition of this average line, we can readily see how different the current year's data is from the past. In particular, we see that the current season can be roughly divided into three parts: from May to mid June, and also from August to October, the ocean heat this year was quite a bit higher than the 10-year average, whereas from mid June to August, it was just above average.

Even more meaning is added when all the individual trajectories from the last decade are shown (in light gray lines). With this addition, we can readily learn how extreme is this year's data. For the post-August period, it's clear that the Gulf of Mexico is the hottest it's been in the past decade. Also, this extreme is not too far from the previous extreme. On the other hand, the extreme in late May-early June is rather more scary. 

 

 


Aligning the visual and the message to hot things up

The headline of this NBC News chart (link) tells readers that Phoenix (Arizona) has been very, very hot this year. It has over 120 days in which the average temperature exceeded 100F (38 C).

Nbcnews_phoenix_tmax

It's not obvious how extreme this situation is. To help readers, it would be useful to add some kind of reference points.

A couple of possibilities come to mind:

First, how many days are depicted in the chart? Since there is one cell for each day of the year, and the day of week is plotted down the vertical axis, we just need to count the number of columns. There are 38 columns, but the first column has one missing cell while the last column has only 3 cells. Thus, the number of days depicted is (36*7)+6+3 = 261. So, the average temperature in Phoenix exceeded 100F on about 46% of the days of the year thus far.

That sounds like a high number. For a better reference point, we'd also like to know the historical average. Is Phoenix just a very hot place? Is 2024 hotter than usual?

***

Let's walk through how one reads the Phoenix "heatmap".

We already figured out that each column represents a week of the year, and each row shows a cross-section of a given day of week throughout the year.

The first column starts on a Monday because the first day of 2024 falls on a Monday. The last column ends on a Tuesday, which corresponds to Sept 17, 2024, the last day of data when this chart was created.

The columns are grouped into months, although such division is complicated by the fact that the number of days in a month (except for a leap month) isn't ever divisible by seven. The designer subtly inserted a thicker border between months. This feature allows readers to comment on the average temperature in a given month. It also lets readers learn quickly that we are two weeks and three days into September.

The color legend explains that temperature readings range from yellow (lower) to red (higher). The range of average daily temperatures during 2024 was 54-118F (12-48C). The color scale is progressive.

Nbcnews_phoenix_colorlegend

Given that 100F is used as a threshold to define "hot days," it makes sense to accentuate this in the visual presentation. For example:

Junkcharts_redo_nbcnewsphoenixmaxtemp

Here, all days with maximum temperature at 100F or above have a red hue.