The gift of small edits and subtraction

While making the chart on fertility rates (link), I came across a problem that pops up quite often, and is  ignored by most software programs.

Here is an earlier version of the chart I later discarded:

Junkcharts_redofertilitychart_2

Compare this to the version I published in the blog post:

Junkcharts_redofertilitychart_1

Aside from adding the chart title, there is one major change. I removed the empty plots from the grid. This is a visualization trick that should be called adding by subtracting. The empty scaffolding on the first chart increases our cognitive load without yielding any benefit. The whitespace brings out the message that only countries in Asia and Africa have fertility rates above 5.0. 

This is a small edit. But small edits accumulate and deliver a big impact. Bear this in mind the next time you make a chart.

 

P.S.

(1) You'd have to use a lower-level coding language to execute this small edit. Most software programs are quite rigid when it comes to making small-multiples (facet) charts.

(2) If there is a next iteration, I'd reverse the Asia and Oceania rows.

 


Visually displaying multipliers

As I'm preparing a blog about another real-world study of Covid-19 vaccines, I came across the following chart (the chart title is mine).

React1_original

As background, this is the trend in Covid-19 cases in the U.K. in the last couple of months, courtesy of OurWorldinData.org.

Junkcharts_owid_uk_case_trend_july_august_2021

The React-1 Study sends swab kits to randomly selected people in England in order to assess the prevalence of Covid-19. Every month, there is a new round of returned swabs that are tested for Covid-19. This measurement method captures asymptomatic cases although it probably missed severe and hospitalized cases. Despite having some shortcomings, this is a far better way to measure cases than the hotch-potch assembling of variable-quality data submitted by different jurisdictions that has become the dominant source of our data.

Rounds 12 and 13 captured an inflection point in the pandemic in England. The period marked the beginning of the end of the belief that widespread vaccination will end the pandemic.

The chart I excerpted up top broke the data down by age groups. The column heights represent the estimated prevalence of Covid-19 during each round - also, described precisely in the paper as "swab positivity." Based on the study's design, one may generalize the prevalence to the population at large. About 1.5% of those aged 13-24 in England are estimated to have Covid-19 around the time of Round 13 (roughly early July).

The researchers came to the following conclusion:

We show that the third wave of infections in England was being driven primarily by the Delta variant in younger, unvaccinated people. This focus of infection offers considerable scope for interventions to reduce transmission among younger people, with knock-on benefits across the entire population... In our data, the highest prevalence of infection was among 12 to 24 year olds, raising the prospect that vaccinating more of this group by extending the UK programme to those aged 12 to 17 years could substantially reduce transmission potential in the autumn when levels of social mixing increase

***

Raise your hand if the graphics software you prefer dictates at least one default behavior you can't stand. I'm sure most hands are up in the air. No matter how much you love the software, there is always something the developer likes that you don't.

The first thing I did with today's chart is to get rid of all such default details.

Redo_react1_cleanup

For me, the bottom chart is cleaner and more inviting.

***

The researchers wanted readers to think in terms of Round 3 numbers as multiples of Round 2 numbers. In the text, they use statements such as:

weighted prevalence in round 13 was nine-fold higher in 13-17 year olds at 1.56% (1.25%, 1.95%) compared with 0.16% (0.08%, 0.31%) in round 12

It's not easy to perceive a nine-fold jump from the paired column chart, even though this chart form is better than several others. I added some subtle divisions inside each orange column in order to facilitate this task:

Redo_react1_multiples

I have recommended this before. I'm co-opting pictograms in constructing the column chart.

An alternative is to plot everything on an index scale although one would have to drop the prevalence numbers.

***

The chart requires an additional piece of context to interpret properly. I added each age group's share of the population below the chart - just to illustrate this point, not to recommend it as a best practice.

Redo_react1_multiples_popshare

The researchers concluded that their data supported vaccinating 13-17 year olds because that group experienced the highest multiplier from Round 12 to Round 13. Notice that the 13-17 year old age group represents only 6 percent of England's population, and is the least populous age group shown on the chart.

The neighboring 18-24 age group experienced a 4.5 times jump in prevalence in Round 13 so this age group is doing much better than 13-17 year olds, right? Not really.

While the same infection rate was found in both age groups during this period, the slightly older age group accounted for 50% more cases -- and that's due to the larger share of population.

A similar calculation shows that while the infection rate of people under 24 is about 3 times higher than that of those 25 and over, both age groups suffered over 175,000 infections during the Round 3 time period (the difference between groups was < 4,000).  So I don't agree that focusing on 13-17 year olds gives England the biggest bang for the buck: while they are the most likely to get infected, their cases account for only 14% of all infections. Almost half of the infections are in people 25 and over.

 


Stumped by the ATM

The neighborhood bank recently installed brand new ATMs, with tablet monitors and all that jazz. Then, I found myself staring at this screen:

Banknote_picker_us

I wanted to withdraw $100. I ordinarily love this banknote picker because I can get the $5, $10, $20 notes, instead of $50 and $100 that come out the slot when I don't specify my preference.

Something changed this time. I find myself wondering which row represents which note. For my non-U.S. readers, you may not know that all our notes are the same size and color. The screen resolution wasn't great and I had to squint really hard to see the numbers of those banknote images.

I suppose if I grew up here, I might be able to tell the note values from the figureheads. This is an example of a visualization that makes my life harder!

***
I imagine that the software developer might be a foreigner. I imagine the developer might live in Europe. In this case, the developer might have this image in his/her head:

Banknote_picker_euro

Euro banknotes are heavily differentiated - by color, by image, by height and by width. The numeric value also occupies a larger proportion of the area. This makes a lot of sense.

I like designs to be adaptable. Switching data from one country to another should not alter the design. Switching data at different time scales should not affect the design. This banknote picker UI is not adaptable across countries.

***

Once I figured out the note values, I learned another reason why I couldn't tell which row is which note. It's because one note is absent.

Banknote_us_2

Where is the $10 note? That and the twenty are probably the most frequently used. I am also surprised people want $1 notes from an ATM. But I assume the bank knows something I don't.


Book Review: Visualizing with Text by Richard Brath

Richardbarth_bookcoverThe creative process is sometimes described in terms of diverge-converge cycles. The diverge step involves experimentation and rewards suspending disbelief, while excesses are curbed and concepts refined during the converge step. Richard Brath's just-released book Visualizing with Text is an important resource that expands our appreciation for the place of text in visual displays.

Books on data visualization fall into recognizable types, of which two popular ones are the style guide, such as Edward Tufte, Dona Wong, and Alberto Cairo, and the coding manual, such as Ben Fry (processing) and Hadley Wickham (ggplot, Shiny). Brath's volume belongs to neither of those - it reads more like an encyclopedic catalog of how text can be incorporated into charts and graphs. He challenges us to blow up our imaginative space for characters, words, sentences, paragraphs and prose. It is a valuable aid for the diverge step of our creative process.

In modern data visualization, text is treated as an accessory, frequently found in titles, labels, legends, footnotes or surrounding text. Brath wants us to elevate text to the starring attraction. Starting with baby steps, such as direct labeling of lines and objects, and coordinating colors between chart elements and words, he experiments with inserting text into unlikely crannies, not shying away from ideas that even he admits may be somewhat of a dead-end.

One of the more immediately useful examples is the use of text labels that hug the lines on a line chart, similar to how roads and rivers are labeled on maps. I wish all software developers implement this function without delay.

Barth_riverlabelsonlines

A more esoteric example is to replace these lines with small-size text, as Brath makes an analogy between sentences and lines.

Barth_textinlines

I am still deciding if this is a gold mine or a minefield. It is thought-provoking nonetheless.

Finally, the book includes some flights of fancy, like this one:

Barth_french_departments

The red superscripts are numeric codes for French departments (provinces), arranged in ascending order of a given metric, and placed in proportional distance within the prose!

The converge step is left to the reader, as Brath refrains from bullhorning his opinions about chart types, which is why readers should not expect a style guide. He includes many experimental graphics, and may provide the pros and cons of a form without registering a judgement.

Because many of these ideas have yet to enter the mainstream, we'd need to implement these ideas on our own, which is why readers will not find a coding manual. As mentioned above, even the simplest and least controversial tactic of directly labeling lines is not available in Excel, let alone text that hugs or replaces lines. (This proves Brath's point that our community has done text a disservice.) Other ideas explored in later chapters require such features as italicizing numeric proportions of a word, rather than the entire word.

Recently, text has become a mainstay of Big Data. Visualizing with Text is timely, relevant and provocative. It is also clearly written, and tightly organized. Chapter 13 neatly summarizes the key concepts that have appeared along the way. There are plenty of use cases, primarily derived from research or business. After reading this book, you'll revel in the new sandbox of text, and long to free yourself from the constraints of your tool.


***

I recommend that you get the paper copy of the book. I reviewed the electronic version, and what irony! As you may have guessed, the electronic version ruins the typesetting. On every page, certain paragraphs show up in tiny font that resist all attempts to magnify, making Brath's case that legibility is an important metric for text visualization. Some of the more unusual fonts are dropped. The images are too small, even when popped up.

[P.S. Richard has a webpage where he included larger images and some code.]


Why you should expunge the defaults from Excel or (insert your favorite graphing program)

Yesterday, I posted the following chart in the post about Cornell's Covid-19 case rate after re-opening for in-person instruction.

Redo_junkchats_fraziercornellreopeningsuccess2

This is an edited version of the chart used in Peter Frazier's presentation.

Pfrazier_cornellreopeningupdate

The original chart carries with it the burden of Excel defaults.

What did I change and why?

I switched away from the default color scheme, which ignores the relationships between the two lines. In particular, the key comparison on this chart should be the actual case rate versus the nominal case rate. In addition, the three lines at the top are related as they all come from the same underlying mathematical model. I used the same color but different shades.

Also, instead of placing the legend as far away from the data labels as possible, I moved the line labels next to the data labels.

Instead of daily date labels, I moved to weekly labels, and set the month names on a separate level than the day names.

The dots were removed from the top three lines but I'd have retained them, perhaps with some level of transparency, if I spent more time making the edits. I'd definitely keep the last dot to make it clear that the blue lines contain one extra dot.

***

Every graphing program has defaults, typically computed by some algorithm tuned to the average chart. Don't settle for the average chart. Get rid of any default setting that slows down understanding.

 

 


The why axis

A few weeks ago, I replied to a tweet by someone who was angered by the amount of bad graphics about coronavirus. I take a glass-half-full viewpoint: it's actually heart-warming for  dataviz designers to realize that their graphics are being read! When someone critiques your work, it is proof that they cared enough to look at it. Worse is when you publish something, and no one reacts to it.

That said, I just wasted half an hour trying to get into the head of the person who made the following:

Fox31_co_newcases edited

Longtime reader Chris P. forwarded this tweet to me, and I saw that Andrew Gelman got sent this one, too.

The chart looked harmless until you check out the vertical axis labels. It's... um... the most unusual. The best way to interpret what the designer did is to break up the chart into three components. Like this:

Redo_junkcharts_fox31cocases

The big mystery is why the designer spent the time and energy to make this mischief.

The usual suspect is fake news. The clearest sign of malintent is the huge size of the dots. Each dot spans almost the entirety of the space between gridlines.

But there is almost no fake news here. The overall trend line is intact despite the attempted distortion. The following is a superposition of an unmanipulated line (yellow) on top of the manipulated:

Redo_junkcharts_fox31cocases2

***

The next guess is incompetence. The evidence against this view is the amount of energy required to execute these changes. In Excel, it takes a lot of work. It's easier to do this in R or any programming languages with which you can design your own axis.

Even for the R coders, the easy part is to replicate the design, but the hard part is to come up with the concept in the first place!

You can't just stumble onto a design like this. So I am not convinced the designer is an idiot.

***

How much work? You have to create three separate charts, with three carefully chosen vertical scales, and then clip, merge, and sew the seam. The weirdest bit is throwing away three of the twelve axis labels and writing in three fake numbers.

Here's the recipe: (if the gif doesn't load automatically, click on it)

Fox31_co_cases_B6

Help me readers! I'm stumped. Why oh why did someone make this? What is the point?

 

P.S. [4/9/2020] A conversation with Carlos on Andrew's blog reveals another issue. I pointed out that the "Total cases" printed up top was not the sum of the 15 numbers on the chart. There was a gap of 184 cases. Carlos sent me a link showing a day on which the total cases in Colorado was 183 cases. I didn't quite get the point initially. He explained that it's 183 existing cases prior to the start of the period of this chart, plus the new cases during this period, leading to the "Total cases" as of the end of the period of this chart.

So, another mystery solved. This brings up an important point about making effective charts: one way confusion arises is if there are two things from the visual that seem to contradict each other. In most line charts, if there is a line, and then a "total", the natural expectation is that the "total" is the sum of the data that make up the line. In this case, that "total" is the total new cases during the time period depicted. Total new cases isn't the same as total cases from case #1.

It's clearer to say "Total Cases on 3/17 = 183; on 4/1 = 3342".

 


Book review: Visualizing Baseball

I requested a copy of Jim Albert’s Visualizing Baseball book, which is part of the ASA-CRC series on Statistical Reasoning in Science and Society that has the explicit goal of reaching a mass audience.

Visualizingbaseball_coverThe best feature of Albert’s new volume is its brevity. For someone with a decent background in statistics (and grasp of basic baseball jargon), it’s a book that can be consumed within one week, after which one receives a good overview of baseball analytics, otherwise known as sabermetrics.

Within fewer than 200 pages, Albert outlines approaches to a variety of problems, including:

  • Comparing baseball players by key hitting (or pitching) metrics
  • Tracking a player’s career
  • Estimating the value of different plays, such as a single, a triple or a walk
  • Predicting expected runs in an inning from the current state of play
  • Analyzing pitches and swings using PitchFX data
  • Describing the effect of ballparks on home runs
  • Estimating the effect of particular plays on the outcome of a game
  • Simulating “fake” games and seasons in order to produce probabilistic forecasts such as X% chance that team Y will win the World Series
  • Examining whether a hitter is “streaky” or not

Most of the analyses are descriptive in nature, e.g. describing the number and types of pitches thrown by a particular pitcher, or the change in on-base percentage over the career of a particular hitter. A lesser number of pages are devoted to predictive analytics. This structure is acceptable in a short introductory book. In practice, decision-makers require more sophisticated work on top of these descriptive analyses. For example, what’s the value of telling a coach that the home run was the pivotal moment in a 1-0 game that has played out?

To appreciate the practical implications of the analyses included in this volume, I’d recommend reading Moneyball by Michael Lewis, or the more recent Astroball by Ben Reiter.

For the more serious student of sabermetrics, key omitted details will need to be gleaned from other sources, including other books by the same author – for years, I have recommended Curve Ball by Albert and Bennett to my students.

***

In the final chapters, Albert introduced the simulation of “fake” seasons that underlies predictions. An inquiring reader should investigate how the process is tied back to the reality of what actually happened; otherwise, the simulation will have a life of its own. Further, if one simulates 1,000 seasons of 2018 baseball, a large number of these fake seasons would crown some team other than the Red Sox as the 2018 World Series winner. Think about it: that’s how it is possible to make the prediction that the Red Sox has a say 60 percent chance of winning the World Series in 2018! A key to understanding the statistical way of thinking is to accept the logic of this fake simulated world. It is not the stated goal of Albert to convince readers of the statistical way of thinking – but you’re not going to be convinced unless you think about why we do it this way.

***

While there are plenty of charts included in the book, a more appropriate title for “Visualizing Baseball” would have been “Fast Intro to Baseball Analytics”. With several exceptions, the charts are not essential to understanding the analyses. The dominant form of exposition is first describe the analytical conclusion, then introduce a chart to illustrate that conclusion. The inverse would be: Start with the chart, and use the chart to explain the analysis.

The visualizations are generally of good quality, emphasizing clarity over prettiness. The choice of sticking to one software, ggplot2 in R, without post-production, constrains the visual designer to the preferences of the software designer. Such limitations are evident in chart elements like legends and titles. Here is one example (Chapter 5, Figure 5.8):

Albert_visualizingbaseball_chart

By default, the software prints the names of data columns in the titles. Imagine if the plot titles were Changeup, Fastball and Slider instead of CU, FF and SL. Or that the axis labels were “horizontal location” and “vertical location” (check) instead of px and pz. [Note: The chart above was taken from the book's github site; in the  Figure 5.8 in the printed book, the chart titles were edited as suggested.]

The chart analyzes the location relative to the strike zone of pitches that were missed versus pitches that were hit (not missed). By default, the software takes the name of the binary variable (“Miss”) as the legend title, and lists the values of the variable (“True” and “False”) as the labels of the two colors. Imagine if True appeared as “Miss” and False as “Hit” .

Finally, the chart exhibits over-plotting, making it tough to know how many blue or gray dots are present. Smaller dot size might help, or else some form of aggregation.

***

Visualizing Baseball is not the book for readers who learn by running code as no code is included in the book. A github page by the author hosts the code, but only the R/ggplot2 code for generating the data visualization. Each script begins after the analysis or modeling has been completed. If you already know R and ggplot2, the github is worth a visit. In any case, I don’t recommend learning coding from copying and pasting clean code.

All in all, I can recommend this short book to any baseball enthusiast who’s beginning to look at baseball data. It may expand your appreciation of what can be done. For details, and practical implications, look elsewhere.


Webinar Wednesday

Lyon_onlinestreaming


I'm delivering a quick-fire Webinar this Wednesday on how to make impactful data graphics for communication and persuasion. Registration is free, at this link.

***

In the meantime, I'm preparing a guest lecture for the Data Visualization class at Yeshiva University Sims School of Management. The goal of the lecture is to emphasize the importance of incorporating analytics into the data visualization process.

Here is the lesson plan:

  1. Introduce the Trifecta checkup (link) which is the general framework for effective data visualizations
  2. Provide examples of Type D data visualizations, i.e. graphics that have good production values but fail due to issues with the data or the analysis
  3. Hands-on demo of an end-to-end data visualization process
  4. Lessons from the demo including the iterative nature of analytics and visualization; and sketching
  5. Overview of basic statistics concepts useful to visual designers

 


Choosing the right metric reveals the story behind the subway mess in NYC

I forgot who sent this chart to me - it may have been a Twitter follower. The person complained that the following chart exaggerated how much trouble the New York mass transit system (MTA) has been facing in 2017, because of the choice of the vertical axis limits.

Streetsblog_mtatraffic

This chart is vintage Excel, using Excel defaults. I find this style ugly and uninviting. But the chart does contain some good analysis. The analyst made two smart moves: the chart controls for month-to-month seasonality by plotting the data for the same month over successive years; and the designation "12 month averages" really means moving averages with a window size of 12 months - this has the effect of smoothing out the short-term fluctuations to reveal the longer-term trend.

The red line is very alarming as it depicts a sustained negative trend over the entire year of 2017, even though the actual decline is a small percentage.

If this chart showed up on a business dashboard, the CEO would have been extremely unhappy. Slow but steady declines are the most difficult trends to deal with because it cannot be explained by one-time impacts. Until the analytics department figures out what the underlying cause is, it's very difficult to curtail, and with each monthly report, the sense of despair grows.

Because the base number of passengers in the New York transit system is so high, using percentages to think about the shift in volume underplays the message. It's better to use actual millions of passengers lost. That's what I did in my version of this chart:

Redo_jc_mtarevdecline

The quantity depicted is the unexpected loss of revenue passengers, measured against a forecast. The forecast I used is the average of the past two years' passenger counts. Above the zero line means out-performing the forecast but of course, in this case, since October 2016, the performance has dipped ever farther below the forecast. By April, 2017, the gap has widened to over 5 million passengers. That's a lot of lost customers and lost revenues, regardless of percent!

The biggest headache is to investigate what is the cause of this decline. Most likely, it is a combination of factors.


Excel is the graveyard of charts, no!

It's true that Excel is responsible for large numbers of horrible charts. I just came across a typical example recently:

Ewolff_meanmedianincome

This figure comes from Edward Wolff's 2012 paper, "The Asset Price Meltdown and the Wealth of the Middle Class." It's got all the hallmarks of Excel defaults. It's not a pleasing object to look at.

However, it's also true that Excel can be used to make nice charts. Here is a remake:

Redo_meanmedianincome2

This chart is made almost entirely in Excel - the only edit I made outside Excel is to decompose the legend box.

It takes five minutes to make the first chart; it takes probably 30 minutes to make the second chart. That is the difference between good and bad graphics. Excel users: let that be your inspiration!