Why you should expunge the defaults from Excel or (insert your favorite graphing program)

Yesterday, I posted the following chart in the post about Cornell's Covid-19 case rate after re-opening for in-person instruction.

Redo_junkchats_fraziercornellreopeningsuccess2

This is an edited version of the chart used in Peter Frazier's presentation.

Pfrazier_cornellreopeningupdate

The original chart carries with it the burden of Excel defaults.

What did I change and why?

I switched away from the default color scheme, which ignores the relationships between the two lines. In particular, the key comparison on this chart should be the actual case rate versus the nominal case rate. In addition, the three lines at the top are related as they all come from the same underlying mathematical model. I used the same color but different shades.

Also, instead of placing the legend as far away from the data labels as possible, I moved the line labels next to the data labels.

Instead of daily date labels, I moved to weekly labels, and set the month names on a separate level than the day names.

The dots were removed from the top three lines but I'd have retained them, perhaps with some level of transparency, if I spent more time making the edits. I'd definitely keep the last dot to make it clear that the blue lines contain one extra dot.

***

Every graphing program has defaults, typically computed by some algorithm tuned to the average chart. Don't settle for the average chart. Get rid of any default setting that slows down understanding.

 

 


Putting vaccine trials in boxes

Bloomberg Businessweek has a special edition about vaccines, and I found this chart on the print edition:

Bloombergbw_vaccinetrials_sm

The chart's got a lot of white space. Its structure is a series of simple "treemaps," one for each type of vaccine. Though simple, such a chart burns a few brain cells.

Here, I've extracted the largest block, which corresponds to vaccines that work with the virus's RNA/DNA. I applied a self-sufficiency test, removing the data from the boxes. 

Redo_junkcharts_bloombergbw_vaccinetrials_0

What proportion of these projects have moved from pre-clinical to Phase 1?  To answer this question, we have to understand the relative areas of boxes, since that's how the data are encoded. How many yellow boxes can fit into the gray box?

It's not intuitive. We'd need a ruler to do this task properly.

Then, we learn that the gray box is exactly 8 times the size of the yellow box (72 projects are pre-clinical while 9 are in Phase I). We can cram eight yellows into the gray box. Imagine doing that, and it's pretty clear the visual elements fail to convey the meaning of the data.

Self-sufficiency is the idea that a data graphic should not rely on printed data to convey its meaning; the visual elements of a data graphic should bear much of the burden. Otherwise, use a data table. To test for self-sufficiency, cover up the printed data and see if the chart still works.

***

A key decision for the designer is the relative importance of (a) the number of projects reaching Phase III, versus (b) the number of projects utilizing specific vaccine strategies.

This next chart emphasizes the clinical phases:

Redo_junkcharts_bloombergbw_vaccinetrials_2

 

Contrast this with the version shown in the online edition of Bloomberg (link), which emphasizes the vaccine strategies.

Bloombergbwonline_vaccinetrials

If any reader can figure out the logic of the ordering of the vaccine strategies, please leave a comment below.


Deaths as percent neither of cases nor of population. Deaths as percent of normal.

Yesterday, I posted a note about excess deaths on the book blog (link). The post was inspired by a nice data visualization by the New York Times (link). This is a great example of data journalism.

Nyt_excessdeaths_south

Excess deaths is a superior metric for measuring the effect of Covid-19 on public health. It's better than deaths as percent of cases. Also better than percent of the population.What excess deaths measure is deaths as a percent of normal. Normal is usually defined as the average deaths in the respective week in years past.

The red areas indicate how far the deaths in the Southern states are above normal. The highest peak, registered in Texas in late July, is 60 percent above the normal level.

***

The best way to appreciate the effort that went into this graphic is to imagine receiving the outputs from the model that computes excess deaths. A three-column spreadsheet with columns "state", "week number" and "estimated excess deaths".

The first issue is unequal population sizes. More populous states of course have higher death tolls. Transforming death tolls to an index pegged to the normal level solves this problem. To produce this index, we divide actual deaths by the normal level of deaths. So the spreadsheet must be augmented by two additional columns, showing the historical average deaths and actual deaths for each state for each week. Then, the excess death index can be computed.

The journalist builds a story around the migration of the coronavirus between different regions as it rages across different states  during different weeks. To this end, the designer first divides the dataset into four regions (South, West, Midwest and Northeast). Within each region, the states must be ordered. For each state, the week of peak excess deaths is identified, and the peak index is used to sort the states.

The graphic utilizes a small-multiples framework. Time occupies the horizontal axis, by convention. The vertical axis is compressed so that the states are not too distant. For the same reason, the component graphs are allowed to overlap vertically. The benefit of the tight arrangement is clearer for the Northeast as those peaks are particularly tall. The space-saving appearance reminds me of sparklines, championed by Ed Tufte.

There is one small tricky problem. In most of June, Texas suffered at least 50 percent more deaths than normal. The severity of this excess death toll is shortchanged by the low vertical height of each component graph. What forced such congestion is probably the data from the Northeast. For example, New York City:

Nyt_excessdeaths_northeast3

 

New York City's death toll was almost 8 times the normal level at the start of the epidemic in the U.S. If the same vertical scale is maintained across the four regions, then the Northeastern states dwarf all else.

***

One key takeaway from the graphic for the Southern states is the persistence of the red areas. In each state, for almost every week of the entire pandemic period, actual deaths have exceeded the normal level. This is strong indication that the coronavirus is not under control.

In fact, I'd like to see a second set of plots showing the cumulative excess deaths since March. The weekly graphic is better for identifying the ebb and flow while the cumulative graphic takes measure of the total impact of Covid-19.

***

The above description leaves out a huge chunk of work related to computing excess deaths. I assumed the designer receives these estimates from a data scientist. See the related post in which I explain how excess deaths are estimated from statistical models.

 


Everything in Texas is big, but not this BIG

Long-time reader John forwarded the following chart via Twitter.

Covidtracking_texassquare

The chart shows the recent explosive growth in deaths due to Covid-19 in Texas. John flagged this graphic as yet another example in which the data are encoded to the lengths of the squares, not their areas.

Fixing this chart just requires fixing the length of one side of the square. I also flipped it to make a conventional column chart.

Redo_texasdeathsquares_process

The final product:

Redo_texasdeaths_columns

An important qualification lurks in the footnote; it is directly applied to the label of July.

How much visual distortion is created when data are encoded to the lengths and not the areas? The following chart shows what readers see, assuming they correctly perceive the areas of those squares. The value for March is held the same as above while the other months show the death counts implied by the relative areas of the squares.

Redo_texasdeaths_distortion

Owing to squaring, the smaller counts are artificially compressed while the big numbers are massively exaggerated.


This chart shows why the PR agency for the UK government deserves a Covid-19 bonus

The Economist illustrated some interesting consumer research with this chart (link):

Economist_covidpoll

The survey by Dalia Research asked people about the satisfaction with their country's response to the coronavirus crisis. The results are reduced to the "Top 2 Boxes", the proportion of people who rated their government response as "very well" or "somewhat well".

This dimension is laid out along the horizontal axis. The chart is a combo dot and bubble chart, arranged in rows by region of the world. Now what does the bubble size indicate?

It took me a while to find the legend as I was expecting it either in the header or the footer of the graphic. A larger bubble depicts a higher cumulative number of deaths up to June 15, 2020.

The key issue is the correlation between a country's death count and the people's evaluation of the government response.

Bivariate correlation is typically shown on a scatter plot. The following chart sets out the scatter plots in a small multiples format with each panel displaying a region of the world.

Redo_economistcovidpolling_scatter

The death tolls in the Asian countries are low relative to the other regions, and yet the people's ratings vary widely. In particular, the Japanese people are pretty hard on their government.

In Europe, the people of Greece, Netherlands and Germany think highly of their government responses, which have suppressed deaths. The French, Spaniards and Italians are understandably unhappy. The British appears to be the most forgiving of their government, despite suffering a higher death toll than France, Spain or Italy. This speaks well of their PR operation.

Cumulative deaths should be adjusted by population size for a proper comparison across nations. When the same graphic is produced using deaths per million (shown on the right below), the general story is preserved while the pattern is clarified:

Redo_economistcovidpolling_deathspermillion_2

The right chart shows deaths per million while the left chart shows total deaths.

***

In the original Economist chart, what catches our attention first is the bubble size. Eventually, we notice the horizontal positioning of these bubbles. But the star of this chart ought to be the new survey data. I swapped those variables and obtained the following graphic:

Redo_economistcovidpolling_swappedvar

Instead of using bubble size, I switched to using color to illustrate the deaths-per-million metric. If ratings of the pandemic response correlate tightly with deaths per million, then we expect the color of these dots to evolve from blue on the left side to red on the right side.

The peculiar loss of correlation in the U.K. stands out. Their PR firm deserves a bonus!


The discontent of circular designs

You have two numbers +84% and -25%.

The textbook method to visualize this pair is to plot two bars. One bar in the positive direction, the other in the negative direction. The chart is clear (more on the analysis later).

Redo_pbs_mask1

But some find this graphic ugly. They don’t like straight lines, right angles and such. They prefer circles, and bends. Like PBS, who put out the following graphic that was forwarded to me by Fletcher D. on twitter:

Maskwearing_racetrack

Bending the columns is not as simple as it seems. Notice that the designer adds red arrows pointing up and down. Because the circle rounds onto itself, the sense of direction is lost. Now, readers must pick up the magnitude and the direction separately. It doesn’t help that zero is placed at the bottom of the circle.

Can we treat direction like we would on a bar chart? Make counter-clockwise the negative direction. This is what it looks like:

Redo_pbsmaskwearing

But it’s confusing. I made the PBS design worse because now, the value of each position on the circle depends on knowing whether the arrow points up or down. So, we couldn’t remove those red arrows.

The limitations of the “racetrack” design reveal themselves in similar data that are just a shade different. Here are a couple of scenarios to ponder:

  1. You have growth exceeding 100%. This is a hard problem.
  2. You have three or more rates to compare. Making one circle for each rate quickly becomes cluttered. You may make a course with multiple racetracks. But anyone who runs track can tell you the outside lanes are not the same distance as the inside. I wrote about this issue in a long-ago post (see here).

***

For a Trifecta Checkup (link), I'd also have concerns about the analytics. There are so many differences between the states that have required masks and states that haven't - the implied causality is far from proven by this simple comparison. For example, it would be interesting to see the variability around these averages - by state or even by county.


Presented without comment

Weekend assignment - which of these tells the story better?

Ourworldindata_cases_log

Or:

Ourworldindata_cases_linear

The cop-out answer is to say both. If you must pick one, which one?

***

When designing a data visualization as a living product (not static), you'd want a design that adapts as the data change.


What is the price for objectivity

I knew I had to remake this chart.

TMC_hospitalizations

The simple message of this chart is hidden behind layers of visual complexity. What the analyst wants readers to focus on (as discerned from the text on the right) is the red line, the seven-day moving average of new hospital admissions due to Covid-19 in Texas.

My eyes kept wandering away from the line. It's the sideway data labels on the columns. It's the columns that take up vastly more space than the red line. It's the sideway date labels on the horizontal axis. It's the redundant axis labels for hospitalizations when the entire data set has already been printed. It's the two hanging diamonds, for which the clues are filed away in the legend above.

Here's a version that brings out the message: after Phase 2 re-opening, the number of hospital admissions has been rising steadily.

Redo_junkcharts_texas_covidhospitaladmissions_1

Dots are used in place of columns, which push these details to the background. The line as well as periods of re-opening are directly labeled, removing the need for a legend.

Here's another visualization:

Redo_junkcharts_texas_covidhospitaladmissions_2

This chart plots the weekly average new hospital admissions, instead of the seven-day moving average. In the previous chart, the raggedness of moving average isn't transmitting any useful information to the average reader. I believe this weekly average metric is easier to grasp for many readers while retaining the general story.

***

On the original chart by TMC, the author said "the daily hospitalization trend shows an objective view of how COVID-19 impacts hospital systems." Objectivity is an impossible standard for any kind of data analysis or visualization. As seen above, the two metrics for measuring the trend in hospitalizations have pros and cons. Even if one insists on using a moving average, there are choices of averaging methods and window sizes.

Scientists are trained to believe in objectivity. It frequently disappoints when we discover that the rest of the world harbors no such notion. If you observe debates between politicians or businesspeople or social scientists, you rarely hear anyone claim one analysis is more objective - or less subjective - than another. The economist who predicts Dow to reach a new record, the business manager who argues for placing discounted products in the front not the back of the store, the sportscaster who maintains Messi is a better player than Ronaldo: do you ever hear these people describe their methods as objective?

Pursuing objectivity leads to the glorification of data dumps. The scientist proclaims disinterest in holding an opinion about the data. This is self-deception though. We clearly have opinions because when someone else  "misinterprets" the data, we express dismay. What is the point of pretending to hold no opinions when most of the world trades in opinions? By being "objective," we never shape the conversation, and forever play defense.


Hope and reality in one Georgia chart

Over the weekend, Georgia's State Health Department agitated a lot of people when it published the following chart:

Georgia_top5counties_covid19

(This might have appeared a week ago as the last date on the chart is May 9 and the title refers to "past 15 days".)

They could have avoided the embarrassment if they had read my article at DataJournalism.com (link). In that article, I lay out a set of the "unspoken conventions," things that visual designers are, or should be, doing more or less in their sleep. Under the section titled "Order", I explain the following two "rules":

  • Place values in the natural order when it is available
  • Retain the same order across all plots in a panel of charts

In the chart above, the natural order for the horizontal (time) axis is time running left to right. The order chosen by the designer  is roughly but not precisely decreasing height of the tallest column in each daily group. Many observers suggested that the columns were arranged to give the appearance of cases dropping over time.

Within each day, the counties are ordered in decreasing number of new cases. The title of the chart reads "number of cases over time" which sounds like cumulative cases but it's not. The "lead" changed hands so many times over the 15 days, meaning the data sequence was extremely noisy, which would be unlikely for cumulative cases. There are thousands of cases in each of these counties by May. Switching the order of the columns within each daily group defeats the purpose of placing these groups side-by-side.

Responding to the bad press, the department changed the chart design for this week's version:

Georgia_top5counties_covid19_revised

This chart now conforms to the two spoken rules described above. The time axis runs left to right, and within each group of columns, the order of the counties is maintained.

The chart is still very noisy, with no apparent message.

***

Next, I'd like to draw your attention to a Data issue. Notice that the 15-day window has shifted. This revised chart runs from May 2 to May 16, which is this past Saturday. The previous chart ran from Apr 26 to May 9. 

Here's the data for May 8 and 9 placed side by side.

Junkcharts_georgia_covid19_cases

There is a clear time lag of reporting cases in the State of Georgia. This chart should always exclude the last few days. The case counts keep going up until it stabilizes. The same mistake occurs in the revised chart - the last two days appear as if new cases have dwindled toward zero when in fact, it reflects a lag in reporting.

The disconnect between the Question being posed and the quality of the Data available dooms this visualization. It is not possible to provide a reliable assessment of the "past 15 days" when during perhaps half of that period, the cases are under-counted.

***

Nyt_tryingtobefashionableThis graphical distortion due to "immature" data has become very commonplace in Covid-19 graphics. It's similar to placing partial-year data next to full-year results, without calling out the partial data.

The following post from the ancient past (2005!) about a New York Times graphic shows that calling out this data problem does not actually solve it. It's a less-bad kind of thing.

The coronavirus data present more headaches for graphic designers than the financial statistics. Because of accounting regulations, we know that only the current quarter's data are immature. For Covid-19 reporting, the numbers are being adjusted for days and weeks.

Practically all immature counts are under-estimates. Over time, more cases are reported. Thus, any plots over time - if unadjusted - paint a misleading picture of declining counts. The effect of the reporting lag is predictable, having a larger impact as we run from left to right in time. Thus, even if the most recent data show a downward trend, it can eventually mean anything: down, flat or up. This is not random noise though - we know for certain of the downward bias; we just don't know the magnitude of the distortion for a while.

Another issue that concerns coronavirus reporting but not financial reporting is inconsistent standards across counties. Within a business, if one were to break out statistics by county, the analysts would naturally apply the same counting rules. For Covid-19 data, each county follows its own set of rules, not just  how to count things but also how to conduct testing, and so on.

Finally, with the politics of re-opening, I find it hard to trust the data. Reported cases are human-driven data - by changing the number of tests, by testing different mixes of people, by delaying reporting, by timing the revision of older data, by explicit manipulation, ...., the numbers can be tortured into any shape. That's why it is extremely important that the bean-counters are civil servants, and that politicians are kept away. In the current political environment, that separation between politics and statistics has been breached.

***

Why do we have low-quality data? Human decisions, frequently political decisions, adulterate the data. Epidemiologists are then forced to use the bad data, because that's what they have. Bad data lead to bad predictions and bad decisions, or if the scientists account for the low quality, predictions with high levels of uncertainty. Then, the politicians complain that predictions are wrong, or too wide-ranging to be useful. If they really cared about those predictions, they could start by being more transparent about reporting and more proactive at discovering and removing bad accounting practices. The fact that they aren't focused on improving the data gives the game away. Here's a recent post on the politics of data.

 


How Covid-19 deaths sneaked into Florida's statistics

Like many others, some Floridians are questioning their state's Covid statistics. It's clear there are numerous "degrees of freedom" for politicians to manipulate the numbers. What's not clear is who's influencing these decisions. Are they public-health experts, donors, voters, or whom?

A Twitter follower sent in the following chart, embedded in an informative article in Sun-Sentinel:

Sun-sentinel_pneumonia_percent_of_total

I like the visual design. It's clean, and conveys a moderately complex concept effectively. The reader may not immediately get what metrics are being plotted but the idea that the blue line should operate within the gray area.. until it doesn't is easily grasped. The range is technically an uncertainty band.

The metric is the proportion of total deaths (all causes) that are attributed to pneumonia and flu. Typical influenza deaths are found in that category. This chart investigates whether there were excess (unexplained) P&F deaths. The gray band measures the variability in the proportions of past years. When the blue line operates inside the band, the metric is normal. When it pierces the upper band, which happened here around week 25, a rare event has occurred.

The concern on Twitter was about the horizontal axis. Those integer labels can be confusing. The designer places a "how to read this" message in a footnote, explaining that week 1 is the first week of a typical flu season (which corresponds to late September 2019). This nugget of information helps a lot. We can see that the flu season peaks around week 20, and by the spring, it should be waning. Not so in 2020.

It's hard to escape the conclusion that deaths from Covid-19 are hiding inside the statistics of Pneumonia & Flu. As a statistician, I want to tell you Statistics Don't Lie! You can hide the data along one dimension, but they show up elsewhere. Misclassifying the deaths does buy someone some time. It takes a few weeks to compile all-cause mortality data (gasp, the CDC said mortality records are only 75 percent accurate after 8 weeks!)

The other small problem with the chart is the labeling. Neither axis has labels. The data label that shows up when you click on the line might be a default from the software that can't be turned off. It shows the two numbers being plotted without labels.

***

Here is a re-working of the chart that tells the story:

Redo_junkcharts_sunsentinelpneunominacovid19

The proportion of deaths attributed to P&F and Covid together is roughly double the upper end of what Florida should be seeing this time of the year (without Covid). Covid-19 accounts for half the gap. The other half are still being classified as P&F. However, I suspect CDC will adjust these numbers later to reflect the reality. (In making this chart, I also learned that Florida stopped including seasonal visitors in the death counts. This is egregious manipulation. If someone died while in Florida, they should be counted. I didn't investigate whether this counting rule applies only to Covid-19 deaths, or to deaths from all causes. If they had always done that, then I might give them a pass.)

On second thought, maybe not. The other egregious thing that appeared to have happened is that the Florida state health department unplugged their prior website (https://www.floridahealth.gov) so no one can cross-reference any prior documents. The only website I can access now for Florida state health is a Covid-specific site (https://floridahealthcovid19.gov).

Florida_state_health_websites

There must be something juicy on the previous influenza page, no?

***

Lastly, when you look at my chart, please pretend that the last week is not on there. In all likelihood, the "drop" is fake because the mortality data have not been fully updated. My chart contains one more week than the Sun Sentinel chart. So you can see that the drastic decline shown on their chart turned up a big uptick on mine (next to last week).

This is a common mistake on many charts I see these days. Half-baked numbers are shown next to fully-baked ones.