The discontent of circular designs

You have two numbers +84% and -25%.

The textbook method to visualize this pair is to plot two bars. One bar in the positive direction, the other in the negative direction. The chart is clear (more on the analysis later).

Redo_pbs_mask1

But some find this graphic ugly. They don’t like straight lines, right angles and such. They prefer circles, and bends. Like PBS, who put out the following graphic that was forwarded to me by Fletcher D. on twitter:

Maskwearing_racetrack

Bending the columns is not as simple as it seems. Notice that the designer adds red arrows pointing up and down. Because the circle rounds onto itself, the sense of direction is lost. Now, readers must pick up the magnitude and the direction separately. It doesn’t help that zero is placed at the bottom of the circle.

Can we treat direction like we would on a bar chart? Make counter-clockwise the negative direction. This is what it looks like:

Redo_pbsmaskwearing

But it’s confusing. I made the PBS design worse because now, the value of each position on the circle depends on knowing whether the arrow points up or down. So, we couldn’t remove those red arrows.

The limitations of the “racetrack” design reveal themselves in similar data that are just a shade different. Here are a couple of scenarios to ponder:

  1. You have growth exceeding 100%. This is a hard problem.
  2. You have three or more rates to compare. Making one circle for each rate quickly becomes cluttered. You may make a course with multiple racetracks. But anyone who runs track can tell you the outside lanes are not the same distance as the inside. I wrote about this issue in a long-ago post (see here).

***

For a Trifecta Checkup (link), I'd also have concerns about the analytics. There are so many differences between the states that have required masks and states that haven't - the implied causality is far from proven by this simple comparison. For example, it would be interesting to see the variability around these averages - by state or even by county.


Presented without comment

Weekend assignment - which of these tells the story better?

Ourworldindata_cases_log

Or:

Ourworldindata_cases_linear

The cop-out answer is to say both. If you must pick one, which one?

***

When designing a data visualization as a living product (not static), you'd want a design that adapts as the data change.


What is the price for objectivity

I knew I had to remake this chart.

TMC_hospitalizations

The simple message of this chart is hidden behind layers of visual complexity. What the analyst wants readers to focus on (as discerned from the text on the right) is the red line, the seven-day moving average of new hospital admissions due to Covid-19 in Texas.

My eyes kept wandering away from the line. It's the sideway data labels on the columns. It's the columns that take up vastly more space than the red line. It's the sideway date labels on the horizontal axis. It's the redundant axis labels for hospitalizations when the entire data set has already been printed. It's the two hanging diamonds, for which the clues are filed away in the legend above.

Here's a version that brings out the message: after Phase 2 re-opening, the number of hospital admissions has been rising steadily.

Redo_junkcharts_texas_covidhospitaladmissions_1

Dots are used in place of columns, which push these details to the background. The line as well as periods of re-opening are directly labeled, removing the need for a legend.

Here's another visualization:

Redo_junkcharts_texas_covidhospitaladmissions_2

This chart plots the weekly average new hospital admissions, instead of the seven-day moving average. In the previous chart, the raggedness of moving average isn't transmitting any useful information to the average reader. I believe this weekly average metric is easier to grasp for many readers while retaining the general story.

***

On the original chart by TMC, the author said "the daily hospitalization trend shows an objective view of how COVID-19 impacts hospital systems." Objectivity is an impossible standard for any kind of data analysis or visualization. As seen above, the two metrics for measuring the trend in hospitalizations have pros and cons. Even if one insists on using a moving average, there are choices of averaging methods and window sizes.

Scientists are trained to believe in objectivity. It frequently disappoints when we discover that the rest of the world harbors no such notion. If you observe debates between politicians or businesspeople or social scientists, you rarely hear anyone claim one analysis is more objective - or less subjective - than another. The economist who predicts Dow to reach a new record, the business manager who argues for placing discounted products in the front not the back of the store, the sportscaster who maintains Messi is a better player than Ronaldo: do you ever hear these people describe their methods as objective?

Pursuing objectivity leads to the glorification of data dumps. The scientist proclaims disinterest in holding an opinion about the data. This is self-deception though. We clearly have opinions because when someone else  "misinterprets" the data, we express dismay. What is the point of pretending to hold no opinions when most of the world trades in opinions? By being "objective," we never shape the conversation, and forever play defense.


Hope and reality in one Georgia chart

Over the weekend, Georgia's State Health Department agitated a lot of people when it published the following chart:

Georgia_top5counties_covid19

(This might have appeared a week ago as the last date on the chart is May 9 and the title refers to "past 15 days".)

They could have avoided the embarrassment if they had read my article at DataJournalism.com (link). In that article, I lay out a set of the "unspoken conventions," things that visual designers are, or should be, doing more or less in their sleep. Under the section titled "Order", I explain the following two "rules":

  • Place values in the natural order when it is available
  • Retain the same order across all plots in a panel of charts

In the chart above, the natural order for the horizontal (time) axis is time running left to right. The order chosen by the designer  is roughly but not precisely decreasing height of the tallest column in each daily group. Many observers suggested that the columns were arranged to give the appearance of cases dropping over time.

Within each day, the counties are ordered in decreasing number of new cases. The title of the chart reads "number of cases over time" which sounds like cumulative cases but it's not. The "lead" changed hands so many times over the 15 days, meaning the data sequence was extremely noisy, which would be unlikely for cumulative cases. There are thousands of cases in each of these counties by May. Switching the order of the columns within each daily group defeats the purpose of placing these groups side-by-side.

Responding to the bad press, the department changed the chart design for this week's version:

Georgia_top5counties_covid19_revised

This chart now conforms to the two spoken rules described above. The time axis runs left to right, and within each group of columns, the order of the counties is maintained.

The chart is still very noisy, with no apparent message.

***

Next, I'd like to draw your attention to a Data issue. Notice that the 15-day window has shifted. This revised chart runs from May 2 to May 16, which is this past Saturday. The previous chart ran from Apr 26 to May 9. 

Here's the data for May 8 and 9 placed side by side.

Junkcharts_georgia_covid19_cases

There is a clear time lag of reporting cases in the State of Georgia. This chart should always exclude the last few days. The case counts keep going up until it stabilizes. The same mistake occurs in the revised chart - the last two days appear as if new cases have dwindled toward zero when in fact, it reflects a lag in reporting.

The disconnect between the Question being posed and the quality of the Data available dooms this visualization. It is not possible to provide a reliable assessment of the "past 15 days" when during perhaps half of that period, the cases are under-counted.

***

Nyt_tryingtobefashionableThis graphical distortion due to "immature" data has become very commonplace in Covid-19 graphics. It's similar to placing partial-year data next to full-year results, without calling out the partial data.

The following post from the ancient past (2005!) about a New York Times graphic shows that calling out this data problem does not actually solve it. It's a less-bad kind of thing.

The coronavirus data present more headaches for graphic designers than the financial statistics. Because of accounting regulations, we know that only the current quarter's data are immature. For Covid-19 reporting, the numbers are being adjusted for days and weeks.

Practically all immature counts are under-estimates. Over time, more cases are reported. Thus, any plots over time - if unadjusted - paint a misleading picture of declining counts. The effect of the reporting lag is predictable, having a larger impact as we run from left to right in time. Thus, even if the most recent data show a downward trend, it can eventually mean anything: down, flat or up. This is not random noise though - we know for certain of the downward bias; we just don't know the magnitude of the distortion for a while.

Another issue that concerns coronavirus reporting but not financial reporting is inconsistent standards across counties. Within a business, if one were to break out statistics by county, the analysts would naturally apply the same counting rules. For Covid-19 data, each county follows its own set of rules, not just  how to count things but also how to conduct testing, and so on.

Finally, with the politics of re-opening, I find it hard to trust the data. Reported cases are human-driven data - by changing the number of tests, by testing different mixes of people, by delaying reporting, by timing the revision of older data, by explicit manipulation, ...., the numbers can be tortured into any shape. That's why it is extremely important that the bean-counters are civil servants, and that politicians are kept away. In the current political environment, that separation between politics and statistics has been breached.

***

Why do we have low-quality data? Human decisions, frequently political decisions, adulterate the data. Epidemiologists are then forced to use the bad data, because that's what they have. Bad data lead to bad predictions and bad decisions, or if the scientists account for the low quality, predictions with high levels of uncertainty. Then, the politicians complain that predictions are wrong, or too wide-ranging to be useful. If they really cared about those predictions, they could start by being more transparent about reporting and more proactive at discovering and removing bad accounting practices. The fact that they aren't focused on improving the data gives the game away. Here's a recent post on the politics of data.

 


How Covid-19 deaths sneaked into Florida's statistics

Like many others, some Floridians are questioning their state's Covid statistics. It's clear there are numerous "degrees of freedom" for politicians to manipulate the numbers. What's not clear is who's influencing these decisions. Are they public-health experts, donors, voters, or whom?

A Twitter follower sent in the following chart, embedded in an informative article in Sun-Sentinel:

Sun-sentinel_pneumonia_percent_of_total

I like the visual design. It's clean, and conveys a moderately complex concept effectively. The reader may not immediately get what metrics are being plotted but the idea that the blue line should operate within the gray area.. until it doesn't is easily grasped. The range is technically an uncertainty band.

The metric is the proportion of total deaths (all causes) that are attributed to pneumonia and flu. Typical influenza deaths are found in that category. This chart investigates whether there were excess (unexplained) P&F deaths. The gray band measures the variability in the proportions of past years. When the blue line operates inside the band, the metric is normal. When it pierces the upper band, which happened here around week 25, a rare event has occurred.

The concern on Twitter was about the horizontal axis. Those integer labels can be confusing. The designer places a "how to read this" message in a footnote, explaining that week 1 is the first week of a typical flu season (which corresponds to late September 2019). This nugget of information helps a lot. We can see that the flu season peaks around week 20, and by the spring, it should be waning. Not so in 2020.

It's hard to escape the conclusion that deaths from Covid-19 are hiding inside the statistics of Pneumonia & Flu. As a statistician, I want to tell you Statistics Don't Lie! You can hide the data along one dimension, but they show up elsewhere. Misclassifying the deaths does buy someone some time. It takes a few weeks to compile all-cause mortality data (gasp, the CDC said mortality records are only 75 percent accurate after 8 weeks!)

The other small problem with the chart is the labeling. Neither axis has labels. The data label that shows up when you click on the line might be a default from the software that can't be turned off. It shows the two numbers being plotted without labels.

***

Here is a re-working of the chart that tells the story:

Redo_junkcharts_sunsentinelpneunominacovid19

The proportion of deaths attributed to P&F and Covid together is roughly double the upper end of what Florida should be seeing this time of the year (without Covid). Covid-19 accounts for half the gap. The other half are still being classified as P&F. However, I suspect CDC will adjust these numbers later to reflect the reality. (In making this chart, I also learned that Florida stopped including seasonal visitors in the death counts. This is egregious manipulation. If someone died while in Florida, they should be counted. I didn't investigate whether this counting rule applies only to Covid-19 deaths, or to deaths from all causes. If they had always done that, then I might give them a pass.)

On second thought, maybe not. The other egregious thing that appeared to have happened is that the Florida state health department unplugged their prior website (https://www.floridahealth.gov) so no one can cross-reference any prior documents. The only website I can access now for Florida state health is a Covid-specific site (https://floridahealthcovid19.gov).

Florida_state_health_websites

There must be something juicy on the previous influenza page, no?

***

Lastly, when you look at my chart, please pretend that the last week is not on there. In all likelihood, the "drop" is fake because the mortality data have not been fully updated. My chart contains one more week than the Sun Sentinel chart. So you can see that the drastic decline shown on their chart turned up a big uptick on mine (next to last week).

This is a common mistake on many charts I see these days. Half-baked numbers are shown next to fully-baked ones.


Twitter people UpSet with that Covid symptoms diagram

Been busy with an exciting project, which I might talk about one day. But I promised some people I'll follow up on Covid symptoms data visualization, so here it is.

After I posted about the Venn diagram used to depict self-reported Covid-19 symptoms by users of the Covid Symptom Tracker app (reported by Nature), Xan and a few others alerted me to Twitter discussion about alternative visualizations that people have made after they suffered the indignity of trying to parse the Venn diagram.

To avoid triggering post-trauma, for those want to view the Venn diagram, please click here.

[In the Twitter links below, you almost always have to scroll one message down - saving tweets, linking to tweets, etc. are all stuff I haven't fully figured out.]

Start with the Questions

Xan’s final comment is especially appropriate: "There's an over-riding Type-Q issue: count charts answer the wrong question".

As dataviz designers, we frequently get locked into the mindset of “what is the best way to present this dataset?” This line of thinking leads to overloaded graphics that attempt to answer every possible question that may arise from the data in one panoptic chart, akin to juggling 10 balls at once.

For complex datasets, it is often helpful to narrow down the list of questions, and provide a series of charts, each addressing one or two questions. I’ll come back to this point. I want to first show some of the nicer visuals that others have produced, which brings out the structure and complexity of this dataset.

 

The UpSet chart

The primary contender is the “UpSet” chart form, as best exemplified by Bart’s effort

Upset_bartjutte

The centerpiece of this chart is the matrix of dots. The horizontal rows of dots represent the presence of specific symptoms such as cough and anosmia (loss of smell and taste). The vertical columns are intuitive, once you get it. They represent combinations of symptoms, and the fill/no-fill of the dots indicates which symptoms are being combined. For example, the first column counts people reporting fatigue plus anosmia (but nothing else).

The UpSet chart clearly communicates the structure of the data. In many survey questions (including this one conducted by the Symptom Tracker app), respondents are allowed to check/tick more than one answer choices. This creates a situation where the number of answers (here, symptoms) per respondent can be zero up to the total number of answer choices.

So far, we have built a structure like we have drawn country outlines on a map. There is no data yet. The data are primarily found in the sidebar histograms (column/bar charts). Reading horizontally to the right side, one learns that the most frequently reported symptom was fatigue, covering 88 percent of the users.* Reading vertically, one learns that the top combination of symptoms was fatigue plus anosmia, covering 16 percent of users.

***

Now come the divisive acts.

Act 1: Bart orders the columns in a particular way that meets his subjective view of how he wants readers to see the data. The columns are sorted from the most frequent combinations to the least. The histogram has a “long tail”, with most of the combinations receiving a small proportion of the total. The top five combinations is where the bulk of the data is – I’d have liked to see all five columns labeled, without decimal places.

This is a choice on the part of the designer. Nils, for example, made two versions of his UpSet charts. The second version arranges the combinations from singles to quintuples.

Nils Gehlenborg_upsetplot_sortedbynumberofsymptoms

 

Digression: The Visual in Data Visualization

The two rendering of “UpSet” charts, by Nils and Bart, is a perfect illustration of the Trifecta Checkup framework. Each corner of the Trifecta is an independent dimension, and yet all must sync. With the same data and the same question types, what differentiates the two versions is the visual design.

See how many differences you can find, and make your own design choices!

 

I place the digression here because Act 1 above has to do with the Q corner, and both visual designs can accommodate the sorting decisions. But Act 2 below pertains to the V corner.

Act 2: Bart applies a blue gradient to the matrix of dots that reinforces his subjective view about identifying frequent combinations of symptoms. Nils, by contrast, uses the matrix to show present/absent only.

I’m not sure about Act 2. I think the addition of the color gradient overloads the matrix in the chart. It has the nice effect of focusing the reader’s attention on the top 5 combinations but it also requires the reader to have understood the meaning of columns first. Perhaps applying the gradient to the histogram up top rather than the dots in the matrix can achieve the same goal with less confusion.

 

Getting Obtuse

For example, some readers (e.g. Robin) expressed confusion.

Robin is alleging something the chart doesn’t do. He pointed out (correctly) that while 16 percent experienced fatigue and anosmia only (without other symptoms), more than 50 percent reported fatigue and anosmia, plus other symptoms. That nugget of information is deeply buried inside Bart’s chart – it’s the sum of each column for which the first two dots are filled in. For example, the second column represents fatigue+anosmia+cough. So Robin wants to aggregate those up.

Robin’s critique arises from the Q(uestion) corner. If the designer wants to highlight specific combinations that occur most frequently in the data, then Bart’s encoding makes perfect sense. On the other hand, if the purpose is to highlight pairs of symptoms that occur most frequently together (disregarding symptoms outside each pair), then the data must be further aggregated. The switch in the Question requires more Data manipulation, which then affects the Visualization. That's the essence of the Trifecta Checkup framework.

Rest assured, the version that addresses Robin’s point will not give an easy answer to Bart’s question. In fact, Xan whipped up a bar chart in response:

Xan_symptomscombo_barchart

This is actually hard to comprehend because Robin’s question is even hard to state. The first bar shows 87 percent of users reported fatigue as a symptom, the same number that appeared on Bart’s version on the right side. Then, the darkened section of the bar indicates the proportion of users who reported only fatigue and nothing else, which appears to be about 10 percent. So 1 out of 9 reported just fatigue while 8 out of 9 who reported fatigue also experienced other symptoms.

 

Xan’s bar chart can be flipped 90 degrees and replace Bart’s histogram on top of the matrix. But you see, we end up with the same problem as I mentioned up top. By jamming more insights from more questions onto the same chart, we risk dropping the other balls that were already in the air.

So, my advice is always to first winnow down the list of questions you want to address. And don’t be afraid of making a series of charts instead of one panoptic chart.

***

Act 3: Bart decides to leave out labels for the columns.

This is a curious choice given the key storyline we’ve been working with so far (the Top 5 combinations of symptoms). But notice how annoying this problem is. Combinations require long text, which must be written vertically or slanted on this design. Transposing could help but not really. It’s just a limitation of this chart form. For me, reading the filled dots underneath the columns as column labels isn’t a show-stopper.

 

Histograms vs Bar Charts

It’s worth pointing out that the sidebar “histograms” are not both histograms. I tend to think of histograms as a specific type of bar (column) chart, in which the sum of the bars (columns) can be interpreted as a whole. So all histograms are bar charts but only some bar charts are histograms.

The column chart up top is a histogram. The combinations of symptoms are disjoint, and the total of the combinations should be the total number of answer choices selected by all respondents. The bar chart on the right side however is not a histogram. Each percentage is a proportion to the whole, and adding those percentages yields way above 100%.

I like the annotation on Bart’s chart a lot. They are succinct and they give just the right information to explain how to read the chart.

 

Limitations

I already mentioned the vertical labeling issue for UpSet charts. Here are two other considerations for you.

The majority of the plotting area is dedicated to the matrix of dots. The matrix contains merely labels for data. They are like country boundaries on a map. While it lays out the structure of data very clearly, the designer should ask whether it is essential for the readers to see the entire landscape.

In real-world data, the “long tail” phenomenon we saw earlier is very common. With six featured symptoms, there are 2^6 = 64 possible combinations of symptoms (minus 1 if they filtered out those not reporting symptoms*), almost all of which will be empty. Should the low-frequency columns be removed? This is not as controversial as you think, because implicitly both Bart and Nils already dropped all empty combinations!

 

Data and Code

Kieran Healy left a comment on the last post, and you can find both the data (thank you!) and some R code for UpSet charts at his blog.

Also, Nils has a Shiny app on Github.

 

(*) One must be very careful about what “users” are being represented. They form a tiny subset of users of the Symptom Tracker app, just those who have previously taken a diagnostic test and have self-reported at least one symptom. I have separately commented on the analyses of this dataset by the team behind the app. The first post discusses their analytical methods, the second post examines how they pre-processed the data, and a future post will describe the data collection practices. For the purpose of this blog post, I’ll ignore any data issues.

(#) Bart’s chart is conceptual because some of the columns of dots are repeated, and there is one column without fills, which should have been removed by a pre-processing step applied by the research team.


This exercise plan for your lock-down work-out is inspired by Venn

A twitter follower did not appreciate this chart from Nature showing the collection of flu-like symptoms that people reported they have to an UK tracking app. 

Nature tracking app venn diagram

It's a super-complicated Venn diagram. I have written about this type of chart before (see here); it appears to be somewhat popular in the medicine/biology field.

A Venn diagram is not a data visualization because it doesn't plot the data.

Notice that the different compartments of the Venn diagram do not have data encoded in the areas. 

The chart also fails the self-sufficiency test because if you remove the data from it, you end up with a data container - like a world map showing country boundaries and no data.

If you're new here: if a graphic requires the entire dataset to be printed on it for comprehension, then the visual elements of the graphic are not doing any work. The graphic cannot stand on its own.

When the Venn diagram gets complicated, teeming with many compartments, there will be quite a few empty compartments. If I have to make this chart, I'd be nervous about leaving out a number or two by accident. An empty cell can be truly empty or an oversight.

Another trap is that the total doesn't add up. The numbers on this graphic add to 1,764 whereas the study population in the preprint was 1,702. Interestingly, this diagram doesn't show up in the research paper. Given how they winnowed down the study population from all the app downloads, I'm sure there is an innocent explanation as to why those two numbers don't match.

***

The chart also strains the reader. Take the number 18, right in the middle. What combination of symptoms did these 18 people experience? You have to figure out the layers sitting beneath the number. You see dark blue, light blue, orange. If you blink, you might miss the gray at the bottom. Then you have to flip your eyes up to the legend to map these colors to diarrhoea, shortness of breath, anosmia, and fatigue. Oops, I missed the yellow, which is the cough. To be sure, you look at the remaining categories to see where they stand - I've named all of them except fever. The number 18 lies outside fever so this compartment represents everything except fever. 

What's even sadder is there is not much gain from having done it once. Try to interpret the number 50 now. Maybe I'm just slow but it doesn't get better the second or third time around. This graphic not only requires work but painstaking work!

Perhaps a more likely question is how many people who had a loss of smell also had fever. Now it's pretty easy to locate the part of the dark gray oval that overlaps with the orange oval. But now, I have to add all those numbers, 69+17+23+50+17+46 = 222. That's not enough. Next, I must find the total of all the numbers inside the orange oval, which is 222 plus what is inside the orange and outside the dark gray. That turns out to be 829. So among those who had lost smell, the proportion who also had fever is 222/(222+829) = 21 percent. 

How many people had three or more symptoms? I'll let you figure this one out!

 

 

 

 

 

 

 


Reviewing the charts in the Oxford Covid-19 study

On my sister (book) blog, I published a mega-post that examines the Oxford study that was cited two weeks ago as a counterpoint to the "doomsday" Imperial College model. These studies bring attention to the art of statistical modeling, and those six posts together are designed to give you a primer, and you don't need math to get a feel.

One aspect that didn't make it to the mega-post is the data visualization. Sad to say, the charts in the Oxford study (link) are uniformly terrible. Figure 3 is typical:

Oxford_covidmodel_fig3

There are numerous design decisions that frustrate readers.

a) The graphic contains two charts, one on top of the other. The left axis extends floor-to-ceiling, giving the false impression that it is relevant to both charts. In fact, the graphic uses dual axes. The bottom chart references the axis shown in the bottom right corner; the left axis is meaningless. The two charts should be drawn separately.

For those who have not read the mega-post about the Oxford models, let me give a brief description of what these charts are saying. The four colors refer to four different models - these models have the same structure but different settings. The top chart shows the proportion of the population that is still susceptible to infection by a certain date. In these models, no one can get re-infected, and so you see downward curves. The bottom chart displays the growth in deaths due to Covid-19. The first death in the UK was reported on March 5.  The black dots are the official fatalities.

b) The designer allocates two-thirds of the space to the top chart, which has a much simpler message. This causes the bottom chart to be compressed beyond cognition.

c) The top chart contains just five lines, smooth curves of the same shape but different slopes. The designer chose to use thick colored lines with black outlines. As a result, nothing precise can be read from the chart. When does the yellow line start dipping? When do the two orange lines start to separate?

d) The top chart should have included margins of error. These models are very imprecise due to the sparsity of data.

e) The bottom chart should be rejected by peer reviewers. We are supposed to judge how well each of the five models fits the cumulative death counts. But three design decisions conspire to prevent us from getting the answer: (i) the vertical axis is severely compressed by tucking this chart underneath the top chart (ii) the vertical axis uses a log scale which compresses large values and (iii) the larger-than-life dots.

As I demonstrated in this post also from the sister blog, many models especially those assuming an exponential growth rate has poor fits after the first few days. Charting in log scale hides the degree of error.

f) There is a third chart squeezed into the same canvass. Notice the four little overlapping hills located around Feb 1. These hills are probability distributions, which are presented without an appropriate vertical axis. Each hill represents a particular model's estimate of the date on which the novel coronavirus entered the UK. But that date is unknowable. So the model expresses this uncertainty using a probability distribution. The "peak" of the distribution is the most likely date. The spread of the hill gives the range of plausible dates, and the height at a given date indicates the chance that that is the date of introduction. The missing axis is a probability scale, which is neither the left nor the right axis.

***

The bottom chart shows up in a slightly different form as Figure 1(A).

Oxford_covidmodels_Fig1A

Here, the green, gray (blocked) and red thick lines correspond to the yellow/orange/red diamonds in Figure 3. The thin green and red lines show the margins of error I referred to above (these lines are not explicitly explained in the chart annotation.) The actual counts are shown as white rather than black diamonds.

Again, the thick lines and big diamonds conspire to swamp the gaps between model fit and actual data. Again, notice the use of a log scale. This means that the same amount of gap signifies much bigger errors as time moves to the right.

When using the log scale, we should label it using the original units. With a base 10 logarithm, the axis should have labels 1, 10, 100, 1000 instead of 0, 1, 2, 3. (This explains my previous point - why small gaps between a model line and a diamond can mean a big error as the counts go up.)

Also notice how the line of white diamonds makes it impossible to see what the models are doing prior to March 5, the date of the first reported death. The models apparently start showing fatalities prior to March 5. This is a key part of their conclusion - the Oxford team concluded that the coronavirus has been circulating in the U.K. even before the first infection was reported. The data visualization should therefore bring out the difference in timing.

I hope by the time the preprint is revised, the authors will have improved the data visualization.

 

 

 


The why axis

A few weeks ago, I replied to a tweet by someone who was angered by the amount of bad graphics about coronavirus. I take a glass-half-full viewpoint: it's actually heart-warming for  dataviz designers to realize that their graphics are being read! When someone critiques your work, it is proof that they cared enough to look at it. Worse is when you publish something, and no one reacts to it.

That said, I just wasted half an hour trying to get into the head of the person who made the following:

Fox31_co_newcases edited

Longtime reader Chris P. forwarded this tweet to me, and I saw that Andrew Gelman got sent this one, too.

The chart looked harmless until you check out the vertical axis labels. It's... um... the most unusual. The best way to interpret what the designer did is to break up the chart into three components. Like this:

Redo_junkcharts_fox31cocases

The big mystery is why the designer spent the time and energy to make this mischief.

The usual suspect is fake news. The clearest sign of malintent is the huge size of the dots. Each dot spans almost the entirety of the space between gridlines.

But there is almost no fake news here. The overall trend line is intact despite the attempted distortion. The following is a superposition of an unmanipulated line (yellow) on top of the manipulated:

Redo_junkcharts_fox31cocases2

***

The next guess is incompetence. The evidence against this view is the amount of energy required to execute these changes. In Excel, it takes a lot of work. It's easier to do this in R or any programming languages with which you can design your own axis.

Even for the R coders, the easy part is to replicate the design, but the hard part is to come up with the concept in the first place!

You can't just stumble onto a design like this. So I am not convinced the designer is an idiot.

***

How much work? You have to create three separate charts, with three carefully chosen vertical scales, and then clip, merge, and sew the seam. The weirdest bit is throwing away three of the twelve axis labels and writing in three fake numbers.

Here's the recipe: (if the gif doesn't load automatically, click on it)

Fox31_co_cases_B6

Help me readers! I'm stumped. Why oh why did someone make this? What is the point?

 

P.S. [4/9/2020] A conversation with Carlos on Andrew's blog reveals another issue. I pointed out that the "Total cases" printed up top was not the sum of the 15 numbers on the chart. There was a gap of 184 cases. Carlos sent me a link showing a day on which the total cases in Colorado was 183 cases. I didn't quite get the point initially. He explained that it's 183 existing cases prior to the start of the period of this chart, plus the new cases during this period, leading to the "Total cases" as of the end of the period of this chart.

So, another mystery solved. This brings up an important point about making effective charts: one way confusion arises is if there are two things from the visual that seem to contradict each other. In most line charts, if there is a line, and then a "total", the natural expectation is that the "total" is the sum of the data that make up the line. In this case, that "total" is the total new cases during the time period depicted. Total new cases isn't the same as total cases from case #1.

It's clearer to say "Total Cases on 3/17 = 183; on 4/1 = 3342".

 


Make your color legend better with one simple rule

The pie chart about COVID-19 worries illustrates why we should follow a basic rule of constructing color legends: order the categories in the way you expect readers to encounter them.

Here is the chart that I discussed the other day, with the data removed since they are not of concern in this post. (link)

Junkcharts_abccovidbiggestworries_sufficiency

First look at the pie chart. Like me, you probably looked at the orange or the yellow slice first, then we move clockwise around the pie.

Notice that the legend leads with the red square ("Getting It"), which is likely the last item you'll see on the chart.

This is the same chart with the legend re-ordered:

Redo_junkcharts_abcbiggestcovidworries_legend

***

Simple charts can be made better if we follow basic rules of construction. When used frequently, these rules can be made silent. I cover rules for legends as well as many other rules in this Long Read article titled "The Unspoken Conventions of Data Visualization" (link).