Tip of the day: transform data before plotting

The Financial Times called out a twitter user for some graphical mischief. Here are the two charts illustrating the plunge in Bitcoin's price last week : (Hat tip to Mark P.)

Ft_tradingview_btcprices

There are some big differences between the two charts. The left chart depicts this month's price actions, drawing attention to the last week while the right chart shows a longer period of time, starting from 2012. The author of the tweet apparently wanted to say that the recent drop is nothing to worry about. 

The Financial Times reporter noted another subtle difference - the right chart uses a log scale while the left chart is linear. Specifically, it's a log 2 scale, which means that each step up is double the previous number (1, 2, 4, 8, etc.). The effect is to make large changes look smaller. Presumably most readers fail to notice the scale. Even if they do, it's not natural to assign different differences to the same physical distances.

***

Junkcharts_redo_fttradingviewbitcoinpricechart

These price charts always miss the mark. That's because the current price is insufficient to capture whether a Bitcoin investor made money or lost money. If you purchased Bitcoins this month, you lost money. If your purchase was a year ago, you still made quite a bit of money despite the recent price plunge.

The following chart should not be read as a time series, even though the horizontal axis is time. Think date of Bitcoin purchase. This chart tells you how much $1 of Bitcoin is worth last week, based on what day the purchase was made.

Junkcharts_redo_fttradingviewbitcoinpricechart_2

People who bought this year have mostly been in the red. Those who purchased before October 2020 and held on are still very pleased with their decision.

This example illustrates that simple transformations of the raw data yield graphics that are much more informative.

 


Losses trickle down while gains trickle up

In a rich dataset, it's hard to convey all the interesting insights on a single chart. Following up on the previous post, I looked further at the wealth distribution dataset. In the previous post, I showed this chart, which indicated that the relative wealth of the super-rich (top 1%) rose dramatically around 2011.

Redo_bihouseholdwealth_legend

As a couple of commenters noticed, that's relative wealth. I indiced everything to the Bottom 50%.

In this next chart, I apply a different index. Each income segment is set to 100 at the start of the time period under study (2000), and I track how each segment evolved in the last two decades.

Junkcharts_redo_bihouseholdwealth_2

This chart offers many insights.

The Bottom 50% have been left far, far behind in the last 20 years. In fact, from 2000-2018, this segment's wealth never once reached the 2000 level. At its worst, around 2010, the Bottom 50% found themselves 80% poorer than they were 10 years ago!

In the meantime, the other half of the population has seen their wealth climb continuously through the 20 years. This is particularly odd because the major crisis of these two decades was the Too Big to Fail implosion of financial instruments, which the Bottom 50% almost surely did not play a part in. During that crisis, the top 50% were 30-60% better off than they were in 2000. Is this the "trickle-down" economy in which losses are passed down (but gains are passed up)?

The chart also shows how the recession hit the bottom 50% much deeper, and how the recovery took more than a decade. For the top half, the recovery came between 2-4 years.

It also appears that top 10% are further peeling off from the rest of the population. Since 2009, the top 11-49% have been steadily losing ground relative to the top 10%, while the gap between them and the Bottom 50% has narrowed.

***

This second chart is not nearly as dramatic as the first one but it reveals much more about the data.

 


Finding the hidden information behind nice-looking charts

This chart from Business Insider caught my attention recently. (link)

Bi_householdwealthchart

There are various things they did which I like. The use of color to draw a distinction between the top 3 lines and the line at the bottom - which tells the story that the bottom 50% has been left far behind. Lines being labelled directly is another nice touch. I usually like legends that sit atop the chart; in this case, I'd have just written the income groups into the line labels.

Take a closer look at the legend text, and you'd notice they struggled with describing the income percentiles.

Bi_householdwealth_legend

This is a common problem with this type of data. The top and bottom categories are easy, as it's most natural to say "top x%" and "bottom y%". By doing so, we establish two scales, one running from the top, and the other counting from the bottom - and it's a head scratcher which scale to use for the middle categories.

The designer decided to lose the "top" and "bottom" descriptors, and went with "50-90%" and "90-99%". Effectively, these follow the "bottom" scale. "50-90%" is the bottom 50 to 90 percent, which corresponds to the top 10 to 50 percent. "90-99%" is the bottom 90-99%, which corresponds to the top 1 to 10%. On this chart, since we're lumping the top three income groups, I'd go with "top 1-10%" and "top 10-50%".

***

The Business Insider chart is easy to mis-read. It appears that the second group from the top is the most well-off, and the wealth of the top group is almost 20 times that of the bottom group. Both of those statements are false. What's confusing us is that each line represents very different numbers of people. The yellow line is 50% of the population while the "top 1%" line is 1% of the population. To see what's really going on, I look at a chart showing per-capita wealth. (Just divide the data of the yellow line by 50, etc.)

Redo_bihouseholdwealth_legend

For this chart, I switched to a relative scale, using the per-capita wealth of the Bottom 50% as the reference level (100). Also, I applied a 4-period moving average to smooth the line. The data actually show that the top 1% holds much more wealth per capita than all other income segments. Around 2011, the gap between the top 1% and the rest was at its widest - the average person in the top 1% is about 3,000 times wealthier than someone in the bottom 50%.

This chart raises another question. What caused the sharp rise in the late 2000s and the subsequent decline? By 2020, the gap between the top and bottom groups is still double the size of the gap from 20 years ago. We'd need additional analyses and charts to answer this question.

***

If you are familiar with our Trifecta Checkup, the Business Insider chart is a Type D chart. The problem with it is in how the data was analyzed.


The time has arrived for cumulative charts

Long-time reader Scott S. asked me about this Washington Post chart that shows the disappearance of pediatric flu deaths in the U.S. this season:

Washingtonpost_pediatricfludeaths

The dataset behind this chart is highly favorable to the designer, because the signal in the data is so strong. This is a good chart. The key point is shown clearly right at the top, with an informative title. Gridlines are very restrained. I'd draw attention to the horizontal axis. The master stroke here is omitting the week labels, which are likely confusing to all but the people familiar with this dataset.

Scott suggested using a line chart. I agree. And especially if we plot cumulative counts, rather than weekly deaths. Here's a quick sketch of such a chart:

Junkcharts_redo_wppedflu_panel

(On second thought, I'd remove the week numbers from the horizontal axis, and just go with the month labels. The Washington Post designer is right in realizing that those week numbers are meaningless to most readers.)

The vaccine trials have brought this cumulative count chart form to the mainstream. For anyone who have seen the vaccine efficacy charts, the interpretation of the panel of line charts should come naturally.

Instead of four plots, I prefer one plot with four superimposed lines. Like this:

Junkcharts_redo_wppeddeaths_superpose2

 

 

 


Vaccine researchers discard the start-at-zero rule

I struggled to decide on which blog to put this post. The reality is it bridges the graphical and analytical sides of me. But I ultimately placed it on the dataviz blog because that's where today's story starts.

Data visualization has few set-in-stone rules. If pressed for one, I'd likely cite the "start-at-zero" rule, which has featured regularly on Junk Charts (here, here, and here, for example). This rule only applies to a bar chart, where the heights (and thus, areas) of the bars should encode the data.

Here is a stacked column chart that earns boos from us:

Kfung_stackedcolumn_notstartingatzero_0

I made it so I'm downvoting myself. What's wrong with this chart? The vertical axis starts at 42 instead of zero. I've cropped out exactly 42 units from each column. Therefore, the column areas are no longer proportional to the ratio of the data. Forty-two is 84% of the column A while it is 19% of column B. By shifting the x-axis, I've made column B dwarf column A. For comparison, I added a second chart that has the x-axis start at zero.

Kfung_stackedcolumn_notstartatzero

On the right side, Column B is 22 times the height of column A. On the left side, it is 4 times as high. Both are really the same chart, except one has its legs chopped off.

***

Now, let me reveal the data behind the above chart. It is a re-imagination of the famous cumulative case curve from the Pfizer vaccine trial.

Pfizerfda_figure2_cumincidencecurves

I transferred the data to a stacked column chart. Each column block shows the incremental cases observed in a given week of the trial. All the blocks stacked together rise to the total number of cases observed by the time the interim analysis was presented to the FDA.

Observe that in the cumulative cases chart, the count starts at zero on Day 0 (first dose). This means the chart corresponds to the good stacked column chart, with the x-axis starting from zero on Day 0.

Kfung_pfizercumcases_stackedcolumn

The Pfizer chart above is, however, disconnected from the oft-chanted 95% vaccine efficacy number. You can't find this number on there. Yes, everyone has been lying to you. In a previous post, I did the math, and if you trace the vaccine efficacy throughout the trial, you end up at about 80% toward the right, not 95%.

Pfizer_cumcases_ve_vsc_published

How can they conclude VE is 95% but show a chart that never reaches that level? The chart was created for a "secondary" analysis included in the report for completeness. The FDA and researchers have long ago decided, before the trials started enrolling people, that they don't care about the cumulative case curve starting on Day 0. The "primary" analysis counts cases starting 7 days after the second shot, which means Day 29.

The first week that concerns the FDA is Days 29-35 (for Pfizer's vaccine). The vaccine arm saw 41 cases in the first 28 days of the trial. In effect, the experts chop the knees off the column chart. When they talk about 95% VE, they are looking at the column chart with the axis starting at 42.

Kfung_pfizercumcases_stackedcolumn_chopped

Yes, that deserves a boo.

***

It's actually even worse than that, if you could believe it.

The most commonly cited excuse for the knee-chop is that any vaccine is expected to be useless in the first X days (X being determined after the trial ends when they analyze the data). A recently published "real world" analysis of the situation in Israel contains a lengthy defense of this tactic, in which they state:

Strictly speaking, the vaccine effectiveness based on this risk ratio overestimates the overall vaccine effectiveness in our study because it does not include the early follow-up period during which the vaccine has no detectable effect (and thus during which the ratio is 1). [Appendix, Supplement 4]

Assuming VE = 0 prior to day X is equivalent to stipulating that the number of cases found in the vaccine arm is the same (within margin of error) as the number of cases in the placebo arm during the first X days.

That assumption is refuted by the Pfizer trial (and every other trial that has results so far.)

The Pfizer/Biontech vaccine was not useless during the first week. It's not 95% efficacious, more like 16%. In the second week, it improves to 33%, and so on. (See the VE curve I plotted above for the Pfizer trial.)

What happened was all the weeks before which the VE has not plateaued were dropped.

***

So I was simplifying the picture by chopping same-size blocks from both columns in the stacked column chart. Contrary to the no-effect assumption, the blocks at the bottom of each column are of different sizes. Much more was chopped from the placebo arm than from the vaccine arm.

You'd think that would unjustifiably favor the placebo. Not true! As almost all the cases on the vaccine arm were removed, the remaining cases on the placebo arm are now many multiples of those on the vaccine arm.

The following shows what the VE would have been reported if they had started counting cases from day X. The first chart counts all cases from first shot. The second chart removes the first two weeks of cases, corresponding to the analysis that other pharmas have done, namely, evaluate efficacy from 14 days after the first dose. The third chart removes even more cases, and represents what happens if the analysis is conducted from second dose. The fourth chart is the official Pfizer analysis, which began days after the second shot. Finally, the fifth chart shows analysis begining from 14 days after the second shot, the window selected by Moderna and Astrazeneca.

Kfung_howvaccinetrialsanalyzethedata

The premise that any vaccine is completely useless for a period after administration is refuted by the actual data. By starting analysis windows at some arbitrary time, the researchers make it unnecessarily difficult to compare trials. Selecting the time of analysis based on the results of a single trial is the kind of post-hoc analysis that statisticians have long warned leads to over-estimation. It's equivalent to making the vertical axis of a column chart start above zero in order to exaggerate the relative heights of the columns.

 

P.S. [3/1/2021] See comment below. I'm not suggesting vaccines are useless. They are still a miracle of science. I believe the desire to report a 90% VE number is counterproductive. I don't understand why a 70% or 80% effective vaccine is shameful. I really don't.


Reading an infographic about our climate crisis

Let's explore an infographic by SCMP, which draws attention to the alarming temperature recorded at Verkhoyansk in Russia on June 20, 2020. The original work was on the back page of the printed newspaper, referred to in this tweet.

This view of the globe brings out the two key pieces of evidence presented in the infographic: the rise in temperature in unexpected places, and the shrinkage of the Arctic ice.

Scmp_russianheat_1a

A notable design decision is to omit the color scale. On inspection, the scale is present - it was sewn into the graphic.

Scmp_russianheat_colorscale

I applaud this decision as it does not take the reader's eyes away from the graphic. Some information is lost as the scale isn't presented in full details but I doubt many readers need those details.

A key takeaway is that the temperature in Verkhoyansk, which is on the edge of the Arctic Circle, was the same as in New Delhi in India on that day. We can see how the red was encroaching upon the Arctic Circle.

***Scmp_russianheat_2a

Next, the rapid shrinkage of the Arctic ice is presented in two ways. First, a series of maps.

The annotations are pared to the minimum. The presentation is simple enough such that we can visually judge that the amount of ice cover has roughly halved from 1980 to 2009.

A numerical measure of the drop is provided on the side.

Then, a line chart reinforces this message.

The line chart emphasizes change over time while the series of maps reveals change over space.

Scmp_russianheat_3a

This chart suggests that the year 2020 may break the record for the smallest ice cover since 1980. The maps of Australia and India provide context to interpret the size of the Arctic ice cover.

I'd suggest reversing the pink and black colors so as to refer back to the blue and pink lines in the globe above.

***

The final chart shows the average temperature worldwide and in the Arctic, relative to a reference period (1981-2000).

Scmp_russianheat_4

This one is tough. It looks like an area chart but it should be read as a line chart. The darker line is the anomaly of Arctic average temperature while the lighter line is the anomaly of the global average temperature. The two series are synced except for a brief period around 1940. Since 2000, the temperatures have been dramatically rising above that of the reference period.

If this is a stacked area chart, then we'd interpret the two data series as summable, with the sum of the data series signifying something interesting. For example, the market shares of different web browsers sum to the total size of the market.

But the chart above should not be read as a stacked area chart because the outside envelope isn't the sum of the two anomalies. The problem is revealed if we try to articulate what the color shades mean.

Scmp_russianheat_4_inset

On the far right, it seems like the dark shade is paired with the lighter line and represents global positive anomalies while the lighter shade shows Arctic's anomalies in excess of global. This interpretation only works if the Arctic line always sits above the global line. This pattern is broken in the late 1990s.

Around 1999, the Arctic's anomaly is negative while the global anomaly is positive. Here, the global anomaly gets the lighter shade while the Arctic one is blue.

One possible fix is to encode the size of the anomaly into the color of the line. The further away from zero, the darker the red/blue color.

 

 


A beautiful curve and its deadly misinterpretation

When the preliminary analyses of their Phase 3 trials came out , vaccine developers pleased their audience of scientists with the following data graphic:

Pfizerfda_cumcases

The above was lifted out of the FDA briefing document for the Pfizer / Biontech vaccine.

Some commentators have honed in on the blue line for the vaccinated arm of the Pfizer trial.

Junkcharts_pfizerfda_redo_vaccinecases

Since the vertical axis shows cumulative number of cases, it is noted that the vaccine reached peak efficacy after 14 days following the first dose. The second dose was administered around Day 21. At this point, the vaccine curve appeared almost flat. Thus, these commentators argued, we should make a big bet on the first dose.

***

The chart is indeed very beautiful. It's rare to see such a huge gap between the test group and the control group. Notice that I just described the gap between test and control. That's what a statistician is looking at in that chart - not the blue line, but the gap between the red and blue lines.

Imagine: if the curve for the placebo group looked the same as that for the vaccinated group, then the chart would lose all its luster. Screams of victory would be replaced by tears of sadness.

Here I bring back both lines, and you should focus on the gaps between the lines:

Junkcharts_pfizerfda_redo_twocumcases

Does the action stop around day 14? The answer is a resounding No! In fact, the red line keeps rising so over time, the vaccine's efficacy improves (since VE is a ratio between the two groups).

The following shows the vaccine efficacy curve:

Junkcharts_pfizerfda_redo_ve

Right before the second dose, VE is just below 50%. VE keeps rising and reaches 70% by day 50, which is about a month after the second dose.

If the FDA briefing document has shown the VE curve, instead of the cumulative-cases curve, few would argue that you don't need the second dose!

***

What went wrong here? How come the beautiful chart may turn out to be lethal? (See this post on my book blog for reasons why I think foregoing or delaying the second dose will exacerbate the pandemic.)

It's a bit of bait and switch. The original chart plots cumulative case counts, separately for each treatment group. Cumulative case counts are inputs to computing vaccine efficacy. It is true that as the blue line for the vaccine flattens, VE would likely rise. But the case count for the vaccine group is an imperfect proxy for VE. As I showed above, the VE continues to gain strength long after the vaccine case count has levelled.

The important lesson for data visualization designers is: plot the metric that matters to decision-makers; avoid imperfect proxies.

 

P.S. [1/19/2021: For those who wants to get behind the math of all this, the following several posts on my book blog will help.

One-dose Pfizer is not happening, and here's why

The case for one-dose vaccines is lacking key details

One-dose vaccine strategy elevates PR over science

]

[1/21/2021: The Guardian chimes in with "Single Covid vaccine dose in Israel 'less effective than we thought'" (link). "In remarks reported by Army Radio, Nachman Ash said a single dose appeared “less effective than we had thought”, and also lower than Pfizer had suggested." To their credit, Pfizer has never publicly recommended a one-dose treatment.]

[1/21/2021: For people in marketing or business, I wrote up a new post that expresses the one-dose vs two-dose problem in terms of optimizing an email drip campaign. It boils down to: do you accept that argument that you should get rid of your latter touches because the first email did all the work? Or do you want to run an experiment with just one email before you decide? You can read this on the book blog here.]


Is this an example of good or bad dataviz?

This chart is giving me feelings:

Trump_mcconnell_chart

I first saw it on TV and then a reader submitted it.

Let's apply a Trifecta Checkup to the chart.

Starting at the Q corner, I can say the question it's addressing is clear and relevant. It's the relationship between Trump and McConnell's re-election. The designer's intended message comes through strongly - the chart offers evidence that McConnell owes his re-election to Trump.

Visually, the graphic has elements of great story-telling. It presents a simple (others might say, simplistic) view of the data - just the poll results of McConnell vs McGrath at various times, and the election result. It then flags key events, drawing the reader's attention to those. These events are selected based on key points on the timeline.

The chart includes wise design choices, such as no gridlines, infusing the legend into the chart title, no decimals (except for last pair of numbers, the intention of which I'm not getting), and leading with the key message.

I can nitpick a few things. Get rid of the vertical axis. Also, expand the scale so that the difference between 51%-40% and 58%-38% becomes more apparent. Space the time points in proportion to the dates. The box at the bottom is a confusing afterthought that reduces rather than assists the messaging.

But the designer got the key things right. The above suggestions do not alter the reader's expereince that much. It's a nice piece of visual story-telling, and from what I can see, has made a strong impact with the audience it is intended to influence.

_trifectacheckup_junkchartsThis chart is proof why the Trifecta Checkup has three corners, plus linkages between them. If we just evaluate what the visual is conveying, this chart is clearly above average.

***

In the D corner, we ask: what the Data are saying?

This is where the chart runs into several problems. Let's focus on the last two sets of numbers: 51%-40% and 58%-38%. Just add those numbers and do you notice something?

The last poll sums to 91%. This means that up to 10% of the likely voters responded "not sure" or some other candidate. If these "shy" voters show up at the polls as predicted by the pollsters, and if they voted just like the not shy voters, then the election result would have been 56%-44%, not 51%-40%. So, the 58%-38% result is within the margin of error of these polls. (If the "shy" voters break for McConnell in a 75%-25% split, then he gets 58% of the total votes.)

So, the data behind the line chart aren't suggesting that the election outcome is anomalous. This presents a problem with the Q-D and D-V green arrows as these pairs are not in sync.

***

In the D corner, we should consider the totality of the data available to the designer, not just what the designer chooses to utilize. The pivot of the chart is the flag annotating the "Trump robocall."

Here are some questions I'd ask the designer:

What else happened on October 31 in Kentucky?

What else happened on October 31, elsewhere in the country?

Was Trump featured in any other robocalls during the period portrayed?

How many robocalls were made by the campaign, and what other celebrities were featured?

Did any other campaign event or effort happen between the Trump robocall and election day?

Is there evidence that nothing else that happened after the robocall produced any value?

The chart commits the XYopia (i.e. X-Y myopia) fallacy of causal analysis. When the data analyst presents one cause and one effect, we are cued to think the cause explains the effect but in every scenario that is not a designed experiment, there are multiple causes at play. Sometimes, the more influential cause isn't the one shown in the chart.

***

Finally, let's draw out the connection between the last set of poll numbers and the election results. This shows why causal inference in observational data is such a beast.

Poll numbers are about a small number of people (500-1,000 in the case of Kentucky polls) who respond to polling. Election results are based on voters (> 2 million). An assumption made by the designer is that these polls are properly conducted, and their results are credible.

The chart above makes the claim that Trump's robocall gave McConnell 7% more votes than expected. This implies the robocall influenced at least 140,000 voters. Each such voter must fit the following criteria:

  • Was targeted by the Trump robocall
  • Was reached by the Trump robocall (phone was on, etc.)
  • Responded to the Trump robocall, by either picking up the phone or listening to the voice recording or dialing a call-back number
  • Did not previously intend to vote for McConnell
  • If reached by a pollster, would refuse to respond, or say not sure, or voting for McGrath or a third candidate
  • Had no other reason to change his/her behavior

Just take the first bullet for example. If we found a voter who switched to McConnell after October 31, and if this person was not on the robocall list, then this voter contributes to the unexpected gain in McConnell votes but weakens the case that the robocall influenced the election.

As analysts, our job is to find data to investigate all of the above. Some of these are easier to investigate. The campaign knows, for example, how many people were on the target list, and how many listened to the voice recording.

 

 

 

 


Book Review: Visualizing with Text by Richard Brath

Richardbarth_bookcoverThe creative process is sometimes described in terms of diverge-converge cycles. The diverge step involves experimentation and rewards suspending disbelief, while excesses are curbed and concepts refined during the converge step. Richard Brath's just-released book Visualizing with Text is an important resource that expands our appreciation for the place of text in visual displays.

Books on data visualization fall into recognizable types, of which two popular ones are the style guide, such as Edward Tufte, Dona Wong, and Alberto Cairo, and the coding manual, such as Ben Fry (processing) and Hadley Wickham (ggplot, Shiny). Brath's volume belongs to neither of those - it reads more like an encyclopedic catalog of how text can be incorporated into charts and graphs. He challenges us to blow up our imaginative space for characters, words, sentences, paragraphs and prose. It is a valuable aid for the diverge step of our creative process.

In modern data visualization, text is treated as an accessory, frequently found in titles, labels, legends, footnotes or surrounding text. Brath wants us to elevate text to the starring attraction. Starting with baby steps, such as direct labeling of lines and objects, and coordinating colors between chart elements and words, he experiments with inserting text into unlikely crannies, not shying away from ideas that even he admits may be somewhat of a dead-end.

One of the more immediately useful examples is the use of text labels that hug the lines on a line chart, similar to how roads and rivers are labeled on maps. I wish all software developers implement this function without delay.

Barth_riverlabelsonlines

A more esoteric example is to replace these lines with small-size text, as Brath makes an analogy between sentences and lines.

Barth_textinlines

I am still deciding if this is a gold mine or a minefield. It is thought-provoking nonetheless.

Finally, the book includes some flights of fancy, like this one:

Barth_french_departments

The red superscripts are numeric codes for French departments (provinces), arranged in ascending order of a given metric, and placed in proportional distance within the prose!

The converge step is left to the reader, as Brath refrains from bullhorning his opinions about chart types, which is why readers should not expect a style guide. He includes many experimental graphics, and may provide the pros and cons of a form without registering a judgement.

Because many of these ideas have yet to enter the mainstream, we'd need to implement these ideas on our own, which is why readers will not find a coding manual. As mentioned above, even the simplest and least controversial tactic of directly labeling lines is not available in Excel, let alone text that hugs or replaces lines. (This proves Brath's point that our community has done text a disservice.) Other ideas explored in later chapters require such features as italicizing numeric proportions of a word, rather than the entire word.

Recently, text has become a mainstay of Big Data. Visualizing with Text is timely, relevant and provocative. It is also clearly written, and tightly organized. Chapter 13 neatly summarizes the key concepts that have appeared along the way. There are plenty of use cases, primarily derived from research or business. After reading this book, you'll revel in the new sandbox of text, and long to free yourself from the constraints of your tool.


***

I recommend that you get the paper copy of the book. I reviewed the electronic version, and what irony! As you may have guessed, the electronic version ruins the typesetting. On every page, certain paragraphs show up in tiny font that resist all attempts to magnify, making Brath's case that legibility is an important metric for text visualization. Some of the more unusual fonts are dropped. The images are too small, even when popped up.

[P.S. Richard has a webpage where he included larger images and some code.]


Why you should expunge the defaults from Excel or (insert your favorite graphing program)

Yesterday, I posted the following chart in the post about Cornell's Covid-19 case rate after re-opening for in-person instruction.

Redo_junkchats_fraziercornellreopeningsuccess2

This is an edited version of the chart used in Peter Frazier's presentation.

Pfrazier_cornellreopeningupdate

The original chart carries with it the burden of Excel defaults.

What did I change and why?

I switched away from the default color scheme, which ignores the relationships between the two lines. In particular, the key comparison on this chart should be the actual case rate versus the nominal case rate. In addition, the three lines at the top are related as they all come from the same underlying mathematical model. I used the same color but different shades.

Also, instead of placing the legend as far away from the data labels as possible, I moved the line labels next to the data labels.

Instead of daily date labels, I moved to weekly labels, and set the month names on a separate level than the day names.

The dots were removed from the top three lines but I'd have retained them, perhaps with some level of transparency, if I spent more time making the edits. I'd definitely keep the last dot to make it clear that the blue lines contain one extra dot.

***

Every graphing program has defaults, typically computed by some algorithm tuned to the average chart. Don't settle for the average chart. Get rid of any default setting that slows down understanding.