Locating the political center

I mentioned the September special edition of Bloomberg Businessweek on the election in this prior post. Today, I'm featuring another data visualization from the magazine.

Bloomberg_politicalcenter_print_sm

***

Here are the rightmost two charts.

Bloomberg_politicalcenter_rightside Time runs from top to bottom, spanning four decades.

Each chart covers a political issue. These two charts concern abortion and marijuana.

The marijuana question (far right) has only two answers, legalize or don't legalize. The underlying data measure the proportions of people agreeing to each point of view. Roughly three-quarters of the population disagreed with legalization in 1980 while two-thirds agree with it in 2020.

Notice that there are no horizontal axis labels. This is a great editorial decision. Only coarse trends are of interest here. It's not hard to figure out the relative proportions. Adding labels would just clutter up the display.

By contrast, the abortion question has three answer choices. The middle option is "Sometimes," which is represented by a white color, with a dot pattern. This is an issue on which public opinion in aggregate has barely shifted over time.

The charts are organized in a small-multiples format. It's likely that readers are consuming each chart individually.

***

What about the dashed line that splits each chart in half? Why is it there?

The vertical line assists our perception of the proportions. Think of it as a single gridline.

In fact, this line is underplayed. The headline of the article is "tracking the political center." Where is the center?

Until now, we've paid attention to the boundaries between the differently colored areas. But those boundaries do not locate the political center!

The vertical dashed line is the political center; it represents the view of the median American. In 1980, the line sat inside the gray section, meaning the median American opposed legalizing marijuana. But the prevalent view was losing support over time and by 2010, there wer more Americans wanting to legalize marijuana than not. This is when the vertical line crossed into the green zone.

The following charts draw attention to the middle line, instead of the color boundaries:

Junkcharts_redo_bloombergpoliticalcenterrightsideOn these charts, as you glance down the middle line, you can see that for abortion, the political center has never exited the middle category while for marijuana, the median American didn't want to legalize it until an inflection point was reached around 2010.

I highlight these inflection points with yellow dots.

***

The effect on readers is entirely changed. The original charts draw attention to the areas first while the new charts pull your eyes to the vertical line.

 


Deaths as percent neither of cases nor of population. Deaths as percent of normal.

Yesterday, I posted a note about excess deaths on the book blog (link). The post was inspired by a nice data visualization by the New York Times (link). This is a great example of data journalism.

Nyt_excessdeaths_south

Excess deaths is a superior metric for measuring the effect of Covid-19 on public health. It's better than deaths as percent of cases. Also better than percent of the population.What excess deaths measure is deaths as a percent of normal. Normal is usually defined as the average deaths in the respective week in years past.

The red areas indicate how far the deaths in the Southern states are above normal. The highest peak, registered in Texas in late July, is 60 percent above the normal level.

***

The best way to appreciate the effort that went into this graphic is to imagine receiving the outputs from the model that computes excess deaths. A three-column spreadsheet with columns "state", "week number" and "estimated excess deaths".

The first issue is unequal population sizes. More populous states of course have higher death tolls. Transforming death tolls to an index pegged to the normal level solves this problem. To produce this index, we divide actual deaths by the normal level of deaths. So the spreadsheet must be augmented by two additional columns, showing the historical average deaths and actual deaths for each state for each week. Then, the excess death index can be computed.

The journalist builds a story around the migration of the coronavirus between different regions as it rages across different states  during different weeks. To this end, the designer first divides the dataset into four regions (South, West, Midwest and Northeast). Within each region, the states must be ordered. For each state, the week of peak excess deaths is identified, and the peak index is used to sort the states.

The graphic utilizes a small-multiples framework. Time occupies the horizontal axis, by convention. The vertical axis is compressed so that the states are not too distant. For the same reason, the component graphs are allowed to overlap vertically. The benefit of the tight arrangement is clearer for the Northeast as those peaks are particularly tall. The space-saving appearance reminds me of sparklines, championed by Ed Tufte.

There is one small tricky problem. In most of June, Texas suffered at least 50 percent more deaths than normal. The severity of this excess death toll is shortchanged by the low vertical height of each component graph. What forced such congestion is probably the data from the Northeast. For example, New York City:

Nyt_excessdeaths_northeast3

 

New York City's death toll was almost 8 times the normal level at the start of the epidemic in the U.S. If the same vertical scale is maintained across the four regions, then the Northeastern states dwarf all else.

***

One key takeaway from the graphic for the Southern states is the persistence of the red areas. In each state, for almost every week of the entire pandemic period, actual deaths have exceeded the normal level. This is strong indication that the coronavirus is not under control.

In fact, I'd like to see a second set of plots showing the cumulative excess deaths since March. The weekly graphic is better for identifying the ebb and flow while the cumulative graphic takes measure of the total impact of Covid-19.

***

The above description leaves out a huge chunk of work related to computing excess deaths. I assumed the designer receives these estimates from a data scientist. See the related post in which I explain how excess deaths are estimated from statistical models.

 


Working with multiple dimensions, an example from Germany

An anonymous reader submitted this mirrored bar chart about violent acts by extremists in the 16 German states.

Germanextremists_bars

At first glance, this looks like a standard design. On a second look, you might notice what the reader discovered- the chart used two different scales, one for each side. The left side (red) depicting left-wing extremism is artificially compressed relative to the right side (blue). Not sure if this reflects the political bias of the publication - but in any case, this distortion means the only way to consume this chart is to read the numbers.

Even after fixing the scales, this design is challenging for the reader. It's unnatural to compare two years by looking first below then above. It's not simple to compare across states, and even harder to compare left- and right-wing extremism (due to mirroring).

The chart feels busy because the entire dataset is printed on it. I appreciate not including a redundant horizontal axis. (I wonder if the designer first removed the axis, then edited the scale on one side, not realizing the distortion.) Another nice touch, hidden in the legend, is the country totals.

I present two alternatives.

The first is a small-multiples "bumps chart".

Redo_junkcharts_germanextremists_sidebysidelines

Each plot presents the entire picture within a state. You can see the general level of violence, the level of left- and right-wing extremism, and their year-on-year change. States can be compared holistically.

Several German state names are rather long, so I explored a horizontal orientation. In this case, a connected dot plot may be more appropriate.

Redo_junkcharts_germanextremists_dots

The sign of a good multi-dimensional visual display is whether readers can easily learn complex relationships. Depending on the question of interest, the reader can mentally elevate parts of this chart. One can compare the set of blue arrows to the set of red arrows, or focus on just blue arrows pointing right, or red arrows pointing left, or all arrows for Berlin, etc.

 

[P.S. Anonymous reader said the original chart came from the Augsburger newspaper. This link in German contains more information.]


Cornell must remove the logs before it reopens the campus in the fall

Against all logic, Cornell announced last week it would re-open in the fall because a mathematical model under development by several faculty members and grad students predicts that a "full re-opening" would lead to 80 percent fewer infections than a scenario of full virtual instruction. That's what was reported by the media.

The model is complicated, with loads of assumptions, and the report is over 50 pages long. I will put up my notes on how they attained this counterintuitive result in the next few days. The bottom line is - and the research team would agree - it is misleading to describe the analysis as "full re-open" versus "no re-open". The so-called full re-open scenario assumes the entire community including students, faculty and staff submit to a full program of test-trace-isolate, including (mandatory) PCR diagnostic testing once every five days throughout the 16-week semester, and immediate quarantine and isolation of new positive cases, as well as those in contact with such persons, plus full compliance with this program. By contrast, it assumes students do not get tested in the online instruction scenario. In other words, the researchers expect Cornell to get done what the U.S. governments at all levels failed to do until now.

[7/8/2020: The post on the Cornell model is now up on the book blog. Here.]

The report takes us back to the good old days of best-base-worst-case analysis. There is no data for validating such predictions so they performed sensitivity analyses, defined as changing one factor at a time assuming all other factors are fixed at "nominal" (i.e. base case) values. In a large section of the report, they publish a series of charts of the following style:

Cornell_reopen_sensitivity

Each line here represents one of the best-base-worst cases (respectively, orange-blue-green). Every parameter except one is given the "nominal" value (which represents the base case). The parameter that is manpulated is shown on the horizontal axis, and for the above chart, the variable is the assumption of average number of daily contacts per person. The vertical axis shows the main outcome variable, which is the percentage of the community infected by the end of term.

This flatness of the lines in the above chart appears to say that the outcome is quite insensitive to the change in the average daily contact rate under all three scenarios - until the daily contact rises above 10 per person per day. It also appears to show that the blue line is roughly midway between the orange and the green so the percent infected is slightly less-than halved under the optimistic scenario, and a bit more than doubled under the pessimistic scenario, relative to the blue line.

Look again.

The vertical axis is presented in log scale, and only labeled at values 1% and 10%. About midway between 1 and 10 on the horizontal axis, the outcome value has already risen above 10%. Because of the log transformation, above 10%, each tick represents an increase of 10% in proportion. So, the top of the vertical axis indicates 80% of the community being infected! Nothing in the description or labeling of the vertical axis prepares the reader for this.

The report assumes a fixed value for average daily contacts of 8 (I rounded the number for discussion), which is invariable across all three scenarios. Drawing a vertical line about eight-tenths of the way towards 10 appears to signal that this baseline daily contact rate places the outcome in the relatively flat part of the curve.

Look again.

The horizontal axis too is presented in log scale. To birth one log-scale may be regarded as a misfortune; to birth two log scales looks like carelessness. 

Since there exists exactly one tick beyond 10 on the horizontal axis, the right-most value is 20. The model has been run for values of average daily contacts from 1 to 20, with unit increases. I can think of no defensible reason why such a set of numbers should be expressed in a log scale.

For the vertical axis, the outcome is a proportion, which is confined to within 0 percent and 100 percent. It's not a number that can explode.

***

Every log scale on a chart is birthed by its designer. I know of no software that automatically performs log transforms on data without the user's direction. (I write this line with trepidation wishing that I haven't planted a bad idea in some software developer's head.)

Here is what the shape of the original data looks like - without any transformation. All software (I'm using JMP here) produces something of this type:

Redo-cornellreopen-nolog

At the baseline daily contact rate value of 8, the model predicts that 3.5% of the Cornell community will get infected by the end of the semester (again, assuming strict test-trace-isolate fully implemented and complied).  Under the pessimistic scenario, the proportion jumps to 14%, which is 4 or 5 times higher than the base case. In this worst-case scenario, if the daily contact rate were about twice the assumed value (just over 16), half of the community would be infected in 16 weeks!

I actually do not understand how there could only be 8 contacts per person per day when the entire student body has returned to 100% in-person instruction. (In the report, they even say the 8 contacts could include multiple contacts with the same person.) I imagine an undergrad student in a single classroom with 50 students. This assumption says the average student in this class only comes into contact with at most 8 of those. That's one class. How about other classes? small tutorials? dining halls? dorms? extracurricular activities? sports? parties? bars?

Back to graphics. Something about the canonical chart irked the report writers so they decided to try a log scale. Here is the same chart with the vertical axis in log scale:

Redo-cornellreopen-logy

The log transform produces a visual distortion. On the right side, where the three lines are diverging rapidly, the log transform pulls them together. On the left side, where the three lines are close together, the log transform pulls them apart.

Recall that on the log scale, a straight line is exponential growth. Look at the green line (worst case). That line is approximately linear so in the pessimistic scenario, despite assuming full compliance to a strict test-trace-isolate regimen, the cases are projected to grow exponentially.

Something about that last chart still irked the report writers so they decided to birth a second log scale. Here is the chart they ultimately settled on:

Redo-cornellreopen-logylogx

As with the other axis, the effect of the log transform is to squeeze the larger values (on the right side) and spread out the smaller values (on the left side). After this cosmetic surgery, the left side looks relatively flat while the right side looks steep.

In the next version of the Cornell report, they should replace all these charts with ones using linear scales.

***

Upon discovering this graphical mischief, I wonder if the research team received a mandate that includes a desired outcome.

 

[P.S. 7/8/2020. For more on the Cornell model, see this post.]


What is the price for objectivity

I knew I had to remake this chart.

TMC_hospitalizations

The simple message of this chart is hidden behind layers of visual complexity. What the analyst wants readers to focus on (as discerned from the text on the right) is the red line, the seven-day moving average of new hospital admissions due to Covid-19 in Texas.

My eyes kept wandering away from the line. It's the sideway data labels on the columns. It's the columns that take up vastly more space than the red line. It's the sideway date labels on the horizontal axis. It's the redundant axis labels for hospitalizations when the entire data set has already been printed. It's the two hanging diamonds, for which the clues are filed away in the legend above.

Here's a version that brings out the message: after Phase 2 re-opening, the number of hospital admissions has been rising steadily.

Redo_junkcharts_texas_covidhospitaladmissions_1

Dots are used in place of columns, which push these details to the background. The line as well as periods of re-opening are directly labeled, removing the need for a legend.

Here's another visualization:

Redo_junkcharts_texas_covidhospitaladmissions_2

This chart plots the weekly average new hospital admissions, instead of the seven-day moving average. In the previous chart, the raggedness of moving average isn't transmitting any useful information to the average reader. I believe this weekly average metric is easier to grasp for many readers while retaining the general story.

***

On the original chart by TMC, the author said "the daily hospitalization trend shows an objective view of how COVID-19 impacts hospital systems." Objectivity is an impossible standard for any kind of data analysis or visualization. As seen above, the two metrics for measuring the trend in hospitalizations have pros and cons. Even if one insists on using a moving average, there are choices of averaging methods and window sizes.

Scientists are trained to believe in objectivity. It frequently disappoints when we discover that the rest of the world harbors no such notion. If you observe debates between politicians or businesspeople or social scientists, you rarely hear anyone claim one analysis is more objective - or less subjective - than another. The economist who predicts Dow to reach a new record, the business manager who argues for placing discounted products in the front not the back of the store, the sportscaster who maintains Messi is a better player than Ronaldo: do you ever hear these people describe their methods as objective?

Pursuing objectivity leads to the glorification of data dumps. The scientist proclaims disinterest in holding an opinion about the data. This is self-deception though. We clearly have opinions because when someone else  "misinterprets" the data, we express dismay. What is the point of pretending to hold no opinions when most of the world trades in opinions? By being "objective," we never shape the conversation, and forever play defense.


Designs of two variables: map, dot plot, line chart, table

The New York Times found evidence that the richest segments of New Yorkers, presumably those with second or multiple homes, have exited the Big Apple during the early months of the pandemic. The article (link) is amply assisted by a variety of data graphics.

The first few charts represent different attempts to express the headline message. Their appearance in the same article allows us to assess the relative merits of different chart forms.

First up is the always-popular map.

Nytimes_newyorkersleft_overallmap

The advantage of a map is its ease of comprehension. We can immediately see which neighborhoods experienced the greater exoduses. Clearly, Manhattan has cleared out a lot more than outer boroughs.

The limitation of the map is also in view. With the color gradient dedicated to the proportions of residents gone on May 1st, there isn't room to express which neighborhoods are richer. We have to rely on outside knowledge to make the correlation ourselves.

The second attempt is a dot plot.

Nytimes_newyorksleft_percentathome

We may have to take a moment to digest the horizontal axis. It's not time moving left to right but income percentiles. The poorest neighborhoods are to the left and the richest to the right. I'm assuming that these percentiles describe the distribution of median incomes in neighborhoods. Typically, when we see income percentiles, they are based on households, regardless of neighborhoods. (The former are equal-sized segments, unlike the latter.)

This data graphic has the reverse features of the map. It does a great job correlating the drop in proportion of residents at home with the income distribution but it does not convey any spatial information. The message is clear: The residents in the top 10% of New York neighborhoods are much more likely to have left town.

In the following chart, I attempted a different labeling of both axes. It cuts out the need for readers to reverse being home to not being home, and 90th percentile to top 10%.

Redo_nyt_newyorkerslefttown

The third attempt to convey the income--exit relationship is the most successful in my mind. This is a line chart, with time on the horizontal axis.

Nyt_newyorkersleft_percenthomebyincome

The addition of lines relegates the dots to the background. The lines show the trend more clearly. If directly translated from the dot plot, this line chart should have 100 lines, one for each percentile. However, the closeness of the top two lines suggests that no meaningful difference in behavior exists between the 20th and 80th percentiles. This can be conveyed to readers through a short note. Instead of displaying all 100 percentiles, the line chart selectively includes only the 99th , 95th, 90th, 80th and 20th percentiles. This is a design choice that adds by subtraction.

Along the time axis, the line chart provides more granularity than either the map or the dot plot. The exit occurred roughly over the last two weeks of March and the first week of April. The start coincided with New York's stay-at-home advisory.

This third chart is a statistical graphic. It does not bring out the raw data but features aggregated and smoothed data designed to reveal a key message.

I encourage you to also study the annotated table later in the article. It shows the power of a well-designed table.

[P.S. 6/4/2020. On the book blog, I have just published a post about the underlying surveillance data for this type of analysis.]

 

 


Consumption patterns during the pandemic

The impact of Covid-19 on the economy is sharp and sudden, which makes for some dramatic data visualization. I enjoy reading the set of charts showing consumer spending in different categories in the U.S., courtesy of Visual Capitalist.

The designer did a nice job cleaning up the data and building a sequential story line. The spending are grouped by categories such as restaurants and travel, and then sub-categories such as fast food and fine dining.

Spending is presented as year-on-year change, smoothed.

Here is the chart for the General Commerce category:

Visualcapitalist_spending_generalcommerce

The visual design is clean and efficient. Even too sparse because one has to keep returning to the top to decipher the key events labelled 1, 2, 3, 4. Also, to find out that the percentages express year-on-year change, the reader must scroll to the bottom, and locate a footnote.

As you move down the page, you will surely make a stop at the Food Delivery category, noting that the routine is broken.

Visualcapitalist_spending_fooddelivery

I've featured this device - an element of surprise - before. Remember this Quartz chart that depicts drinking around the world (link).

The rule for small multiples is to keep the visual design identical but vary the data from chart to chart. Here, the exceptional data force the vertical axis to extend tremendously.

This chart contains a slight oversight - the red line should be labeled "Takeout" because food delivery is the label for the larger category.

Another surprise is in store for us in the Travel category.

Visualcapitalist_spending_travel

I kept staring at the Cruise line, and how it kept dipping below -100 percent. That seems impossible mathematically - unless these cardholders are receiving more refunds than are making new bookings. Not only must the entire sum of 2019 bookings be wiped out, but the records must also show credits issued to these credit (or debit) cards. It's curious that the same situation did not befall the airlines. I think many readers would have liked to see some text discussing this pattern.

***

Now, let me put on a data analyst's hat, and describe some thoughts that raced through my head as I read these charts.

Data analysis is hard, especially if you want to convey the meaning of the data.

The charts clearly illustrate the trends but what do the data reveal? The designer adds commentary on each chart. But most of these comments count as "story time." They contain speculation on what might be causing the trend but there isn't additional data or analyses to support the storyline. In the General Commerce category, the 50 to 100 percent jump in all subcategories around late March is attributed to people stockpiling "non-perishable food, hand sanitizer, and toilet paper". That might be true but this interpretation isn't supported by credit or debit card data because those companies do not have details about what consumers purchased, only the total amount charged to the cards. It's a lot more work to solidify these conclusions.

A lot of data do not mean complete or unbiased data.

The data platform provided data on 5 million consumers. We don't know if these 5 million consumers are representative of the 300+ million people in the U.S. Some basic demographic or geographic analysis can help establish the validity. Strictly speaking, I think they have data on 5 million card accounts, not unique individuals. Most Americans use more than one credit or debit cards. It's not likely the data vendor have a full picture of an individual's or a family's spending.

It's also unclear how much of consumer spending is captured in this dataset. Credit and debit cards are only one form of payment.

Data quality tends to get worse.

One thing that drives data analyst nuts. The spending categories are becoming blurrier. In the last decade or so, big business has come to dominate the American economy. Big business, with bipartisan support, has grown by (a) absorbing little guys, and (b) eliminating boundaries between industry sectors. Around me, there is a Walgreens, several Duane Reades, and a RiteAid. They currently have the same owner, and increasingly offer the same selection. In the meantime, Walmart (big box), CVS (pharmacy), Costco (wholesale), etc. all won regulatory relief to carry groceries, fresh foods, toiletries, etc. So, while CVS or Walgreens is classified as a pharmacy, it's not clear that what proportion of the spending there is for medicines. As big business grows, these categories become less and less meaningful.


The elusive meaning of black paintings and red blocks

Joe N, a longtime reader, tweeted about the following chart, by the People's Policy Project:

3p_oneyearinonemonth_laborflow

This is a simple column chart containing only two numbers, far exceeded by the count of labels and gridlines.

I look at charts like the lady staring at these Ad Reinhardts:

 

SUBJPREINHARDT2-videoSixteenByNine1050

My artist friends say the black squares are not the same, if you look hard enough.

Here is what I learned after one such seating:

The tiny data labels sitting on the inside top edges of the columns hint that the right block is slightly larger than the left block.

The five labels of the vertical axis serve no purpose, nor the gridlines.

The horizontal axis for time is reversed, with 2019 appearing after 2020 (when read left to right).

The left block has one month while the right block has 12 months. This is further confused by the word "All" which shares the same starting and ending letters as "April".

As far as I can tell, the key message of this chart is that the month of April has the impact of a full year. It's like 12 months of outflows from employment hitting the economy in one month.

***

My first response is this chart:

Junkcharts_oneyearinonemonth_laborflow_1

Breaking the left block into 12 pieces, and color-coding the April piece brings out the comparison. You can also see that in 2019, the outflows from employment to unemployment were steady month to month.

Next, I want to see what happens if I restored the omitted months of Jan to March, 2020.

Junkcharts_oneyearinonemonth_laborflow_2

The story changes slightly. Now, the chart says that the first four months have already exceeded the full year of 2019.

Since the values hold steady month to month, with the exception of April 2020, I make a monthly view:

Junkcharts_oneyearinonemonth_laborflow_monthly_bar_1

You can see the slight nudge-up in March 2020 as well. This draws more attention to the break in pattern.

For time-series data, I prefer to look at line charts:

Junkcharts_oneyearinonemonth_laborflow_monthly_line_1

As I explained in this post about employment statistics (or Chapter 6 of Numbersense (link)), the Bureau of Labor Statistics classifies people into three categories: Employed, Unemployed and Not in Labor Force. Exits from Employed to Unemployed status contribute to unemployment in the U.S. To depict a negative trend, it's often natural to use negative numbers:

Junkcharts_oneyearinonemonth_laborflow_monthly_line_neg_1

You may realize that this data series paints only a partial picture of the health of the labor market. While some people exit the Employed status each month, there are others who re-enter or enter the Employed status. We should really care about net flows.

Junkcharts_oneyearinonemonth_laborflow_net_lines

In all of 2019, there were more entrants than exits, leading to a slightly positive net inflow to the Employed status from Unemployed (blue line). In April 2020, the red line (exits) drags the blue line dramatically.

Of course, even this chart is omitting important information. There are also flows from Employed to and from Not in Labor Force.

 

 

 

 

 


Hope and reality in one Georgia chart

Over the weekend, Georgia's State Health Department agitated a lot of people when it published the following chart:

Georgia_top5counties_covid19

(This might have appeared a week ago as the last date on the chart is May 9 and the title refers to "past 15 days".)

They could have avoided the embarrassment if they had read my article at DataJournalism.com (link). In that article, I lay out a set of the "unspoken conventions," things that visual designers are, or should be, doing more or less in their sleep. Under the section titled "Order", I explain the following two "rules":

  • Place values in the natural order when it is available
  • Retain the same order across all plots in a panel of charts

In the chart above, the natural order for the horizontal (time) axis is time running left to right. The order chosen by the designer  is roughly but not precisely decreasing height of the tallest column in each daily group. Many observers suggested that the columns were arranged to give the appearance of cases dropping over time.

Within each day, the counties are ordered in decreasing number of new cases. The title of the chart reads "number of cases over time" which sounds like cumulative cases but it's not. The "lead" changed hands so many times over the 15 days, meaning the data sequence was extremely noisy, which would be unlikely for cumulative cases. There are thousands of cases in each of these counties by May. Switching the order of the columns within each daily group defeats the purpose of placing these groups side-by-side.

Responding to the bad press, the department changed the chart design for this week's version:

Georgia_top5counties_covid19_revised

This chart now conforms to the two spoken rules described above. The time axis runs left to right, and within each group of columns, the order of the counties is maintained.

The chart is still very noisy, with no apparent message.

***

Next, I'd like to draw your attention to a Data issue. Notice that the 15-day window has shifted. This revised chart runs from May 2 to May 16, which is this past Saturday. The previous chart ran from Apr 26 to May 9. 

Here's the data for May 8 and 9 placed side by side.

Junkcharts_georgia_covid19_cases

There is a clear time lag of reporting cases in the State of Georgia. This chart should always exclude the last few days. The case counts keep going up until it stabilizes. The same mistake occurs in the revised chart - the last two days appear as if new cases have dwindled toward zero when in fact, it reflects a lag in reporting.

The disconnect between the Question being posed and the quality of the Data available dooms this visualization. It is not possible to provide a reliable assessment of the "past 15 days" when during perhaps half of that period, the cases are under-counted.

***

Nyt_tryingtobefashionableThis graphical distortion due to "immature" data has become very commonplace in Covid-19 graphics. It's similar to placing partial-year data next to full-year results, without calling out the partial data.

The following post from the ancient past (2005!) about a New York Times graphic shows that calling out this data problem does not actually solve it. It's a less-bad kind of thing.

The coronavirus data present more headaches for graphic designers than the financial statistics. Because of accounting regulations, we know that only the current quarter's data are immature. For Covid-19 reporting, the numbers are being adjusted for days and weeks.

Practically all immature counts are under-estimates. Over time, more cases are reported. Thus, any plots over time - if unadjusted - paint a misleading picture of declining counts. The effect of the reporting lag is predictable, having a larger impact as we run from left to right in time. Thus, even if the most recent data show a downward trend, it can eventually mean anything: down, flat or up. This is not random noise though - we know for certain of the downward bias; we just don't know the magnitude of the distortion for a while.

Another issue that concerns coronavirus reporting but not financial reporting is inconsistent standards across counties. Within a business, if one were to break out statistics by county, the analysts would naturally apply the same counting rules. For Covid-19 data, each county follows its own set of rules, not just  how to count things but also how to conduct testing, and so on.

Finally, with the politics of re-opening, I find it hard to trust the data. Reported cases are human-driven data - by changing the number of tests, by testing different mixes of people, by delaying reporting, by timing the revision of older data, by explicit manipulation, ...., the numbers can be tortured into any shape. That's why it is extremely important that the bean-counters are civil servants, and that politicians are kept away. In the current political environment, that separation between politics and statistics has been breached.

***

Why do we have low-quality data? Human decisions, frequently political decisions, adulterate the data. Epidemiologists are then forced to use the bad data, because that's what they have. Bad data lead to bad predictions and bad decisions, or if the scientists account for the low quality, predictions with high levels of uncertainty. Then, the politicians complain that predictions are wrong, or too wide-ranging to be useful. If they really cared about those predictions, they could start by being more transparent about reporting and more proactive at discovering and removing bad accounting practices. The fact that they aren't focused on improving the data gives the game away. Here's a recent post on the politics of data.

 


How Covid-19 deaths sneaked into Florida's statistics

Like many others, some Floridians are questioning their state's Covid statistics. It's clear there are numerous "degrees of freedom" for politicians to manipulate the numbers. What's not clear is who's influencing these decisions. Are they public-health experts, donors, voters, or whom?

A Twitter follower sent in the following chart, embedded in an informative article in Sun-Sentinel:

Sun-sentinel_pneumonia_percent_of_total

I like the visual design. It's clean, and conveys a moderately complex concept effectively. The reader may not immediately get what metrics are being plotted but the idea that the blue line should operate within the gray area.. until it doesn't is easily grasped. The range is technically an uncertainty band.

The metric is the proportion of total deaths (all causes) that are attributed to pneumonia and flu. Typical influenza deaths are found in that category. This chart investigates whether there were excess (unexplained) P&F deaths. The gray band measures the variability in the proportions of past years. When the blue line operates inside the band, the metric is normal. When it pierces the upper band, which happened here around week 25, a rare event has occurred.

The concern on Twitter was about the horizontal axis. Those integer labels can be confusing. The designer places a "how to read this" message in a footnote, explaining that week 1 is the first week of a typical flu season (which corresponds to late September 2019). This nugget of information helps a lot. We can see that the flu season peaks around week 20, and by the spring, it should be waning. Not so in 2020.

It's hard to escape the conclusion that deaths from Covid-19 are hiding inside the statistics of Pneumonia & Flu. As a statistician, I want to tell you Statistics Don't Lie! You can hide the data along one dimension, but they show up elsewhere. Misclassifying the deaths does buy someone some time. It takes a few weeks to compile all-cause mortality data (gasp, the CDC said mortality records are only 75 percent accurate after 8 weeks!)

The other small problem with the chart is the labeling. Neither axis has labels. The data label that shows up when you click on the line might be a default from the software that can't be turned off. It shows the two numbers being plotted without labels.

***

Here is a re-working of the chart that tells the story:

Redo_junkcharts_sunsentinelpneunominacovid19

The proportion of deaths attributed to P&F and Covid together is roughly double the upper end of what Florida should be seeing this time of the year (without Covid). Covid-19 accounts for half the gap. The other half are still being classified as P&F. However, I suspect CDC will adjust these numbers later to reflect the reality. (In making this chart, I also learned that Florida stopped including seasonal visitors in the death counts. This is egregious manipulation. If someone died while in Florida, they should be counted. I didn't investigate whether this counting rule applies only to Covid-19 deaths, or to deaths from all causes. If they had always done that, then I might give them a pass.)

On second thought, maybe not. The other egregious thing that appeared to have happened is that the Florida state health department unplugged their prior website (https://www.floridahealth.gov) so no one can cross-reference any prior documents. The only website I can access now for Florida state health is a Covid-specific site (https://floridahealthcovid19.gov).

Florida_state_health_websites

There must be something juicy on the previous influenza page, no?

***

Lastly, when you look at my chart, please pretend that the last week is not on there. In all likelihood, the "drop" is fake because the mortality data have not been fully updated. My chart contains one more week than the Sun Sentinel chart. So you can see that the drastic decline shown on their chart turned up a big uptick on mine (next to last week).

This is a common mistake on many charts I see these days. Half-baked numbers are shown next to fully-baked ones.