Everything in Texas is big, but not this BIG

Long-time reader John forwarded the following chart via Twitter.

Covidtracking_texassquare

The chart shows the recent explosive growth in deaths due to Covid-19 in Texas. John flagged this graphic as yet another example in which the data are encoded to the lengths of the squares, not their areas.

Fixing this chart just requires fixing the length of one side of the square. I also flipped it to make a conventional column chart.

Redo_texasdeathsquares_process

The final product:

Redo_texasdeaths_columns

An important qualification lurks in the footnote; it is directly applied to the label of July.

How much visual distortion is created when data are encoded to the lengths and not the areas? The following chart shows what readers see, assuming they correctly perceive the areas of those squares. The value for March is held the same as above while the other months show the death counts implied by the relative areas of the squares.

Redo_texasdeaths_distortion

Owing to squaring, the smaller counts are artificially compressed while the big numbers are massively exaggerated.


This chart shows why the PR agency for the UK government deserves a Covid-19 bonus

The Economist illustrated some interesting consumer research with this chart (link):

Economist_covidpoll

The survey by Dalia Research asked people about the satisfaction with their country's response to the coronavirus crisis. The results are reduced to the "Top 2 Boxes", the proportion of people who rated their government response as "very well" or "somewhat well".

This dimension is laid out along the horizontal axis. The chart is a combo dot and bubble chart, arranged in rows by region of the world. Now what does the bubble size indicate?

It took me a while to find the legend as I was expecting it either in the header or the footer of the graphic. A larger bubble depicts a higher cumulative number of deaths up to June 15, 2020.

The key issue is the correlation between a country's death count and the people's evaluation of the government response.

Bivariate correlation is typically shown on a scatter plot. The following chart sets out the scatter plots in a small multiples format with each panel displaying a region of the world.

Redo_economistcovidpolling_scatter

The death tolls in the Asian countries are low relative to the other regions, and yet the people's ratings vary widely. In particular, the Japanese people are pretty hard on their government.

In Europe, the people of Greece, Netherlands and Germany think highly of their government responses, which have suppressed deaths. The French, Spaniards and Italians are understandably unhappy. The British appears to be the most forgiving of their government, despite suffering a higher death toll than France, Spain or Italy. This speaks well of their PR operation.

Cumulative deaths should be adjusted by population size for a proper comparison across nations. When the same graphic is produced using deaths per million (shown on the right below), the general story is preserved while the pattern is clarified:

Redo_economistcovidpolling_deathspermillion_2

The right chart shows deaths per million while the left chart shows total deaths.

***

In the original Economist chart, what catches our attention first is the bubble size. Eventually, we notice the horizontal positioning of these bubbles. But the star of this chart ought to be the new survey data. I swapped those variables and obtained the following graphic:

Redo_economistcovidpolling_swappedvar

Instead of using bubble size, I switched to using color to illustrate the deaths-per-million metric. If ratings of the pandemic response correlate tightly with deaths per million, then we expect the color of these dots to evolve from blue on the left side to red on the right side.

The peculiar loss of correlation in the U.K. stands out. Their PR firm deserves a bonus!


Cornell must remove the logs before it reopens the campus in the fall

Against all logic, Cornell announced last week it would re-open in the fall because a mathematical model under development by several faculty members and grad students predicts that a "full re-opening" would lead to 80 percent fewer infections than a scenario of full virtual instruction. That's what was reported by the media.

The model is complicated, with loads of assumptions, and the report is over 50 pages long. I will put up my notes on how they attained this counterintuitive result in the next few days. The bottom line is - and the research team would agree - it is misleading to describe the analysis as "full re-open" versus "no re-open". The so-called full re-open scenario assumes the entire community including students, faculty and staff submit to a full program of test-trace-isolate, including (mandatory) PCR diagnostic testing once every five days throughout the 16-week semester, and immediate quarantine and isolation of new positive cases, as well as those in contact with such persons, plus full compliance with this program. By contrast, it assumes students do not get tested in the online instruction scenario. In other words, the researchers expect Cornell to get done what the U.S. governments at all levels failed to do until now.

[7/8/2020: The post on the Cornell model is now up on the book blog. Here.]

The report takes us back to the good old days of best-base-worst-case analysis. There is no data for validating such predictions so they performed sensitivity analyses, defined as changing one factor at a time assuming all other factors are fixed at "nominal" (i.e. base case) values. In a large section of the report, they publish a series of charts of the following style:

Cornell_reopen_sensitivity

Each line here represents one of the best-base-worst cases (respectively, orange-blue-green). Every parameter except one is given the "nominal" value (which represents the base case). The parameter that is manpulated is shown on the horizontal axis, and for the above chart, the variable is the assumption of average number of daily contacts per person. The vertical axis shows the main outcome variable, which is the percentage of the community infected by the end of term.

This flatness of the lines in the above chart appears to say that the outcome is quite insensitive to the change in the average daily contact rate under all three scenarios - until the daily contact rises above 10 per person per day. It also appears to show that the blue line is roughly midway between the orange and the green so the percent infected is slightly less-than halved under the optimistic scenario, and a bit more than doubled under the pessimistic scenario, relative to the blue line.

Look again.

The vertical axis is presented in log scale, and only labeled at values 1% and 10%. About midway between 1 and 10 on the horizontal axis, the outcome value has already risen above 10%. Because of the log transformation, above 10%, each tick represents an increase of 10% in proportion. So, the top of the vertical axis indicates 80% of the community being infected! Nothing in the description or labeling of the vertical axis prepares the reader for this.

The report assumes a fixed value for average daily contacts of 8 (I rounded the number for discussion), which is invariable across all three scenarios. Drawing a vertical line about eight-tenths of the way towards 10 appears to signal that this baseline daily contact rate places the outcome in the relatively flat part of the curve.

Look again.

The horizontal axis too is presented in log scale. To birth one log-scale may be regarded as a misfortune; to birth two log scales looks like carelessness. 

Since there exists exactly one tick beyond 10 on the horizontal axis, the right-most value is 20. The model has been run for values of average daily contacts from 1 to 20, with unit increases. I can think of no defensible reason why such a set of numbers should be expressed in a log scale.

For the vertical axis, the outcome is a proportion, which is confined to within 0 percent and 100 percent. It's not a number that can explode.

***

Every log scale on a chart is birthed by its designer. I know of no software that automatically performs log transforms on data without the user's direction. (I write this line with trepidation wishing that I haven't planted a bad idea in some software developer's head.)

Here is what the shape of the original data looks like - without any transformation. All software (I'm using JMP here) produces something of this type:

Redo-cornellreopen-nolog

At the baseline daily contact rate value of 8, the model predicts that 3.5% of the Cornell community will get infected by the end of the semester (again, assuming strict test-trace-isolate fully implemented and complied).  Under the pessimistic scenario, the proportion jumps to 14%, which is 4 or 5 times higher than the base case. In this worst-case scenario, if the daily contact rate were about twice the assumed value (just over 16), half of the community would be infected in 16 weeks!

I actually do not understand how there could only be 8 contacts per person per day when the entire student body has returned to 100% in-person instruction. (In the report, they even say the 8 contacts could include multiple contacts with the same person.) I imagine an undergrad student in a single classroom with 50 students. This assumption says the average student in this class only comes into contact with at most 8 of those. That's one class. How about other classes? small tutorials? dining halls? dorms? extracurricular activities? sports? parties? bars?

Back to graphics. Something about the canonical chart irked the report writers so they decided to try a log scale. Here is the same chart with the vertical axis in log scale:

Redo-cornellreopen-logy

The log transform produces a visual distortion. On the right side, where the three lines are diverging rapidly, the log transform pulls them together. On the left side, where the three lines are close together, the log transform pulls them apart.

Recall that on the log scale, a straight line is exponential growth. Look at the green line (worst case). That line is approximately linear so in the pessimistic scenario, despite assuming full compliance to a strict test-trace-isolate regimen, the cases are projected to grow exponentially.

Something about that last chart still irked the report writers so they decided to birth a second log scale. Here is the chart they ultimately settled on:

Redo-cornellreopen-logylogx

As with the other axis, the effect of the log transform is to squeeze the larger values (on the right side) and spread out the smaller values (on the left side). After this cosmetic surgery, the left side looks relatively flat while the right side looks steep.

In the next version of the Cornell report, they should replace all these charts with ones using linear scales.

***

Upon discovering this graphical mischief, I wonder if the research team received a mandate that includes a desired outcome.

 

[P.S. 7/8/2020. For more on the Cornell model, see this post.]


The discontent of circular designs

You have two numbers +84% and -25%.

The textbook method to visualize this pair is to plot two bars. One bar in the positive direction, the other in the negative direction. The chart is clear (more on the analysis later).

Redo_pbs_mask1

But some find this graphic ugly. They don’t like straight lines, right angles and such. They prefer circles, and bends. Like PBS, who put out the following graphic that was forwarded to me by Fletcher D. on twitter:

Maskwearing_racetrack

Bending the columns is not as simple as it seems. Notice that the designer adds red arrows pointing up and down. Because the circle rounds onto itself, the sense of direction is lost. Now, readers must pick up the magnitude and the direction separately. It doesn’t help that zero is placed at the bottom of the circle.

Can we treat direction like we would on a bar chart? Make counter-clockwise the negative direction. This is what it looks like:

Redo_pbsmaskwearing

But it’s confusing. I made the PBS design worse because now, the value of each position on the circle depends on knowing whether the arrow points up or down. So, we couldn’t remove those red arrows.

The limitations of the “racetrack” design reveal themselves in similar data that are just a shade different. Here are a couple of scenarios to ponder:

  1. You have growth exceeding 100%. This is a hard problem.
  2. You have three or more rates to compare. Making one circle for each rate quickly becomes cluttered. You may make a course with multiple racetracks. But anyone who runs track can tell you the outside lanes are not the same distance as the inside. I wrote about this issue in a long-ago post (see here).

***

For a Trifecta Checkup (link), I'd also have concerns about the analytics. There are so many differences between the states that have required masks and states that haven't - the implied causality is far from proven by this simple comparison. For example, it would be interesting to see the variability around these averages - by state or even by county.


Presented without comment

Weekend assignment - which of these tells the story better?

Ourworldindata_cases_log

Or:

Ourworldindata_cases_linear

The cop-out answer is to say both. If you must pick one, which one?

***

When designing a data visualization as a living product (not static), you'd want a design that adapts as the data change.


What is the price for objectivity

I knew I had to remake this chart.

TMC_hospitalizations

The simple message of this chart is hidden behind layers of visual complexity. What the analyst wants readers to focus on (as discerned from the text on the right) is the red line, the seven-day moving average of new hospital admissions due to Covid-19 in Texas.

My eyes kept wandering away from the line. It's the sideway data labels on the columns. It's the columns that take up vastly more space than the red line. It's the sideway date labels on the horizontal axis. It's the redundant axis labels for hospitalizations when the entire data set has already been printed. It's the two hanging diamonds, for which the clues are filed away in the legend above.

Here's a version that brings out the message: after Phase 2 re-opening, the number of hospital admissions has been rising steadily.

Redo_junkcharts_texas_covidhospitaladmissions_1

Dots are used in place of columns, which push these details to the background. The line as well as periods of re-opening are directly labeled, removing the need for a legend.

Here's another visualization:

Redo_junkcharts_texas_covidhospitaladmissions_2

This chart plots the weekly average new hospital admissions, instead of the seven-day moving average. In the previous chart, the raggedness of moving average isn't transmitting any useful information to the average reader. I believe this weekly average metric is easier to grasp for many readers while retaining the general story.

***

On the original chart by TMC, the author said "the daily hospitalization trend shows an objective view of how COVID-19 impacts hospital systems." Objectivity is an impossible standard for any kind of data analysis or visualization. As seen above, the two metrics for measuring the trend in hospitalizations have pros and cons. Even if one insists on using a moving average, there are choices of averaging methods and window sizes.

Scientists are trained to believe in objectivity. It frequently disappoints when we discover that the rest of the world harbors no such notion. If you observe debates between politicians or businesspeople or social scientists, you rarely hear anyone claim one analysis is more objective - or less subjective - than another. The economist who predicts Dow to reach a new record, the business manager who argues for placing discounted products in the front not the back of the store, the sportscaster who maintains Messi is a better player than Ronaldo: do you ever hear these people describe their methods as objective?

Pursuing objectivity leads to the glorification of data dumps. The scientist proclaims disinterest in holding an opinion about the data. This is self-deception though. We clearly have opinions because when someone else  "misinterprets" the data, we express dismay. What is the point of pretending to hold no opinions when most of the world trades in opinions? By being "objective," we never shape the conversation, and forever play defense.


Visualizing black unemployment in the U.S.

In a prior post, I explained how the aggregate unemployment rate paints a misleading picture of the employment situation in the United States. Even though the U3 unemployment rate in 2019 has returned to the lowest level we have seen in decades, the aggregate statistic hides some concerning trends. There is an alarming rise in the proportion of people considered "not in labor force" by the Bureau of Labor Statistics - these forgotten people are not counted as "employable": when a worker drops out of the labor force, the unemployment rate ironically improves.

In that post, I looked at the difference between men and women. This post will examine the racial divide, whites and blacks.

I did not anticipate how many obstacles I'd encounter. It's hard to locate a specific data series, and it's harder to know whether the lack of search results indicates the non-existence of the data, or the incompetence of the search engine. Race-related data tend not to be offered in as much granularity. I was only able to find quarterly data for the racial analysis while I had monthly data for the gender analysis. Also, I only have data from 2000, instead of 1990.

***

As before, I looked at the official unemployment rate first, this time presented by race. Because whites form the majority of the labor force, the overall unemployment rate (not shown) is roughly the same as that for whites, just pulled up slightly toward the line for blacks.

Jc_unemploybyrace

The racial divide is clear as day. Throughout the past two decades, black Americans are much more likely to be unemployed, and worse during recessions.

The above chart determines the color encoding for all the other graphics. Notice that the best employment situations occurred on either end of this period, right before the dotcom bust in 2000, and in 2019 before the Covid-19 pandemic. As explained before, despite the headline unemployment rate being the same in those years, the employment situation was not the same.

***

Here is the scatter plot for white Americans:

Jc_unemploybyrace_scatter_whites

Even though both ends of the trajectory are marked with the same shade of blue, indicating almost identical (low) rates of unemployment, we find that the trajectory has failed to return to its starting point after veering off course during the recession of the early 2010s. While the proportion of part-time workers (counted as employed) returned to 17.5% in 2019, as in 2000, about 15 percent more whites are now excluded from the unemployment rate calculation.

The experience of black Americans appears different:

Jc_unemploybyrace_scatter_blacks

During the first decade, the proportion of black Americans dropping out of the labor force accelerated while among those considered employed, the proportion holding part-time jobs kept increasing. As the U.S. recovered from the Great Recession, we've seen a boomerang pattern. By 2019, the situation was halfway back to 2000. The last available datum for the first quarter of 2020 is before Covid-19; it actually showed a halt of the boomerang.

If the pattern we saw in the prior post holds for the Covid-19 world, we would see a marked spike in the out-of-labor-force statistic, coupled with a drop in part-time employment. It appeared that employers were eliminating part-time workers first.

***

One reader asked about placing both patterns on the same chart. Here is an example of this:

Jc_unemploybyrace_scatter_both

This graphic turns out okay because the two strings of dots fit tightly into the grid while not overlapping. There is a lot going on here; I prefer a multi-step story than throwing everything on the wall.

There is one insight that this chart provides that is not easily observed in two separate plots. Over the two decades, the racial gap has narrowed in these two statistics. Both groups have traveled to the top right corner, which is the worst corner to reside -- where more people are classified as not employable, and more of the employed are part-time workers.

The biggest challenge with making this combined scatter plot is properly controlling the color. I want the color to represent the overall unemployment rate, which is a third data series. I don't want the line for blacks to be all red, and the line for whites to be all blue, just because black Americans face a tough labor market always. The color scheme here facilitates cross-referencing time between the two dot strings.


Designs of two variables: map, dot plot, line chart, table

The New York Times found evidence that the richest segments of New Yorkers, presumably those with second or multiple homes, have exited the Big Apple during the early months of the pandemic. The article (link) is amply assisted by a variety of data graphics.

The first few charts represent different attempts to express the headline message. Their appearance in the same article allows us to assess the relative merits of different chart forms.

First up is the always-popular map.

Nytimes_newyorkersleft_overallmap

The advantage of a map is its ease of comprehension. We can immediately see which neighborhoods experienced the greater exoduses. Clearly, Manhattan has cleared out a lot more than outer boroughs.

The limitation of the map is also in view. With the color gradient dedicated to the proportions of residents gone on May 1st, there isn't room to express which neighborhoods are richer. We have to rely on outside knowledge to make the correlation ourselves.

The second attempt is a dot plot.

Nytimes_newyorksleft_percentathome

We may have to take a moment to digest the horizontal axis. It's not time moving left to right but income percentiles. The poorest neighborhoods are to the left and the richest to the right. I'm assuming that these percentiles describe the distribution of median incomes in neighborhoods. Typically, when we see income percentiles, they are based on households, regardless of neighborhoods. (The former are equal-sized segments, unlike the latter.)

This data graphic has the reverse features of the map. It does a great job correlating the drop in proportion of residents at home with the income distribution but it does not convey any spatial information. The message is clear: The residents in the top 10% of New York neighborhoods are much more likely to have left town.

In the following chart, I attempted a different labeling of both axes. It cuts out the need for readers to reverse being home to not being home, and 90th percentile to top 10%.

Redo_nyt_newyorkerslefttown

The third attempt to convey the income--exit relationship is the most successful in my mind. This is a line chart, with time on the horizontal axis.

Nyt_newyorkersleft_percenthomebyincome

The addition of lines relegates the dots to the background. The lines show the trend more clearly. If directly translated from the dot plot, this line chart should have 100 lines, one for each percentile. However, the closeness of the top two lines suggests that no meaningful difference in behavior exists between the 20th and 80th percentiles. This can be conveyed to readers through a short note. Instead of displaying all 100 percentiles, the line chart selectively includes only the 99th , 95th, 90th, 80th and 20th percentiles. This is a design choice that adds by subtraction.

Along the time axis, the line chart provides more granularity than either the map or the dot plot. The exit occurred roughly over the last two weeks of March and the first week of April. The start coincided with New York's stay-at-home advisory.

This third chart is a statistical graphic. It does not bring out the raw data but features aggregated and smoothed data designed to reveal a key message.

I encourage you to also study the annotated table later in the article. It shows the power of a well-designed table.

[P.S. 6/4/2020. On the book blog, I have just published a post about the underlying surveillance data for this type of analysis.]

 

 


Consumption patterns during the pandemic

The impact of Covid-19 on the economy is sharp and sudden, which makes for some dramatic data visualization. I enjoy reading the set of charts showing consumer spending in different categories in the U.S., courtesy of Visual Capitalist.

The designer did a nice job cleaning up the data and building a sequential story line. The spending are grouped by categories such as restaurants and travel, and then sub-categories such as fast food and fine dining.

Spending is presented as year-on-year change, smoothed.

Here is the chart for the General Commerce category:

Visualcapitalist_spending_generalcommerce

The visual design is clean and efficient. Even too sparse because one has to keep returning to the top to decipher the key events labelled 1, 2, 3, 4. Also, to find out that the percentages express year-on-year change, the reader must scroll to the bottom, and locate a footnote.

As you move down the page, you will surely make a stop at the Food Delivery category, noting that the routine is broken.

Visualcapitalist_spending_fooddelivery

I've featured this device - an element of surprise - before. Remember this Quartz chart that depicts drinking around the world (link).

The rule for small multiples is to keep the visual design identical but vary the data from chart to chart. Here, the exceptional data force the vertical axis to extend tremendously.

This chart contains a slight oversight - the red line should be labeled "Takeout" because food delivery is the label for the larger category.

Another surprise is in store for us in the Travel category.

Visualcapitalist_spending_travel

I kept staring at the Cruise line, and how it kept dipping below -100 percent. That seems impossible mathematically - unless these cardholders are receiving more refunds than are making new bookings. Not only must the entire sum of 2019 bookings be wiped out, but the records must also show credits issued to these credit (or debit) cards. It's curious that the same situation did not befall the airlines. I think many readers would have liked to see some text discussing this pattern.

***

Now, let me put on a data analyst's hat, and describe some thoughts that raced through my head as I read these charts.

Data analysis is hard, especially if you want to convey the meaning of the data.

The charts clearly illustrate the trends but what do the data reveal? The designer adds commentary on each chart. But most of these comments count as "story time." They contain speculation on what might be causing the trend but there isn't additional data or analyses to support the storyline. In the General Commerce category, the 50 to 100 percent jump in all subcategories around late March is attributed to people stockpiling "non-perishable food, hand sanitizer, and toilet paper". That might be true but this interpretation isn't supported by credit or debit card data because those companies do not have details about what consumers purchased, only the total amount charged to the cards. It's a lot more work to solidify these conclusions.

A lot of data do not mean complete or unbiased data.

The data platform provided data on 5 million consumers. We don't know if these 5 million consumers are representative of the 300+ million people in the U.S. Some basic demographic or geographic analysis can help establish the validity. Strictly speaking, I think they have data on 5 million card accounts, not unique individuals. Most Americans use more than one credit or debit cards. It's not likely the data vendor have a full picture of an individual's or a family's spending.

It's also unclear how much of consumer spending is captured in this dataset. Credit and debit cards are only one form of payment.

Data quality tends to get worse.

One thing that drives data analyst nuts. The spending categories are becoming blurrier. In the last decade or so, big business has come to dominate the American economy. Big business, with bipartisan support, has grown by (a) absorbing little guys, and (b) eliminating boundaries between industry sectors. Around me, there is a Walgreens, several Duane Reades, and a RiteAid. They currently have the same owner, and increasingly offer the same selection. In the meantime, Walmart (big box), CVS (pharmacy), Costco (wholesale), etc. all won regulatory relief to carry groceries, fresh foods, toiletries, etc. So, while CVS or Walgreens is classified as a pharmacy, it's not clear that what proportion of the spending there is for medicines. As big business grows, these categories become less and less meaningful.


Hope and reality in one Georgia chart

Over the weekend, Georgia's State Health Department agitated a lot of people when it published the following chart:

Georgia_top5counties_covid19

(This might have appeared a week ago as the last date on the chart is May 9 and the title refers to "past 15 days".)

They could have avoided the embarrassment if they had read my article at DataJournalism.com (link). In that article, I lay out a set of the "unspoken conventions," things that visual designers are, or should be, doing more or less in their sleep. Under the section titled "Order", I explain the following two "rules":

  • Place values in the natural order when it is available
  • Retain the same order across all plots in a panel of charts

In the chart above, the natural order for the horizontal (time) axis is time running left to right. The order chosen by the designer  is roughly but not precisely decreasing height of the tallest column in each daily group. Many observers suggested that the columns were arranged to give the appearance of cases dropping over time.

Within each day, the counties are ordered in decreasing number of new cases. The title of the chart reads "number of cases over time" which sounds like cumulative cases but it's not. The "lead" changed hands so many times over the 15 days, meaning the data sequence was extremely noisy, which would be unlikely for cumulative cases. There are thousands of cases in each of these counties by May. Switching the order of the columns within each daily group defeats the purpose of placing these groups side-by-side.

Responding to the bad press, the department changed the chart design for this week's version:

Georgia_top5counties_covid19_revised

This chart now conforms to the two spoken rules described above. The time axis runs left to right, and within each group of columns, the order of the counties is maintained.

The chart is still very noisy, with no apparent message.

***

Next, I'd like to draw your attention to a Data issue. Notice that the 15-day window has shifted. This revised chart runs from May 2 to May 16, which is this past Saturday. The previous chart ran from Apr 26 to May 9. 

Here's the data for May 8 and 9 placed side by side.

Junkcharts_georgia_covid19_cases

There is a clear time lag of reporting cases in the State of Georgia. This chart should always exclude the last few days. The case counts keep going up until it stabilizes. The same mistake occurs in the revised chart - the last two days appear as if new cases have dwindled toward zero when in fact, it reflects a lag in reporting.

The disconnect between the Question being posed and the quality of the Data available dooms this visualization. It is not possible to provide a reliable assessment of the "past 15 days" when during perhaps half of that period, the cases are under-counted.

***

Nyt_tryingtobefashionableThis graphical distortion due to "immature" data has become very commonplace in Covid-19 graphics. It's similar to placing partial-year data next to full-year results, without calling out the partial data.

The following post from the ancient past (2005!) about a New York Times graphic shows that calling out this data problem does not actually solve it. It's a less-bad kind of thing.

The coronavirus data present more headaches for graphic designers than the financial statistics. Because of accounting regulations, we know that only the current quarter's data are immature. For Covid-19 reporting, the numbers are being adjusted for days and weeks.

Practically all immature counts are under-estimates. Over time, more cases are reported. Thus, any plots over time - if unadjusted - paint a misleading picture of declining counts. The effect of the reporting lag is predictable, having a larger impact as we run from left to right in time. Thus, even if the most recent data show a downward trend, it can eventually mean anything: down, flat or up. This is not random noise though - we know for certain of the downward bias; we just don't know the magnitude of the distortion for a while.

Another issue that concerns coronavirus reporting but not financial reporting is inconsistent standards across counties. Within a business, if one were to break out statistics by county, the analysts would naturally apply the same counting rules. For Covid-19 data, each county follows its own set of rules, not just  how to count things but also how to conduct testing, and so on.

Finally, with the politics of re-opening, I find it hard to trust the data. Reported cases are human-driven data - by changing the number of tests, by testing different mixes of people, by delaying reporting, by timing the revision of older data, by explicit manipulation, ...., the numbers can be tortured into any shape. That's why it is extremely important that the bean-counters are civil servants, and that politicians are kept away. In the current political environment, that separation between politics and statistics has been breached.

***

Why do we have low-quality data? Human decisions, frequently political decisions, adulterate the data. Epidemiologists are then forced to use the bad data, because that's what they have. Bad data lead to bad predictions and bad decisions, or if the scientists account for the low quality, predictions with high levels of uncertainty. Then, the politicians complain that predictions are wrong, or too wide-ranging to be useful. If they really cared about those predictions, they could start by being more transparent about reporting and more proactive at discovering and removing bad accounting practices. The fact that they aren't focused on improving the data gives the game away. Here's a recent post on the politics of data.