Five-value summaries of distributions

BG commented on my previous post, describing her frustration with the “stacked range chart”:

A stacked graph visualizes cubes stacked one on top of the other. So you can't use it for negative numbers, because there's no such thing [as] "negative data". In graphs, a "minus" sign visualizes the opposite direction of one series from another. Doing average plus average plus average plus average doesn't seem logical at all.

***

I have already planned a second post to discuss the problems of using a stacked column chart to show markers of a numeric distribution.

I tried to replicate how the Youtuber generated his “stacked range chart” by appropriating Excel’s stacked column chart, but failed. I think there are some missing steps not mentioned in the video. At around 3:33 of the video, he shows a “hack” involving adding 100 degrees (any large enough value) to all values (already converted to ranges). Then, the next screen displays the resulting chart. Here is the dataset on the left and the chart on the right.

Minutephysics_londontemperature_datachart

Afterwards, he replaces the axis labels with new labels, effectively shifting the axis. But something is missing from the narrative. Since he’s using a stacked column chart, the values in the table are encoded in the heights of the respective blocks. The total stacked heights of each column should be in the hundreds since he has added 100 to each cell. But that’s not what the chart shows.

***

In the rest of the post, I’ll skip over how to make such a chart in Excel, and talk about the consequences of inserting “range” values into the heights of the blocks of a stacked column chart.

Let’s focus on London, Ontario; the five temperature values, corresponding to various average temperatures, are -3, 5, 9, 14, 24. Just throwing those numbers into a stacked column chart in Excel results in the following useless chart:

Stackedcolumnchart_londonontario

The temperature averages are cumulatively summed, which makes no sense, as noted by reader BG. [My daily temperature data differ somewhat from those in the Youtube. My source is here.]

We should ignore the interiors of the blocks, and instead interpret the edges of these blocks. There are five edges corresponding to the five data values. As in:

Junkcharts_redo_londonontariotemperatures_dotplot

The average temperature in London, Ontario (during Spring 2023-Winter 2024) is 9 C. This overall average hides seasonal as well as diurnal variations in temperature.

If we want to acknowledge that night-time temperatures are lower than day-time temperatures, we draw attention to the two values bracketing 9 C, i.e. 5 C and 14 C. The average daytime (max) temperature is 14 C while the average night-time (min) temperature is 5 C. Furthermore, Ontario experiences seasons, so that the average daytime temperature of 14 C is subject to seasonal variability; in the summer, it goes up to 24 C. In the winter, the average night-time temperature goes down to -3 C, compared to 5 C across all seasons. [For those paying closer attention, daytime/max and night-time/min form congruous pairs because the max temperature occurs during daytime while the min temperature occurs during night-time. Thus, the average of maximum temperatures is the same as the average of daytime maximum temperatures.]

The above dotplot illustrates this dataset adequately. The Youtuber explained why he didn’t like it – I couldn’t quite make sense of what he said. It’s possible he thinks the gaps between those averages are more meaningful than the averages themselves, and therefore he prefers a chart form that draws our attention to the ranges, rather than the values.

***

Our basic model of temperature can be thought of as: temperature on a given day = overall average + adjustment for seasonality + adjustment for diurnality.

Take the top three values 9, 14, 24 from above list. Starting at the overall average of 9 C, the analyst gets to 14 if he hones in on max daily temperatures, and to 24 if he further restricts the analysis to summer months (which have the higher temperatures). The second gap is 10 C, twice as large as the first gap of 5 C. Thus, the seasonal fluctuations have larger magnitude than daily fluctuations. Said differently, the effect of seasons on temperature is bigger than that of hour of day.

In interpreting the “ranges” or gaps between averages, narrow ranges suggest low variability while wider ranges suggest higher variability.

Here's a set of boxplots for the same data:

Junkcharts_redo_londonontariotemperatures

The boxplot "edges" also demarcate five values; they are not the same five values as defined by the Youtuber but both sets of five values describe the underlying distribution of temperatures.

 

P.S. For a different example of something similar, see this old post.


Aligning the visual and the message to hot things up

The headline of this NBC News chart (link) tells readers that Phoenix (Arizona) has been very, very hot this year. It has over 120 days in which the average temperature exceeded 100F (38 C).

Nbcnews_phoenix_tmax

It's not obvious how extreme this situation is. To help readers, it would be useful to add some kind of reference points.

A couple of possibilities come to mind:

First, how many days are depicted in the chart? Since there is one cell for each day of the year, and the day of week is plotted down the vertical axis, we just need to count the number of columns. There are 38 columns, but the first column has one missing cell while the last column has only 3 cells. Thus, the number of days depicted is (36*7)+6+3 = 261. So, the average temperature in Phoenix exceeded 100F on about 46% of the days of the year thus far.

That sounds like a high number. For a better reference point, we'd also like to know the historical average. Is Phoenix just a very hot place? Is 2024 hotter than usual?

***

Let's walk through how one reads the Phoenix "heatmap".

We already figured out that each column represents a week of the year, and each row shows a cross-section of a given day of week throughout the year.

The first column starts on a Monday because the first day of 2024 falls on a Monday. The last column ends on a Tuesday, which corresponds to Sept 17, 2024, the last day of data when this chart was created.

The columns are grouped into months, although such division is complicated by the fact that the number of days in a month (except for a leap month) isn't ever divisible by seven. The designer subtly inserted a thicker border between months. This feature allows readers to comment on the average temperature in a given month. It also lets readers learn quickly that we are two weeks and three days into September.

The color legend explains that temperature readings range from yellow (lower) to red (higher). The range of average daily temperatures during 2024 was 54-118F (12-48C). The color scale is progressive.

Nbcnews_phoenix_colorlegend

Given that 100F is used as a threshold to define "hot days," it makes sense to accentuate this in the visual presentation. For example:

Junkcharts_redo_nbcnewsphoenixmaxtemp

Here, all days with maximum temperature at 100F or above have a red hue.


Adjust, and adjust some more

This Financial Times report illustrates the reason why we should adjust data.

The story explores the trend in economic statistics during 14 years of governing by conservatives. One of those metrics is so-called council funding (local governments). The graphic is interactive: as the reader scrolls the page, the chart transforms.

The first chart shows the "raw" data.

Ft_councilfunding1

The vertical axis shows year-on-year change in funding. It is an index relative to the level in 2010. From this line chart, one concludes that council funding decreased from 2010 to around 2016, then grew; by 2020, funding has recovered to the level of 2010 and then funding expanded rapidly in recent years.

When the reader scrolls down, this chart is replaced by another one:

Ft_councilfunding2

This chart contains a completely different picture. The line dropped from 2010 to 2016 as before. Then, it went flat, and after 2021, it started raising, even though by 2024, the value is still 10 percent below the level in 2010.

What happened? The data journalist has taken the data from the first chart, and adjusted the values for inflation. Inflation was rampant in recent years, thus, some of the raw growth have been dampened. In economics, adjusting for inflation is also called expressing in "real terms". The adjustment is necessary because the same dollar (hmm, pound) is worth less when there is inflation. Therefore, even though on paper, council funding in 2024 is more than 25 percent higher than in 2010, inflation has gobbled up all of that and more, to the point in which, in real terms, council funding has fallen by 20 percent.

This is one material adjustment!

Wait, they have a third chart:

Ft_councilfunding3

It's unfortunate they didn't stabilize the vertical scale. Relative to the middle chart, the lowest point in this third chart is about 5 percent lower, while the value in 2024 is about 10 percent lower.

This means, they performed a second adjustment - for population change. It is a simple adjustment of dividing by the population. The numbers look worse probably because population has grown during these years. Thus, even if the amount of funding stayed the same, the money would have to be split amongst more people. The per-capita adjustment makes this point clear.

***

The final story is much different from the initial one. Not only was the magnitude of change different but the direction of change reversed.

Whenever it comes to adjustments, remember that all adjustments are subjective. In fact, choosing not to adjust is also subjective. Not adjusting is usually much worse.

 

 

 

 


Chart without an axis

When it comes to global warming, most reports cite a single number such as an average temperature rise of Y degrees by year X. Most reports also claim the existence of a consensus within scientists. The Guardian presented the following chart that shows the spread of opinions amongst the experts.

Guardian_globalwarming

Experts were asked how many degrees they expect average global temperature to increase by 2100. The estimates ranged from "below 1.5 degrees" to "5 degrees or more". The most popular answer was 2.5 degrees. Roughly three out of four respondents picked a number at 2.5 degrees or above. The distribution is close to symmetric around the middle.

***

What kind of chart is this?

It's a type of histogram, given that the horizontal axis shows binned ranges of temperature change while the vertical axis shows number of respondents (out of 380).

A (count) histogram typically encodes the count data in the vertical axis. Did you notice there isn't a vertical axis?

That's because the chart has an abnormal axis. Each of the 380 respondents is shown here as a cell. What looks like a "column" is actually two-dimensional. Each row of cells has 10 slots. To find out how many respondents chose the 2.5 celsius category, you count the number of rows and then the number of stray items on top. (It's 132.)

Only the top row of cells can be partially filled so the general shape of the distribution isn't affected much. However, the lack of axis labels makes it hard to learn the count of each column.

It's even harder to know the proportions of respondents, which should be the primary message of the chart. The proportion would have been possible to show if the maximum number of rows was set to 38. The maximum number of rows on the above chart is 22. Using 38 rows leads to a chart with a lot of white space as the tallest column (count of 132) is roughly 35% of the total response.

At the end, I'm not sure this variant of histogram beats the standard histogram.


Aligning V and Q by way of D

In the Trifecta Checkup (link), there is a green arrow between the Q (question) and V (visual) corners, indicating that they should align. This post illustrates what I mean by that.

I saw the following chart in a Washington Post article comparing dairy milk and plant-based "milks".

Vitamins

The article contains a whole series of charts. The one shown here focuses on vitamins.

The red color screams at the reader. At first, it appears to suggest that dairy milk is a standout on all four categories of vitamins. But that's not what the data say.

Let's take a look at the chart form: it's a grid of four plots, each containing one square for each of four types of "milk". The data are encoded in the areas of the squares. The red and green colors represent category labels and do not reflect data values.

Whenever we make bubble plots (the closest relative of these square plots), we have to solve a scale problem. What is the relationship between the scales of the four plots?

I noticed the largest square is the same size across all four plots. So, the size of each square is made relative to the maximum value in each plot, which is assigned a fixed size. In effect, the data encoding scheme is that the areas of the squares show the index values relative to the group maximum of each vitamin category. So, soy milk has 72% as much potassium as dairy milk while oat and almond milks have roughly 45% as much as dairy.

The same encoding scheme is applied also to riboflavin. Oat milk has the most riboflavin, so its square is the largest. Soy milk is 80% of oat, while dairy has 60% of oat.

***

_trifectacheckup_imageLet's step back to the Trifecta Checkup (link). What's the question being asked in this chart? We're interested in the amount of vitamins found in plant-based milk relative to dairy milk. We're less interested in which type of "milk" has the highest amount of a particular vitamin.

Thus, I'd prefer the indexing tied to the amount found in dairy milk, rather than the maximum value in each category. The following set of column charts show this encoding:

Junkcharts_redo_msn_dairyplantmilks_2

I changed the color coding so that blue columns represent higher amounts than dairy while yellow represent lower.

From the column chart, we find that plant-based "milks" contain significantly less potassium and phosphorus than dairy milk while oat and soy "milks" contain more riboflavin than dairy. Almond "milk" has negligible amounts of riboflavin and phosphorus. There is vritually no difference between the four "milk" types in providing vitamin D.

***

In the above redo, I strengthen the alignment of the Q and V corners. This is accomplished by making a stop at the D corner: I change how the raw data are transformed into index values. 

Just for comparison, if I only change the indexing strategy but retain the square plot chart form, the revised chart looks like this:

Junkcharts_redo_msn_dairyplantmilks_1

The four squares showing dairy on this version have the same size. Readers can evaluate the relative sizes of the other "milk" types.


The cult of raw unadjusted data

Long-time reader Aleks came across the following chart on Facebook:

Unadjusted temp data fgfU4-ia fb post from aleks

The author attached a message: "Let's look at raw, unadjusted temperature data from remote US thermometers. What story do they tell?"

I suppose this post came from a climate change skeptic, and the story we're expected to take away from the chart is that there is nothing to see here.

***

What are we looking at, really?

"Nothing to see" probably refers to the patch of blue squares that cover the entire plot area, as time runs left to right from the 1910s to the present.

But we can't really see what's going on in the middle of the patch. So, "nothing to see" is effectively only about the top-to-bottom range of roughly 29.8 to 82.0. What does that range signify?

The blue patch is subdivided into vertical lines consisting of blue squares. Each line is a year's worth of temperature measurements. Each square is the average temperature on a specific day. The vertical range is the difference between the maximum and minimum daily temperatures in a given year. These are extreme values that say almost nothing about the temperatures in the other ~363 days of the year.

We know quite a bit more about the density of squares along each vertical line. They are broken up roughly by seasons. Those values near the top came from summers while the values near the bottom came from winters. The density is the highest near the middle, where the overplotting is so severe that we can barely see anything.

Within each vertical line, the data are not ordered chronologically. This is a very key observation. From left to right, the data are ordered from earliest to latest but not from top to bottom! Therefore, it is impossible for the human eye to trace the entire trajectory of the daily temperature readings from this chart. At best, you can trace the yearly average temperature – but only extremely roughly by eyeballing where the annual averages are inside the blue patch.

Indeed, there is "nothing to see" on this chart because its design has pulverized the data.

***

_numbersense_bookcoverIn Numbersense (link), I wrote "not adjusting the raw data is to knowingly publish bad information. It is analogous to a restaurant's chef knowingly sending out spoilt fish."

It's a fallacy to think that "raw unadjusted" data are the best kind of data. It's actually the opposite. Adjustments are designed to correct biases or other problems in the data. Of course, adjustments can be subverted to introduce biases in the data as well. It is subversive to presume that all adjustments are of the subversive kind.

What kinds of adjustments are of interest in this temperature dataset?

Foremost is the seasonal adjustment. See my old post here. If we want to learn whether temperatures have risen over these decades, we can't do so without separating out the seasons.

The whole dataset can be simplified by drawing the smoothed annual average temperature grouped by season of the year, and when that is done, the trend of rising temperatures is obvious.

***

The following chart by the EPA roughly implements the above:

Epa-seasonal-temperature_2022

The original can be found here. They made one adjustment which isn't the one I expected.

Note the vertical scale is titled "temperature anomaly". So, they are not plotting the actual recorded average temperatures, but the "anomalies", i.e. the difference between the recorded temperatures and some kind of "expected" temperature. This is a type of data adjustment as well. The purpose is to focus attention on the relative rather than absolute values. Think of this formula: recorded value = expected value + anomaly. The chart shows how many degrees above or below expectation, rather than how many degrees.

For a chart like this, there should be a required footnote that defines what "anomaly" is. Specifically, the reader should know about the model behind the "expectation". Typically, it's a kind of long-term average value.

For me, this adjustment is not necessary. Without the adjustment, the four panels can be combined into one panel with four lines. That's because the data nicely fit into four levels based on seasons.

The further adjustment I'd have liked to see is "smoothing". Each line above has a "smooth" trend, as well as some variability around this trend. The latter is not a big part of the story.

***

It's weird to push back on climate change advocacy by attacking data adjustments. The more productive direction, in my view, is to ask whether the observed trend is caused by human activities or part of some long-term up-and-down cycle. That is a very challenging question to answer.


Several tips for visualizing matrices

Continuing my review of charts that were spammed to my inbox, today I look at the following visualization of a matrix of numbers:

Masterworks_chart9

The matrix shows pairwise correlations between the returns of 16 investment asset classes. Correlation is a number between -1 and 1. It is a symmetric scale around 0. It embeds two dimensions: the magnitude of the correlation, and its direction (positive or negative).

The correlation matrix is a special type of matrix: a bit easier to deal with as the data already come “standardized”. As with the other charts in this series, there is a good number of errors in the chart's execution.

I’ll leave the details maybe for a future post. Just check two key properties of a correlation matrix: the diagonal consisting of self-correlations should contain all 1s; and the matrix should be symmetric across that diagonal.

***

For this post, I want to cover nuances of visualizing matrices. The chart designer knows exactly what the message of the chart is - that the asset class called "art" is attractive because it has little correlation with other popular asset classes. Regardless of the chart's errors, it’s hard for the reader to find the message in the matrix shown above.

That's because the specific data carrying the message sit in the bottom row (and the rightmost column). The cells in this row (and column) has a light purple color, which has been co-opted by the even lighter gray color used for the diagonal cells. These diagonal cells pop out of the chart despite being the least informative (they have the same values for all correlation matrices!)

***

Several tactics can be deployed to push the message to the fore.

First, let's bring the key data to the prime location on the chart - this is the top row and left column (for cultures which read top to bottom, left to right).

Redo_masterwork9_matrix_arttop

For all the drafts in this post, I have dropped the text descriptions of the asset classes, and replaced them with numbers so that it's easier to follow the changes. (For those who're paying attention, I also edited the data to make the matrix symmetric.)

Second, let's look at the color choice. Here, the designer made a wise choice of restricting the number of color levels to three (dark, medium and light). I retained that decision in the above revision - actually, I used four colors but there are no values in one of the four sections, therefore, effectively, only three colors appear. But let's look at what happens when the number of color levels is increased.

Redo_masterwork9_matrix_colors

The more levels of color, the more strain it puts on our processing... with little reward.

Third, and most importantly, the order of the categories affects perception majorly. I have no idea what the designer used as the sorting criterion. In step one of the fix, I moved the art category to the front but left all the other categories in the original order.

The next chart has the asset classes organized from lowest to highest average correlation. Conveniently, using this sorting metric leaves the art category in its prime spot.

Redo_masterwork9_matrix_orderbyavg

Notice that the appearance has completely changed. The new version brings out clusters in the data much more effectively. Most of the assets in the bottom of the chart have high correlation with each other.

Finally, because the correlation matrix is symmetric across the diagonal of self-correlations, the two halves are mirror images and thus redundant. The following removes one of the mirrored halves, and also removes the diagonal, leading to a much cleaner look.

Redo_masterwork9_matrix_orderbyavg_tri

Next time you visualize a matrix, think about how you sort the rows/columns, how you choose the color scale, and whether to plot the mirrored image and the diagonal.

 

 

 


Elevator shoes for column charts

Continuing my review of some charts spammed to me, I wasn’t expecting to find any interest in the following:

Masterworks_chart4

It’s a column chart showing the number of years of data available for different asset classes. The color has little value other than to subtly draw the reader’s attention to the bar called “Art,” which is the focus of the marketing copy.

Do the column heights encode the data?

The answer is no.

***

Let’s take a little journey. First I notice there is a grid behind the column chart, hanging above the baseline.

Redo_masterworks4_grid
I marked out two columns with values 50 and 25, so the second column should be exactly half the height of the first. Each column consists of two parts, the first overlapping the grid while the second connecting the bottom of the grid to the baseline. The second part is a constant for every column; I label this distance Y.  

Against the grid, the column “50” spans 9 cells while the column “25” spans 4 cells. I label the grid height X. Now, if the first column is twice the height of the second, the equation: 9X + Y = 2*(4X+Y) should hold.

The only solution to this equation is X = Y. In other words, the distance between the bottom of the grid to the baseline must be exactly the height of one grid cell if the column heights were to faithfully represent the data. Well – it’s obvious that the former is larger than the latter.

In the revision, I have chopped off the excess height by moving the baseline upwards.

Redo_masterworks4_corrected

That’s the mechanics. Now, figuring out the motivation is another matter.


One bubble is a tragedy, and a bag of bubbles is...

From Kathleen Tyson's twitter account, I came across a graphic showing the destinations of Ukraine's grain exports since 2022 under the auspices of a UN deal. This graphic, made by AFP, uses one of the chart forms that baffle me - the bag of bubbles.

Ukraine_grains_bubbles

The first trouble with a bag of bubbles is the single bubble. The human brain is just not fit for comparing bubble sizes. The self-sufficiency test is my favorite device for demonstrating this weakness. The following is the European section of the above chart, with the data labels removed.

Redo_junkcharts_afp_ukrainegrains_europe_1

How much bigger is Spain than the Netherlands? What's the difference between Italy and the Netherlands? The answers don't come easily to mind. (The Netherlands is about 40% the size of Spain, and Italy is about 20% larger than the Netherlands.)

While comparing relative circular areas is a struggle, figuring out the relative ranks is not. Sure, it gets tougher with small differences (Germany vs S. Korea, Belgium vs Portugal) but saying those pairs are tied isn't a tragedy.

***

Another issue with bubble charts is how difficult it is to assess absolute values. A circle on its own has no reference point. The designer needs to add data labels or a legend. Adding data labels is an act of giving up. The data labels become the primary instrument for communicating the data, not the visual construct. Adding one data label is not enough, as the following shows:

Redo_junkcharts_afpukrainegrains_2

Being told that Spain's value is 4.1 does little to help estimate the values for the non-labelled bubbles.

The chart does come with the following legend:

Afp_ukrianegrains_legend

For this legend to work, the sample bubble sizes should span the range of the data. Notice that it's difficult to extrapolate from the size of the 1-million-ton bubble to 2-million, 4-million, etc. The analogy is a column chart in which the vertical axis does not extend through the full range of the dataset.

The designer totally gets this. The chart therefore contains both selected data labels and the partial legend. Every bubble larger than 1 million tons has an explicit data label. That's one solution for the above problem.

Nevertheless, why not use another chart form that avoids these problems altogether?

***

In Tyson's tweet, she showed another chart that pretty much contains the same information, this one from TASS.

Ukraine_grains_flows

This chart uses the flow diagram concept - in an abstract way, as I explained in previous post.

This chart form imposes structure on the data. The relative ranks of the countries within each region are listed from top to bottom. The relative amounts of grains are shown in black columns (and also in the thickness of the flows).

The aggregate value of movements within each region is called out in that middle section. It is impossible to learn this from the bag of bubbles version.

The designer did print the entire dataset onto this chart (except for the smallest countries grouped together as "other"). This decision takes away from the power of the underlying flow chart. Instead of thinking about the proportional representation of each country within its respective region, or the distribution of grains among regions, our eyes hone in on the data labels.

This brings me back to the principle of self-sufficiency: if we expect readers to consume the data labels - which comprise the entire dataset, why not just print a data table? If we decide to visualize, make the visual elements count!


Why some dataviz fail

Maxim Lisnic's recent post should delight my readers (link). Thanks Alek for the tip. Maxim argues that charts "deceive" not merely by using visual tricks but by a variety of other non-visual means.

This is also the reasoning behind my Trifecta Checkup framework which looks at a data visualization project holistically. There are lots of charts that are well designed and constructed but fail for other reasons. So I am in agreement with Maxim.

He analyzed "10,000 Twitter posts with data visualizations about COVID-19", and found that 84% are "misleading" while only 11% of the 84% "violate common design guidelines". I presume he created some kind of computer program to evaluate these 10,000 charts, and he compiled some fixed set of guidelines that are regarded as "common" practice.

***

Let's review Maxim's examples in the context of the Trifecta Checkup.

_trifectacheckup_image

The first chart shows Covid cases in the U.S. in July and August of 2021 (presumably the time when the chart was published) compared to a year ago (prior to the vaccination campaign).

Maxim_section1

Maxim calls this cherry-picking. He's right - and this is a pet peeve of mine, even with all the peer-reviewed scientific research. In my paper on problems with observational studies (link), my coauthors and I call for a new way forward: researchers should put their model calculations up on a website which is updated as new data arrive, so that we can be sure that the conclusions they published apply generally to all periods of time, not just the time window chosen for the publication.

Looking at the pair of line charts, readers can quickly discover its purpose, so it does well on the Q(uestion) corner of the Trifecta. The cherry-picking relates to the link between the Question and the Data, showing that this chart suffers from subpar analysis.

In addition, I find that the chart also misleads visually - the two vertical scales are completely different: the scale on the left chart spans about 60,000 cases while on the right, it's double the amount.

Thus, I'd call this a Type DV chart, offering opportunities to improve in two of the three corners.

***

The second chart cited by Maxim plots a time series of all-cause mortality rates (per 100,000 people) from 1999 to 2020 as columns.

The designer does a good job drawing our attention to one part of the data - that the average increase in all-cause mortality rate in 2020 over the previous five years was 15%. I also like the use of a different color for the pandemic year.

Then, the designer lost the plot. Instead of drawing a conclusion based on the highlighted part of the data, s/he pushed a story that the 2020 rate was about the same as the 2003 rate. If that was the main message, then instead of computing a 15% increase relative to the past five years, s/he should have shown how the 2003 and 2020 levels are the same!

On a closer look, there is a dashed teal line on the chart but the red line and text completely dominate our attention.

This chart is also Type DV. The intention of the designer is clear: the question is to put the jump in all-cause mortality rate in a historical context. The problem lies again with subpar analysis. In fact, if we take the two insights from the data, they both show how serious a problem Covid was at the time.

When the rate returned to the level of 2003, we have effectively gave up all the gains made over 17 years in a few months.

Besides, a jump in 15% from year to year is highly significant if we look at all other year-to-year changes shown on the chart.

***

The next section concerns a common misuse of charts to suggest causality when the data could only indicate correlation (and where the causal interpretation appears to be dubious). I may write a separate post about this vast topic in the future. Today, I just want to point out that this problem is acute with any Covid-19 research, including official ones.

***

I find the fourth section of Maxim's post to be less convincing. In the following example, the tweet includes two charts, one showing proportion of people vaccinated, and the other showing the case rate, in Iceland and Nigeria.

Maxim_section4

This data visualization is poor even on the V(isual) corner. The first chart includes lots of countries that are irrelevant to the comparison. It includes the unnecessary detail of fully versus partially vaccinated, unnecessary because the two countries selected are at two ends of the scale. The color coding is off sync between the two charts.

Maxim's critique is:

The user fails to account, however, for the fact that Iceland had a much higher testing rate—roughly 200 times as high at the time of posting—making it unreasonable to compare the two countries.

And the section is titled "Issues with Data Validity". It's really not that simple.

First, while the differential testing rate is one factor that should be considered, this factor alone does not account for the entire gap. Second, this issue alone does not disqualify the data. Third, if testing rate differences should be used to invalidate this set of data, then all of the analyses put out by official sources lauding the success of vaccination should also be thrown out since there are vast differences in testing rates across all countries (and also across different time periods for the same country).

One typical workaround for differential testing rate is to look at deaths rather than cases. For the period of time plotted on the case curve, Nigeria's cumulative death per million is about 1/8th that of Iceland. The real problem is again in the Data analysis, and it is about how to interpret this data casually.

This example is yet another Type DV chart. I'd classify it under problems with "Casual Inference". "Data Validity" is definitely a real concern; I just don't find this example convincing.

***

The next section, titled "Failure to account for statistical nuance," is a strange one. The example is a chart that the CDC puts out showing the emergence of cases in a specific county, with cases classified by vaccination status. The chart shows that the vast majority of cases were found in people who were fully vaccinated. The person who tweeted concluded that vaccinated people are the "superspreaders". Maxim's objection to this interpretation is that most cases are in the fully vaccinated because most people are fully vaccinated.

I don't think it's right to criticize the original tweeter in this case. If by superspreader, we mean people who are infected and out there spreading the virus to others through contacts, then what the data say is exactly that most such people are fully vaccinated. In fact, one should be very surprised if the opposite were true.

Indeed, this insight has major public health implications. If the vaccine is indeed 90% effective at stopping cases, we should not be seeing that level of cases. And if the vaccine is only moderately effective, then we may not be able to achieve "herd immunity" status, as was the plan originally.

I'd be reluctant to complain about this specific data visualization. It seems that the data allow different interpretations - some of which are contradictory but all of which are needed to draw a measured conclusion.

***
The last section on "misrepresentation of scientific results" could use a better example. I certainly agree with the message: that people have confirmation bias. I have been calling this "story-first thinking": people with a set story visualize only the data that support their preconception.

However, the example given is not that. The example shows a tweet that contains a chart from a scientific paper that apparently concludes that hydroxychloroquine helps treat Covid-19. Maxim adds this study was subsequently retracted. If the tweet was sent prior to the retraction, then I don't think we can grumble about someone citing a peer reviewed study published in Lancet.

***

Overall, I like Maxim's message. In some cases, I think there are better examples.