## Tile maps on a trip

##### Jun 21, 2023

My friend Ray sent me to a recent blog about tile maps. Typical tile maps use squares or hexagons, although in theory many other shapes will do. Unsurprisingly, the field follows the latest development of math researchers who study the space packing problem. The space packing problem concerns how to pack a space with objects. The study of tesselations is to pack space with one or a few shapes.

It was an open question until recently whether there exists an "aperiodic monotile," that is to say, a single shape that can cover space in a non-repeating manner. We all know that we can use squares to cover a space, which creates the familiar grid of squares, but in that case, a pattern repeats itself all over the space.

Now, some researchers have found an elusive aperiodic monotile, which they dubbed the Einstein monotile. Below is a tesselation using these tiles:

Within this design, one cannot find a set of contiguous tiles that repeats itself.

The blogger then made a tile map using this new tesselation. Here's one:

It doesn't matter what this is illustrating. The blog author cites a coworker, who said: "I can think of no proper cartographic use for Penrose binning, but it’s fun to look at, and so that’s good enough for me." Penrose tiles is another mathematical invention that can be used in a tesselation. The story is still the same: there is no benefit from using these strange-looking shapes. Other than the curiosity factor.

***

Let's review the pros and cons of using tile maps.

Compare a typical choropleth map of the United States (by state) and a tile map by state. The former has the well-known problem that states with the largest areas usually have the lowest population densities, and thus, if we plot demographic data on such maps, the states that catch the most attention are the ones that don't weigh as much - by contrast, the densely populated states in New England barely show up.

The tile map removes this area bias, thus resolving this problem. Every state is represented by equal area.

While the tesselated design is frequently better, it's not always. In many data visualization, we do intend to convey the message that not all states are equal!

The grid arrangement of the state tiles also makes it easier to find regional patterns. A regional pattern is defined here as a set of neighboring states that share similar data (encoded in the color of the tiles). Note that the area of each state is of zero interest here, and thus, the accurate descriptions of relative areas found on the usual map is a distractor.

However, on the tile map, these regional patterns are conceptual. One must not read anything into the shape of the aggregated region, or its boundaries. Indeed, if we use strange-looking shapes like Einstein tiles, the boundaries are completely meaningless, and even misleading.

There also usually is some distortion of the spatial coordinates on a tile map because we'd like to pack the squares or hexagons into a lattice-like structure.

Lastly, the tile map is not scalable. We haven't seen a tile map of the U.S. by county or precinct but we have enjoyed many choropleth maps displaying county- or precinct-level data, e.g. the famous Purple Map of America. There is a reason for this.

***

Here is an old post that contains links to various other posts I've written about tile maps.

## Why some dataviz fail

##### Jun 16, 2023

Maxim Lisnic's recent post should delight my readers (link). Thanks Alek for the tip. Maxim argues that charts "deceive" not merely by using visual tricks but by a variety of other non-visual means.

This is also the reasoning behind my Trifecta Checkup framework which looks at a data visualization project holistically. There are lots of charts that are well designed and constructed but fail for other reasons. So I am in agreement with Maxim.

He analyzed "10,000 Twitter posts with data visualizations about COVID-19", and found that 84% are "misleading" while only 11% of the 84% "violate common design guidelines". I presume he created some kind of computer program to evaluate these 10,000 charts, and he compiled some fixed set of guidelines that are regarded as "common" practice.

***

Let's review Maxim's examples in the context of the Trifecta Checkup.

The first chart shows Covid cases in the U.S. in July and August of 2021 (presumably the time when the chart was published) compared to a year ago (prior to the vaccination campaign).

Maxim calls this cherry-picking. He's right - and this is a pet peeve of mine, even with all the peer-reviewed scientific research. In my paper on problems with observational studies (link), my coauthors and I call for a new way forward: researchers should put their model calculations up on a website which is updated as new data arrive, so that we can be sure that the conclusions they published apply generally to all periods of time, not just the time window chosen for the publication.

Looking at the pair of line charts, readers can quickly discover its purpose, so it does well on the Q(uestion) corner of the Trifecta. The cherry-picking relates to the link between the Question and the Data, showing that this chart suffers from subpar analysis.

In addition, I find that the chart also misleads visually - the two vertical scales are completely different: the scale on the left chart spans about 60,000 cases while on the right, it's double the amount.

Thus, I'd call this a Type DV chart, offering opportunities to improve in two of the three corners.

***

The second chart cited by Maxim plots a time series of all-cause mortality rates (per 100,000 people) from 1999 to 2020 as columns.

The designer does a good job drawing our attention to one part of the data - that the average increase in all-cause mortality rate in 2020 over the previous five years was 15%. I also like the use of a different color for the pandemic year.

Then, the designer lost the plot. Instead of drawing a conclusion based on the highlighted part of the data, s/he pushed a story that the 2020 rate was about the same as the 2003 rate. If that was the main message, then instead of computing a 15% increase relative to the past five years, s/he should have shown how the 2003 and 2020 levels are the same!

On a closer look, there is a dashed teal line on the chart but the red line and text completely dominate our attention.

This chart is also Type DV. The intention of the designer is clear: the question is to put the jump in all-cause mortality rate in a historical context. The problem lies again with subpar analysis. In fact, if we take the two insights from the data, they both show how serious a problem Covid was at the time.

When the rate returned to the level of 2003, we have effectively gave up all the gains made over 17 years in a few months.

Besides, a jump in 15% from year to year is highly significant if we look at all other year-to-year changes shown on the chart.

***

The next section concerns a common misuse of charts to suggest causality when the data could only indicate correlation (and where the causal interpretation appears to be dubious). I may write a separate post about this vast topic in the future. Today, I just want to point out that this problem is acute with any Covid-19 research, including official ones.

***

I find the fourth section of Maxim's post to be less convincing. In the following example, the tweet includes two charts, one showing proportion of people vaccinated, and the other showing the case rate, in Iceland and Nigeria.

This data visualization is poor even on the V(isual) corner. The first chart includes lots of countries that are irrelevant to the comparison. It includes the unnecessary detail of fully versus partially vaccinated, unnecessary because the two countries selected are at two ends of the scale. The color coding is off sync between the two charts.

Maxim's critique is:

The user fails to account, however, for the fact that Iceland had a much higher testing rate—roughly 200 times as high at the time of posting—making it unreasonable to compare the two countries.

And the section is titled "Issues with Data Validity". It's really not that simple.

First, while the differential testing rate is one factor that should be considered, this factor alone does not account for the entire gap. Second, this issue alone does not disqualify the data. Third, if testing rate differences should be used to invalidate this set of data, then all of the analyses put out by official sources lauding the success of vaccination should also be thrown out since there are vast differences in testing rates across all countries (and also across different time periods for the same country).

One typical workaround for differential testing rate is to look at deaths rather than cases. For the period of time plotted on the case curve, Nigeria's cumulative death per million is about 1/8th that of Iceland. The real problem is again in the Data analysis, and it is about how to interpret this data casually.

This example is yet another Type DV chart. I'd classify it under problems with "Casual Inference". "Data Validity" is definitely a real concern; I just don't find this example convincing.

***

The next section, titled "Failure to account for statistical nuance," is a strange one. The example is a chart that the CDC puts out showing the emergence of cases in a specific county, with cases classified by vaccination status. The chart shows that the vast majority of cases were found in people who were fully vaccinated. The person who tweeted concluded that vaccinated people are the "superspreaders". Maxim's objection to this interpretation is that most cases are in the fully vaccinated because most people are fully vaccinated.

I don't think it's right to criticize the original tweeter in this case. If by superspreader, we mean people who are infected and out there spreading the virus to others through contacts, then what the data say is exactly that most such people are fully vaccinated. In fact, one should be very surprised if the opposite were true.

Indeed, this insight has major public health implications. If the vaccine is indeed 90% effective at stopping cases, we should not be seeing that level of cases. And if the vaccine is only moderately effective, then we may not be able to achieve "herd immunity" status, as was the plan originally.

I'd be reluctant to complain about this specific data visualization. It seems that the data allow different interpretations - some of which are contradictory but all of which are needed to draw a measured conclusion.

***
The last section on "misrepresentation of scientific results" could use a better example. I certainly agree with the message: that people have confirmation bias. I have been calling this "story-first thinking": people with a set story visualize only the data that support their preconception.

However, the example given is not that. The example shows a tweet that contains a chart from a scientific paper that apparently concludes that hydroxychloroquine helps treat Covid-19. Maxim adds this study was subsequently retracted. If the tweet was sent prior to the retraction, then I don't think we can grumble about someone citing a peer reviewed study published in Lancet.

***

Overall, I like Maxim's message. In some cases, I think there are better examples.

## Flowing to nowhere

##### Jun 09, 2023

The New York Times printed the following flow chart about water usage of the Colorado River (link).

The Colorado River provides water to more than 10% of the U.S. population. About half is used to feed livestock, another quarter for agriculture, which leaves a quarter to residential and other uses.

***

This type of flow chart in which the widths of the flows encode relative flow volumes is sometimes called a "sankey diagram."

The most famous sankey diagram of all time may be Minard's depiction of Napoleon's campaign in Russia.

In Minard's map, the flows represent movement of troops. The brown color shows advance and the black color shows retreat. The power of this graphic is found how it depicts the attrition of troops over the course of the campaign - on both spatial and temporal dimensions.

Of interest is the choice to disappear these outflows. For most flows, the ending width is smaller than the starting width, the difference being the attrition. On many flow charts, the design imposes a principle of conservation - total outflows equal total inflows, but not here.

For me, the canonical flow chart describes the physical structure of rivers.

Flow is conserved here (well, if we ignore evaporation, and absorption into ground water).

Most flow charts we see these days are not faithful to reality - they present abstract concepts.

***

The Colorado River flow chart is an example of an abstract flow chart.

What's depicted cannot be reality. All the water from the Colorado River do not tumble out of a single huge reservoir, there isn't some gigantic pipeline that takes out half of the water and sends them to agricultural users, etc. All the flows on the chart are abstract, not physical in nature.

A conservation principle is enforced at all junctions, so that the sum of the inflows is always the sum of the outflows. In this sense, the chart visually depicts composition (and decomposition). The NYT flow chart shows two ways to decompose water usage at the Colorado River. One decomposition breaks usage down into agriculture, residential, commercial, and power generation. That's an 80/20 split. A second decomposition breaks agriculture into two parts (livestock and crops) while it aggregates the smaller categories into a single "other".

***

The Colorado River flow chart can be produced without knowing a single physical flow from the river basin to an end-user. The designer only requires total water usage, and water usage by subgroup of users.

For most readers, this may seem like a piece of trivia - for data analysts, it's really important to know whether these "flows" are measured data, or implied data.