The canonical U.S. political map

The previous posts feature the canonical political map of U.S. presidential elections, the vote margin shift map. The following realization of it, made by NBC News (link), drills down to the counties with the largest Asian-American populations:

Nbcnews_votemarginshiftmap_asians

How does this map form encode the data?

***

The key visual element is the arrow. The arrow has a color, a length and also an angle.

The color scheme is fixed to the canonical red-blue palette attached to America's two major political parties.

The angle of the arrow, as seen in the legend, carries no data at all. All arrows are slanted at the same angles. Not quite; the political party is partially encoded into the angle, as the red arrows slant one way while the blue arrows always slant the other way. The degree of slant is constant everywhere, though.

So only the lengths of the arrows contain the vote margin gain/loss data. The legend shows arrows of two different lengths but vote margins have not been reduced to two values. As evident on the map, the arrow lengths are continuous.

The designer has a choice when it comes to assigning colors to these arrows. The colors found on the map above depicts the direction of the vote margin shift so red arrows indicate counties in which the Republicans gained share. (The same color encoding is used by the New York Times.)

Note that a blue county could have shifted to the right, and therefore appear as a red arrow even though the county voted for Kamala Harris in 2024. Alternatively, the designer could have encoded the 2024 vote margin in the arrow color. While this adds more data to the map, it could wreak havoc with our perception as now all four combinations are possible: red, pointing left; red, pointing right; blue, pointing left; and blue, pointing right.

***

To sum this all up, the whole map is built from a single data series, the vote margin shift expressed as a positive or negative percentage, in which a positive number indicates Republicans increased the margin. The magnitude of this data is encoded in the arrow length, ignoring the sign. The sign (direction) of the data, a binary value, is encoded into the arrow color as well as the direction of the arrow.

In other words, it's a proportional symbol map in which each geographical region is represented by a symbol (typically a bubble), and a single numeric measure is encoded in the size of the symbol. In many situations, the symbol's color is used to display a classification of the geographical regions.

The symbol used for the "wind map" are these slanted arrows. The following map, pulled from CNN (link), makes it clear that the arrows play only the role of a metaphor, the left-right axis of political attitude.

Cnn_votemarginshiftmap_triangles

This map is essentially the same as the "wind map" used by the New York Times and NBC News, the key difference being that instead of arrows, the symbol is a triangle. On proportional triangle maps, the data is usually encoded in the height of the triangles, so that the triangles can be interpreted as "hills". Thus, the arrow length in the wind map is the hill height in the triangle map. The only thing left behind is the left-right metaphor.

The CNN map added a detail. Some of the counties have a dark gray color. These are "flipped". A flip is defined as a change in "sign" of the vote margin from 2020 to 2024. A flipped county can exhibit either a blue or a red hill. The direction of the flip is actually constrained by the hill color. If it's a red hill, we know there is a shift towards Republicans, and in addition, the county flipped, it must be that Democrats won that county in 2020, and it flipped to Republicans. Similiar, if a blue hill sits on a dark gray county, then the county must have gone for Republicans in 2020 and flipped to Democrats in 2024.

 


The epidemic of simple comparisons

Another day, another Twitter user sent a sloppy chart featured on TV news. This CNN graphic comes from Hugo K. by way of Kevin T.

And it's another opportunity to apply the self-sufficiency test.

Junkcharts_cnncovidcases_sufficiency_1

Like before, I removed the data printed on the graphic. In reading this chart, we like to know the number of U.S. reported cases of coronavirus relative to China, and Italy relative to the U.S.

So, our eyes trace these invisible lines:

Junkcharts_cnncovidcases_sufficiency_2

U.S. cases are roughly two-thirds of China while Italian cases are 90% of U.S.

That's what the visual elements, the columns, are telling us. But it's fake news. Here is the chart with the data:

Cnn_covidcases

The counts of reported cases in all three countries were neck and neck around this time.

What this quick exercise shows is that anyone who correctly reads this chart is reading the data off the chart, and ignoring the contradictionary message sent by the relative column heights. Thus, the visual elements are not self-sufficient in conveying the message.

***

In a Trifecta Checkup, I'd be most concerned about the D corner. The naive comparison of these case counts is an epidemic of its own. It sometimes leads to poor decisions that can exacerbate the public-health problems. See this post on my sister blog.

The difference in case counts between different countries (or regions or cities or locales) is not a direct measure of the difference in coronavirus spread in these places! This is because there are many often-unobserved factors that will explain most if not all of the differences.

After a lot of work by epidemiologists, medical researchers, statisticians and the likes, we now realize that different places conduct different numbers of tests. No test, no positive. The U.S. has been slow to get testing ramped up.

Less understood is the effect of testing selection. Consider the U.S. where it is still hard to get tested. Only those who meet a list of criteria are eligible. Imagine an alternative reality in which the U.S. conducted the same number of tests but instead of selecting most likely infected people to be tested, we test a random sample of people. The incidence of the virus in a random sample is much lower than in the severely infected, therefore, in this new reality, the number of positives would be lower despite equal numbers of tests.

That's for equal number of tests. If test kits are readily available, then a targeted (triage) testing strategy will under-count cases since mild cases or asymptomatic infections escape attention. (See my Wired column for problems with triage.)

To complicate things even more, in most countries, the number of tests and the testing selection have changed over time so a cumulative count statistic obscures those differences.

Beside testing, there are a host of other factors that affect reported case counts. These are less talked about now but eventually will be.

Different places have different population densities. A lot of cases in a big city and an equal number of cases in a small town do not signify equal severity.  Clearly, the situation in the latter is more serious.

Because the virus affects age groups differently, a direct comparison of the case counts without adjusting for age is also misleading. The number of deaths of 80-year-olds in a college town is low not because the chance of dying from COVID-19 is lower there than in a retirement community; it's low because 80-year-olds are a small proportion of the population.

Next, the cumulative counts ignore which stage of the "epi curve" these countries are at. The following chart can replace most of the charts you're inundated with by the media:

Epicurve_coronavirus

(I found the chart here.)

An epi curve traces the time line of a disease outbreak. Every location is expected to move through stages, with cases reaching a peak and eventually the number of newly recovered will exceed the number of newly infected.

Notice that China, Italy and the US occupy different stages of this curve.  It's proper to compare U.S. to China and Italy when they were at a similar early phase of their respective epi curve.

In addition, any cross-location comparison should account for how reliable the data sources are, and the different definitions of a "case" in different locations.

***

Finally, let's consider the Question posed by the graphic designer. It is the morbid question: which country is hit the worst by coronavirus?

This is a Type DV chart. It's got a reasonable question, but the data require a lot more work to adjust for the list of biases. The visual design is hampered by the common mistake of not starting columns at zero.

 


If we report it, it's a fact

David Leonhardt wrote in the NYT of a shocking incident of statistical abuse committed by Lou Dobbs and the CNN crew.

On several recent occasions, while commenting on the red-hot immigration issue, Lou and company remarked that "there had been 7,000 cases of leprosy in this country over the previous three years, far more than in the past".  (Leprosy is a flesh-eating disease prevalent among immigrants, particularly of Asian or Latin American origin.)

Nyt_leprosyWhen asked about fact-checking, Lou reportedly said: "If we reported it, it's a fact."  A quick visit to the government's leprosy program web-site immediately reveals the time-series chart, shown on the left.  With annual rates at about 150 in the last 5 years or so, one is hard impressed to find the 7,000 alleged cases!

Furthermore, because this chart lacks comparability, we fail to see that 150 cases out of a population of 300 million represent a minuscule risk.

A slight downward trend is evident in the last 20 years or so; this record is even more impressive when we realize the population grew during this period.  These points can be made clearer in multivariate plots.

Source: "Truth, Fiction and Lou Dobbs", New York Times, May 30, 2007; U.S. National Hansen's Disease web-site.

 


Finding dots

Erik W. alerted me to this CNN map that shows FBI statistics about safety of American cities.  As Eric pointed out, this is prototypical of chartjunk a la Tufte.  A lot of ink is used to depict 12 points of data (top 3 cities in safety, crime, improvement and decline).

Cnn_safest Imagine the reader trying to find the 3rd most improved city.  She either has to find all the blue dots and then figure out which is #3; or she needs to find all the #3 dots and figure out which is blue.  As they say, it's "hard work".  In fact, finding the dots among the forest of large text is hard work by itself!

How would I re-make this chart?

  • Highlight only the states containing data (California, Michigan, Missouri, Ohio, Georgia, New Jersey, New York); gray out all other states and their boundaries
  • Separate the states from the cities; only write the State name once for each State; reduce the font size
  • Instead of dots, use numbers.  So the most dangerous city (St Louis) gets a red "1", Oakland gets a purple "3", etc.
  • Remove Mexico, Canada and water from the map

The map gives the false impression that crime is relevant only along the coasts and the lakes, when in fact, the map is just saying that most cities in the U.S. are located along the coasts and the lakes.  Using such a map to depict city-level statistics creates distortion because cities are not evenly distributed across America.

Beyond that, what is the point of this map?  Is it merely a geography class telling us where each city is located?  How is it better than a simple table listing the cities in order?   

Reference: "U.S. City Safety Rankings", CNN, 2006.