The curse of dimensions

Usually the curse of dimensions concerns data with many dimensions. But today I want to talk about a different kind of curse. This is the curse of dimensions in mapping.

We are only talking about a few dimensions, typically between 3 and 6, so small number of dimensions. And yet it's already a curse. Maps are typically drawn in two dimensions. Those two dimensions are usually spoken for: they show the x- and y-coordinate of space. If we want to include a third, fourth or fifth dimension of data on the map, we have to appeal to colors, shapes, and so on. Cartographers have long realized that adding dimensions involves tradeoffs.

***

Andrew featured some colored bubble maps in a recent post. Here is one example:

Dorlingmap_percenthispanic

The above map shows the proportion of population in each U.S. county that is Hispanic. Each county is represented by a bubble pinned to the centroid of the county. The color of the bubble shows the data, divided into demi-deciles so they are using a equal-width binning method. The size of a bubble indicates the size of a county.

The map is sometimes called a "Dorling map" after its presumptive original designer.

I'm going to use this map to explore the curse of dimensions.

***

It's clear from the design that county-level details are regarded as extremely important. As there are about 3,000 counties in the U.S., I don't see how any visual design can satisfy this requirement without giving up clarity.

More details require more objects, which spread readers' attention. More details contain more stories, but that too dilutes their focus.

Another principle of this map is to not allow bubbles to overlap. Of course, having bubbles overlap or print on top of one another is a visual faux pas. But to prevent such behavior on this particular design means the precise locations are sacrificed. Consider the eastern seaboard where there are densely populated counties: they are not pinned to their centroids. Instead, the counties are pushed out of their normal positions, similar to making a cartogram.

I remarked at the start – erroneously but deliberately – that each bubble is centered at the centroid of each county. I wonder how many of you noticed the inaccuracy of that statement. If that rule were followed, then the bubbles in New England would have overlapped and overprinted. 

This tradeoff affects how we perceive regional patterns, as all the densely populated regions are bent out of shape.

Another aspect of the data that the designer treats as important is county population, or rather relative county population. Relative – because bubble size don't portray absolutes, plus the designer didn't bother to provide a legend to decipher bubble sizes.

The tradeoff is location. The varying bubble sizes, coupled with the previous stipulation of no overlapping, push bubbles from their proper centroids. This forced displacement disproportionately affects larger counties.

***

What if we are willing to sacrifice county-level details?

In this setting, we are not obliged to show every single county. One alternative is to perform spatial smoothing. Intuitively, think about the following steps: plot all these bubbles in their precise locations, turn the colors slightly transparent, let them overlap, blend away the edges, and then we have a nice picture of where the Hispanic people are located.

I have sacrificed the county-level details but the regional pattern becomes much clearer, and we don't need to deviate from the well-understood shape of the standard map.

This version reminds me of the language maps that Josh Katz made.

Joshkatz_languagemap

Here is an old post about these maps.

This map design only reduces but does not eliminate the geographical inaccuracy. It uses the same trick as the Dorling map: the "vertical" density of population has been turned into "horizontal" span. It's a bit better because the centroids are not displaced.

***

Which map is better depends on what tradeoffs one is making. In the above example, I'd have made different choices.

 

One final thing – it's minor but maybe not so minor. Most of the bubbles on the map especially in the middle are tiny; as most of them have Hispanic proportions that are on the left side of the scale, they should be showing light orange. However, all of them appear darker than they ought to be. That's because each bubble has a dark border. For small bubbles, the ratio of ink on the border is a high proportion of the ink for the entire object.


Do you want a taste of the new hurricane cone?

The National Hurricane Center (NHC) put out a press release (link to PDF) to announce upcoming changes (in August 2024) to their "hurricane cone" map. This news was picked up by Miami Herald (link).

New_hurricane_map_2024

The above example is what the map looks like. (The data are probably fake since the new map is not yet implemented.)

The cone map has been a focus of research because experts like Alberto Cairo have been highly critical of its potential to mislead. Unfortunately, the more attention paid to it, the more complicated the map has become.

The latest version of this map comprises three layers.

The bottom layer is the so-called "cone". This is the white patch labeled below as the "potential track area (day 1-5)".  Researchers dislike this element because they say readers tend to misinterpret the cone as predicting which areas would be damaged by hurricane winds when the cone is intended to depict the uncertainty about the path of the hurricane. Prior criticism has led the NHC to add the text at the top of the chart, saying "The cone contains the probable path of the storm center but does not show the size of the storm. Hazardous conditions can occur outside of the cone."

The middle layer are the multi-colored bits. Two of these show the areas for which the NHC has issued "watches" and "warnings". All of these color categories represent wind speeds at different times. Watches and warnings are forecasts while the other colors indicate "current" wind speeds. 

The top layer consists of black dots. These provide a single forecast of the most likely position of the storm, with the S, H, M labels indicating the most likely range of wind speeds at forecast times.

***

Let's compare the new cone map to a real hurricane map from 2020. (This older map came from a prior piece also by NHC.)

Old_hurricane_map_2020

Can we spot the differences?

To my surprise, the differences were minor, in spite of the pre-announced changes.

The first difference is a simplification. Instead of dividing the white cone (the bottom layer) into two patches -- a white patch for days 1-3, and a dotted transparent patch for days 4-5, the new map aggregates the two periods. Visually, simplifying makes the map less busy but loses the implicit acknowledge found in the old map that forecasts further out are not as reliable.

The second point of departure is the addition of "inland" warnings and watches. Notice how the red and blue areas on the old map hugged the coastline while the red and blue areas on the new map reach inland.

Both changes push the bottom layer, i.e. the cone, deeper into the background. It's like a shrink-flation ice cream cone that has a tiny bit of ice cream stuffed deep in its base.

***

How might one improve the cone map? I'd start by dismantling the layers. The three layers present answers to different problems, albeit connected.

Let's begin with the hurricane forecasting problem. We have the current location of the storm, and current measurements of wind speeds around its center. As a first requirement, a forecasting model predicts the path of the storm in the near future. At any time, the storm isn't a point in space but a "cloud" around a center. The path of the storm traces how that cloud will move, including any expansion or contraction of its radius.

That's saying a lot. To start with, a forecasting model issues the predicted average path -- the expected path of the storm's center. This path is (not competently) indicated by the black dots in the top layer of the cone map. These dots offer only a sampled view of the average path.

Not surprisingly, there is quite a bit of uncertainty about the future path of any storm. Many models simulate future worlds, generating many predictions of the average paths. The envelope of the most probable set of paths is the "cone". The expanding width of the cone over time reflects the higher uncertainty of our predictions further into the future. Confusingly, this cone expansion does not depict spatial expansion of either the storm's size or the potential areas that may suffer the greatest damage. Both of those tend to shrink as hurricanes move inland.

Nevertheless, the cone and the black dots are connected. The path drawn out by the black dots should be the average path of the center of the storm.

The forecasting model also generates estimates of wind speeds. Those are given as labels inside the black dots. The cone itself offers no information about wind speeds. The map portrays the uncertainty of the position of the storm's center but omits the uncertainty of the projected wind speeds.

The middle layer of colored patches also inform readers about model projections - but in an interpreted manner. The colors portray hurricane warnings and watches for specific areas, which are based on projected wind speeds from the same forecasting models described above. The colors represent NHC's interpretation of these model outputs. Each warning or watch simultaneously uses information on location, wind speed and time. The uncertainty of the projected values is suppressed.

I think it's better to use two focused maps instead of having one that captures a bit of this and a bit of that.

One map can present the interpreted data, and show the areas that have current warnings and watches. This map is about projected wind strength in the next 1-3 days. It isn't about the center of the storm, or its projected path. Uncertainty can be added by varying the tint of the colors, reflecting the confidence of the model's prediction.

Another map can show the projected path of the center of the storm, plus the cone of uncertainty around that expected path. I'd like to bring more attention to the times of forecasting, perhaps shading the cone day by day, if the underlying model has this level of precision.

***

Back in 2019, I wrote a pretty long post about these cone maps. Well worth revisiting today!


Messing with expectations

A co-worker sent me to the following map, found in Forbes:

Forbes_gastaxmap

It shows the amount of state tax surcharge per gallon of gas in the U.S. And it's got one of the most common issues found in choropleth maps - the color scheme runs opposite to reader expectations.

Typically, if we see a red-green color scale, we would expect red to represent large numbers and green, small numbers. This map reverses the typical setup: California, the state with the heftiest gas tax, is shown green.

I know, I know - if we apply the typical color scheme, California would bleed red, and it's a blue state, damn it.

The solution is to avoid the red color. Just don't use red or blue.

Junkcharts_redo_forbes_gastaxmap_green

There is no need to use two colors either.

***

A few minor fixes. Given that all dollar amounts on the map are shown to two decimal places, the legend labels should also be shown to 2 decimal places, and with dollar signs.

Forbes_gastaxmap_legend

The subtitle should read "Dollars per gallon" instead of "Cents per gallon". Alternatively, keep "Cents per gallon" but convert all data labels into cents.

Some of the states are missing data labels.

***

I recast this as a small-multiples by categorizing states into four subgroups.

Junkcharts_redo_forbes_gastaxmap_split

With this change, one can almost justify using maps because there is sort of a spatial pattern.

 

 


To a new year of pleasant surprises

Happy new year!

This year promises to be the year of AI. Already last year, we pretty much couldn't lift an eyebrow without someone making an AI claim. This year will be even noisier. Visual Capitalist acknowledged this by making the noisiest map of 2023:

Visualcapitalist_01_Generative_AI_World_map sm

I kept thinking they have a geography teacher on the team, who really, really wants to give us a lesson of where each country is on the world map.

All our attention is drawn to the guiding lines and the random scatter of numbers. We have to squint to find the country names. All this noise drowns out the attempt to make sense of the data, namely, the inset of the top 10 countries in the lower left corner, and the classification of countries into five colored groups.

A small dose of editing helps. Remove most data labels except for the countries for which they have a story. Provide a data table below for those who want details.

***

In the Methodology section, the data analysts (possibly from a third party called ElectronicsHub) indicated that they used Google search volume of "over 90 of the most popular generative AI tools", calculating the "overall volume across all tools per 100k population". Then came a baffling line: "all search volumes were scaled up according to the search engine market share in each country, using figures from statscounter.com." (Note: in the following, I'm calling the data "AI-related search" for simplicity even though their measurement is restricted to the terms described above.)

It took me a while to comprehend what they could have meant by that line. I believe this is what that sentence means: Google is not the only search engine out there so by only researching Google search volume, they undercount the true search volume. How did they deal with the missing data problem? They "scaled up" so if Google is 80% of the search volume in a country, then they divide the Google volume by 80% to "scale up" to 100%.

Whenever we use heuristics like this, we should investigate its foundations. What is the implicit assumption behind this scaling-up procedure? It is that all search engines are effectively the same. The users of non-Google search engines behave exactly as the Google search engine users. If the analysts somehow could get their hands on the data of other search engines, they would discover that the proportion of search volume that is AI-related is effectively the same as seen on Google.

This is one of those convenient, and obviously wrong assumptions – if true, the market would have no need for more than one search engine. Each search engine's audience is just a random sample from the population of all users.

Let's make up some numbers. Let's say Google has 80% share of search volume in Country A, and AI-related search 10% of the overall Google search volume. The remaining search engines have 20% share. Scaling up here means taking the 8% of Google AI-related search volume, divide by 80%, which yields 10%. Since Google owns 8% of the 10%, the other search engines see 2% of overall search volume attributed to AI searches in Country A. Thus, the proportion of AI-related searches on those other search engines is 2%/20% = 10%.

Now, in certain countries, Google is not quite as dominant. Let's say Google only has 20% share of Country B's search volume. AI-related search on Google is 2%, which is 10% of its total. Using the same scaling-up procedure, the analysts have effectively assumed that the proportion of AI-related search volume in the dominant search engines in Country B to be also 10%.

I'm using the above calculations to illustrate a shortcoming of this heuristic. Using this procedure inflates the search volume in countries in which Google is less dominant because the inflation factor is the reciprocal of Google's market share. The less dominant Google is, the larger the inflation factor.

What's also true? The less dominant Google is, the smaller proportion of the total data the analysts are able to see, the lower the quality of the available information. So the heuristic is the most influential where it has the greatest uncertainty.

***

Hope your new year is full of uncertainty, and your heuristics shall lead you to pleasant surprises.

If you like the blog's content, please spread the word. I'm looking forward to sharing more content as the world of data continues to evolve at an amazing pace.

Disclosure: This blog post is not written by AI.


Partition of Europe

A long-time reader sent me the following map via twitter:

Europeelects_map

This map tells how the major political groups divide up the European Parliament. I’ll spare you the counting. There are 27 countries, and nine political groups (including the "unaffiliated").

The key chart type is a box of dots. Each country gets its own box. Each box has its own width. What determines the width? If you ask me, it’s the relative span of the countries on the map. For example, the narrow countries like Ireland and Portugal have three dots across while the wider countries like Spain, Germany and Italy have 7, 10 and 8 dots across respectively.

Each dot represents one seat in the Parliament. Each dot has one of 9 possible colors. Each color shows a political lean e.g. the green dots represent Green parties while the maroon dots display “Left” parties.

The end result is a counting game. If we are interested in counts of seats, we have to literally count each dot. If we are interested in proportion of seats, take your poison: either eyeball it or count each color and count the total.

Who does the underlying map serve? Only readers who know the map of Europe. If you don’t know where Hungary or Latvia is, good luck. The physical constraints of the map work against the small-multiples set up of the data. In a small multiples, you want each chart to be identical, except for the country-specific data. The small-multiples structure requires a panel of equal-sized cells. The map does not offer this feature, as many small countries are cramped into Eastern Europe. Also, Europe has a few tiny states e.g. Luxembourg (population 660K)  and Malta (population 520K). To overcome the map, the designer produces boxes of different sizes, substantially loading up the cognitive burden on readers.

The map also dictates where the boxes are situated. The centroids of each country form the scaffolding, with adjustments required when the charts overlap. This restriction ensures a disorderly appearance. By contrast, the regular panel layout of a small multiples facilitates comparisons.

***

Here is something I sketched using a tile map.

Eu parties print sm

First, I have to create a tile map of European countries. Some parts, e.g. western part, are straightforward. The eastern side becomes very congested.

The tile map encodes location in an imprecise sense. Think about the scaffolding of centroids of countries referred to prior. The tile map imposes an order to the madness - we're shifting these centroids so that they line up in a tidier pattern. What we gain in comparability we concede in location precision.

For the EU tile map, I decided to show the Baltic countries in a row rather than a column; the latter would have been more faithful to the true geography. Malta is shown next to Italy even though it could have been placed below. Similarly, Cyprus in relation to Greece. I also included several key countries that are not part of the EU for context.

Instead of raw seat counts, I'm showing the proportion of seats within each country claimed by each political group. I think this metric is more useful to readers.

The legend is itself a chart that shows the aggregate statistics for all 27 countries.


Tile maps on a trip

My friend Ray sent me to a recent blog about tile maps. Typical tile maps use squares or hexagons, although in theory many other shapes will do. Unsurprisingly, the field follows the latest development of math researchers who study the space packing problem. The space packing problem concerns how to pack a space with objects. The study of tesselations is to pack space with one or a few shapes.

It was an open question until recently whether there exists an "aperiodic monotile," that is to say, a single shape that can cover space in a non-repeating manner. We all know that we can use squares to cover a space, which creates the familiar grid of squares, but in that case, a pattern repeats itself all over the space.

Now, some researchers have found an elusive aperiodic monotile, which they dubbed the Einstein monotile. Below is a tesselation using these tiles:

Einsteintiles

Within this design, one cannot find a set of contiguous tiles that repeats itself.

The blogger then made a tile map using this new tesselation. Here's one:

Gravitywitheinsteintiles

It doesn't matter what this is illustrating. The blog author cites a coworker, who said: "I can think of no proper cartographic use for Penrose binning, but it’s fun to look at, and so that’s good enough for me." Penrose tiles is another mathematical invention that can be used in a tesselation. The story is still the same: there is no benefit from using these strange-looking shapes. Other than the curiosity factor.

***

Let's review the pros and cons of using tile maps.

Compare a typical choropleth map of the United States (by state) and a tile map by state. The former has the well-known problem that states with the largest areas usually have the lowest population densities, and thus, if we plot demographic data on such maps, the states that catch the most attention are the ones that don't weigh as much - by contrast, the densely populated states in New England barely show up.

The tile map removes this area bias, thus resolving this problem. Every state is represented by equal area.

While the tesselated design is frequently better, it's not always. In many data visualization, we do intend to convey the message that not all states are equal!

The grid arrangement of the state tiles also makes it easier to find regional patterns. A regional pattern is defined here as a set of neighboring states that share similar data (encoded in the color of the tiles). Note that the area of each state is of zero interest here, and thus, the accurate descriptions of relative areas found on the usual map is a distractor.

However, on the tile map, these regional patterns are conceptual. One must not read anything into the shape of the aggregated region, or its boundaries. Indeed, if we use strange-looking shapes like Einstein tiles, the boundaries are completely meaningless, and even misleading.

There also usually is some distortion of the spatial coordinates on a tile map because we'd like to pack the squares or hexagons into a lattice-like structure.

Lastly, the tile map is not scalable. We haven't seen a tile map of the U.S. by county or precinct but we have enjoyed many choropleth maps displaying county- or precinct-level data, e.g. the famous Purple Map of America. There is a reason for this.

***

Here is an old post that contains links to various other posts I've written about tile maps.


Flowing to nowhere

Nyt_colorado_riverThe New York Times printed the following flow chart about water usage of the Colorado River (link).

The Colorado River provides water to more than 10% of the U.S. population. About half is used to feed livestock, another quarter for agriculture, which leaves a quarter to residential and other uses.

***

This type of flow chart in which the widths of the flows encode relative flow volumes is sometimes called a "sankey diagram." 

The most famous sankey diagram of all time may be Minard's depiction of Napoleon's campaign in Russia.

Minards_sankey

In Minard's map, the flows represent movement of troops. The brown color shows advance and the black color shows retreat. The power of this graphic is found how it depicts the attrition of troops over the course of the campaign - on both spatial and temporal dimensions.

Of interest is the choice to disappear these outflows. For most flows, the ending width is smaller than the starting width, the difference being the attrition. On many flow charts, the design imposes a principle of conservation - total outflows equal total inflows, but not here.

Junkcharts_flowchart_conservation

For me, the canonical flow chart describes the physical structure of rivers.

Riverbasinflowdiagram

Flow is conserved here (well, if we ignore evaporation, and absorption into ground water).

Most flow charts we see these days are not faithful to reality - they present abstract concepts.

***

The Colorado River flow chart is an example of an abstract flow chart.

What's depicted cannot be reality. All the water from the Colorado River do not tumble out of a single huge reservoir, there isn't some gigantic pipeline that takes out half of the water and sends them to agricultural users, etc. All the flows on the chart are abstract, not physical in nature.

A conservation principle is enforced at all junctions, so that the sum of the inflows is always the sum of the outflows. In this sense, the chart visually depicts composition (and decomposition). The NYT flow chart shows two ways to decompose water usage at the Colorado River. One decomposition breaks usage down into agriculture, residential, commercial, and power generation. That's an 80/20 split. A second decomposition breaks agriculture into two parts (livestock and crops) while it aggregates the smaller categories into a single "other".

***

The Colorado River flow chart can be produced without knowing a single physical flow from the river basin to an end-user. The designer only requires total water usage, and water usage by subgroup of users.

For most readers, this may seem like a piece of trivia - for data analysts, it's really important to know whether these "flows" are measured data, or implied data.

 

 


Visual story-telling: do you know or do you think?

One of the most important data questions of all time is: do you know? or do you think?

And one of the easiest traps to fall into is: I think, therefore I know.

***

Visual story-telling can be great but it can also mislead. Deception sometimes happens when readers are nudged to "fill in the blanks" with stuff they think they know, but they don't.

A Twitter reader asked me to look at the map in this Los Angeles Times (paywall) opinion column.

Latimes_lifeexpectancy_postcovid

The column promptly announces its premise:

Years of widening economic inequality, compounded by the pandemic and political storm and stress, have given Americans the impression that the country is on the wrong track. Now there’s empirical data to show just how far the country has run off the rails: Life expectancies have been falling.

The writer creates the expectation that he will reveal evidence in the form of data to show that life expectancies have been driven down by economic inequality, pandemic, and politics. Does he succeed?

***

The map portrays average life expectancy (at birth) for some mysterious, presumably very recent, year for every county in the United States. From the color legend, we learn that the bottom-to-top range is about 20 years. There is a clear spatial pattern, with the worst results in the south (excepting south Florida).

The choice of colors is telling. Red and blue on a U.S. map has heavy baggage, as they signify the two main political parties in the country. Given that the author believes politics to be a key driver of health outcomes, the usage of red and blue here is deliberate. Throughout the article, the columnist connects the lower life expectancies in southern states to its politics.

For example, he said "these geographical disparities aren't artifacts of pure geography or demographics; they're the consequences of policy decisions at the state level... Of the 20 states with the worst life expectancies, eight are among the 12 that have not implemented Medicaid expansion under the Affordable Care Act..."

Casual readers may fall into a trap here. There is nothing on the map itself that draws the connection between politics and life expectancies; the idea is evoked purely through the red-blue color scheme. So, as readers, we are filling in the blanks with our own politics.

What could have been done instead? Let's look at the life expectancy map side by side with the map of the U.S. 2020 Presidential election.

Junkcharts_lifeexpectancy_elections

Because of how close recent elections have been, we may think the political map has a nice balance of red and blue but it isn't. The Democrats' votes are heavily concentrated in densely-populated cities so most of the Presidential election map is red. When placed next to each other, it's obvious that politics don't explain the variance in life expectancy well. The Midwest is deep red and yet they have above average life expectancies. I have circled out various regions that contradict the claim that Republican politics drove life expectancies down.

It's not sufficient to point to the South, in which Republican votes and life expectancy are indeed inversely correlated. A good theory has to explain most of the country.

***

The columnist also suggests that poverty is the cause of low life expectancy. That too cannot be gleaned from the published map. Again, readers are nudged to use their wild imagination to fill in the blank.

Data come to the rescue. Here is a side-by-side comparison of the map of life expectancies and the map of median incomes.

Junkcharts_lifeexpectancy_income

A similar conundrum. While the story feels right in the South, it fails to explain the northwest, Florida, and various other parts of the country. Take a look again at the circled areas. Lower income brackets are also sometimes associated with high life expectancies.

***

The author supplies a third cause of lower life expectancies: Covid-19 response. Because Covid-19 was the "most obvious and convenient" explanation for the loss of life expectancy during the pandemic, this theory suggests that the red areas on the life expectancy map should correspond to the regions most ravaged by Covid-19.

Let's see the data.

Junkcharts_lifeexpectancy_covidcases

The map on the right shows the number of confirmed cases until June 2021. As before, the correlation holds somewhat in the South but there are notable exceptions, e.g. the Midwest. We also have states with low Covid-19 cases but below-average life expectancy.

***

What caused the decline of life expectancy in the U.S. - which began before the pandemic, and has continued beyond - is highly complex, beyond what a single map or a pair of maps or a few pairs of maps could convey. Showing a red-blue map presents a trap for readers to fall into, in which they start thinking, without knowing.

 


Bivariate choropleths

A reader submitted a link to Joshua Stephen's post about bivariate choropleths, which is the technical term for the map that FiveThirtyEight printed on abortion bans, discussed here. Joshua advocates greater usage of maps with two-dimensional color scales.

As a reminder, the fundamental building block is expressed in this bivariate color legend:

Fivethirtyeight_abortionmap_colorlegend

Counties are classified into one of these nine groups, based on low/middle/high ratings on two dimensions, distance and congestion.

The nine groups are given nine colors, built from superimposing shades of green and pink. All nine colors are printed on the same map.

Joshuastephens_singlemap

Without a doubt, using these nine related colors are better than nine arbitrary colors. But is this a good data visualization?

Specifically, is the above map better than the pair of maps below?

Joshuastephens_twomaps

The split map is produced by Josh to explain that the bivariate choropleth is just the superposition of two univariate choropleths. I much prefer the split map to the superimposed one.

***

Think about what the reader goes through when comparing two counties.

Junkcharts_bivariatechoropleths

Superimposing the two univariate maps solves one problem: it removes the need to scan back and forth between two maps, looking for the same locations, something that is imprecise. (Unless, the map is interactive, and highlighting one county highlights the same county in the other map.)

For me, that's a small price to pay for quicker translation of color into information.

 

 


Area chart is not the solution

A reader left a link to a Wiki chart, which is ghastly:

House_Seats_by_State_1789-2020_Census

This chart concerns the trend of relative proportions of House representatives in the U.S. Congress by state, and can be found at this Wikipedia entry. The U.S. House is composed of Representatives, and the number of representatives is roughly proportional to each state's population. This scheme actually gives small states disporportional representation, since the lowest number of representatives is 1 while the total number of representatives is fixed at 435.

We can do a quick calculation: 1/435 = 0.23% so any state that has less than 0.23% of the population is over-represented in the House. Alaska, Vermont and Wyoming are all close to that level. The primary way in which small states get larger representation is via the Senate, which sits two senators per state no matter the size. (If you've wondered about Nate Silver's website: 435 Representatives + 100 Senators + 3 for DC = 538 electoral votes for U.S. Presidental elections.)

***

So many things have gone wrong with this chart. There are 50 colors for 50 states. The legend arranges the states by the appropriate metric (good) but in ascending order (bad). This is a stacked area chart, which makes it very hard to figure out the values other than the few at the bottom of the chart.

A nice way to plot this data is a tile map with line charts. I found a nice example that my friend Xan put together in 2018:

Xang_cdcflu_tilemap_lines

A tile map is a conceptual representation of the U.S. map in which each state is represented by equal-sized squares. The coordinates of the states are distorted in order to line up the tiles. A tile map is a small-multiples setup in which each square contains a chart of the same design to faciliate inter-state comparisons.

In the above map, Xan also takes advantage of the foregrounding concept. Each chart actually contains all 50 lines for every state, all shown in gray while the line for the specific state is bolded and shown in red.

***

A chart with 50 lines looks very different from one with 50 areas stacked on each other. California, the most populous state, has 12% of the total population so the line chart has 50 lines that will look like spaghetti. Thus, the fore/backgrounding is important to make sure it's readable.

I suspect that the designer chose a stacked area chart because the line chart looked like spaghetti. But that's the wrong solution. While the lines no longer overlap each other, it is a real challenge to figure out the state-level trends - one has to focus on the heights of the areas, rather than the boundary lines.

[P.S. 2/27/2023] As we like to say, a picture is worth a thousand words. Twitter reader with the handle LHZGJG made the tile map I described above. It looks like this:

Lhzgjg_redo_houseapportionment

You can pick out the states with the key changes really fast. California, Texas, Florida on the upswing, and New York, Pennsylvania going down. I like the fact that the state names are spelled out. Little tweaks are possible but this is a great starting point. Thanks LHZGJG! ]