Appreciating population mountains

Tim Harford tweeted about a nice project visualizing of the world's distribution of population, and wondered why he likes it so much. 

That's the question we'd love to answer on this blog! Charts make us emotional - some we love, some we hate. We like to think that designers can control those emotions, via design choices.

I also happen to like the "Population Mountains" project as well. It fits nicely into a geography class.

1. Chart Form

The key feature is to adopt a 3D column chart form, instead of the more conventional choropleth or dot density. The use of columns is particularly effective here because it is natural - cities do tend to expand vertically upwards when ever more people cramp into the same amount of surface area. 

Jc_popmount

Imagine the same chart form is used to plot the number of swimming pools per square meter. It just doesn't make the same impact. 

2. Color Scale

The designer also made judicious choices on the color scale. The discrete, 5-color scheme is a clear winner over the more conventional, continuous color scale. The designer made a deliberate choice because most software by default uses a continuous color scale for continuous data (population density per square meter).

Jc_popmount_colorscales

Also, notice that the color intervals in 5-color scale is not set uniformly because there is a power law in effect - the dense areas are orders of magnitude denser than the sparsely populated areas, and most locations are low-density. 

These decisions have a strong influence on the perception of the information: it affects the heights of the peaks, the contrasts between the highs and lows, etc. It also injects a degree of subjectivity into the data visualization exercise that some find offensive.

3. Background

The background map is stripped of unnecessary details so that the attention is focused on these "population mountains". No unnecessary labels, roads, relief, etc. This demonstrates an acute awareness of foreground/background issues.

4. Insights on the "shape" of the data 

The article makes the following comment:

What stands out is each city’s form, a unique mountain that might be like the steep peaks of lower Manhattan or the sprawling hills of suburban Atlanta. When I first saw a city in 3D, I had a feel for its population size that I had never experienced before.

I'd strike out population size and replace with population density. In theory, the sum of the areas of the columns in any given surface area gives you the "population size" but given the fluctuating heights of these columns, and the different surface areas (sprawls) of different cities, it is an Olympian task to estimate the volumes of the population mountains!

The more salient features of these mountains, most easily felt by readers, are the heights of the peak columns, the sprawl of the cities, and the general form of the mass of columns. The volume of the mountain is one of the tougher things to see. Similarly, the taller 3D columns hide what's behind them, and you'd need to spin and rotate the map to really get a good feel.

Here is the contrast between Paris and London, with comparable population sizes. You can see that the population in Paris (and by extension, France) is much more concentrated than in the U.K. This difference is a surprise to me.

Jc_popmount_parislondon

5. Sourcing

Some of the other mountains, especially those in India and China, look a bit odd to me, which leads me to wonder about the source of the data. This project has a very great set of footnotes that not only point to the source of the data but also a discussion of its limitations, including the possibility of inaccuracies in places like India and China. 

***

Check out Population Mountains!

 

 

 

 

 


Education deserts: places without schools still serve pies and story time

I very much enjoyed reading The Chronicle's article on "education deserts" in the U.S., defined as places where there are no public colleges within reach of potential students.

In particular, the data visualization deployed to illustrate the story is superb. For example, this map shows 1,500 colleges and their "catchment areas" defined as places within 60 minutes' drive.

Screenshot-2018-8-22 Who Lives in Education Deserts More People Than You Might Think 2

It does a great job walking through the logic of the analysis (even if the logic may not totally convince - more below). The areas not within reach of these 1,500 colleges are labeled "deserts". They then take Census data and look at the adult population in those deserts:

Screenshot-2018-8-22 Who Lives in Education Deserts More People Than You Might Think 4

This leads to an analysis of the racial composition of the people living in these "deserts". We now arrive at the only chart in the sequence that disappoints. It is a pair of pie charts:

Chronicle_edudesserts_pie

 The color scheme makes it hard to pair up the pie slices. The focus of the chart should be on the over or under representation of races in education deserts relative to the U.S. average. The challenge of this dataset is the coexistence of one large number, and many small numbers.

Here is one solution:

Redo_jc_chronedudesserts

***

The Chronicle made a commendable effort to describe this social issue. But the analysis has a lot of built-in assumptions. Readers should look at the following list and see if you agree with the assumptions:

  • Only public colleges are considered. This restriction requires the assumption that the private colleges pretty much serve the same areas as public colleges.
  • Only non-competitive colleges are included. Precisely, the acceptance rate must be higher than 30 percent. The underlying assumption is that the "local students" won't be interested in selective colleges. It's not clear how the 30 percent threshold was decided.
  • Colleges that are more than 60 minutes' driving distance away are considered unreachable. So the assumption is that "local students" are unwilling to drive more than 60 minutes to attend college. This raises a couple other questions: are we only looking at commuter colleges with no dormitories? Is the 60 minutes driving distance based on actual roads and traffic speeds, or some kind of simple model with stylized geometries and fixed speeds?
  • The demographic analysis is based on all adults living in the Census "blocks" that are not within 60 minutes' drive of one of those colleges. But if we are calling them "education deserts" focusing on the availability of colleges, why consider all adults, and not just adults in the college age group? One further hidden assumption here is that the lack of colleges in those regions has not caused young generations to move to areas closer to colleges. I think a map of the age distribution in the "education deserts" will be quite telling.
  • Not surprisingly, the areas classified as "education deserts" lag the rest of the nation on several key socio-economic metrics, like median income, and proportion living under the poverty line. This means those same areas could be labeled income deserts, or job deserts.

At the end of the piece, the author creates a "story time" moment. Story time is when you are served a bunch of data or analyses, and then when you are about to doze off, the analyst calls story time, and starts making conclusions that stray from the data just served!

Story time starts with the following sentence: "What would it take to make sure that distance doesn’t prevent students from obtaining a college degree? "

The analysis provided has nowhere shown that distance has prevented students from obtaining a college degree. We haven't seen anything that says that people living in the "education deserts" have fewer college degrees. We don't know that distance is the reason why people in those areas don't go to college (if true) - what about poverty? We don't know if 60 minutes is the hurdle that causes people not to go to college (if true).We know the number of adults living in those neighborhoods but not the number of potential students.

The data only showed two things: 1) which areas of the country are not within 60 minutes' driving of the subset of public colleges under consideration, 2) the number of adults living in those Census blocks.

***

So we have a case where the analysis is incomplete but the visualization of the analysis is superb. So in our Trifecta analysis, this chart poses a nice question and has nice graphics but the use of data can be improved. (Type QV)

 

 

 


Environmental science can use better graphics

Mike A. pointed me to two animated maps made by Caltech researchers published in LiveScience (here).

The first map animation shows the rise and fall of water levels in a part of California over time. It's an impressive feat of stitching together satellite images. Click here to play the video.

Caltech_groundwater_map1

The animation grabs your attention. I'm not convinced by the right side of the color scale in which the white comes after the red. I'd want the white in the middle then the yellow and finally the red.

In order to understand this map and the other map in the article, the reader has to bring a lot of domain knowledge. This visualization isn't easy to decipher for a layperson.

Here I put the two animations side by side:

Caltech_groundwater_side

The area being depicted is the same. One map shows "ground deformation" while the other shows "subsidence". Are they the same? What's the connection between the two concepts (if any)?  On a further look, one notices that the time window for the two charts differ: the right map is clearly labeled 1995 to 2003 but there is no corresponding label on the left map. To find the time window of the left map, the reader must inspect the little graph on the top right (1996 to 2000).

This means the time window of the left map is a subset of the time window of the right map. The left map shows a sinusoidal curve that moves up and down rhythmically as the ground shifts. How should I interpret the right map? The periodicity is no longer there despite this map illustrating a longer time window. The scale on the right map is twice the magnitude of the left map. Maybe on average the ground level is collapsing? If that were true, shouldn't the sinusoidal curve drift downward over time?

Caltech_groundwater_sineThe chart on the top right of the left map is a bit ugly. The year labels are given in decimals e.g. 1997.5. In R, this can be fixed by customizing the axis labels.

I also wonder how this curve is related to the map it accompanies. The curve looks like a model - perfect oscillations of a fixed period and amplitude. But one suppose the amount of fluctuation should vary by location, based on geographical features and human activities.

The author of the article points to both natural and human impacts on the ground level. Humans affect this by water usage and also by management policies dictated by law. It would be very helpful to have a map that sheds light on the causes of the movements.


Visualizing the Thai cave rescue operation

The Thai cave rescue was a great story with a happy ending. It's also one that lends itself to visualization. A good visualization can explain the rescue operation more efficiently than mere words.

A good visual should bring out the most salient features of the story, such as:

  • Why the operation was so daunting?
  • What were the tactics used to overcome those challenges?
  • How long did it take?
  • What were the specific local challenges that must be overcome?
  • Were there any surprises?

In terms of what made the rescue challenging, some of the following are pertinent:

  • How far in they were?
  • How deep were they trapped?
  • How much of the caves were flooded? Why couldn't they come out by themselves?
  • How much headroom was there in different sections of the cave "tunnel"?

There were many attempts at visualizing the Thai cave rescue operation. The best ones I saw were: BBC (here, here), The New York Times (here), South China Morning Post (here) and Straits Times (here). It turns out each of these efforts focuses on some of the aspects above, and you have to look at all of them to get the full picture.

***

BBC's coverage began with a top-down view of the route of the rescue, which seems to be the most popular view adopted by news organizations. This is easily understood because of the standard map aesthetic.

Bbc_102494059_caves_976

The BBC map is missing a smaller map of Thailand to place this in a geographical context.

While this map provides basic information, it doesn't address many of the elements that make the Thai cave rescue story compelling. In particular, human beings are missing from this visualization. The focus is on the actions ("diving", "standing"). This perspective also does not address the water level, the key underlying environmental factor.

***

Another popular perspective is the sideway cross-section. The Straits Times has one:

Straittimes_thai rescue_part

The excerpt of the infographic presents a nice collection of data that show the effort of the rescue. The sideway cross-sectional section shows the distance and the up-and-down nature of the journey, the level of flooding along the route, plus a bit about the headroom available at different points. Most of these diagrams bring out the "horizontal" distance but somehow ignore the "vertical" distance. One possibility is that the real trajectory is curvy - but if we can straighten out the horizontal, we should be able to straighten out the vertical too.

The NYT article gives a more detailed view of the same perspective, with annotations that describe key moments along the rescue route.

Nyt_detailed_thairescueroute

If, like me, you like to place humans into this picture, then you have to go back to the Straits Times, where they have an expanded version of the sideway cross-section.

  Straitstimes_riskyroute_thairescue

This is probably my most favorite single visualization of the rescue operation.

There are better cartoons of the specific diving actions, though. For example, the BBC has this visual that shows the particularly narrow part of the route, corresponding to the circular inset in the Straits Times version above.

Bbc_thairescue_tightspace

The drama!

NYT also has a set of cartoons. Here's one:

Nyt_thairescue_divers

***

There is one perspective that curiously has been underserved in all of the visualizations - this is the first-person perspective. Imagine the rescuer (or the kids) navigating the rescue route. It's a cross-section from the front, not from the side.

Various publications try to address this by augmenting the top-down route view with sporadic cross-sectional diagrams. Recall the first map we showed from the BBC. On the right column are little annotations of this type (here):

Bbc_thaicaverescue_crosssection

I picked out this part of the map because it shows that the little human figure serves two potentially conflicting purposes. In the bottom diagram, the figurine shows that there is limited headroom in this part of the cave, plus the actual position of the figurine on the ledge conveys information about where the kids were. However, on the top cross-section, the location of the figure conveys no information; the only purpose of the human figure is to show how tall the cave is at that site.

The South China Morning Post (here - site appears to be down when I wrote this) has this wonderful animation of how the shape of the headroom changed as they navigated the route. Please visit their page to see the full animation. Here are two screenshots:

Scmp_caveshape_1

Scmp_caveshape_2

This little clip adds a lot to the story! It'd be even better if the horizontal timeline at the bottom is replaced by the top-down route map.

Thank you all the various dataviz teams for these great efforts.

 

 

 


Checking the scale on a chart

Dot maps, and by extension, bubble maps are popular options for spatial data; but the scale of these maps can be deceiving. Here is an example of a poorly-scaled dot map:

Farm-Dot Density

The U.S. was primarily an agrarian economy in 1997, if you believe your eyes.

Here is a poorly-scaled bubble map:

image from junkcharts.typepad.com

New Yorkers have all become Citibikers, if you believe what you see.

Last week, I saw a nice dot map embedded inside this New York Times Graphics feature on the destruction of the Syrian city of Raqqa.

Nyt_raqqa_dotmap

Before I conclude that the destruction was broadly felt, I'd like to check the scale on the map to make sure it doesn't have the problem seen above. What is helpful here is the scale provided on the map itself.

Nty_raqqa_scale

That line segment representing a quarter mile fits about 15 dots side by side. Then, I found out that a Manhattan avenue (longer) block is roughly a quarter mile. That means the map places about 15 buildings to an avenue block. In my experience, that sounds about right: I'd imagine 15-20 buildings per block.

So I'm convinced that the designer chose an appropriate scale to display the data. It is actually true that the entire city of Raqqa was pretty much annihilated by U.S. bombs.


Is the chart answering your question? Excavating the excremental growth map

Economist_excrement_growthSan Franciscans are fed up with excremental growth. Understandably.

Here is how the Economist sees it - geographically speaking.

***

In the Trifecta Checkup analysis, one of the questions to ask is "What does the visual say?" and with respect to the question being asked.

The question is how much has the problem of human waste in SF grew from 2011 to 2017.

What does the visual say?

The number of complaints about human waste has increased from 2011 to 2014 to 2017.

The areas where there are complaints about human waste expanded.

The worst areas are around downtown, and that has not changed during this period of time.

***

Now, what does the visual not say?

Let's make a list:

  • How many complaints are there in total in any year?
  • How many complaints are there in each neighborhood in any year?
  • What's the growth rate in number of complaints, absolute or relative?
  • What proportion of complaints are found in the worst neighborhoods?
  • What proportion of the area is covered by the green dots on each map?
  • What's the growth in terms of proportion of areas covered by the green dots?
  • Does the density of green dots reflect density of human waste or density of human beings?
  • Does no green dot indicate no complaints or below the threshold of the color scale?

There's more:

  • Is the growth in complaints a result of more reporting or more human waste?
  • Is each complainant unique? Or do some people complain multiple times?
  • Does each piece of human waste lead to one and only one complaint? In other words, what is the relationship between the count of complaints and the count of human waste?
  • Is it easy to distinguish between human waste and animal waste?

And more:

  • Are all complaints about human waste valid? Does anyone verify complaints?
  • Are the plotted locations describing where the human waste is or where the complaint was made?
  • Can all complaints be treated identically as a count of one?
  • What is the per-capita rate of complaints?

In other words, the set of maps provides almost all no information about the excrement problem in San Francisco.

After you finish working, go back and ask what the visual is saying about the question you're trying to address!

 

As a reference, I found this map of the population density in San Francisco (link):

SFO_Population_Density

 


Two thousand five hundred ways to say the same thing

Wallethub published a credit card debt study, which includes the following map:

Wallethub_creditcardpaydownbyCity

Let's describe what's going on here.

The map plots cities (N = 2,562) in the U.S. Each city is represented by a bubble. The color of the bubble ranges from purple to green, encoding the percentile ranking based on the amount of credit card debt that was paid down by consumers. Purple represents 1st percentile, the lowest amount of paydown while green represents 99th percentile, the highest amount of paydown.

The bubble size is encoding exactly the same data, apparently in a coarser gradation. The more purple the color, the smaller the bubble. The more green the color, the larger the bubble.

***

The design decisions are baffling.

Purple is more noticeable than the green, but signifies the less important cities, with the lesser paydowns.

With over 2,500 bubbles crowding onto the map, over-plotting is inevitable. The purple bubbles are printed last, dominating the attention but those are the least important cities (1st percentile). The green bubbles, despite being larger, lie underneath the smaller, purple bubbles.

What might be the message of this chart? Our best guess is: the map explores the regional variation in the paydown rate of credit card debt.

The analyst provides all the data beneath the map. 

Wallethub_paydownbyCity_data

From this table, we learn that the ranking is not based on total amount of debt paydown, but the amount of paydown per household in each city (last column). That makes sense.

Shouldn't it be ranked by the paydown rate instead of the per-household number? Divide the "Total Credit Card Paydown by City" by "Total Credit Card Debt Q1 2018" should yield the paydown rate. Surprise! This formula yields a column entirely consisting of 4.16%.

What does this mean? They applied the national paydown rate of 4.16% to every one of 2,562 cities in the country. If they had plotted the paydown rate, every city would attain the same color. To create "variability," they plotted the per-household debt paydown amount. Said differently, the color scale encodes not credit card paydown as asserted but amount of credit card debt per household by city.

Here is a scatter plot of the credit card amount against the paydown amount.

Redo_creditcardpaydown_scatter

A perfect alignment!

This credit card debt paydown map is an example of a QDV chart, in which there isn't a clear question, there is almost no data, and the visual contains several flaws. (See our Trifecta checkup guide.) We are presented 2,562 ways of saying the same thing: 4.16%.

 

P.S. [6/22/2018] Added scatter plot, and cleaned up some language.

 

 

 


Digital revolution in China: two visual takes

The following map accompanied an article in the Economist about China's drive to create a "digital silkroad," roughly defined as making a Silicon Valley. 

Economist_digitalsilkroad

The two variables plotted are the wealth of each province (measured by GDP per capita) and the level of Internet penetration. The designer made the following choices:

  • GDP per capita is presented with less precision than Internet penetration. The former is grouped into five large categories while the latter is given as a percentage to one decimal place.
  • The visual design favors GDP per capita which is encoded as the shade of color of each province. The Internet penetration data appeared added on as an afterthought.

If we apply the self-sufficiency test (i.e. by removing the printed data from the chart), it's immediately clear that the visual elements convey zero information about Internet penetration. This is a serious problem for a chart about the "digital silkroad"!

***

If those two variables are chosen, it would seem appropriate to convey to readers the correlation between the two variables. The following sketch is focused on surfacing the correlation.

Redo_jc_china_digitalsilkroad2

(Click on the image to see it in full.) Here is the top of the graphic:

Redo_jc_china_digitalskilkroad_detail

The individual maps are not strictly necessary. Just placing provincial names onto the grid is enough, because regional pattern isn't salient here.

The Internet penetration data were grouped into five categories as well, putting it on equal footing as GDP per capita.

 


This map steals story-telling from the designer

Stolen drugs is a problem at federal VA hospitals according to the following map.

Hospitals_losing_drugs

***

Let's evaluate this map from a Trifecta Checkup perspective.

VISUAL - Pass by a whisker. The chosen visual form of a map is standard for geographic data although the map snatches story-telling from our claws, just as people steal drugs from hospitals. Looking at the map, it's not clear what the message is. Is there one?

The 50 states plus DC are placed into five groups based on the reported number of incidents of theft. From the headline, it appears that the journalist conducted a Top 2 Box analysis, defining "significant" losses of drugs as 300 incidents or more. The visual design ignores this definition of "significance."

DATA - Fail. The map tells us where the VA hospitals are located. It doesn't tell us which states are most egregious in drug theft. To learn that, we need to compute a rate, based on the number of hospitals or patients or the amount of spending on drugs.

Looking more carefully, it's not clear they used a Top 2 Box analysis either. I counted seven states with the highest level of theft, followed by another seven states with the second highest level of theft. So the cutoff of twelve states awkwardly lands in between the two levels.

QUESTION - Fail. Drug theft from hospitals is an interesting topic but the graphic does not provide a good answer to the question.

***

Even if we don't have data to compute a rate, the chart is a bit better if proportions are emphasized, rather than counts.

Redo_hospitaldrugloss
 

The proportions are most easily understood from the base of four quarters making the whole. The first group is just over a quarter; the second group is exactly a quarter. The third group plus the first group roughly make up a half. The fourth and fifth groups together almost fills out a quarter.

In the original map, we are told about at least 400 incidents of theft in Texas but given no context to interpret this statistic. What proportion of the total thefts occur in Texas?

 

 


Hog wild about dot maps

Reader Chris P. sent me this chart.

This was meant to be "light entertainment." See the Twitter discussion below.

9gag_hogsmap

***

Let's think a bit about the dot map as a data graphic.

Dot maps are one dimensional. The dot's location is used to indicate the latitude and longitude and therefore the x,y coordinates cannot encode any other data. If we have basically a black/white chart, as in this hog map, the dot can only encode binary data (yes/no).

The legend says "each dot represents 5,000 hogs." Think about how that statement applies to these scenarios:

  • Do you expect to see something different between the dot representing 4,200 and the one showing 4,900?
  • Do you expect to see something different between the dot representing 400 and 4,000?
  • Do you expect to see something different between the location with 4,800 hogs and 9,600 hogs?


Based on the legend, the designer would need two dots to represent 10,000 hogs. But those two dots pertain to the same location. Sometimes, "jitter" is added, and the two dots are placed side by side. However, with the scale of the map of the U.S., and the dots representing seemingly small neighborhoods, jitter creates more confusion than anything. Also, what about 3, 4, 5, .. dots in the same location?

 9gag_hogmap_inset

Looking at the details above, are the dots jittered or do they represent neighboring locations?

Sometimes, colors are used to encode data on a dot map. But each dot can only contain one color, so it only typically shows the top category in each location.

Dot maps are very limited. Think before you use them.