Checking the scale on a chart

Dot maps, and by extension, bubble maps are popular options for spatial data; but the scale of these maps can be deceiving. Here is an example of a poorly-scaled dot map:

Farm-Dot Density

The U.S. was primarily an agrarian economy in 1997, if you believe your eyes.

Here is a poorly-scaled bubble map:

image from junkcharts.typepad.com

New Yorkers have all become Citibikers, if you believe what you see.

Last week, I saw a nice dot map embedded inside this New York Times Graphics feature on the destruction of the Syrian city of Raqqa.

Nyt_raqqa_dotmap

Before I conclude that the destruction was broadly felt, I'd like to check the scale on the map to make sure it doesn't have the problem seen above. What is helpful here is the scale provided on the map itself.

Nty_raqqa_scale

That line segment representing a quarter mile fits about 15 dots side by side. Then, I found out that a Manhattan avenue (longer) block is roughly a quarter mile. That means the map places about 15 buildings to an avenue block. In my experience, that sounds about right: I'd imagine 15-20 buildings per block.

So I'm convinced that the designer chose an appropriate scale to display the data. It is actually true that the entire city of Raqqa was pretty much annihilated by U.S. bombs.


Two nice examples of interactivity

Janie on Twitter pointed me to this South China Morning Post graphic showing off the mighty train line just launched between north China and London (!)

Scmp_chinalondonrail

Scrolling down the page simulates the train ride from origin to destination. Pictures of key regions are shown on the left column, as well as some statistics and other related information.

The interactivity has a clear purpose: facilitating cross-reference between two chart forms.

The graphic contains a little oversight ... The label for the key city of Xian, referenced on the map, is missing from the elevation chart on the left here:

Scmp_chinalondonrail_xian

 ***

I also like the way New York Times handled interactivity to this chart showing the rise in global surface temperature since the 1900s. The accompanying article is here.

Nyt_surfacetemp

When the graph is loaded, the dots get printed from left to right. That's an attention grabber.

Further, when the dots settle, some years sink into the background, leaving the orange dots that show the years without the El Nino effect. The reader can use the toggle under the chart title to view all of the years.

This configuration is unusual. It's more common to show all the data, and allow readers to toggle between subsets of the data. By inverting this convention, it's likely few readers need to hit that toggle. The key message of the story concerns the years without El Nino, and that's where the graphic stands.

This is interactivity that succeeds by not getting in the way. 

 

 

 


A look at how the New York Times readers look at the others

Nyt_taxcutmiddleclass

The above chart, when it was unveiled at the end of November last year, got some mileage on my Twitter feed so it got some attention. A reader, Eric N., didn't like it at all, and I think he has a point.

Here are several debatable design decisions.

The chart uses an inverted axis. A tax cut (negative growth) is shown on the right while a tax increase is shown on the left. This type of inversion has gotten others in trouble before, namely, the controversy over the gun deaths chart (link). The green/red color coding is used to signal the polarity although some will argue this is bad for color-blind readers. The annotation below the axis is probably the reason why I wasn't confused in the first place but the other charts further down the page do not repeat the annotation, and that's where the interpretation of -$2,000 as a tax increase is unnatural!

The chart does not aggregate the data. It plots 25,000 households with 25,000 points. Because of the variance of the data, it's hard to judge trends. It's easy enough to see that there are more green dots than red but how many more? 10 percent, 20 percent, 40 percent? It's also hard to answer any specific questions, say, about households with a certain range of incomes. There are various ways to aggregate the data, such as heatmaps, histograms, and so on.

For those used to looking at scientific charts, the x- and y-axes are reversed. By convention, we'd have put the income ranges on the horizontal axis and the tax changes (the "outcome" variable) on the vertical axis.

***

The text labels do not describe the data patterns on the chart so much as they offer additional information. To see this, remove the labels as I have done below. Try adding the labels based on what is shown on the chart.

Nyt_taxcutmiddleclass_2

Perhaps it's possible to illustrate those insights with a set of charts.

***

While reading this chart, I kept wondering how those 25,000 households were chosen. This is a sample of  households. The methodology is explained in a footnote, which describes the definition of "middle class" but unfortunately, they forgot to tell us how the 25,000 households were chosen from all such middle-class households.

Nyt_taxcutmiddleclass_footnote

The decision to omit the households with income below $40,000 needs more explanation as it usurps the household-size adjustment. Also, it's not clear that the impact of the tax bill on the households with incomes between $20-40K can be assumed the same as for those above $40K.

Are the 25,000 households is a simple random sample of all "middle class" households or are they chosen in some ways to represent the relative counts? It's also useful to know if they applied the $40K cutoff before or after selecting the 25,000 households. 

Ironically, the media kit of the Times discloses an affluent readership with median household income of almost $190K so it appears that the majority of readers are not represented in the graphic at all!

 


Storm story, a masterpiece

The visual story published by the New York Times on hurricane Irma is a masterpiece. See the presentation here.

The story starts with the standard presentation of the trajectories of past hurricane on a map:

Nyt_irma_map

Maps are great at conveying location and direction but much is lost in this rendering - wind speeds, time, strength, energy, to name but a few.

The Times then switches to other chart forms to convey some of the other data. A line chart is used to convey the strength of wind speeds as the storms shake through the Atlantic. Some kind of approximation is used to straighten the trajectories along an east-west orientation.

Nyt_irma_notime

The key insight here is how strong Irma was pretty far out in the Atlantic. The lines in the background can be brought to live by clicking on them. This view omits some details - the passage of time is ignored, and location has been reduced to one dimension.

The display then switches again, and this time it shows time and wind speed.

Nyt_irma_nolocation

This shows Irma's strength, sustaining Category 5 level windss for three days. This line chart ignores location completely.

Finally, a composite metric called cyclone energy is introduced.

Nyt_irma_energy

This chart also ignores location. It does show Irma as a special storm. The storm that has reached the maximum energy by far is Ivan. Will Irma beat that standard? I am not so sure.

Each chart form has limitations. The use of multiple charts helps convey a story from multiple perspectives. A very nice example indeed.

 


Sorting out what's meaningful and what's not

A few weeks ago, the New York Times Upshot team published a set of charts exploring the relationship between school quality, home prices and commute times in different regions of the country. The following is the chart for the New York/New Jersey region. (The article and complete data visualization is here.)

Nyt_goodschoolsaffordablehomes_nyc

This chart is primarily a scatter plot of home prices against school quality, which is represented by average test scores. The designer wants to explore the decision to live in the so-called central city versus the decision to live in the suburbs, hence the centering of the chart about New York City. Further, the colors of the dots represent the average commute times, which are divided into two broad categories (under/over 30 minutes). The dots also have different sizes, which I presume measures the populations of each district (but there is no legend for this).

This data visualization has generated some negative reviews, and so has the underlying analysis. In a related post on the sister blog, I discuss the underlying statistical issues. For this post, I focus on the data visualization.

***

One positive about this chart is the designer has a very focused question in mind - the choice between living in the central city or living in the suburbs. The line scatter has the effect of highlighting this particular question.

Boy, those lines are puzzling.

Each line connects New York City to a specific school district. The slope of the line is, nominally, the trade-off between home price and school quality. The slope is the change in home prices for each unit shift in school quality. But these lines don't really measure that tradeoff because the slopes span too wide a range.

The average person should have a relatively fixed home-price-to-school-quality trade-off. If we could estimate this average trade-off, it should be represented by a single slope (with a small cone of error around it). The wide range of slopes actually undermines this chart, as it demonstrates that there are many other variables that factor into the decision. Other factors are causing the average trade-off coefficient to vary so widely.

***

The line scatter is confusing for a different reason. It reminds readers of a flight route map. For example:

BA_NYC_Flight_Map

The first instinct may be to interpret the locations on the home-price-school-quality plot as geographical. Such misinterpretation is reinforced by the third factor being commute time.

Additionally, on an interactive chart, it is typical to hide the data labels behind mouseovers or clicks. I like the fact that the designer identifies some interesting locales by name without requiring a click. However, one slight oversight is the absence of data labels for NYC. There is nothing to click on to reveal the commute/population/etc. data for central cities.

***

In the sister blog post, I mentioned another difficulty - most of the neighborhoods are situated to the right and below New York City, challenging the notion of a "trade-off" between home price and school quality. It appears as if most people can spend less on housing and also send kids to better schools by moving out of NYC.

In the New York region, commute times may be the stronger factor relative to school quality. Perhaps families chose NYC because they value shorter commute times more than better school quality. Or, perhaps the improvement in school quality is not sufficient to overcome the negative of a much longer commute. The effect of commute times is hard to discern on the scatter plot as it is coded into the colors.

***

A more subtle issue can be seen when comparing San Francisco and Boston regions:

Nyt_goodschoolsaffordablehomes_sfobos

One key insight is that San Francisco homes are on average twice as expensive as Boston homes. Also, the variability of home prices is much higher in San Francisco. By using the same vertical scale on both charts, the designer makes this insight clear.

But what about the horizontal scale? There isn't any explanation of this grade-level scale. It appears that the central cities have close to average grade level in each chart so it seems that each region is individually centered. Otherwise, I'd expect to see more variability in the horizontal dots across regions.

If one scale is fixed across regions, and the other scale is adapted to each region, then we shouldn't compare the slopes across regions. The fact that the lines are generally steeper in the San Francisco chart may be an artifact of the way the scales are treated.

***

Finally, I'd recommend aggregating the data, and not plot individual school districts. The obsession with magnifying little details is a Big Data disease. On a chart like this, users are encouraged to click on individual districts and make inferences. However, as I discussed in the sister blog (link), most of the differences in school quality shown on these charts are not statistically meaningful (whereas the differences on the home-price scale are definitely notable). 

***

If you haven't already, see this related post on my sister blog for a discussion of the data analysis.

 

 

 

 


Attractive, interactive graphic challenges lazy readers

The New York Times spent a lot of effort making a nice interactive graphical feature to accompany their story about Uber's attempt to manipulate its drivers. The article is here. Below is a static screenshot of one of the graphics.

Nytimes_uber_simulation

The illustrative map at the bottom is exquisite. It has Uber cars driving around, it has passengers waiting at street corners, the cars pick up passengers, new passengers appear, etc. There are also certain oddities: all the cars go at the same speed, some strange things happen when cars visually run into each other, etc.

This interactive feature is mostly concerned with entertainment. I don't think it is possible to infer either of the two metrics listed above the chart by staring at the moving Uber cars. The metrics are the percentage of Uber drivers who are idle and the average number of minutes that a passenger waits. Those two metrics are crucial to understanding the operational problem facing Uber planners. You can increase the number of Uber cars on the road to reduce average waiting time but the trade-off is a higher idle rate among drivers.

***

One of the key trends in interactive graphics at the Times is simplication. While a lot of things are happening behind the scenes, there is only one interactive control. The only thing the reader can control is the number of drivers in the grid.

As one of the greatest producers of interactive graphics, I trust that they know what they are doing. In fact, this article describes some comments made by Gregor Aisch, who works at the Times. The gist is: very few readers play with their interactive graphics. Someone else said, "If you make a tooltip or rollover, assume no one will ever see it." I also have heard someone say (hope this is not merely a voice in my own head): "Every extra button or knob you place on the graphic, you lose another batch of readers." This might be called the law of the interactive knob, analogous to the law of the printed equation, in the realm of popular book publishing, which stipulates that every additional equation you print in a book, you lose another batch of readers.

(Note, however, that we are talking about graphics for communications here, not exploratory graphics.)

***

Several years ago, I introduced the concept of "return on effort" in this blog post. Most interactive graphics are high effort to produce. The question is whether there is enough reward for the readers. 

Junkcharts_return_on_effort_matrix


Political winds and hair styling

Washington Post (link) and New York Times (link) published dueling charts last week, showing the swing-swang of the political winds in the U.S. Of course, you know that the pendulum has shifted riotously rightward towards Republican red in this election.

The Post focused its graphic on the urban / not urban division within the country:

Wp_trollhair

Over Twitter, Lazaro Gamio told me they are calling these troll-hair charts. You certainly can see the imagery of hair blowing with the wind. In small counties (right), the wind is strongly to the right. In urban counties (left), the straight hair style has been in vogue since 2008. The numbers at the bottom of the chart drive home the story.

Previously, I discussed the Two Americas map by the NY Times, which covers a similar subject. The Times version emphasizes the geography, and is a snapshot while the Post graphic reveals longer trends.

Meanwhile, the Times published its version of a hair chart.

Nyt_hair_election

This particular graphic highlights the movement among the swing states. (Time moves bottom to top in this chart.) These states shifted left for Obama and marched right for Trump.

The two sets of charts have many similarities. They both use curvy lines (hair) as the main aesthetic feature. The left-right dimension is the anchor of both charts, and sways to the left or right are important tropes. In both presentations, the charts provide visual aid, and are nicely embedded within the story. Neither is intended as exploratory graphics.

But the designers diverged on many decisions, mostly in the D(ata) or V(isual) corner of the Trifecta framework.

***

The Times chart is at the state level while the Post uses county-level data.

The Times plots absolute values while the Post focuses on relative values (cumulative swing from the 2004 position). In the Times version, the reader can see the popular vote margin for any state in any election. The middle vertical line is keyed to the electoral vote (plurality of the popular vote in most states). It is easy to find the crossover states and times.

The Post's designer did some data transformations. Everything is indiced to 2004. Each number in the chart is the county's current leaning relative to 2004. Thus, left of vertical means said county has shifted more blue compared to 2004. The numbers are cumulative moving top to bottom. If a county is 10% left of center in the 2016 election, this effect may have come about this year, or 4 years ago, or 8 years ago, or some combination of the above. Again, left of center does not mean the county voted Democratic in that election. So, the chart must be read with some care.

One complaint about anchoring the data is the arbitrary choice of the starting year. Indeed, the Times chart goes back to 2000, another arbitrary choice. But clearly, the two teams were aiming to address slightly different variations of the key question.

There is a design advantage to anchoring the data. The Times chart is noticeably more entangled than the Post chart. There are tons more criss-crossing. This is particularly glaring given that the Times chart contains many fewer lines than the Post chart, due to state versus county.

Anchoring the data to a starting year has the effect of combing one's unruly hair. Mathematically, they are just shifting the lines so that they start at the same location, without altering the curvature. Of course, this is double-edged: the re-centering means the left-blue / right-red interpretation is co-opted.

On the Times chart, they used a different coping strategy. Each version of their charts has a filter: they highlight the set of lines to demonstrate different vignettes: the swing states moved slightly to the right, the Republican states marched right, and the Democratic states also moved right. Without these filters, the readers would be winking at the Times's bad-hair day.

***

Another decision worth noting: the direction of time. The Post's choice of top to bottom seems more natural to me than the Times's reverse order but I am guessing some of you may have different inclinations.

Finally, what about the thickness of the lines? The Post encoded population (voter) size while the Times used electoral votes. This decision is partly driven by the choice of state versus county level data.

One can consider electoral votes as a kind of log transformation. The effect of electorizing the popular vote is to pull the extreme values to the center. This significantly simplifies the designer's life. To wit, in the Post chart (shown nbelow), they have to apply a filter to highlight key counties, and you notice that those lines are so thick that all the other countries become barely visible.

  Wp_trollhair_texas

 


Mapping the two Americas

If you type "two Americas map" into Google image search, you get the following top results:

Google_twoAmericasmaps

Designers overwhelmingly pick the choropleth map as the way to depitct the two nations.

Now, look at these maps from the New York Times (link):


Nytimes_election2016_mapDem

and this:

Nytimes_election2016_mapRep

I believe the background is a relief map. Would like to see one where the color is based on the strength of support for Democrats or Republicans.

The pair of maps is extremely effective at bringing out the story about the splitting of the U.S. population. From a design standpoint, I really like it.

I love, love, love the cute annotations everywhere on the page. I imagine the designer had fun coming up with them.

Nytimes_election2016_mapRep_inset

Pittsburgh Puddle, Cleveland Cove, Cincinnati Slough, ...

***

There is an artistic (or data journalistic) license behind the way the data are processed. Most likely, a 50% cutoff is applied to determine which map a county sits atop. The analysis is at the county level so there is neccessarily some simplification... in fact, this aggregation is needed to make the "islands" and other features contiguous.

I am a bit sad that at this moment, we are so focused on what sets us apart, and not what binds us together as a nation.

 

PS. Via twitter, Maciej reacted negatively to these maps: "Horribly tendentious map visualization from the NYT makes the candidate who won more votes look like a tiny minority."

This is a good illustration of selecting the chart form to bring out one's message. If the goal of the chart is to show that Clinton has more votes, I agree that these maps fail to convey that message.

What I believe the NYT designer wants to point out is that the supporters of Clinton are clustered into these densely populated urban areas, leaving the Republicans with most of the land mass. (Like I said above, because of the 50% cutoff criterion, we are over-simplifying the picture. There are definitely Democrats living somewhere in Trump's nation, and likewise Republicans residing in Clinton strongholds.)


How will the Times show election results next week? Will they give us a cliffhanger?

I don't know for sure how the New York Times will present election results next week; it's going to be as hard to predict as the outcome of the election!

The Times just published a wonderful article describing all the different ways election results have been displayed in the past.

tldr; The designer has to make hard choices. Some graphics are better at one thing but worse at another. If the designer can prioritize the Qs, then the choice will come naturally. This is why the Q corner is at the top of the Trifecta framework (link).

Nytimes_election_2000I particularly like the non-map shown right, published in 2000.

This chart doesn't answer every question you want. But it gives a sense of how the candidates built their path to victory.

The imagery of a building works well here. The foundation of a building is its bottom, consisting of states which lean heavily to one party or the other. These foundational blocks scale with either the skew of the support or the number of electoral votes. The lower down in the building, the more solid is the bloc, which makes a lot of sense.

The three-tier color scheme helpfully separates partisan states, competitive states and swing states.

It's not easy to learn the exact vote totals for each state but the vertical axis is pure Tufte and sufficient for most readers.

All in all, this graphic is top-notch. It takes a little time to perfect but not too much. It has clear takeaways and I feel like I learned much more from this chart than I could in a "purple map" type of rendition.

***

There is a little room for augmentation. It's how they handled the "undecided" states. For me, that is the suspense of this graphic. It's the cliffhanger.

Staring at the chart for the first time, I find that it doesn't address the question of the night: who won? Neither of the "buildings" hit the 270 level required to win the election. Also, there isn't a current vote count so readers have to figure out how many votes are required to win. That's frustrating.

There is an annotation in the middle right, explaining that three states with 37 votes have not yet issued results. That text is better placed near the peaks of the buildings next to the gap where the undecided states would eventually show up.

Also, it is interesting to expand the graphic a bit to address the question of who's likely to win and how. With three states remaining that can go either way, there are eight possible scenarios. It turns out that everything comes down to Florida. Whoever wins Florida wins the election. The other two contests don't matter! (Florida has 25 votes, New Mexico 7, Oregon 5. Gore needs 16 more votes, and Bush needs 24.)

Here is one way to present these scenarios. A little bit of hover-over effect will help here, to provide some details of each scenario.

Redo_electoralvotes

 

 

 


Lining up the dopers and their medals

The Times did a great job making this graphic (this snapshot is just the top half):

Nyt_olympicdopers_top

A lot of information is packed into a small space. It's easy to compose the story in our heads. For example, Lee Chong Wai, the Malaysian badminton silver medalist, was suspended for doping for a short time during 2015, and he was second twice before the doping incident.

They sorted the athletes according to the recency of the latest suspension. This is very smart as it helps make the chart readable. Other common ordering such as alphabetically by last name, by sport, by age, and by number of medals will result in a bit of a mess.

I'm curious about the athletes who also had doping suspensions but did not win any medals in 2016.