Graphical forms impose assumptions on the data

In a comment to my previous post, reader Chris P. pointed me to the following set of maps, also from the New York Times crew, on the legalization of gay marriage in the U.S. (link)

 

Nyt_squaremaps_marriage

(For those who did not click through, the orange colors represent two types of bans while the dark gray/grey color indicates legalization.)

These maps are pleasing to the eye for sure. By portraying every state as a same-sized square, the presentation avoids the usual areal distortion introduced by the map.

But not so quick. Note that each presentation makes its own assumption on the relative importance of states. The typical map scales weights according to geographical area while this presentation assumes that every state has equal weight. Another typical cartographic display uses squares of different sizes, based on the population of each state.

The location of states are necessarily distorted. One way to remedy this is to have hover-over state labels. On a browser, such interactivity works better than having to scroll to the top where there is a larger map which doubles as the legend.

It would be interesting to learn also about the future. Are there any legislation in the pipeline either to legalize gay marriage in the remaining orange states or to overturn the legalization laws in the gray states?

 

PS. [5/6/2015] Here is an alternative presentation of this data by David Mendoza.


Observing Rosling’s Current Visual Style

On the sister blog, I wrote about Hans Rosling’s recent presentation in New York (link). I noted that Rosling has apparently simplified his visual palette.

Rosling is best known as the developer of the Gapminder tool, used to visualize global social statistics data collected by national statistical agencies. I wrote favorably about this tool in a series of posts (link). Gapminder made popular the moving bubble chart, although not the only graphical form present.

Gapminder_screengrab

These animated bubble charts also made Rosling a YouTube star (See here.)

***

In last week’s presentation, Rosling only showed one moving bubble chart. The rest of his graphics are noticeably simpler, something that anyone can produce on Excel or Powerpoint. Here is one example:

Image1
 

I’m particularly impressed by a simple sequence of charts in which Rosling explains the demographic changes the world is expecting to see in the next 50 to 100 years.

  Image2

This is an enhanced area chart. Each slice of area is subdivided into stick figures so that an axis for population counts becomes unnecessary.

Instead, the reader sees two useful dimensions: region of the world, and age group.

How the population ages as it grows is the feature story and the effect of aging is ingeniously portrayed as layers. This becomes apparent as Rosling lets time roll forward, and the layers literally walk off the page. (Unfortunately, I couldn't capture each step fast enough.)

Image3

 (This photo courtesy of Daniel Vadnais.)

When Rosling showed the 2085 projection, we find that the entire rectangle has filled up, so the world population has definitely grown, roughly by 30 percent. The growth happens by filling up of adults; the total number of children has not changed. This is one of the key insights from recent demographic data. The first photo above shows something remarkable: the fertility rate in Asian countries has plunged to about the same level of developed countries already.

***

This set of charts is unusually effective. It represents another level of simplification in visual means. At the same time, the message is sharpened.

As I reported the other day (link), Rosling does not believe modern tools have improved data analysis. This talk which utilized simple tools is a good demonstration of his point.


Tricky boy William

Last week, I was quite bothered by this chart I produced using the Baby Name Voyager tool.

Bnv_william

According to this chart, William has drastically declined in popularity over time. The name was 7 times more popular back in the 1880s compared to the 2010s. And yet, when I hovered over the chart, the rank of William in 2013 was 3. Apparently, William was the 3rd most popular boy name in 2013.

I wrote the nice people at the website and asked if there might be a data quality issue, and their response was:

The data in our Name Voyager tool is correct. While it may be puzzling, there are definitely less Williams in the recent years than there were in the past (1880s). Although the name is still widely popular, there are plenty of other baby names that parents are using. In the past, there were a limited amount of names that parents would choose, therefore more children had the same name.

What bothered me was that the rate has declined drastically while the number of births was increasing. So, I was expecting William to drop in rank as well. But their explanation makes a lot of sense: if there is a much wider spread of names in recent times, the rank could indeed remain top. It was very nice of them to respond.

***

There are three ways to present this data series, as shown below. One can show the raw counts of William babies (orange line). One can show the popularity against total births (what Baby Name Wizard shows, blue line). One can show the rank of William relative to all other male baby names (green line). Consider how different these three lines look!

Jc_william_3metrics

The rate metric (per million births) adjusts for growth in total births. But the blue line is difficult to interpret in the face of the orange line. In the period 1900 to 1950, the actual number of William babies went up but the blue line came down. The rank is also tough especially in the 1970-2000 period when it took a dive, a trend not visible in either the raw counts or the adjusted counts.

Adding to the difficulty is the use of the per-million metric. In the following chart, I show three different scales for popularity: per million, per 100,000, and per 100 (i.e. proportion). The raw count is shown up top.

Jc_william_4scale

All three blue lines are essentially the same but how readers interpret the scales is quite another matter. The per-million births metric is the worst of the lot. The chart shows values in the 20,000-25,000 range in the 1910s but the actual number of William babies was below 20,000 for a number of years. Switching to per-100K helps but in this case, using the standard proportion (the bottom chart) is more natural.

***

The following scatter plot shows the strange relationship between the rate of births and the rank over time for Williams babies.

Jc_william_rank_prop

Up to 1990s, there is an intuitive relationship: as the proportion of Williams among male babies declined, so did the rank of William. Then in the 1990s and beyond, the relationship flipped. The proportion of Williams among male babies continued to drop but the rank of William actually recovered!

 

 


Minimalism as a form of abuse

With each succeeding year, I get more and more frustrated with "minimalist" designs that have little respect for users.

This Christmas, I received a portable cellphone charger as a gift. A thoughtful gift. I have heard of these devices but have never touched one. Until a few weeks ago (when I wrote this post).

This is the packaging.

Package_contents

The Phunkee Juice Box is a square cylinder. It has no buttons, and no obvious signals. The only other thing I found in the box was a multi-headed wire. This is as minimal as you can get. Even the brand's name is taped on, as if to say "You don't even have to advertise our name if you don't like it".

I needed to get some power into this battery first. I was in a computer lab with many power outlets but the cord in the box had no plugs. I looked for instructions. This is the back cover:

Package_instructions

So how do I use this thing? There's a note at the bottom: Please see detailed instructions inside.

Package_backbottom

Amusingly, there wasn't anything inside the box that resembled instructions (see the first photo).

***

Perhaps I could connect the device to one of the lab computers and power it up that way. Instinctively, I inserted the USB connector into the device. Then I realized none of the three remaining connectors could fit into the computer.

Device_plugs

The device has two sockets, so I reversed the wire.

Device_replugged

Now the USB connector went into the desktop computer while the mini-USB plug went into the Juice Box.

Device_charging

A red light appeared around the neck of the device. It was a persistent light, not blinking, not changing colors. There was only one light on the Juice Box so how much charge did it have?

***

Then I started having doubts. Was I sure power was flowing from the computer to the Juice Box? Couldn't power be moving from the Juice Box to the computer? What I think caused this confusion was the reversing of the wire. The USB port was first inserted in the computer, then flipped over to the device. Cords are typically uni-directional but this one might be bi-directional.

An hour later, I didn't see any change. The red light was still on. Someone told me I should use my iPhone plug and insert the Juice Box directly to the socket on the wall. This device made me feel dumb.

Again, the red light came on, and again no other signal was forthcoming. Eventually, after three hours or so, the light turned blue. Finally, I learned that the light turns from red to blue on a full charge. I still have no idea how much charge is in the device at any time.

I left the fully charged device on my desk. One day later, my phone was out of power and I connected the Juice Box the only way it could -  the mini-USB port into the phone, the USB port into the Juice Box. I had reversed the direction of the cord again. Presumably power was flowing from the battery into my phone. I wasn't sure since the one and only light was completely extinguished. (PS. Turned out no power was moving across. Perhaps the device was defective. Perhaps the power dissipated during those 24 hours of idleness.)

***
You know I will get to visualization eventually. The current trend of hiding labels and text is irritating. The new interface of Google Maps is more confusing to use than the previous interface, not least because of de-cluttering and replacing text with symbols. To read many of today's graphics, stumbling readers must hover over or click on the chart surface--these interactions add nothing to the experience.

Minimalism is taking away unneccessary things. It isn't taking away everything. Please stop torturing users.


Cloudy and red

Note: I'm traveling during the holidays so updates will be infrequent.

 

Reader Daniel L. pointed me to a blog post discussing the following weather map:

Vane_temp_anomaly

The author claimed that many readers misinterpreted the red color as meaning high temperatures when he intended to show higher-than-normal temperatures. In other words, the readers did not recognize a relative scale is in play.

That is a minor issue that can be fixed by placing a label on the map.

There are several more irritants, starting with the abundance of what Ed Tufte calls chartjunk. The county boundaries do not serve a purpose, nor is it necessary to place so many place names. State boundaries too are  too imposing. The legend fails to explain what the patch of green in Florida means.

The article itself links to a different view of this data on a newly launched site called Climate Prediction Center, by the National Oceanic and Atmospheric Administration (link). Here is a screenshot of the continental U.S.

Cpc_temp_anomaly

This chart is the other extreme, bordering on too simple.

I'd suggest adding a little bit of interactivity to this chart, such as:

  • Hiding the state boundaries and showing them on hover only
  • Selectively print the names of major cities to help readers orient themselves
  • Selectively print the names of larger cities around the color boundaries
  • Using a different background map that focuses on the U.S. rather than the entire North American continent 

This is a Type V chart.


An uninformative end state

This chart cited by ZeroHedge feels like a parody. It's a bar chart that doesn't utilize the length of bars. It's a dot plot that doesn't utilize the position of dots. The range of commute times (between city centers and airports) from 18 to 111 minutes is compressed into red/yellow/green levels.

20141124_Air4

ZeroHedge got this from Bloomberg Businessweek, which has a data visualization group so this seems strange. The project called "The Airport Frustration Index" is here.

It turns out the above chart is a byproduct of interactivity. The designer illustrates the passage of time by letting lines run across the page. The imagery is that of a horse race. This experiment reminds me of the audible chart by New York Times (link).

The trick works better when the scale is in seconds, thus real time, as in the NYT chart. On the Businessweek chart, three different scales are simultaneously in motion: real time, elapsed time of the interactive element, and length of the line. Take any two airports: the amount of elapsed time between one "horse" and the other "horse" reaching the right side is not equal to the extra time needed but a fraction of it--obviously, the designer can't have readers wait, say, 10 minutes if that was the real difference in commute times!

Besides, the interactive component is responsible for the uninformative end state shown above.

***

Now, let's take a spin around the Trifecta Checkup. The question being asked is how "painful" is the commute from the city center to the airport. The data used:

Bw_commuteairport_def

Here are some issues about the data worth spending a moment of your time:

In Chapter 1 of Numbers Rule Your World (link), I review some key concepts in analyzing waiting times. The most important concept is the psychology of waiting time. Specifically, not all waiting time is created equal. Some minutes are just more painful than others.

As a simple example, there are two main reasons why Google Maps say it takes longer to get to Airport A than Airport B--distance between the city center and the airport; and congestion on the roads. If in getting to A, the car is constantly moving while in getting to B, half of the time is spent stuck in jams, then the average commuter considers the commute to B much more painful even if the two trips take the same number of physical minutes.

Thus, it is not clear that Google driving time is the right way to measure pain. One quick but incomplete fix is to introduce distance into the metric, which means looking at speed rather than time.

Another consideration is whether the "center" of all business trips coincides with the city center. In New York, for instance, I'm not sure what should be considered the "city center". If all five boroughs are considered, I heard that the geographical center is in Brooklyn. If I type "New York, NY" into Google Maps, it shows up at the World Trade Center. During rush hour, the 111 minutes for JFK would be underestimated for most commuters who are located above Canal Street.

I'd consider this effort a Type DV.

 


Circular but insufficient

One of my students analyzed the following Economist chart for her homework.

Economist_book_sales_printversion

I was looking for it online, and found an interactive version that is a bit different (link). Here are three screen shots from the online version for years 2009, 2013 and 2018. The first and last snapshots correspond to the years depicted in the print version.

  Economist_booksales_all

The online version is the self-sufficiency test for the print version. In testing self-sufficiency, we want to see if the visual elements (i.e. the circular sectors on the print version) pull their own weights. The quick answer is no. The reader can't tell how much sales are represented in each sector, nor can they reliably estimate the relative scales of print versus ebook (pink/red vs yellow/orange) or year-to-year growth rates.

As usual, when we see the entire data set printed on the chart itself, it is giveaway that the visual elements are mere ornaments.

The online version does not have labels unless you hover over the hemispheres. But again it is a challenge to learn anything from the picture.

In the Trifecta checkup, this is a Type V chart.

***

This particular dataset is made for the bumps-style chart:

Redo_economistbooksales

 

 

 


An infographic showing up here for the right reason

Infographics do not have to be "data ornaments" (link). Once in a blue moon, someone finds the right balance of pictures and data. Here is a nice example from the Wall Street Journal, via ThumbsUpViz.

 

Thumbsupviz_wsj_footballinjuries

 

Link to the image

 

What makes this work is that the picture of the running back serves a purpose here, in organizing the data.  Contrast this to the airplane from Consumer Reports (link), which did a poor job of providing structure. An alternative of using a bar chart is clearly inferior and much less engaging.

Redowsjinjuries_bar

***

I went ahead and experimented with it:

Redo_wsj_nflinjuries

 

I fixed the self-sufficiency issue, always present when using bubble charts. In this case, I don't think it matters whether the readers know the exact number of injuries so I removed all of the data from the chart.

Here are  three temptations that I did not implement:

  • Not include the legend
  • Not include the text labels, which are rendered redundant by the brilliant idea of using the running guy
  • Hide the bar charts behind a mouseover effect.

 


Relevance, to you or me: a response to Cairo

Alberto Cairo discussed a graphic by the New York Times on the slowing growth of Medicare spending (link).

Medicarespend_combinedThe chart on the top is published, depicting the quite dramatic flattening of the growth in average spending over the last years--average being the total spend divided by the number of Medicare recipients. The other point of the story is that the decline is unexpected, in the literal sense that the Congressional Budget Office planners did not project its magnitude. (The planners did take the projections down over time so they did project the direction correctly.)

Meanwhile, Cairo asked for a chart of total spend, and Kevin Quealy obliged with the chart shown at the bottom. It shows almost straight line growth.

Cairo's point is that the average does not give the full picture, and we should aim to "show all the relevant data".

***

I want to follow that line of thinking further.

My first reaction is Cairo did not say "show all the data", he said "show the relevant data".  That is a crucial difference. For complex social problems like Medicare, and in general, for "Big Data", it is not wise to show all the data. Pick out the data of interest, and focus on those.

A second reaction. How can "relevance" be defined? Doesn't it depend on what the question is? Doesn't it depend on the interests and persuasion of the chart designer (or reader)? One of the key messages I wish to impart in my book Numbersense (link) is that reasonable people using uncontroversial statistical methods to analyze the same dataset can come to different, even opposite, conclusions. 

Statistical analysis is concerned with figuring what is relevant and what isn't. This is no different from Nate Silver's choice of signal versus noise. Noise is not just what is bad but also what is irrelevant.

In practice, you present what is relevant to your story. Someone else will do the same. The particular parts of the data that support each story may be different. The two sides have to engage each other, and debate which story has a greater chance of being close to the truth. If the "truth" can be verified in the future, the debate is more easily settled.

Unfortunately, there is no universal standard of relevance.

***

Going back to the NYT story. The chart on total Medicare spending is not as useful as it may seem. This is because an aggregate metric like this for a social phenomenon is influenced by a multitude of factors. Clearly, population growth is a notable factor here. When they use the word "real", I don't know if this means actualized (as opposed to projected), or "in real terms" (that is, inflation adjusted). If not the latter, the value of money would be another factor affecting our interpretation of the lines.

Without some reference levels for population and value of money, it is hard to interpret whether the straight-line growth implies higher or lower spending intensity. For the second chart, I suggest plotting the growth in the number of Medicare recipients. I believe one of the goals of the Affordable Care Act is to reduce the ranks of the uninsured so a direct depiction of this result is interesting.

The average spend can be thought of as population-adjusted. It is a more interpretable number -- but as Cairo pointed out, it is also narrow in scope. This is a tradeoff inherent in all of statistics. To grow understanding, we narrow the scope; but as we focus, we lose the big picture. So, we compile a set of focal points to paint a fuller picture.