Surveys generate a lot of data. And, if you have used a survey vendor, you know they generate a ton of charts.
I was in Germany to attend the Data Meets Viz workshop organized by Antony Unwin. Paul and Sascha from Zeit Online presented some of their work at the German publication, and I was highly impressed by this effort to visualize survey results. (I hope the link works for you. I found that the "scroll" fails on some platforms.)
The survey questions attempted to assess the gap between West and East Germans 25 years after reunification.
The best feature of this presentation is the maintenance of one chart form throughout. This is the general format:
The survey asks whether working mothers is a good thing or not. They choose to plot how the percent agreeing that working mothers is good changes over time. The blue line represents the East German average and the yellow line the West German average. There is a big gap in attitude between the two sides on this issue although both regions have experienced an increase in acceptance of working mothers over time.
All the other lines in the background indicate different subgroups of interest. These subgroups are accessible via the tabs on top. They include gender, education level, and age.
The little red "i" conceals some text explaining the insight from this chart.
Hovering over the "Men" tab leads to the following visual:
Both lines for men sit under the respective average but the shape is roughly the same. (Clicking on the tab highlights the two lines for men while moving the aggregate lines to the background.)
The Zeit team really does an amazing job keeping this chart clean while still answering a variety of questions.
They did make an important choice: not to put every number on this chart. We don't see the percent disagreeing or those who are ambivalent or chose not to answer the question.
Like I said before, what makes this set of charts is the seamless transitions between one question and the next. Every question is given the same graphical treatment. This eliminates learning time going from one chart to the next.
Here is one using a Likert scale, and accordingly, the vertical axis goes from 1 to 7. They plotted the average score within each subgroup and the overall average:
Here is one where they combined the top categories into a "Bottom 2 Box" type metric:
Finally, I appreciate the nice touch of adding tooltips to the series of dots used to aid navigation.
The theme of the workshop was interactive graphics. This effort by the Zeit team is one of the best I have seen. Market researchers take note!
First, I saw Alberto tweet his design for the Wall Street Journal (below is the English version):
The yellow space is the size of the smallest "livable" apartment in Hong Kong, known as the "mosquito" apartment. Livability is defined by the real estate developers.
If you've lived in a tropical area like Hong Kong, you'll understand the obsession with mosquitoes. The itching for days! The sneaky little things that suck your blood!
In Manhattan, it seems like we prefer saying the shoebox apartment. By comparison, it's not that scary. It's larger in size too.
The graphic is fantastic as it offers various comparisons of everyday spaces, like a NYC parking space and a basketball court, for which many Americans have some sense of their proportion.
This chart leads me down an unexpected path. I found a set of very powerful photos, commissioned by a humanitarian association in Hong Kong. Overwhelming. Here's one:
Yes, that is the entire living space for this family. All of forty square feet.
This article describes the project, as well as links to a number of other equally astounding photos.
These photos are unfair competition for any graphic designer.
Finally, I came across an inspiring, ingenious design. Gary Chang, who is an architect in Hong Kong, created his own apartment (344 square feet, almost 10 times larger than that in the photo, and twice as large as the mosquito apartment) in this amazing, space-saving design.
Through a series of movable walls, and beds, his apartment can be configured in 24 different ways. This is a small multiples layout!
Here is an article about his achievement, together with a video tour of his home. Not to be missed. It defines making something out of nothing.
Here is a little graphic describing certain transformations:
Thanks to reader Charles Chris P., I was able to get the police staffing data to play around with. Recall from the previous post that the Washington Post made the following scatter plot, comparing the proportion of whites among police officers relative to the proportion of whites among all residents, by city.
In the last post, I suggested making a histogram. As you see below, the histogram was not helpful.
The histogram does point out one feature of the data. Despite the appearance of dots scattered about, the slopes (equivalently, angles at the origin) do not vary widely.
This feature causes problems with interpreting the scatter plot. The difficulty arises from the need to estimate dot density everywhere. This difficulty, sad to say, is introduced by the designer. It arises from using overly granular data. In this case, the proportions are recorded to one decimal place. This means that a city with 10% is shown separate from one with 10.1%. The effect is jittering the dots, which muddies up densities.
One way to solve this problem is to use a density chart (heatmap).
You no longer have every city plotted but you have a better view of the landscape. You learn that most of the action occurs on the top row, especially on the top right. It turns out there are lots of cities (22% of the dataset!) with 100% white police forces. This group of mostly small cities is obscuring the rest of the data. Notice that the yellow cells contain very little data, fewer than 10 cities each.
For the question the reporter is addressing, the subgroup of cities with 100% white police forces is trivially important. Most of these places have at least 60% white residents, frequently much higher. But if every police officer is white, then the racial balance will almost surely be "off". I now remove this subgroup from the heatmap:
Immediately, you are able to see much more. In particular, you see a ridge in the expected direction. The higher the proportion of white residents, the higher the proportion of white officers.
But this view is also too granular. The yellow cells now have only one or two cities. So I collapse the cells.
More of the data lie above the bottom-left-top-right diagonal, indicating that in the U.S., the police force is skewed white on average. When comparing cities, we can take this national bias out. The following view does this.
The point indicated by the circle is the average city indicated by relative proportions of zero and zero. Notice that now, the densest regions are clustered around the 45-degree dotted diagonal.
To conclude, the Washington Post data appear to show these insights:
There is a national bias of whites being more likely to be in the police force
In about one-fifth of the cities, the entire police force is reported to be white. (The following points exclude these cities.)
Most cities confirm to the national bias, within an acceptable margin of error
There are a small number of cities worth investigating further: those that are far away from the 45-degree line through the average city in the final chart shown above.
Showing all the data is not necessarily a good solution. Indeed, it is frequently a suboptimal design choice.
On the sister blog, I wrote about Hans Rosling’s recent presentation in New York (link). I noted that Rosling has apparently simplified his visual palette.
Rosling is best known as the developer of the Gapminder tool, used to visualize global social statistics data collected by national statistical agencies. I wrote favorably about this tool in a series of posts (link). Gapminder made popular the moving bubble chart, although not the only graphical form present.
These animated bubble charts also made Rosling a YouTube star (See here.)
In last week’s presentation, Rosling only showed one moving bubble chart. The rest of his graphics are noticeably simpler, something that anyone can produce on Excel or Powerpoint. Here is one example:
I’m particularly impressed by a simple sequence of charts in which Rosling explains the demographic changes the world is expecting to see in the next 50 to 100 years.
This is an enhanced area chart. Each slice of area is subdivided into stick figures so that an axis for population counts becomes unnecessary.
Instead, the reader sees two useful dimensions: region of the world, and age group.
How the population ages as it grows is the feature story and the effect of aging is ingeniously portrayed as layers. This becomes apparent as Rosling lets time roll forward, and the layers literally walk off the page. (Unfortunately, I couldn't capture each step fast enough.)
(This photo courtesy of Daniel Vadnais.)
When Rosling showed the 2085 projection, we find that the entire rectangle has filled up, so the world population has definitely grown, roughly by 30 percent. The growth happens by filling up of adults; the total number of children has not changed. This is one of the key insights from recent demographic data. The first photo above shows something remarkable: the fertility rate in Asian countries has plunged to about the same level of developed countries already.
This set of charts is unusually effective. It represents another level of simplification in visual means. At the same time, the message is sharpened.
As I reported the other day (link), Rosling does not believe modern tools have improved data analysis. This talk which utilized simple tools is a good demonstration of his point.
With each succeeding year, I get more and more frustrated with "minimalist" designs that have little respect for users.
This Christmas, I received a portable cellphone charger as a gift. A thoughtful gift. I have heard of these devices but have never touched one. Until a few weeks ago (when I wrote this post).
This is the packaging.
The Phunkee Juice Box is a square cylinder. It has no buttons, and no obvious signals. The only other thing I found in the box was a multi-headed wire. This is as minimal as you can get. Even the brand's name is taped on, as if to say "You don't even have to advertise our name if you don't like it".
I needed to get some power into this battery first. I was in a computer lab with many power outlets but the cord in the box had no plugs. I looked for instructions. This is the back cover:
So how do I use this thing? There's a note at the bottom: Please see detailed instructions inside.
Amusingly, there wasn't anything inside the box that resembled instructions (see the first photo).
Perhaps I could connect the device to one of the lab computers and power it up that way. Instinctively, I inserted the USB connector into the device. Then I realized none of the three remaining connectors could fit into the computer.
The device has two sockets, so I reversed the wire.
Now the USB connector went into the desktop computer while the mini-USB plug went into the Juice Box.
A red light appeared around the neck of the device. It was a persistent light, not blinking, not changing colors. There was only one light on the Juice Box so how much charge did it have?
Then I started having doubts. Was I sure power was flowing from the computer to the Juice Box? Couldn't power be moving from the Juice Box to the computer? What I think caused this confusion was the reversing of the wire. The USB port was first inserted in the computer, then flipped over to the device. Cords are typically uni-directional but this one might be bi-directional.
An hour later, I didn't see any change. The red light was still on. Someone told me I should use my iPhone plug and insert the Juice Box directly to the socket on the wall. This device made me feel dumb.
Again, the red light came on, and again no other signal was forthcoming. Eventually, after three hours or so, the light turned blue. Finally, I learned that the light turns from red to blue on a full charge. I still have no idea how much charge is in the device at any time.
I left the fully charged device on my desk. One day later, my phone was out of power and I connected the Juice Box the only way it could - the mini-USB port into the phone, the USB port into the Juice Box. I had reversed the direction of the cord again. Presumably power was flowing from the battery into my phone. I wasn't sure since the one and only light was completely extinguished. (PS. Turned out no power was moving across. Perhaps the device was defective. Perhaps the power dissipated during those 24 hours of idleness.)
*** You know I will get to visualization eventually. The current trend of hiding labels and text is irritating. The new interface of Google Maps is more confusing to use than the previous interface, not least because of de-cluttering and replacing text with symbols. To read many of today's graphics, stumbling readers must hover over or click on the chart surface--these interactions add nothing to the experience.
Minimalism is taking away unneccessary things. It isn't taking away everything. Please stop torturing users.
I'd like to start 2015 on a happy note. I enjoyed reading the piece by Steven Rattner in the New York Times called "The Year in Charts". (link)
I particularly like the crisp headers, and unfussy language, placing the charts at the center. The components of the story flow nicely.
Here are my notes on some of the charts:
This chart is missing context, which is performance against population growth or potential. Changing the context also changes the implicit yardstick. The implied metric here is more-than-zero growth or continued growth.
It took me a while to find the titles to know what each section depicts. I'd prefer to put the titles back to the top or the top left corner. The "information in my head" is making me look at the "wrong" places. But otherwise, this is Tufte goodness.
This innocent thing prompts a host of questions. First, how could a "median" be found to have so many values within one population? It would appear that this is an exercise in isolating each quintile (decile in the case of the top 20%) and computing the median within each segment. In other words, the data represent these income percentiles: 95th, 85th, 75th, 50th, 3oth and 10th. Given that the income data have already been grouped, computing group averages makes more sense than calculating group medians. This is especially so when comparing changes over time. The robust median suppresses changes.
The bucketing of income presents another challenge. All buckets except at the very top are essentially bounded. All the central buckets have minimum and maximum values. The bottom bucket is bounded under by zero. The top bucket, however, is basically unbounded so important features of this data could be lost by summarizing the top bucket by its median.
A third problem surfaces if one were to inquire how the survey collects its data. According to the Federal Reserve description, the data concern "usual income" as opposed to "actual income". Respondents are told to ignore "temporary" conditions in describing their "usual incomes". It is likely the case that people think income increases are permanent while getting laid off is temporary so while usual income solves one problem (the long-term planner's problem), it creates a different problem (short-term bias). I particularly don't think it is a good metric for assessing changes around a recession/recovery.
I also wonder about the imputation of missing data. I'd assume that possibly there is a preponderance of missing values for unemployed people. If the imputation cannot predict the employment status of those people, then it would surely have inflated incomes.
I wonder if any of my readers knows details about some of these potential problems. Would love to hear how the Fed's statisticians deal with these issues.
On this chart, the author has found an excellent story, and the graphic is effective. I prefer to see the horizontal axis labelled "More Unequal" as opposed to "Less Equal" because of the conventional that "more" is usually placed to the right of "less" on the horizontal axis. Here is a scatter plot version of the data:
It shows the U.S. is a bit more extreme than all others.
This is another great chart. I like the imagery of the emptying middle. I find the labels a bit too long and requiring too much interpreting. I prefer this:
I have yet to understand why the vertical axis of the top chart keeps changing scales over time. The white dot labelled "Peak 1982" (70 million) is barely above the other white dot for "2007" (38 million). This chart hides a clear trend: the population of sheep in New Zealand has plunged by 45% over 25 years.
To address the question of sheep versus human, one should plot the ratio of sheep-to-human directly. In this case, the designer probably faced a problem: because of the plunging population of sheep, the ratio has plunged steeply in 25 years. To make a point that "people are outnumbered more than 9 to 1", the designer didn't want to show a plunging trend. (Could this be the reason why the human population in 1982 was not printed?)
This is a case of too many details. Instead of manipulating the scale to distort the data, one can simply show the current ratio, or the average ratio in the last five years.
As the reader scans to the bottom set of charts, a cognitive wedge is encountered, as the curved scale of the New Zealand chart gave way to the normal uniform scale. These smaller charts are no less confusing, however.
The two lines on these two charts appear almost the same and yet, the Australian chart (on the left) shows a ratio of 4 to 1 while the Icelandic chart (on the right) shows a ratio of 1.5 times. Makes you wonder if each one of the small-multiples have a dual axis.
Again, I'm not convivned that the time series adds anything to the message.
Note: I'm traveling during the holidays so updates will be infrequent.
Reader Daniel L. pointed me to a blog post discussing the following weather map:
The author claimed that many readers misinterpreted the red color as meaning high temperatures when he intended to show higher-than-normal temperatures. In other words, the readers did not recognize a relative scale is in play.
That is a minor issue that can be fixed by placing a label on the map.
There are several more irritants, starting with the abundance of what Ed Tufte calls chartjunk. The county boundaries do not serve a purpose, nor is it necessary to place so many place names. State boundaries too are too imposing. The legend fails to explain what the patch of green in Florida means.
The article itself links to a different view of this data on a newly launched site called Climate Prediction Center, by the National Oceanic and Atmospheric Administration (link). Here is a screenshot of the continental U.S.
This chart is the other extreme, bordering on too simple.
I'd suggest adding a little bit of interactivity to this chart, such as:
Hiding the state boundaries and showing them on hover only
Selectively print the names of major cities to help readers orient themselves
Selectively print the names of larger cities around the color boundaries
Using a different background map that focuses on the U.S. rather than the entire North American continent
This NYT graphic published on the eve of the Senate elections represents the best of data visualization: it carries its message with a punch.
The link to the web page is here. The graphic proudly occupied the front page of the print edition on Tuesday.
This graphic is not cliched. The typical consequence of such a statement is that it has to come with a reader's manual. The beauty of this beauty is that the required manual is compact:
The rectangular areas indicate the lack of competitiveness in each race. The extremes are: the entirely filled rectangle is a lock from start to finish; and the completely blank rectangle is a 50/50 tossup from start to finish. The more color, the less competitive the race.
Red implies the Republican candidate is projected to be leading at that moment; Blue, the Democrat; and Green, an independent. (The juxtaposition of red and green is one of the few mis-steps here.)
If you stick to the above, you will do fine.
If you start thinking the height of the area is the chance of winning, you run into trouble.
*** Here is a more conventional way to show time-series projections. It is a mirrored line chart, in which one of the two lines is redundant. (This chart shows up elsewhere on the NYT site.)
To turn this into the other style, draw a line through the 50-percent level, erase everything below 50, and then switch from line to area.
On the far right, where it says 75%, you can see that it is precisely half-way between 50 and 100 percent. So the new chart breaks the start-at-zero rule for area charts.
Except... this is an ingenious violation of that rule. Like I said, if you are able to get your head around to thinking that the area maps to lack of competitiveness (or, the amount of lead the leader has, regardless of who's leading), and suppress the urge to interpret the areas as the chance of winning, then the axis starting at 50-percent is not a problem. (I'm assuming that most of these races are in essence two-horse races. If there are more than two viable candidates, this particular chart form doesn't work.)
The payoff is a very compact chart that shows a lot of data in a small space. The NH race was a lock for the Democrats at the start bu the lead kept dwindling so that on the eve of the election, the lead has been cut in half. But the halved chance is still 75 percent in favor of the Dems.
Iowa and Colorado both flipped from Democratic to Republican lead around middle of September.
When the visualization is driven well, the readers have an effortless ride.