New York/Tri-State residents: Meet me at NYU Bookstore tonight, 6-7:30 pm. (link)
When I wrote about the graphic showing the vote distribution around Syria in the Congress a few posts ago (link), readers offered opinions about what's a better graphic might look like. Having considered these submissions, I came up with a new visualization.
This graphic is one that facilitates an assessment of the prospect of the Syria resolution passing, given the known and leaning votes. It addresses various scenarios of how the undecided votes would break out. It also considers the extreme -- and unlikely -- case in which all leaning yes votes are sustained, all leaning no votes reverse, and all undecided vote yes. In that scenario, the President would have 131% of the votes needed for passing the resolution.
In this graphic, the real story of the data is revealed: based on the then known and leaning votes, the President would face certain defeat. Even if all the undecided broke in his favor, he would still only get to 86% of the votes needed to pass.
The top bar, showing composition, is a concession to those who wanted to understand how each party is voting under each scenario. It's a minor concern here.
Comparison to the original chart, reproduced below, is almost unfair. What is the prospect of the resolution passing? It's impossible to tell.
My graphic exposes less data, hides all No and Leaning No votes, displays no vote totals, and focuses on a computed metric, the proportional progress towards the 271 vote goal.
Kevin Drum shows the following graphic (link) to illustrate where the House stood on authorizing force in Syria.
What interests me is whether the semi-circle concept adds to the chart. It evokes the physical appearance of a chamber, presumably where such a debate has taken place -- although most televised hearings tend to exhibit lots of empty seats.
The half-filled circles in particular do not make peace with me.
Reader Steph G. didn't like the effort by WRAL (North Carolina) to visualize the demographics of protestors in Raleigh. It sounds like the citizens of NC are making their voices heard. Maybe my friends in Raleigh can give us some background.
There are definitely problems with the choice of charts. But I rate this effort a solid B. In the Trifecta Checkup, they did a good job describing the central question, as well as compiled an appropriate dataset. I love it when people go out to collect the right data rather than use whatever they could grab. The issue was the execution of the charts.
The first was a map showing where the arrested protestors came from.
Maps are typically used to show geographical distribution. The chosen color scheme (two levels of green and gray) compresses the data so much that we learn almost nothing about distribution. I clicked on Wake County to learn that there were 178 arrests there. The neighboring Randolph County had only 1 arrest but you can't tell from the colors.
The next chart shows the trend of arrests over time. I like the general appearance (except for the shadows). The problem is the even spacing of the columns when the gaps between the arrests are uneven.
Here's a quick redo, with proper spacing:
The final set of charts is inspired. They compare the demographics of those arrested protestors against the average North Carolina resident. For example:
For categories like Age with quite a few levels, the pie chart isn't a good choice. It's also hard to compare across pie charts. A column or dot chart works better.
The New York Times brought attention to the Bronx courtrooms this weekend. (link) The following small-multiples chart effectively illustrates how the Bronx system is uniquely unproductive, compared to the other boroughs:
The above chart shows the outcomes. The next chart shows the possible cause.
It appeared that at any time of the day, at least one-third of the courtrooms are not actively conducting business. In fact, outside of the period between 10:30 and 12:30, and 2:30, less than half of the courtrooms have a judge present.
I want to draw your attention to the caption below the chart. It said: "The Times visited all 47 courtrooms at the Bronx County Hall of Justice in 30-minute intervals totally how many were open and actively in session, ..."
Too often, we analyze and plot whatever data has been collected conveniently by some machine. Such data frequently do not address the questions we'd like to answer. We let the data dictate our research question.
Most great work in statistics come from people who put in the effort to define their research goals first, and then manually collect the specific data needed to accomplish those goals.
Felix linked to a set of charts about guns in the U.S. (and elsewhere). The original charts, by Liz Fosslien, are found here.
I like the clean style used by Fosslien. Some of the charts are thought-provoking. Many of them may raise more questions than they answer. Here are a few that caught my eye.
A simplistic interpretation would claim that banning handguns is futile, and may even have an adverse impact on murder rate. However, this chart does not reveal the direction of causality. Did some countries ban handguns because they are reacting to higher violence? If that is the case, this chart is confirming that the countries with handgun bans are a self-selected group.
The U.S. is an outlier, both in terms of firearm ownership and firearm homicides. This makes the analysis much harder because the U.S. is really in a class of its own. It's not at all clear whether there is a positive correlation in the cluster below, and even if there is, whether we can draw a straight line up to the U.S. dot is also dubious.
Fosslien is being cheeky to deny us the identity of the other outlier, the country with few firearms but even higher death rate from intentional homicide. These scatter plots are great by the way to show bivariate distributions.
I'd still prefer a line chart for this type of data but this particular paired bar chart works for me as well. The contents of this chart is a shock to me.
Xan G. has a must-read post comparing different ways of showing the electoral map. See here.
The key learning is something I often point out on this blog: geographical data can have a greater impact when it is unshackled from the map.
Xan pointed to a series of ideas that are improvements upon the map.
Here's an attempt to portray the election night as a horse race. This borrows an idea from the sports world where a baseball game can be portrayed with such a chart.
I love this sort of presentation. Similar to a baseball game, someone can look at this chart after the fact and experience the ups and downs of an Obama/Romney supporter without actually being there.
Then Xan spoils some of the fun by transforming the above into the following chart, which portrays Obama's win as a rout. All the suspense is gone!
As Xan explains it, he took Nathan Silver's predictions of "sure wins" and plotted those first. Thus, Obama started the night at almost 200 while Romney started with about 170.
While indeed the fun is gone, this is a more accurate view of this just-concluded election. I was a spoilt sport myself that night as I kept telling my friends that the only reason why Romney seemed close at the start was that the Red States generally have smaller populations, and thus took less time to count their votes. In addition, the Red States also tend to favor Republican candidates by very large margins so that the winner could be called early without counting most of the votes.
I have other thoughts on the state of reporting on polls, which I'll cover in a later post.
The November issue of Bloomberg Markets published the following pair of pyramid charts:
This chart fails a number of tests:
Tufte's data-ink ratio test
There are a total of six data points in the entire graphic. A mathematician would say only four data points, since the "no opinion" category is just the remainder. The designer lavishes this tiny data set with a variety of effects: colors, triangles, fonts of different tints, fonts of different sizes, solid and striped backgrounds, and legends, making something that is simple much more complex than necessary. The extra stuff impedes rather than improves understanding. In fact, there were so many parts that the designer even forgot to add little squares on the right panel beside the category labels.
Junk Charts's Self-sufficiency test
The data are encoded in the heights of the pyramids, not the areas. The shapes of the areas are inconsistent, which also makes it impossible to decipher. The way it is set up, one must compare the green, striped triangle with two trapezoids. This is when a designer realizes that he/she must print the data labels onto the chart as well. That's when self-sufficiency is violated. Cover up the data labels, and the graphical elements themselves no longer convey the data to the readers. More posts about self-sufficiency here.
Junk Charts's Trifecta checkup
The juxtaposition of two candidates' positions on two entirely different issues does not yield much insights. One is an economic issue, one is military in nature. Is this a commentary of the general credibility of the candidates? or their credibility on specific issues? or the investors' attitude toward the issues? Once the pertinent question is clarified, then the journalist needs to find the right data to address the question. More posts about the Trifecta checkup here.
Minimum Reporting Requirements for polls
Any pollster who doesn't report the sample size and/or the margin of error is not to be taken seriously. In addition, we should want to know how the sample was selected. What does it mean by "global investors"? Did the journalist randomly sample some investors? Did investors happen to fill out a survey that is served up somehow?
The following bar charts, while not innovative, speak louder.
I enjoy looking at the New York Times' summation of National Convention speeches via visualization. (link)
It's a disguised word cloud combined with a bubble chart with a little bar chart thrown in for good measure.
The size of the bubble is the total number of mentions of particular words or phrases. So the bubbles tell us the importance of specific concepts in aggregate of two parties.
It's the split within each bubble that represents the relative emphasis by party. Helpfully, the bubbles are sorted from left to right with the most Democratic words on the left. This splitting uses a bar chart paradigm. The diameter of the bubble is being partitioned, not the areas of the segments.
I wanted to see this as a straight-out word cloud. In the following, I use the red-blue-purple color gradient to indicate the Republican-Democratic bias, and the size of the words to indicate the number of mentions.
This word cloud is created using the Wordle tool, advanced options. My colleague John helped me pick the colors. (By the way, I don't like the insertion of small words within large letters, like what happened here inside the O in Obama.)
Also, I'd line the colors up so that the red words are on one side, blue on the other and purple in the middle. I'd need a different tool to be able to exercise this type of control.
The article talks about the effect of early voting during Presidential elections in the States. People are allowed to mail in their votes as early as 2 months before the November 6 election.
The chart on the right identifies all the states that allow early voting, and in particular, it highlights (in orange) the seven battleground states that allow early voting. This shows the designer keenly aware of what's important and what's not important on the chart. The states are ordered by the first date of voting, instead of alphabetically. (I do have a question about why several of the gray lines towards the bottom of the chart do not reach November 6. Probably because mail-in voting is closed prior to Election Day in some states...)
If the data were to be available, a nice addition to this chart is to include the distribution of early votes over time. It's useful to see if North Carolina voters tend to spread their mail-in votes evenly over the 2 month period, or if most of them get sent close to Election Day, or some other pattern. Changing the bar chart to a dot plot and using the density of dots to indicate frequency would work fine here.
Instead of the first date of voting, the chart would be more informative if it plots the average date of voting (among mail-in voters). This is because the first date of voting is an extreme value and there may be few voters who vote on that day. If we have to pick one number to represent all early voters, we should pick the one with the average (or median) voting time. Again, this is constrained by whether such data is publicly released.
The chart on the left is also well executed. The title should include the additional fact that only battleground states are depicted. I'd also extend the vertical axis to 100% since the data are proportions. The beauty of this presentation is that it functions on several levels, whether you are interested in knowing that not much changed in Iowa from 2004 to 2008, or the fact that almost 8 of 10 mail-in votes in Colorado were early votes, or that in both Colorado and North Carolina, the proportion of mail-in votes more than doubled between 2004 and 2008.
Neither of these are fancy charts, but they pack quite a bit of useful information.