In case you don't see my other blog, the most recent post should have been posted here: The unkind fate of data graphics in the media.
Making data graphics interactive should improve the user experience. In practice, interactivity too often becomes overhead, making it harder for users to understand the data on the graph.
Reader Joe D. (via Twitter) admires the statistical sophistication behind this graphic about home runs in Major League Baseball. This graphic does present interesting analyses, as opposed to acting as a container for data.
For example, one can compare the angle and distance of the home runs hit by different players:
One can observe patterns as most of these highlighted players have more home runs on the left side than the right side. However, for this chart to be more telling, additional information should be provided. Knowing whether the hitter is left- or right-handed or a switch hitter would be key to understanding the angles. Also, information about the home ballpark, and indeed differentiating between home and away home runs, are also critical to making sense of this data. (One strange feature of baseball fields is that they all have different dimensions and shapes.)
But back to my point about interactivity. The original chart does not present the data in small multiples. Instead, the user must "interact" with the chart by clicking successively on each player (listed above the graphic).
Given that the graphic only shows one player at a time, the user must use his or her memory to make the comparison between one player and the next.
The chosen visual form discourages readers from making such comparisons, which defeats one of the primary goals of the chart.
It's a nice small-multiples setup with two tabs, one showing the states in order of descending spend and the other, alphabetical.
In the article itself, they excerpt the top of the chart containing the states that have suspiciously high per-patient spend.
Several types of comparisons are facilitated: comparison over time within each state, comparison of each state against the national average, comparison of trend across states, and comparison of state to state given the year.
The first comparison is simple as it happens inside each chart component.
The second type of comparison is enabled by the orange line being replicated on every component. (I'd have removed the columns from the first component as it is both redundant and potentially confusing, although I suspect that the designer may need it for technical reasons.)
The third type of comparison is also relatively easy. Just look at the shape of the columns from one component to the next.
The fourth type of comparison is where the challenge lies for any small-multiples construction. This is also a secret of this chart. If you mouse over any year on any component, every component now highlights that particular year's data so that one can easily make state by state comparisons. Like this for 2008:
You see that every chart now shows 2008 on the horizontal axis and the data label is the amount for 2008. The respective columns are given a different color. Of course, if this is the most important comparison, then the dimensions should be switched around so that this particular set of comparisons occurs within a chart component--but obviously, this is a minor comparison so it gets minor billing.
I love to see this type of thoughtfulness! This is an example of using interactivity in a smart way, to enhance the user experience.
The Boston subway charts I featured before also introduce interactivity in a smart way. Make sure you read that post.
Also, I have a few comments about the data analysis on the sister blog.
Reader Joe D. tipped me about a nice visualization project by a pair of grad students at WPI (link). They displayed data about the Boston subway system (i.e. the T).
The project has many components, one of which is the visualization of the location of every train in the Boston T system on a given day. This results in a very tall chart, the top of which I clipped:
I recall that Tufte praised this type of chart in one of his books. It is indeed an exquisite design, attributed to Marey. It provides data on both time and space dimensions in a compact manner. The slope of each line is positively correlated with the velocity of the train (I use the word correlated because the distances between stations are not constant as portrayed in this chart). The authors acknowledge the influence of Tufte in their credits, and I recognize a couple of signatures:
Clicking on a particular schedule highlights that train, pushing all the other lines into the background. The side panel provides a different visual of the same data, using a schematic subway map.
Notice that my mouse is hovering over the 6:11 am moment (represented by the horizontal guide on the right side). This generates a snapshot of the entire T system shown on the left. This map shows the momentary location of every train in the system at 6:11 am. The circled dot is the particular Red Line train I have clicked on before.
This is a master class in linking multiple charts and using interactivity wisely.
You may feel that the chart using the subway map is more intuitive and much easier to comprehend. It also becomes very attractive when the dots (i.e., trains) are animated and shown to move through the system. That is the image that project designers have blessed with the top position of their Github page.
However, the image above allows us to see why the Marey diagram is the far superior representation of the data.
What are some of the questions you might want to answer with this dataset? (The Q of our Trifecta Checkup)
Perhaps figure out which trains were behind schedule on a given day. We can define behind-schedule as slower than the average train on the same route.
It is impossible to figure this out on the subway map. The static version presents a snapshot while the dynamic version has moving dots, from which readers are challenged to estimate their velocities. The Marey diagram shows all of the other schedules, making it easier to find the late trains.
Another question you might ask is how a delay in one train propagates to other trains. Again, the subway map doesn't show this at all but the Marey diagram does - although here one can nitpick and say even the Marey diagram suffers from overcrowding.
On that last question, the project designers offer up an alternative Marey. Think of this as an indiced view. Each trip is indiced to its starting point. The following setting shows the morning rush hour compared to the rest of the day:
I think they can utilize this display better if they did not show every single schedule but show the hourly average. Instead of letting readers play with the time scale, they should pre-compute the periods that are the most interesting, which according to the text, are the morning rush, afternoon rush, midday lull and evening lull.
The trouble with showing every line is that the density of lines is affected by the frequency of trains. The rush hours have more trains, causing the lines to be denser. The density gradient competes with the steepness of the lines for our attention, and completely overwhelms it.
There really is a lot to savor in this project. You should definitely spend some time reviewing it. Click here.
Also, there is still time to sign up for my NYU chart-making workshop, starting on Saturday. For more information, see here.
Darin Myers at PGi was kind enough to send over an analysis of a chart using the Trifecta Checkup framework. I'm reproducing the critique in full, with a comment at the end.
At first glance this looks like a valid question, with good data, presented poorly (Type V). Checking the fine print (glad it’s included), the data falls apart.
It’s a good question…What device are we using the most? With so much digital entertainment being published every day, it pays to know what your audience is using to access your content. The problem is this data doesn’t really answer that question conclusively.
This was based on Survey data asking respondents “Roughly how long did you spend yesterday…watching television (not online) / using the internet on a laptop or PC / on a smartphone / on a tablet? Survey respondents were limited to those who owned or had access to a TV and a smartphone and/or tablet.
In fact the Council for Research Excellence found that self-reported screen time does not correlate with actual screen time. “Some media tend to be over-reported whereas others tend to be under-reported – sometimes to an alarming extent.” -Mike Bloxham, director of insight and research for Ball State
The visual has the usual problems with stacked bar charts where it is easy to see the first bar and the total, but not to judge the other values. This may not be an issue based on the question, but the presentation is focusing on an individual piece of tech (smartphones), so the design should focus on smartphones. At the very least, smartphones should be the first column in the chart and it should be sorted by smartphone usage.
My implementation is simply to compare the smartphone usage to the usage of the next highest device. Overall 53% of the time people are using a smartphone compared to something else. I went back and forth on whether I should keep the Tablet category in the Key though it was not the first or second used device. In the end, I decided to keep it to parallel the source visual.
Despite the data problems, I was really interested in seeing the breakdowns in each country by device, so I built the chart below with rank added (in bold). I also built some simple interaction to sort by column when you click the header [Ed: I did not attach the interactive excel sheet that came with the submission]. As a final touch, I displayed the color corresponding to the highest usage as a box to the left of the country name. It’s easy to see that the vast majority of countries use smartphones the most.
Hope you enjoyed Darin's analysis and revamp of the chart. The diagnosis is spot on. I like the second revision of the chart, especially for analysts who really want to know the exact numbers. The first redo has the benefit of greater simplicity--it can be a tough sell to an audience, especially when using color to indicate the second most popular device while disassociating the color and the length of the bar.
The biggest problem in the original treatment is the misalignment of the data with the question being asked. In addition to the points made by Darin, the glaring issue relates to the responder population. The analysis only includes people who have at least a smartphone or a tablet. But many people in lesser developed countries do not have either device. In those countries, it is likely that the TV screen time has been strongly underestimated. People who watch TV but do not own a smartphone or tablet are simply dropped from consideration.
For this same reason, the other footnoted comment claiming that the sampling frame accounts for ~70 percent of the global population is an irrelevance.
Josh Katz, who did the dialect maps I featured recently, is at it again. He's one of the co-authors of a series of maps (link) published by the New York Times about the fan territorities of major league baseball teams.
Similar to the dialect maps, these are very pleasing to look at, and also statistically interesting. The authors correctly point out that the primary points of interest are at the boundaries, and provide fourteen insets on particular regions. This small gesture represents a major shift from years past, when designers would have just printed an interactive map, letting readers figure out where the interesting stuff is.
The other interesting areas are the "no-man lands", the areas in which there are no local teams. The map uses the same kind of spatial averaging technology that blends the colors. The challenge here would be the larger number of colors.
I'd have preferred that they have given distinct colors to the teams like the Yankees and the Red Sox that have broader appeal. Maybe the Yankees is the only national team they discovered, since it does have the unique gray color which is very subtle.
I also think it is smart to hide the political boundaries of state, zip, etc. in the maps (unless you click on them).
I'd like to see a separate series of maps: small multiples by team, showing the geographical extent of each team. This is a solution to the domination issue to be addressed below.
The issue of co-dominant groups I discussed in the dialect maps also shows up here. Notably, in New York, the Mets are invisible, and in the Bay Area, the Oakland As similarly do not appear on the map.
Recall that the each zip code is represented by the team with the highest absolute proportion of fans. It may be true that the Mets are perennial #2 in all relevant zip codes. Zooming into the Yankee territory, I didn't see any zip code in which Mets fans are more numerous. So this may be the perfect example of what falls through the cracks when the algorithm just drops everything but the top level.
Now, in the Trifecta checkup, we want to understand what the data is saying. I have to say this is a bit challenging. The core dataset contains Facebook Likes (aggregated to the zip-code level). It is not even clear what the base of those proportions are. Is it the total population in a zip code? the total Facebook users? the total potential baseball fans?
As I have said elsewhere, Facebook data is often taken to be "N=All". This is an assumption, not a fact of the data. Different baseball teams may have different social-media/Facebook strategies. Different teams may have different types of fans, who are more/less likely to be on Facebook. This is particularly true of cross-town rivals.
Apart from the obvious problem with brands buying or otherwise managing Likes, "Like" is a binary metric that doesn't measure fan fervor. It is a static measure as I don't believe Facebook users manage their list of Likes actively (please correct me if I am wrong about this behavior.)
We are not provided any real numbers, and none of the maps have scales. Unless we see some absolute counts, it is hard to know if the data make sense relative to other measures of fandom, like merchandise and ticket sales. With Facebook data, it is sometimes possible to have too much--in other words, you might find there are more team fans than potential baseball fans or even population in a specific zip code.
It is very likely that Facebook, which is the source of the aggregated data, did not want to have raw counts published. This is par for the course for the Internet giants, and also something I find completely baffling. Here are the evangelizers of privacy is dead, and they stockpile our data, and yet they lock the data up in their data centers, away from our reach. Does that make any sense?
Back in 2009, I wrote about a failed attempt to visualize regional dialects in the U.S. (link). The raw data came from Bert Vaux's surveys. I recently came across some fantastic maps based on the same data. Here's one:
These maps are very pleasing to look at, and also very effective at showing the data. We learn that Americans use three major words to describe what others might call "soft drinks". The regional contrast is the point of the raw data, and Joshua Katz, who created these maps while a grad student at North Carolina State, did wonders with the data. (Looks like Katz has been hired by the New York Times.)
The entire set of maps can be found here.
What more evidence do we need that effective data visualization brings data alive... the corollary being bad data visualization takes the life out of data!
Look at the side by side comparisons of two ways to visualize the same data. This is the "soft drinks" question:
And this is the "caramel" question:
The set of maps referred to in the 2009 post can be found here.
Now, the maps on the left is more truthful to the data (at the zip code level) while Katz applies smoothing liberally to achieve the pleasing effect.
Katz has a poster describing the methodology -- at each location on the map, he averages the closest data. This is why the white areas on the left-side maps disappear from Katz's maps.
The dot notation on the left-side maps has a major deficiency, in that it is a binary element: the dot is either present or absent. We lost the granularity of how strongly the responses are biased toward that answer. This may be the reason why in both examples, several of the heaviest patches on Katz's maps correspond to relatively sparse regions on the left-side maps.
Katz also tells us that his maps use only part of the data. For each point on his maps, he only uses the most frequent answer; in reality, there are proportions of respondents for each of the available choices. Dropping the other responses is not a big deal if the responses are highly concentrated on the top choice but if the responses are evenly split, or well-balanced say among the top two choices, then using only the top choice presents a problem.
Reading Alberto Cairo’s fabulous book, The Functional Art, feels like reading my own work. It’s staggering how closely aligned our sensibilities are, notwithstanding our disparate backgrounds, he a data journlist by training, and I a statistician. We probably can finish each other’s sentences—and did at this recent Analytically Speaking webcast (link to clip).
Cairo currently teaches data visualization at the University of Miami; this is after a distinguished career as a data/visual journalist, having won many awards.
The Functional Art is divided into halves, which can be read independently.
The front part is a terrific overview of data visualization concepts. Cairo’s interest is in principles, rather than recipes. The field of data visualization has developed separately under three academic disciplines: design, computer science, and statistics. Inevitably, the work products contain contradictions and much re-invention. Cairo achieves a synthesis of these schools of thought, and this book is the clarion call for more work on unifying the key intellectual threads of the field.
The second half contains a series of interviews with industry luminaries. This section is a unique contribution to the literature, glancing at behind-the-scenes of the craft. Practitioners will find these short pieces illuminating and profitable. It is often a long journey to arrive at the graphic in print. The selection of designers emphasizes mainstream media outlets although the interviewees have wide-ranging views.
Included in these pages are plenty of published data graphics, frequently work that Cairo produced while working for the Brazilian publication, Epoca. These graphics are elaborate and ambitious, and nicely reproduced in color images. They reward detailed study, with attention to composition, narrative structure, chart types, selection of statistics, etc.
There are plenty of books on the market about how to do graphics (Dona Wong, Naomi Robbins, Nathan Yau come to mind.) Cairo’s book is not about doing, but about thinking about charts. Trust me, time spent thinking about charts will make your charts much improved.
I will now describe some sections of the book that particularly hold my interest:
In Chapter 3, Cairo explains the “visualization wheel,” a nice way to visualize the decisions that designers make when creating charts. Each decision is presented as a trade-off between two extremes. For example, a chart can be “light” or “dense.” This axis evokes Tufte’s data-ink ratio. Devices such as this wheel are useful for integrating the diverse viewpoints that coexist in our field. Frequently, these trade-off decisions are made implicitly—but they can really benefit from explicit consideration.
Figure 4.11 is one of the Epoca charts narrating a Brazilian election. Just recently, I linked to Cairo’s blog post about a similar chart. In both, a spider (radar) plot features prominently. On the same chart, you’ll find a nice demonstration of the small-multiples principle. I applaud the publisher of Epoca for supporting such deep data graphics.
Chapter 8 is invaluable in documenting the chart-making process. Trial and error is a key element of this process. Here, Cairo shows some of the earlier drafts of projects that eventually went to publication. This material is similar to what Kevin Quealy shows at his ChartNThings blog about New York Times graphics.
Chapter 9 is one of the more mature discussions of interactive graphics I have seen. Too often, interactivity is reduced to a feature that is layered onto any dataset. It should rightfully be seen as a problem of design.
Figure 10.1 is not strictly speaking a “data” graphic but I love John Grimwade’s visual explanation of the “transatlantic superhighway”.
Cairo also writes a blog.
One of the dangers of "Big Data" is the temptation to get lost in the details. You become so absorbed in the peeling of the onion that you don't realize your tear glands have dried up.
Hans Rosling linked to a visualization of tobacco use around the world from Twitter (link to original). The setup is quite nice for exploration. I'd call this a "tool" rather than a visual.
I appreciate the designer's concept -- the typical visualization of this type of data is looking at relative rates, which obscures the fact that China and India have far and away the most smokers even if their rates are middling (24% and 13% respectively).
This circular chart is supposed to show the absolute distribution of smokers across so-called "super-regions" of the world.
Unfortunately, the designer decided to pile on additional details. The concentric circles present a geography lesson, in effect. For example, high-income super-region is composed of high-income North America, Western Europe, high-income Asia Pacific, etc. and then high-income North America is composed of USA, Canada, etc.
Notice something odd? The further out you go, the larger the circular segments but the smaller the amount of people they represent! There are more people in the super-region of high-income worldwide than in high-income North America and in turn, there are more people in the high-income North American region than in USA. But the size of the graphical elements is reversed.
In principle, the "bumps"-like chart used to show the evolution of tobacco prevalence in individual countries make for a nice visual. In fact, Rosling marvelled that the global rate of consumption has fallen in recent years.
However, I'm often irritated when the designer pays no attention to what not to show. There are probably well above 200 lines densely packed into this chart. It is almost for sure that over-plotting will cause some of these lines to literally never see the light of day. Try hovering over these lines and see for yourself.
The same chart with say 10 judiciously chosen lines (countries or regions) provides the reader with a lot more profit.
The discerning reader figures out that the best visual actually does not even show up on the dashboard. Go ahead, and click on the tab called "Data" on top of the page. You now see a presentation of each country's "data" by age group and by gender. This is where you can really come up with stories for what is going on in different countries.
For example, the British have really done extremly well in reducing tobacco use. Look at how steep the declines are across the board for British men (in most parts of the world, the prevalence of smoking is much higher among men than women.)
Bulgaria on the other hand shows a rather odd pattern. It is one of the few countries in the bumps chart that showed a climb in smoking rates, at least in the early 2000s. Here the data for men is broken down into age groups.
This chart exposes a weakness of the underlying data. The error bars indicate to us that what is being plotted is not actual data but modeled data. The error bars here are enormous. With the average at about 40% to 50% for many age groups, the confidence interval is also 40% wide. Further, note that there were only three or four observations (purple dots) and curves are being fitted to these three or four dots, plus extrapolation outside the window of observation. The end result is that the apparent uplift in smoking in the early 2000s is probably a figment of the modeler's imagination. You'd want to understand if there are changes in methodologies around that time.
As a responsible designer of data graphics, you should focus less on comprehensiveness and focus more on highlighting the good data. I'm a firm believer of "no data is better than bad data".
There is no denying that this chart is beautiful to look at. But what is its message? That there are people migrating from and to every state? (assuming all fifty states are present)
Daily Kos describes how one can hover over any state to see its individual patterns. Something like this:
This is a great way, perhaps the only way, to consume the chart. Essentially, the reader is asked to generate a small-multiples panel of charts. The chart does a better job at showing the pairs of states between which people migrate than at showing the relative size of the flows. The size of the flows is coded in the width of the arcs. The widths are too similar to tell apart; and it doesn't help that no legend is provided.
The choice of color is curious. Each region of the country is its own color, in a "nominal" way. It is a design decision to emphasize regions.
Another decision is to hide information on the distances of the migrations. Evidently, the designer sacrificed that information in order to create the neat circular arrangement of states.
A shortcoming of this representation is one missing dimension: the direction of the flow. I'm not sure given any pair of states A and B, whether the net migration is into A or into B.
On this map, when you hover over a particular state, it highlights all other states for which there are migrations flows into or out of that state. For color, use a blue-white-red scheme with blue indicating net inflow, red indicating net outflow, and white for near-zero flows. Include a legend.
Another important decision for the designer is absolute versus relative scales. In an absolute scheme, you rank the entire set of flows for all pairs of states; obviously, the resulting colors would be influenced by the state populations. Alternatively, you rank the flow sizes within each state; in this case, the smaller states will feel exaggerated.
The map has the additional advantage of showing the approximate distance (and direction) moved, which, for me, is a useful piece of information.