On Twitter, Joe D. disliked the following chart on the Information is Beautiful blog:
The chart carries a long list of flaws.
The column labeled "%" is probably the most jarring. The meaning of these numbers changes with the color. When pink, they give the proportion of females; when blue, the proportion of males. As the stated purpose of the chart is to explore the male-female balance at different websites, it is a bad decision to fold two dimensions into one. While you're thinking about what I just said, what do you think the percentages in gray mean? Your guess is as good as mine.
Now, I appreciate that the designer uses a margin of error (implicitly), and separated these three sites as representing "equality", even though only one of them has the exact 50/50 split.
Wait, for Orkut (second row), it's 51 percent female, and for Foursquare, it's 52 percent male. The gender is coded in the figurines. You can check that with your magnifying glass.
It gets better.
The list of websites is ordered by increasing polarity but only within the three sections. Logically, the three "equality" sites should sit between the "matriarchy" and the "patriarchy". Pinterest and Reddit, the two most polarized sites, should stand on the edges. On the diagram shown right, I simulated a reader who wants to scan through the list of websites from the most female-oriented (Pinterest) to the most male-oriented (Reddit). It's quite the obstacle course.
Let's get to Joe D.'s issue with the chart. How many people does each figurine represent? It's quite a mouthful. Each figurine represents one percent of the unique visitors at the specific website but only in excess of fifty-percent. In effect, the Facebook figurine represents a huge number of people compared to the figurine of a less popular website like tagged. The designer did not explain the inclusion criteria for websites.
If you didn't get that definition, just ignore the figurines and think of this chart as a bar chart in which the bars start at 50 percent (rather than zero as it should). A standard population pyramid appears to do a better job - just add bars to the left of the diagram and properly align the male and female sections.
As I said before, read the fine print.
Here's the fine print:
If I am not mistaken, the designer applied the gender proportions to the traffic totals to obtain the rightmost column, labeled "million more monthly female or male visitors". The trouble is one number pertains to U.S. visitors while the other pertains to worldwide traffic. By multiplying them, the designer makes an assumption: that gender ratio is equivalent inside and outside the U.S., for every website.
Just to give you a sense of scale, according to this chart, Facebook has an excess of 155 million female visitors per month. According to Comscore, the key provider of such data, Facebook has about 145 million total U.S. visitors in June, 2013. It's not a small deal to mix up the geographies.
This example illustrates what I call "use at your own peril". It's like the surgeon's warning in restaurants in the U.S.: we warn you that drinking alcohol while pregnant could lead to birth defects, but you are free to do whatever you want with this information.
As of this writing, the original chart has thousands of Facebook likes, hundreds of shares on Linkedin and Pinterest, etc.
It appears that a lot of people are enjoying the chart more than Joe and I do.
Finally, here is a sketch of how I would plot this type of data. (U.S. traffic data from Comscore, various months of 2012, where I can find them. Comscore is a fee-based service so it is not easy to find data for the smaller sites unless you have a subscription.)
One piece of advice I give for those wanting to get into data visualization is to trash the defaults (see the last part of this interview with me). Jon Schwabish, an economist with the government, gives a detailed example of how this is done in a guest blog on the Why Axis.
Here are the highlights of his piece.
He starts with a basic chart, published by the Bureau of Labor Statistics. You can see the hallmarks of the Excel chart using the Excel defaults. The blue, red, green color scheme is most telling.
Just by making small changes, like using tints as opposed to different colors, using columns instead of bars, reordering the industry categories, and placing the legend text next to the columns, Schwabish made the chart more visually appealing and more effective.
The final version uses lines instead of columns, which will outrage some readers. It is usually true that a grouped bar chart should be replaced by overlaid line charts, and this should not be limited to so-called discrete data.
Schwabish included several bells and whistles. The three data points are not evenly spaced in time. The year-on-year difference is separately plotted as a bar chart on the same canvass. I'd consider using a line chart here as well... and lose the vertical axis since all the data are printed on the chart (or else, lose the data labels).
This version is considerably cleaner than the original.
I noticed that the first person to comment on the Why Axis post said that internal BLS readers resist more innovative charts, claiming "they don't understand it". This is always a consideration when departing from standard chart types.
Another reader likes the "alphabetical order" (so to speak) of the industries. He raises another key consideration: who is your audience? If the chart is only intended for specialist readers who expect to find certain things in certain places, then the designer's freedom is curtailed. If the chart is used as a data store, then the designer might as well recuse him/herself.
Robert Kosara takes us back to the 1940s, and an incredible "infographics" project by the Lawrence Livermoore Laboratory. (link) Here is one of the designs:
When did information graphics turn into ‘infographics,’ and when did we
lose the meticulous, well-researched, information-rich graphics for the
sad waste of pixels that calls itself infographic today?
I think one of the key missing pieces is analytics. Most of today's infographics seemingly are a result of treating data as flowers to be arranged. There is little analytical thinking behind what the data mean. Incidentally, that is why the new NYU certificate is not called Certificate in Data Visualization--we wanted to emphasize the importance of analytics next to datavis.
Also, we have an elective designed for people interested in content marketing. The Livermoore Lab project would fall into this category. So do annual reports for corporations, fundraising prospectuses for non-profit organizations, magazines whether commercial or membership, content for web marketing, etc.
*** The other problem is a kind of perversion of measurement. Because so much of this stuff is online, so many pieces are judged by click rates or bounce rates or time on page. The problem with click rates is well known. Headlines of so many online articles are written solely to create clicks. It's gotten to the point that we feel duped by the headlines.
The design may have originated in print, but in all likelihood, it is also uploaded to the Web; the interaction of readers with the online version is much easier to track than the effect of print, leading to the lazy generalization that the Web response would be "similar to" the print response. This is one of my pet peeves: bad data is worse than no data.
Rick (via Twitter) tells me he is baffled by this chart that showed up in Financial Review:
I'm baffled as well. What might the designer have in mind?
Based on the cues such as length of the curves, one would expect the US, Singapore, Japan, etc. to be leaders and India and China to be laggards. But what is being plotted on the vertical axis? It's not explained.
The title of the chart seems to indicate there is a time dimension but it's not on the horizontal axis where you'd expect it. The vertical axis does not appear to be time either, as it runs negative. The length of the lines could encode time but it is counterintuitive since China's line should then be much longer than that of the U.S., given its history.
Finally, how does one explain the placement of the callout box, noting China's GDP per capita. It literally points to nowhere.
One of my summer projects is to develop the curriculum for a new Certificate in Analytics and Data Visualization, offered at NYU (link). (If you are interested in teaching these courses, please contact me.) The program aims to give students a balanced training, covering datavis from the perspectives of statistics, graphical design and computer science.
Nathan Yau's new book, Data Points, landed on my desk at just the right time. It is a nice overview of the subject of data visualization, and it can serve nicely in our introductory course. The book sits closer to the statistical and design perspectives. Instructors will need to supplement the computer science topics such as interactivity, networks, and online graphics. It is of course difficult to teach interactive graphics from a static textbook. (Yau's previous book, Visualize This, has detailed tutorials of most of these techniques. My issue with that book is trying to be too many things at once.)
Data Points is a concepts and examples book. It's not a how-to book. There are figures on almost every page, and unlike Visualize This, most figures are actual published data visualization projects.
Just for fun, I classified the figures and plotted the result. (Some purely instructive figures are skipped.)
Running from left to right is the order of appearance of the chart within the book. I classified a total of 135 charts. For each chart, I considered whether one or more of 12 adjectives apply. I labeled about 40 charts "useful", "banal", "silly", and/or "engaging".
You can see from this graph that I enjoy the charts in the initial chapters. Up till chart number 50 or so, I find few "banal" charts, and many "engaging" or "amusing" or "artistic" charts. In the second part of the book, there are not many "surprising" or "amusing" charts.
As for "silly" and "baffling" charts, they appear at an even clip throughout. But that represents just my own bias. I also find "useful" charts throughout the book.
PS. I received a review copy of Data Points. Nathan's blog is Flowing Data.
On Twitter, Andy C. (@AnkoNako) asked me to look at this pretty creation at NFL.com (link).
There is a reason why you don't read much about spider charts (web charts, radar charts, etc.) here. While this chart is beautifully constructed, and fun to play with, it just doesn't work as a vehicle for communication.
This example above allows us to compare four players (here, quarterbacks) on eight metrics. Each white polygon represents one player, and the orange outline represents the league average quarterback.
What are some of the questions one might have about comparing quarterbacks?
Who is the best quarterback, and who is the worst?
Who is the better passer? (ignoring other skills, like rushing ability)
Is each quarterback better or worse than the average quarterback?
How will you figure these out from the spider chart?
Not sure. The relative value of the quarterbacks is definitely not encoded in the shape of the polygon, nor the area. To really figure this out, you'd need to look at each of the eight spokes independently, and then aggregate the comparisons in your head. Unless... you are willing to ignore seven of the eight metrics, and just look at passer rating (below right).
Focusing on passing only means focusing on five of the eight metrics, from pass attempts to interceptions. How do you combine five metrics into one evaluation is your own guess.
One can tell that Joe Flacco is basically the average quarterback as his contour is almost exactly that of the average (orange outline). Are the others better or worse thean average? Hard to tell at first glance.
First, the chart invites users to place equal emphasis on each of the eight dimensions. (There is a control to remove dimensions.) But the metrics are clearly not equally important. You certainly should value passing yards more than rushing yards, for example.
Second, the chart ignores the correlation between these eight metrics. The easiest way to see this is the "Passer Rating", which is a formula comprising the Passing Attempts, Passing Completions, Interceptions, Touchdown Passes, and Passing Yards. Yes, all those five components have been separately plotted. Another easy way to see the problem is that Passing Yards are highly correlated with Passing Attempts or Passing Completions.
Third, the chart fails to account for different types of quarterbacks. I deliberately chose these four because Joe Flacco was a starter, Tyrod Taylor was a backup who almost never played, while at San Francisco, Alex Smith and Colin Kaepernick shared the starting duties. So for Passing Yards, the numbers were 3817, 179, 1737 and 1814 respectively. Those numbers should not be directly compared. Better statistics are something like yards per minute played, yards per offensive series, yards per plays executed, etc. The way that this data is used here, all the second- and third-string quarterbacks will be below average and most of the starters will be above average.
From a design perspective, there are a small number of misses.
Mysteriously, the legend always has only two colors no matter how many players are being compared. The orange is labeled Average while the white is labeled "Leader". I have no idea why any of the players should be considered the "Leader".
The only way to know which white polygon represents which player is to hover on the polygon itself. You'll notice that in my example, several of those polygons overlap substantially so sometimes, hovering is not a task easily accomplished.
The last issue is scale. Turns out that some of the metrics like interceptions, touchdown passes, rushing yards, etc. can be zeroes. Take a look at this subset of the chart where I hovered on Tyrrod Taylor.
Do you see the problem? The zero point is definitely not the center of the circle. This problem exists for any circular charts like bubble charts.
Now look at Interceptions. Because the scale is reverse (lower is better), the zero point of this metric will lie on the outer edge of the circle. This is a vexing issue because the radius is open-ended on the outside but closed-ended on the inside.
In the next post, I will discuss some alternative presentation of this data.