« July 2013 | Main | September 2013 »

Highlight the right elements of a chart

The big news in the tech world is Steve Ballmer's retirement accouncement. Andrew Sullivan cites this chart by Derek Thompson as a reason for Ballmer's departure: (original article)


 How about this version?



What makes this version better?

  • Having the Microsoft/Wintel area at the bottom means the boundary of the area traces its rise and fall
  • Choosing a heavy color for Microsoft/Wintel draws attention to the main stage
  • Focus numerical labels on the particular items that convey the story, i.e. the numbers highlighted at the top of the original chart in red
  • Subtle and sparse gridlines tied to the key message
  • Tilt labels to fit inside areas
  • Place data labels inside chart next to the highlighted features
  • Draw attention to the boundary of the Microsoft/Wintel area


Hate the defaults

One piece of  advice I give for those wanting to get into data visualization is to trash the defaults (see the last part of this interview with me). Jon Schwabish, an economist with the government, gives a detailed example of how this is done in a guest blog on the Why Axis.

Here are the highlights of his piece.


He starts with a basic chart, published by the Bureau of Labor Statistics. You can see the hallmarks of the Excel chart using the Excel defaults. The blue, red, green color scheme is most telling.



Just by making small changes, like using tints as opposed to different colors, using columns instead of bars, reordering the industry categories, and placing the legend text next to the columns, Schwabish made the chart more visually appealing and more effective.


 The final version uses lines instead of columns, which will outrage some readers. It is usually true that a grouped bar chart should be replaced by overlaid line charts, and this should not be limited to so-called discrete data.


Schwabish included several bells and whistles. The three data points are not evenly spaced in time. The year-on-year difference is separately plotted as a bar chart on the same canvass. I'd consider using a line chart here as well... and lose the vertical axis since all the data are printed on the chart (or else, lose the data labels). 

This version is considerably cleaner than the original.


I noticed that the first person to comment on the Why Axis post said that internal BLS readers resist more innovative charts, claiming "they don't understand it". This is always a consideration when departing from standard chart types.

Another reader likes the "alphabetical order" (so to speak) of the industries. He raises another key consideration: who is your audience? If the chart is only intended for specialist readers who expect to find certain things in certain places, then the designer's freedom is curtailed. If the chart is used as a data store, then the designer might as well recuse him/herself.


Kosara wants to rescue infographics

Robert Kosara takes us back to the 1940s, and an incredible "infographics" project by the Lawrence Livermoore Laboratory. (link) Here is one of the designs:


Kosara laments:

When did information graphics turn into ‘infographics,’ and when did we lose the meticulous, well-researched, information-rich graphics for the sad waste of pixels that calls itself infographic today?

I think one of the key missing pieces is analytics. Most of today's infographics seemingly are a result of treating data as flowers to be arranged. There is little analytical thinking behind what the data mean. Incidentally, that is why the new NYU certificate is not called Certificate in Data Visualization--we wanted to emphasize the importance of analytics next to datavis.

Also, we have an elective designed for people interested in content marketing. The Livermoore Lab project would fall into this category. So do annual reports for corporations, fundraising prospectuses for non-profit organizations, magazines whether commercial or membership, content for web marketing, etc.

The other problem is a kind of perversion of measurement. Because so much of this stuff is online, so many pieces are judged by click rates or bounce rates or time on page. The problem with click rates is well known. Headlines of so many online articles are written solely to create clicks. It's gotten to the point that we feel duped by the headlines.

The design may have originated in print, but in all likelihood, it is also uploaded to the Web; the interaction of readers with the online version is much easier to track than the effect of print, leading to the lazy generalization that the Web response would be "similar to" the print response. This is one of my pet peeves: bad data is worse than no data.


Light entertainment: Hidden time, and shifted label

Rick (via Twitter) tells me he is baffled by this chart that showed up in Financial Review:



I'm baffled as well. What might the designer have in mind?

Based on the cues such as length of the curves, one would expect the US, Singapore, Japan, etc. to be leaders and India and China to be laggards. But what is being plotted on the vertical axis? It's not explained.

The title of the chart seems to indicate there is a time dimension but it's not on the horizontal axis where you'd expect it. The vertical axis does not appear to be time either, as it runs negative. The length of the lines could encode time but it is counterintuitive since China's line should then be much longer than that of the U.S., given its history.

Finally, how does one explain the placement of the callout box, noting China's GDP per capita. It literally points to nowhere.


Book review: Data Points by Nathan Yau

DataPointsOne of my summer projects is to develop the curriculum for a new Certificate in Analytics and Data Visualization, offered at NYU (link). (If you are interested in teaching these courses, please contact me.) The program aims to give students a balanced training, covering datavis from the perspectives of statistics, graphical design and computer science.

Nathan Yau's new book, Data Points, landed on my desk at just the right time. It is a nice overview of the subject of data visualization, and it can serve nicely in our introductory course. The book sits closer to the statistical and design perspectives. Instructors will need to supplement the computer science topics such as interactivity, networks, and online graphics. It is of course difficult to teach interactive graphics from a static textbook. (Yau's previous book, Visualize This, has detailed tutorials of most of these techniques. My issue with that book is trying to be too many things at once.)

Data Points is a concepts and examples book. It's not a how-to book. There are figures on almost every page, and unlike Visualize This, most figures are actual published data visualization projects.

Just for fun, I classified the figures and plotted the result. (Some purely instructive figures are skipped.)


Running from left to right is the order of appearance of the chart within the book. I classified a total of 135 charts. For each chart, I considered whether one or more of 12 adjectives apply. I labeled about 40 charts "useful", "banal", "silly", and/or "engaging".

You can see from this graph that I enjoy the charts in the initial chapters. Up till chart number 50 or so, I find few "banal" charts, and many "engaging" or "amusing" or "artistic" charts. In the second part of the book, there are not many "surprising" or "amusing" charts.

As for "silly" and "baffling" charts, they appear at an even clip throughout. But that represents just my own bias. I also find "useful" charts throughout the book.


PS. I received a review copy of Data Points. Nathan's blog is Flowing Data.

Beautiful spider loses its way

On Twitter, Andy C. (@AnkoNako) asked me to look at this pretty creation at NFL.com (link).


There is a reason why you don't read much about spider charts (web charts, radar charts, etc.) here. While this chart is beautifully constructed, and fun to play with, it just doesn't work as a vehicle for communication.

This example above allows us to compare four players (here, quarterbacks) on eight metrics. Each white polygon represents one player, and the orange outline represents the league average quarterback. 

What are some of the questions one might have about comparing quarterbacks?

  • Who is the best quarterback, and who is the worst?
  • Who is the better passer? (ignoring other skills, like rushing ability)
  • Is each quarterback better or worse than the average quarterback?

How will you figure these out from the spider chart?

  • Not sure. The relative value of the quarterbacks is definitely not encoded in the shape of the polygon, nor the area. To really figure this out, you'd need to look at each of the eight spokes independently, and then aggregate the comparisons in your head. Unless... you are willing to ignore seven of the eight metrics, and just look at passer rating (below right).
  • Focusing on passing only means focusing on five of the eight metrics, from pass attempts to interceptions. How do you combine five metrics into one evaluation is your own guess.
  • One can tell that Joe Flacco is basically the average quarterback as his contour is almost exactly that of the average (orange outline). Are the others better or worse thean average? Hard to tell at first glance.


There are a number of statistical points worth noting.

First, the chart invites users to place equal emphasis on each of the eight dimensions. (There is a control to remove dimensions.) But the metrics are clearly not equally important. You certainly should value passing yards more than rushing yards, for example.

Second, the chart ignores the correlation between these eight metrics. The easiest way to see this is the "Passer Rating", which is a formula comprising the Passing Attempts, Passing Completions, Interceptions, Touchdown Passes, and Passing Yards. Yes, all those five components have been separately plotted. Another easy way to see the problem is that Passing Yards are highly correlated with Passing Attempts or Passing Completions.

Third, the chart fails to account for different types of quarterbacks. I deliberately chose these four because Joe Flacco was a starter, Tyrod Taylor was a backup who almost never played, while at San Francisco, Alex Smith and Colin Kaepernick shared the starting duties. So for Passing Yards, the numbers were 3817, 179, 1737 and 1814 respectively. Those numbers should not be directly compared. Better statistics are something like yards per minute played, yards per offensive series, yards per plays executed, etc. The way that this data is used here, all the second- and third-string quarterbacks will be below average and most of the starters will be above average.


From a design perspective, there are a small number of misses.

Mysteriously, the legend always has only two colors no matter how many players are being compared. The orange is labeled Average while the white is labeled "Leader". I have no idea why any of the players should be considered the "Leader".

The only way to know which white polygon represents which player is to hover on the polygon itself. You'll notice that in my example, several of those polygons overlap substantially so sometimes, hovering is not a task easily accomplished.

The last issue is scale. Turns out that some of the metrics like interceptions, touchdown passes, rushing yards, etc. can be zeroes. Take a look at this subset of the chart where I hovered on Tyrrod Taylor.

Nfl_spider_zeroesDo you see the problem? The zero point is definitely not the center of the circle. This problem exists for any circular charts like bubble charts.

Now look at Interceptions. Because the scale is reverse (lower is better), the zero point of this metric will lie on the outer edge of the circle. This is a vexing issue because the radius is open-ended on the outside but closed-ended on the inside.


In the next post, I will discuss some alternative presentation of this data.

Beautiful spider loses its way 2

A double post today.

In the previous post, I talked about NFL.com's visualization of football player statistics. In this post, I offer a few different views of the same data.


The first is a dot plot arranged in small multiples.


Notice that I have indiced every metric against the league average. This is shown in the first panel. I use a red dot to warn readers that the direction of this metric is opposite to the others (left of center is a good thing!)

You can immediately make a bunch of observations:

  • Alex Smith was quite poor, except for interceptions.
  • Colin Kaepernick had similar passing statistics as Smith. His only advantage over Smith was the rushing.
  • Joe Flacco, as we noted before, is as average as it goes (except for rushing yards).
  • Tyrrod Taylor is here to remind us that we have to be careful about backup players being included in the same analysis.


The second version is a heatmap.

This takes inspiration from the fact that any serious reader of the spider chart will be reading the eight spokes (dimensions) separately. Why not plot these neatly in columns and use color to help us find the best and worst?


Imagine this to be a large table with as many rows as there are quarterbacks. You will able to locate the red (hot) zones quickly. You can also scan across a row to understand that player's performance relative to the average, on every metric.

I like this visualization best, primarily because it scales beautifully.


The final version is a profile chart, or sometimes called a parallel coordinates plot. While I am an advocate of profile charts, they really only work when you have a small number of things to compare.





Various ways to show variability

Reader Doeke W. sends me to this chart.


I like many aspects of this exercise. This chart displays the results of an experiment conducted by a computer games company to show that the new build ("249") renders frames faster than the older build ("248"). The messages of the chart are clear: the 249 build (blue bars) is substantially faster, over 80% of the frames render in 7 miliseconds or fewer under 249 compared to less than 40% under 248, and less obviously, the variance of frame times is also significantly smaller.

The slight problem is that readers probably have to read the text to grasp most of the above.


Using lines (or areas) improves the readability.


In the text, the author explains how to turn time per frame into frame per second, the more common way of measuring rendering speed. The formula is 1000 divided by time per frame. Wouldn't it be better if the chart plots fps directly?


When it comes to presenting distributions (or variability), the cumulative chart is more useful but it also is harder for readers to comprehend. For example:


The beauty of this chart is that one can take any point on the vertical axis, say, 80% level and read off the comparative values of 7 millisecond for the blue line (249) and 10.5 ms for the red (248). That means 80% of the 249 frames were rendered in fewer than 7 ms, relative to 10.5 ms for 248 frames.

Alternatively, taking a point on the horizontal axis, say 5 milliseconds, one can see that about 8% of 248 frames would reach that threshold but 30% of 249 frames did.

The steeper the ascent of the S-curve, the more efficient is the rendering.

Stutter steps, and functional legends

Dona Wong asked me to comment on a project by the New York Fed visualizing funding and expenditure at NY and NJ schools. The link to the charts is here. You have to click through to see the animation.


Here are my comments:

  • I like the "Takeaways" section up front, which uses words to tell readers what to look for in the charts to follow.
  • I like the stutter steps that are inserted into the animation. This gives me time to process the data. The point of these dynamic maps is to showcase the changes in the data over time.
  • I really, really want to click on the green boxes (the legend) and have the corresponding school districts highlighted. In other words, turning the legend into something functional. Tool developers, please take notes!
  • The other options on the map are federal, state and local shares of funding, given in proportions. These are controlled by the three buttons above. This is a design decision that privileges showing how federal funds are distributed across districts and across time. The tradeoff is that it's harder to comprehend the mix of sources of funds within each district over time.
  • I usually like to flip back and forth between actual values and relative values. I find that both perspectives provide information. Here, I'd like to see dollars and proportions.

I also find the line charts to be much clearer but the maps are more engaging. Here is an example of the line chart: (the blue dashed line is the New York state average)


After looking at these charts, I also want to see a bivariate analysis. How is funding per student and expenditure per student related?

Do you have any feedback for Dona?