Robert Kosara takes us back to the 1940s, and an incredible "infographics" project by the Lawrence Livermoore Laboratory. (link) Here is one of the designs:
When did information graphics turn into ‘infographics,’ and when did we
lose the meticulous, well-researched, information-rich graphics for the
sad waste of pixels that calls itself infographic today?
I think one of the key missing pieces is analytics. Most of today's infographics seemingly are a result of treating data as flowers to be arranged. There is little analytical thinking behind what the data mean. Incidentally, that is why the new NYU certificate is not called Certificate in Data Visualization--we wanted to emphasize the importance of analytics next to datavis.
Also, we have an elective designed for people interested in content marketing. The Livermoore Lab project would fall into this category. So do annual reports for corporations, fundraising prospectuses for non-profit organizations, magazines whether commercial or membership, content for web marketing, etc.
*** The other problem is a kind of perversion of measurement. Because so much of this stuff is online, so many pieces are judged by click rates or bounce rates or time on page. The problem with click rates is well known. Headlines of so many online articles are written solely to create clicks. It's gotten to the point that we feel duped by the headlines.
The design may have originated in print, but in all likelihood, it is also uploaded to the Web; the interaction of readers with the online version is much easier to track than the effect of print, leading to the lazy generalization that the Web response would be "similar to" the print response. This is one of my pet peeves: bad data is worse than no data.
One of my summer projects is to develop the curriculum for a new Certificate in Analytics and Data Visualization, offered at NYU (link). (If you are interested in teaching these courses, please contact me.) The program aims to give students a balanced training, covering datavis from the perspectives of statistics, graphical design and computer science.
Nathan Yau's new book, Data Points, landed on my desk at just the right time. It is a nice overview of the subject of data visualization, and it can serve nicely in our introductory course. The book sits closer to the statistical and design perspectives. Instructors will need to supplement the computer science topics such as interactivity, networks, and online graphics. It is of course difficult to teach interactive graphics from a static textbook. (Yau's previous book, Visualize This, has detailed tutorials of most of these techniques. My issue with that book is trying to be too many things at once.)
Data Points is a concepts and examples book. It's not a how-to book. There are figures on almost every page, and unlike Visualize This, most figures are actual published data visualization projects.
Just for fun, I classified the figures and plotted the result. (Some purely instructive figures are skipped.)
Running from left to right is the order of appearance of the chart within the book. I classified a total of 135 charts. For each chart, I considered whether one or more of 12 adjectives apply. I labeled about 40 charts "useful", "banal", "silly", and/or "engaging".
You can see from this graph that I enjoy the charts in the initial chapters. Up till chart number 50 or so, I find few "banal" charts, and many "engaging" or "amusing" or "artistic" charts. In the second part of the book, there are not many "surprising" or "amusing" charts.
As for "silly" and "baffling" charts, they appear at an even clip throughout. But that represents just my own bias. I also find "useful" charts throughout the book.
PS. I received a review copy of Data Points. Nathan's blog is Flowing Data.
I like many aspects of this exercise. This chart displays the results of an experiment conducted by a computer games company to show that the new build ("249") renders frames faster than the older build ("248"). The messages of the chart are clear: the 249 build (blue bars) is substantially faster, over 80% of the frames render in 7 miliseconds or fewer under 249 compared to less than 40% under 248, and less obviously, the variance of frame times is also significantly smaller.
The slight problem is that readers probably have to read the text to grasp most of the above.
In the text, the author explains how to turn time per frame into frame per second, the more common way of measuring rendering speed. The formula is 1000 divided by time per frame. Wouldn't it be better if the chart plots fps directly?
When it comes to presenting distributions (or variability), the cumulative chart is more useful but it also is harder for readers to comprehend. For example:
The beauty of this chart is that one can take any point on the vertical axis, say, 80% level and read off the comparative values of 7 millisecond for the blue line (249) and 10.5 ms for the red (248). That means 80% of the 249 frames were rendered in fewer than 7 ms, relative to 10.5 ms for 248 frames.
Alternatively, taking a point on the horizontal axis, say 5 milliseconds, one can see that about 8% of 248 frames would reach that threshold but 30% of 249 frames did.
The steeper the ascent of the S-curve, the more efficient is the rendering.
Robert Kosara has a great summary of the "banking to 45 degrees" practice first proposed by Bill Cleveland (link). Roughly speaking, the idea is that the slope of a line chart should be close to 45 degrees for the best perception. It's not a rule that you see much on Junk Charts because it's one of those rules about which I don't hold a strong opinion.
Here are the examples given by Kosara:
The same data is presented three ways. The slope is a reflection of the scales used on the two axes.
*** Well, I lied when I said I didn't care. Look at this particular chart below:
Some of you may recognize this style... I'm imitating Google Analytics charts. Several of the other Web charting tools also seem to come up with gems like this. Pretty much every chart you see in the Google Analytics interface looks like a flat line. The chart above looks like nothing more than noisy data from week to week.
But then look at the scale! The leftmost part of the line is a rise over two weeks. The actual rise was 50% or 300,000, i.e. an earth-shattering change.
If you use Google Analytics, you are better off downloading the data to Excel and drawing your own charts.
Nick C. on Twitter sent us to the following chart of salaries in Major League Soccer. (link)
This chart is hosted at Tableau, which is one of the modern visualization software suites. It appears to be a user submission. Alas, more power did not bring more responsibility.
Sorting the bars by total salary would be a start.
The colors and subsections of the bars were intended to unpack the composition of the total salaries, namely, which positions took how much of the money. I'm at a loss to explain why those rectangles don't seem to be drawn to scale, or what it means to have rectangles stacked on top of each other. Perhaps it's because I don't know much about how the cap works.
Combined with the smaller chart (shown below), the story seems to be that while all teams have similar cap numbers, the actual salaries being paid could differ by multiples.
This is the standard stacked bar chart showing the distribution of salary cap usage by team:
I have never understood the appeal of stacking data. It's not easy to compare the middle segments.
After quite a bit of work, I arrived at the following:
The MLS teams are divided into five groups based on how they used the salary cap. Salary cap figures are converted into proportion of total cap. For example, the first cluster includes Chicago, Los Angeles, New York, Seattle and Toronto, and these teams spread the wealth among the D, F, and M players while not spending much on goalie and "others". On the other hand, Groups 2 and 3, especially Group 3 allocated 30-45% of the cap on the midfield.
Three teams form their own clusters. CLB spends more of its cap on "others" than any other team (others are mostly hyphenated positions like D-F, F-M, etc.) DAL and VAN spend a lot less on midfield players than other teams. VAN spends a lot on defense.
My version has many fewer data points (although the underlying data set is the same) but it's easier to interpret.
I tried various chart types like bar charts, and even pie charts. I still like the profile (line) charts best.
In a modern software (I'm using JMP's Graph Builder here), it's only one click to go from line to bar, and one click to go to pie.
There is a tendency when producing dashboards to go for the cutesy-cutesy. Reader Daniel L. came across an attempt by Facebook to document its data center metrics (link). They chose this circular, spiraling design:
Notice that the lines of equal distance on a circular plot are the concentric circles. Thus, when they connect different points in a continuous way, as if it were a standard line chart, the line segments between data points are distorted. The diagram below shows the problem:
One potential advantage (although not worthwhile) of wrapping the data into a circle is that the 24 hours become a continuous line. Except that it isn't the case here! Weirdly, the purple and blue lines show a huge discontinuity at the ray that points vertically upwards from the origin. This leads to an even more fascinating find.
The circle actually rotates! It's like a rotating restaurant. The time shown vertically pointing upwards keeps changing as I write this post. This makes the discontinuity even more baffling. You'd think the previous data point just shifts anti-clockwise but apparently not. If any of you can figure this out, please leave a comment.
As Daniel pointed out, the traditional line charts shown in the bottom half of the page would have done the job with less fuss. Not as eye-catching, but not as baffling either.
One innovation of on-line charts is the replacement of axis labels with mouse-over effects. Mousing over the chart here produces the underlying data values. This is elegance.
One horrible trend with on-line charts is the horrendous choice of scale. Look at the top two charts, especially the orange line chart about power usage. It makes no sense to choose a scale that completely annihilates the underlying fluctuations.
I have found the same problems with many Google charts. It looks as if nothing is happening except when you look more closely, you learn that a tiny distance represents a big percentage shift in the underlying data.
When we visualize data, we want to expose the information contained within, or to use the terminology Nate Silver popularized, to expose the signal and leave behind the noise.
When graphs are not done right, sometimes they manage to obscure the information.
Reader John H. found a confusing bar chart while studying a paper (link to PDF) in which the authors compared two algorithms used to determine the position of Wi-Fi access points under various settings.
The first reaction is maybe the researchers are telling us there is no information here. The most important variable on this chart is what they call "datanum", and it goes from left to right across the page. A casual glance across the page gives the idea that nothing much is going on.
Then you look at the row labels, and realize that this dataset is very well structured. The target variables (AP Position Error) is compared along four dimensions: datanum, the algorithm (WCL, or GPR+WCL), the number of access points, and the location of these access points (inner, boundary, all).
When the data has a nice structure, there should be better ways to visualize it.
John submitted a much improved version, which he created using ggplot2.
This is essentially a small multiples chart. The key differences between the two charts are:
Giving more dimensions a chance to shine
Spacing the "datanum" proportional to the sample size (we think "datanum" means the number of sample readings taken from each access point)
Using a profile chart, which also allows the y-axis to start from 2
When you read this chart, you finally realize that the experiment has yielded several insights:
Increasing sample size does not affect aggregate WCL error rate but it does reduce the error rate of aggregate GPR+WCL.
The improvement of GPR+WCL comes only from the inner access points.
The WCL algorithm performs really well in inner access points but poorly in outer access points.
The addition of GPR to the WCL algorithm improves the performance of outer access points but deteriorates the performance of inner access points. (In aggregate, it improves the performance... this is only because there are almost two outer access points to every inner access point.)
Now, I don't know anything about this position estimation problem. The chart leaves me thinking why they don't just use WCL on inner access points. The performance under that setting is far and away the best of all the tested settings.
The researchers described their metric as AP Position Error (2drms, 95% confidence). I'm not sure what they mean by that because when I see 95% confidence, I am expecting to see the confidence band around the point estimates being shown above.
And yet, the data table shows only point estimates -- in fact, estimates to two decimal places of precision. In statistics, the more precision you have, the less confidence.
Readers Fausto and Jeruza have a question for us. In the following official schedule (link) for the upcoming London Olympics, what do the colors signify?
The blue seems to signify aquatics (diving, rowing, sailing, etc.) except that at the bottom of the chart (clipped), weightlifting has the same blue. The four types of cycling come in three colors. A legend would be a very useful thing here. Like F&J, I kept staring at the chart hoping for inspiration but nothing is forthcoming.
Would prefer also to see the sports shown in order of earliest start date, as opposed to alphabetical.
I find other aspects of this chart attractive. There is an impressive amount of details, like which days the event finals are held. Mousing over the medal symbols produces some useful data:
The daily view provides even more details:
The details are nicely wrapped in layers.
Taking a step back, it would be interesting to understand who the designer has in mind when he/she created this chart. I could imagine a journalist trying to get a quick overview of the day's events but then the chart doesn't provide the venues. An avid fan might want to figure out what's on TV but then consulting a true TV schedule would be better since no station is unlikely to show everything, and they can't show simultaneous competitions anyway.