The Giants QB Eli Manning is in the news for the wrong reason this season. His hometown paper, the New York Times, looked the other way, focusing on one metric that he still excels at, which is longevity. This is like the Cal Ripken of baseball. The graphic (link) though is fun to look at while managing to put Eli's streak in context. It is a great illustration of recognition of foreground/background issues. (I had to snip the bottom of the chart.)
After playing around with this graphic, please go read Kevin QuigleyQuealy's behind-the-scenes description of the various looks that were discarded (link). He showed 19 sketches of the data. Sketching cannot be stressed enough. If you don't have discarded sketches, you don't have a great chart.
Pay attention to tradeoffs that are being made along the way. For example, one of the sketches showed the proportion of possible games started:
I like this chart quite a bit. The final selection arranges the data by team rather than by player so necessarily, the information about proportion of possible games started fell by the wayside.
(Disclosure: I'm on Team Philip. Good to see that he is right there with Eli even on this metric.)
Sometimes I wonder if I should just become a chart doctor. Andrew recently wrote that journals should have graphical editors. Businesses also need those, judging from this submission through Twitter (@francesdonald). Link is here.
You don't know whether to laugh or cry at this pie chart:
The author of the article complains that all the tall buildings around the world are cheats: vanity height is defined as the height above which the floors are unoccupied. The sample proportions aren't that different between countries, ranging from 13% to 19% (of the total heights). Why are they added together to make a whole?
The following boxplot illustrates both the average and the variation in vanity heights by region, and tells a more interesting story:
Recall that in a boxplot, the gray box contains the middle 50% of the data and the white line inside the box indicates the median value. UAE has a tendency to inflate the heights more while the other three regions are not much different.
The other graphic included in the same article is only marginally better, despite a much more attractive exterior:
This chart misrepresents the actual heights of the buildings. At first glance, I thought there must be a physical limit to the number of occupied floors since the grayed out sections are equal heights. If the decision has been made to focus on the vanity height, then just don't show the rest of the buildings.
Also, it's okay to assume a minimal intelligence on the part of readers - I mean, is there a need to repeat the "non-occupiable height" label 10 times? Similarly, the use of 10 sets of double asterisks is rather extravagant.
On Twitter, Joe D. disliked the following chart on the Information is Beautiful blog:
The chart carries a long list of flaws.
The column labeled "%" is probably the most jarring. The meaning of these numbers changes with the color. When pink, they give the proportion of females; when blue, the proportion of males. As the stated purpose of the chart is to explore the male-female balance at different websites, it is a bad decision to fold two dimensions into one. While you're thinking about what I just said, what do you think the percentages in gray mean? Your guess is as good as mine.
Now, I appreciate that the designer uses a margin of error (implicitly), and separated these three sites as representing "equality", even though only one of them has the exact 50/50 split.
Wait, for Orkut (second row), it's 51 percent female, and for Foursquare, it's 52 percent male. The gender is coded in the figurines. You can check that with your magnifying glass.
It gets better.
The list of websites is ordered by increasing polarity but only within the three sections. Logically, the three "equality" sites should sit between the "matriarchy" and the "patriarchy". Pinterest and Reddit, the two most polarized sites, should stand on the edges. On the diagram shown right, I simulated a reader who wants to scan through the list of websites from the most female-oriented (Pinterest) to the most male-oriented (Reddit). It's quite the obstacle course.
Let's get to Joe D.'s issue with the chart. How many people does each figurine represent? It's quite a mouthful. Each figurine represents one percent of the unique visitors at the specific website but only in excess of fifty-percent. In effect, the Facebook figurine represents a huge number of people compared to the figurine of a less popular website like tagged. The designer did not explain the inclusion criteria for websites.
If you didn't get that definition, just ignore the figurines and think of this chart as a bar chart in which the bars start at 50 percent (rather than zero as it should). A standard population pyramid appears to do a better job - just add bars to the left of the diagram and properly align the male and female sections.
As I said before, read the fine print.
Here's the fine print:
If I am not mistaken, the designer applied the gender proportions to the traffic totals to obtain the rightmost column, labeled "million more monthly female or male visitors". The trouble is one number pertains to U.S. visitors while the other pertains to worldwide traffic. By multiplying them, the designer makes an assumption: that gender ratio is equivalent inside and outside the U.S., for every website.
Just to give you a sense of scale, according to this chart, Facebook has an excess of 155 million female visitors per month. According to Comscore, the key provider of such data, Facebook has about 145 million total U.S. visitors in June, 2013. It's not a small deal to mix up the geographies.
This example illustrates what I call "use at your own peril". It's like the surgeon's warning in restaurants in the U.S.: we warn you that drinking alcohol while pregnant could lead to birth defects, but you are free to do whatever you want with this information.
As of this writing, the original chart has thousands of Facebook likes, hundreds of shares on Linkedin and Pinterest, etc.
It appears that a lot of people are enjoying the chart more than Joe and I do.
Finally, here is a sketch of how I would plot this type of data. (U.S. traffic data from Comscore, various months of 2012, where I can find them. Comscore is a fee-based service so it is not easy to find data for the smaller sites unless you have a subscription.)
I like many aspects of this exercise. This chart displays the results of an experiment conducted by a computer games company to show that the new build ("249") renders frames faster than the older build ("248"). The messages of the chart are clear: the 249 build (blue bars) is substantially faster, over 80% of the frames render in 7 miliseconds or fewer under 249 compared to less than 40% under 248, and less obviously, the variance of frame times is also significantly smaller.
The slight problem is that readers probably have to read the text to grasp most of the above.
In the text, the author explains how to turn time per frame into frame per second, the more common way of measuring rendering speed. The formula is 1000 divided by time per frame. Wouldn't it be better if the chart plots fps directly?
When it comes to presenting distributions (or variability), the cumulative chart is more useful but it also is harder for readers to comprehend. For example:
The beauty of this chart is that one can take any point on the vertical axis, say, 80% level and read off the comparative values of 7 millisecond for the blue line (249) and 10.5 ms for the red (248). That means 80% of the 249 frames were rendered in fewer than 7 ms, relative to 10.5 ms for 248 frames.
Alternatively, taking a point on the horizontal axis, say 5 milliseconds, one can see that about 8% of 248 frames would reach that threshold but 30% of 249 frames did.
The steeper the ascent of the S-curve, the more efficient is the rendering.
Luck is not easy to nail down in a number. For the fantasy football league, I have a way of looking at luck. One aspect of luck is which team you are matched up with in any given week. There is a matter of facing a stronger or a weaker opponent. There is a different matter of whether you face a given opponent on his/her hot or cold day. Sort of like whether a hitter faces a pitcher on his good or bad day.
As noted before, each FFL player picks nine players out of 14 every week, and those nine earn points. There are typically 200-300 possible choices of nine players. So we can measure how well any FFL owner performs by looking at the points total of the activated squad against the whole distribution of 200-300 options. This was the topic of my earlier post.
Now, if I am lucky, then I tend to face opponents in the weeks in which they perform poorly. And the following chart shows this measure from week to week:
In Week 1, this owner was rather unlucky, in the sense that his opponent pretty much used his best possible squad. On the other hand, in Week 4, his opponent (a different team) played a weak hand, something close to the median squad (in addition, the entire histogram sits on the left side of the chart, meaning that even his opponent's best possible squad this week would have been easy to beat.)
Luck can be measured over the course of the 13 weeks. If the vertical lines tend to show up on the right tails of these histograms, then this owner isn't lucky. On the other hand, if the lines show up mostly on the left half of the histograms, then this owner is lucky.
In Chapter 8 of Numbersense, I use such an analysis to figure out the role of luck. This luck factor turned out to be even more important than the owner's own skills!
Special for Junk Charts readers: here is an excerpt from Chapter 8 (link).
The second book giveaway contest is under way on the sister blog. Enter the contest here.
I generated a big data set when writing Chapter 8 of Numbersense. This chapter discusses the question of how to measure your skills in managing/coaching a fantasy sports team. The general statistical question is how to separately measure two factors that both contribute to a single outcome.
In fantasy football (NFL), there is a matchup every week. Each week, you pick nine players from a roster of 14 players (rules vary by league). These nine players will score points for your team, based on how those players actually perform in real-life NFL games that week. You notch a win that week if your team scores more points than your opponent's team.
There are many ways to pick 9 players out of 14. In fact, in any given week, there are 200-300 eligible squads, of which only one is fielded. My big data set consists of all possible squads for every week for every team in the league. This data set contains rich information; the challenge is how to surface the information.
Visualization comes to the rescue. I'll be posting a series of charts here. Today's is the first one.
There are 13 plots, each of which represents a week of the season. The 13 plots trace the decisions of a single team over the course of the season. In each plot, the vertical line indicates the points total for the 9-player squad that was actually fielded by the team owner.
The histogram shows the range of choices the team owner could have made each week. Recall there are 200-300 possible squads of nine players from which the owner selected one. For example, in week 1, the owner didn't choose very well; there are many other sets of 9 players he could have chosen that would have scored him more points (the area to the right of the vertical line).
In Week 4, though, the owner could not have done much better. There were very few changes he could have made that would have increased his points total. Similarly, in Weeks 5 and 8.
You can also see that in Week 7, the 15 players he owned all tanked (in real life). The entire histogram is on the left side, meaning the points totals are horrible. Contrast this with Week 13, when the histogram is located on the right side of the chart, implying that this team owner would score pretty high no matter which 9 players he fielded.
You can get a copy of Numbersensehere. Or enter the book giveaway quiz to try your luck.
Abhinav asks me to check out his blog post on a chart on global warming (I prefer the term climate change) featured on Wonkblog. The chart is sourced to a report by the World Metereological Association (link to PDF).
Hello, start the axis at zero whenever you are using plotting columns. That's as fundamental as only plot proportions on a pie chart.
There is a reason why the designer didn't like to start the axis at zero. It is this (Abhinav helpfully made all these charts):
The trouble is that for this data set (on global average temperature), the area below 13 is completely useless. It's like plotting body temperature on a scale of 0 - 100 Celsius when all feasible values fall into a tight range, maybe 35-38 Celsius. I recount a similar situation that led to a college president saying something stupid in Chapter 1 of my new book, Numbersense. (Information on the book is here.)
So we understand the desire to get rid of the irrelevant white space. This is accomplished by using a line chart. (I'd prefer to omit the data values, and rely on the axis.)
Abhinav then created various versions of this by compressing and expanding the vertical scales. I don't think there is anything wrong with the above scale. As I mentioned, the scale should focus on the range of values that are feasible.
After I wrote the post about superimposing two time series to generate fake correlations, there was a lively discussion in the comments about whether a scatter plot would have done better. Here is the promised follow-up post.
The contentious issue is that X and Y might appear correlated but in
fact, what we are observing is that both data series are strongly
correlated with time (e.g. population almost always grows with time), and X and Y may not be correlated with each other.
Indeed, the first thing a statistician would do when encountering two data series is to create a scatter plot. Economists, by contrast, seem to prefer two line charts, superimposed.
The reason for looking at the scatter plot is to remove the time component. If X and Y are correlated systematically (and not individually with the time component), then even if we disturb the temporal order, we should still be able to see that correlation. If the correlation goes away in an x-y plot, then we know that the two variables are not correlated, and that the superimposed line charts created an illusion.
The catch is that the scatter plot analysis is necessary but not sufficient. In many cases, we will find strong correlation in the scatter plot. But that does not prove there is X-Y correlation beyond each data series being correlated with time. By plotting X and Y and ignoring time, we introduce time as an omitted variable, which can still be controlling both X and Y series.
The scatter plot (right) shows the per capita miles driven against the civilian labor force participation rate. Having hidden the time dimension, we still see a very strong correlation between the two data series.
This is because time is still the invisible hand. Time is running from left to right on the chart still. This pattern is visible if we have line segments connecting the data in temporal order, as in the chart below.
One solution to this problem is to de-trend the data. We want to remove the effect of time from each of the two data series individually, then we plot the residual signals against each other.
Here is the result (right). We now have a random scatter of points that average about zero. If anything, there may be a slightly negative correlation, meaning that when the labor force participation rate is above trend, the per-capita miles driven tend to be slightly below trend; this effect if it exists is small.
What I have done here is to establish the trend for each of the two time series. The actual data being plotted is what is above/below trend. What this chart is saying is that when one value is above trend, it gives us little information about whether the other value is above or below trend.
Robert Kosara has a great summary of the "banking to 45 degrees" practice first proposed by Bill Cleveland (link). Roughly speaking, the idea is that the slope of a line chart should be close to 45 degrees for the best perception. It's not a rule that you see much on Junk Charts because it's one of those rules about which I don't hold a strong opinion.
Here are the examples given by Kosara:
The same data is presented three ways. The slope is a reflection of the scales used on the two axes.
*** Well, I lied when I said I didn't care. Look at this particular chart below:
Some of you may recognize this style... I'm imitating Google Analytics charts. Several of the other Web charting tools also seem to come up with gems like this. Pretty much every chart you see in the Google Analytics interface looks like a flat line. The chart above looks like nothing more than noisy data from week to week.
But then look at the scale! The leftmost part of the line is a rise over two weeks. The actual rise was 50% or 300,000, i.e. an earth-shattering change.
If you use Google Analytics, you are better off downloading the data to Excel and drawing your own charts.