On Twitter, Andy C. (@AnkoNako) asked me to look at this pretty creation at NFL.com (link).
There is a reason why you don't read much about spider charts (web charts, radar charts, etc.) here. While this chart is beautifully constructed, and fun to play with, it just doesn't work as a vehicle for communication.
This example above allows us to compare four players (here, quarterbacks) on eight metrics. Each white polygon represents one player, and the orange outline represents the league average quarterback.
What are some of the questions one might have about comparing quarterbacks?
Who is the best quarterback, and who is the worst?
Who is the better passer? (ignoring other skills, like rushing ability)
Is each quarterback better or worse than the average quarterback?
How will you figure these out from the spider chart?
Not sure. The relative value of the quarterbacks is definitely not encoded in the shape of the polygon, nor the area. To really figure this out, you'd need to look at each of the eight spokes independently, and then aggregate the comparisons in your head. Unless... you are willing to ignore seven of the eight metrics, and just look at passer rating (below right).
Focusing on passing only means focusing on five of the eight metrics, from pass attempts to interceptions. How do you combine five metrics into one evaluation is your own guess.
One can tell that Joe Flacco is basically the average quarterback as his contour is almost exactly that of the average (orange outline). Are the others better or worse thean average? Hard to tell at first glance.
First, the chart invites users to place equal emphasis on each of the eight dimensions. (There is a control to remove dimensions.) But the metrics are clearly not equally important. You certainly should value passing yards more than rushing yards, for example.
Second, the chart ignores the correlation between these eight metrics. The easiest way to see this is the "Passer Rating", which is a formula comprising the Passing Attempts, Passing Completions, Interceptions, Touchdown Passes, and Passing Yards. Yes, all those five components have been separately plotted. Another easy way to see the problem is that Passing Yards are highly correlated with Passing Attempts or Passing Completions.
Third, the chart fails to account for different types of quarterbacks. I deliberately chose these four because Joe Flacco was a starter, Tyrod Taylor was a backup who almost never played, while at San Francisco, Alex Smith and Colin Kaepernick shared the starting duties. So for Passing Yards, the numbers were 3817, 179, 1737 and 1814 respectively. Those numbers should not be directly compared. Better statistics are something like yards per minute played, yards per offensive series, yards per plays executed, etc. The way that this data is used here, all the second- and third-string quarterbacks will be below average and most of the starters will be above average.
From a design perspective, there are a small number of misses.
Mysteriously, the legend always has only two colors no matter how many players are being compared. The orange is labeled Average while the white is labeled "Leader". I have no idea why any of the players should be considered the "Leader".
The only way to know which white polygon represents which player is to hover on the polygon itself. You'll notice that in my example, several of those polygons overlap substantially so sometimes, hovering is not a task easily accomplished.
The last issue is scale. Turns out that some of the metrics like interceptions, touchdown passes, rushing yards, etc. can be zeroes. Take a look at this subset of the chart where I hovered on Tyrrod Taylor.
Do you see the problem? The zero point is definitely not the center of the circle. This problem exists for any circular charts like bubble charts.
Now look at Interceptions. Because the scale is reverse (lower is better), the zero point of this metric will lie on the outer edge of the circle. This is a vexing issue because the radius is open-ended on the outside but closed-ended on the inside.
In the next post, I will discuss some alternative presentation of this data.
I like many aspects of this exercise. This chart displays the results of an experiment conducted by a computer games company to show that the new build ("249") renders frames faster than the older build ("248"). The messages of the chart are clear: the 249 build (blue bars) is substantially faster, over 80% of the frames render in 7 miliseconds or fewer under 249 compared to less than 40% under 248, and less obviously, the variance of frame times is also significantly smaller.
The slight problem is that readers probably have to read the text to grasp most of the above.
In the text, the author explains how to turn time per frame into frame per second, the more common way of measuring rendering speed. The formula is 1000 divided by time per frame. Wouldn't it be better if the chart plots fps directly?
When it comes to presenting distributions (or variability), the cumulative chart is more useful but it also is harder for readers to comprehend. For example:
The beauty of this chart is that one can take any point on the vertical axis, say, 80% level and read off the comparative values of 7 millisecond for the blue line (249) and 10.5 ms for the red (248). That means 80% of the 249 frames were rendered in fewer than 7 ms, relative to 10.5 ms for 248 frames.
Alternatively, taking a point on the horizontal axis, say 5 milliseconds, one can see that about 8% of 248 frames would reach that threshold but 30% of 249 frames did.
The steeper the ascent of the S-curve, the more efficient is the rendering.
Note: The winner of the Book Quiz Round 2 was announced on my book blog. Congratulations to the winners. You can get your own copy of Numbersensehere.
A common advice for anyone living in the U.S. is "read the fine print." If you receive a notice or see an ad, and there is an asterisk or some copy in almost invisible font located at the bottom of the page, you better pull out your magnifying glass.
If you are a data analyst, you better have a magnifying glass in your pocket at all times. One of the recurring themes in Numbersense is that details matter... a lot. This is particularly relevant to Chapters 6 and 7 on economic data.
Last week, on the first Friday of the month, the jobs report came out. For the best reporting on the data itself, with succinct commentary but no hand-waving, I go to Calculated Risk blog.
One of the charts highlighted (in this post) is the unemployment rate by educational attainment. This is the chart that leads to horribly misleading statements saying that the solution to the unemployment crisis is more education. I ranted about this before--see here and here.
Taking this chart at face value, you'd say that the unemployment rate is lower, the more education one has. One can also say that the unemployment rate is less volatile, the more education one has.
Bill makes two succinct comments, basically letting his readers know this chart is next to worthless.
1. Although education matters for the unemployment rate, it doesn't appear
to matter as far as finding new employment - and the unemployment rate
is moving sideways for those with a college degree!
The issue behind this is the "cohort effect". The chart above aggregates everyone from 25 years old and over. This means it treats equally people who just graduated from college last year and people who got their degrees thirty years ago. Why does this matter? A jobs recession hits certain types of people harder than others, and one important determinant is work experience (another would be the industry one works in.) The low unemployment rate for all college graduates masks the challenging job market for recent college graduates. The misinterpretation of this chart leads to wrongheaded policies such as make more college gradutes.
2. This says nothing about the quality of jobs - as an example, a college
graduate working at minimum wage would be considered "employed".
This is where the magnifying glass is critical. You should not assume that your idea of "employed" is the same as the official definition of "employed". Bill raised the issue of minimum wage. Elsewhere, other commentators noted the issue of "part-timers". Part-time employment is not distinguished from full-time employment in the official aggregate statistics.
Taking this further, isn't it plausible that unemployment "trickles down"? As the college graduates grab whatever job they can find, including the minimum-wage ones, they push the high-school graduates out of jobs.
In data, there is often no fine print to be found. In Big Data, this problem is aggravated by a thousand times. Unfortunately, magnifying blank is still blank. So, having the magnifying glass is not enough.
The solution then is to create your own fine print. Spend inordinate amounts of time understanding how data is collected. Dig deeply into how data is defined.
No, this work is not sexy. (PS. If you can't stand it, you really shouldn't be in data science.)
In Chapter 6 of Numbersense, I did this work for you as it relates to jobs data. What I show there is that there is no "right" way to measure employment--it's not as clearcut as you'd like to think. If you were to put forth your definition of "employed" for comment, your definition will absolutely get criticized, just the same way you're criticizing the government's definition.
PS. Larry at Good Stats, Bad Stats pulled out his magnifying glass and wrote a series of posts about education, employment and income. He mildly disagrees with me.
Luck is not easy to nail down in a number. For the fantasy football league, I have a way of looking at luck. One aspect of luck is which team you are matched up with in any given week. There is a matter of facing a stronger or a weaker opponent. There is a different matter of whether you face a given opponent on his/her hot or cold day. Sort of like whether a hitter faces a pitcher on his good or bad day.
As noted before, each FFL player picks nine players out of 14 every week, and those nine earn points. There are typically 200-300 possible choices of nine players. So we can measure how well any FFL owner performs by looking at the points total of the activated squad against the whole distribution of 200-300 options. This was the topic of my earlier post.
Now, if I am lucky, then I tend to face opponents in the weeks in which they perform poorly. And the following chart shows this measure from week to week:
In Week 1, this owner was rather unlucky, in the sense that his opponent pretty much used his best possible squad. On the other hand, in Week 4, his opponent (a different team) played a weak hand, something close to the median squad (in addition, the entire histogram sits on the left side of the chart, meaning that even his opponent's best possible squad this week would have been easy to beat.)
Luck can be measured over the course of the 13 weeks. If the vertical lines tend to show up on the right tails of these histograms, then this owner isn't lucky. On the other hand, if the lines show up mostly on the left half of the histograms, then this owner is lucky.
In Chapter 8 of Numbersense, I use such an analysis to figure out the role of luck. This luck factor turned out to be even more important than the owner's own skills!
Special for Junk Charts readers: here is an excerpt from Chapter 8 (link).
The second book giveaway contest is under way on the sister blog. Enter the contest here.
The winner of the Numbersense Book Quiz has been announced. See here.
GOOD NEWS: McGraw-Hill is sponsoring another quiz. Same format. Another chance to win a signed book. Click here to go directly to the quiz.
I did a little digging around the quiz data. The first thing I'd like to know is when people sent in responses.
This is shown on the right. Not surprisingly, Monday and Tuesday were the most popular days, combining for 70 percent of all entries. The contest was announced on Monday so this is to be expected.
There was a slight bump on Friday, the last day of the contest.
I'm at a loss to explain the few stray entries on Saturday. This is very typical of real-world data; strange things just happen. In the software, I set the stop date to be Saturday, 12:00 AM, and I was advised that they abide by Pacific Standard Time. This doesn't seem to be the case, unless... the database itself is configured to a different time standard!
The last entry was around 7 am on Saturday. Pacific Time is about 8 hours behind Greenwich Mean Time, which is also the ISO 8601 standard used by a lot of web servers.
That's my best guess. I can't spend any more time on this investigation.
The next question that bugs me is how could only about 80% of the entries contained 3 correct answers. The quiz was designed to pose as low a barrier as possible, and I know based on interactions on the blog that the IQ of my readers is well above average.
I start with a hypothesis. Perhaps the odds of winning the book is rather low (even though it's much higher than any lottery), and some people are just not willing to invest the time to answer 3 questions, and they randomly guessed. What would the data say?
Haha, these people are caught red-handed. The boxplots (on the left) show the time spent completing the quiz.
Those who have one or more wrong answers are labelled "eligible = 0" and those who have all 3 answers are labelled "eligible = 1".
There is very strong evidence that those who have wrong answers spent significantly less time doing the quiz. In fact, 50 percent of these people sent in their response less than 1 minute after starting the quiz! (In a boxplot, the white line inside the box indicates the median.)
Also, almost everyone who have one or more wrong answers spent less time filling out the quiz than the 25th-percentile person who have three correct answers.
As with any data analysis, one must be careful drawing conclusions. While I think these readers are unwilling to invest the time, perhaps just checking off the answers at random, there are other reasons for not having three correct answers. Abandonment is one, maybe those readers were distracted in the middle of the quiz. Maybe the system went down in the middle of the process (I'm not saying this happened, it's just a possibility.)
Finally, among those who got at least one answer wrong, were they more likely to enter the quiz at the start of the week or at the end?
There is weak evidence that those who failed to answer all 3 questions correctly were more likely to enter the contest on Friday (last day of the quiz) while those who entered on Wednesday or Thursday (the lowest response days of the week) were more likely to have 3 correct answers. It makes sense that those readers were more serious about wanting the book.
Now, hope you have better luck in round 2 of the Numbersense book quiz. Enter the quiz here.
I generated a big data set when writing Chapter 8 of Numbersense. This chapter discusses the question of how to measure your skills in managing/coaching a fantasy sports team. The general statistical question is how to separately measure two factors that both contribute to a single outcome.
In fantasy football (NFL), there is a matchup every week. Each week, you pick nine players from a roster of 14 players (rules vary by league). These nine players will score points for your team, based on how those players actually perform in real-life NFL games that week. You notch a win that week if your team scores more points than your opponent's team.
There are many ways to pick 9 players out of 14. In fact, in any given week, there are 200-300 eligible squads, of which only one is fielded. My big data set consists of all possible squads for every week for every team in the league. This data set contains rich information; the challenge is how to surface the information.
Visualization comes to the rescue. I'll be posting a series of charts here. Today's is the first one.
There are 13 plots, each of which represents a week of the season. The 13 plots trace the decisions of a single team over the course of the season. In each plot, the vertical line indicates the points total for the 9-player squad that was actually fielded by the team owner.
The histogram shows the range of choices the team owner could have made each week. Recall there are 200-300 possible squads of nine players from which the owner selected one. For example, in week 1, the owner didn't choose very well; there are many other sets of 9 players he could have chosen that would have scored him more points (the area to the right of the vertical line).
In Week 4, though, the owner could not have done much better. There were very few changes he could have made that would have increased his points total. Similarly, in Weeks 5 and 8.
You can also see that in Week 7, the 15 players he owned all tanked (in real life). The entire histogram is on the left side, meaning the points totals are horrible. Contrast this with Week 13, when the histogram is located on the right side of the chart, implying that this team owner would score pretty high no matter which 9 players he fielded.
You can get a copy of Numbersensehere. Or enter the book giveaway quiz to try your luck.
One of the most important steps in analyzing data is to remove noise. First, we have to identify where the noise is, then we find ways to reduce the noise, which has the effect of surfacing the signal.
The labor force participation rate data, discussed here and here, can be decomposed into two components, known as the trend and residuals. (See right.) The residuals are the raw data minus the trend; in other words, they are the data after removing the trend.
If the purpose of the analysis is to describe the evolution of the labor force participation rate over time, then the trend is the signal we're after.
Our purpose is the opposite. I want to remove the trend in order to surface correlations that are unrelated to time evolution. Thus, the residuals are where the signal is.
Another way to think about the residuals (bottom chart) is that positive values imply the actual data was above trend while negative values imply the actual data was below trend.
After decomposing the miles-driven data in the same way, I obtain two sets of residuals. These were plotted in the last post in a scatter plot.
The lack of correlation is also obvious in the plot below. You can see that the periods when one series of residuals went above trend was not well correlated with the other series being above trend (or below trend).
After I wrote the post about superimposing two time series to generate fake correlations, there was a lively discussion in the comments about whether a scatter plot would have done better. Here is the promised follow-up post.
The contentious issue is that X and Y might appear correlated but in
fact, what we are observing is that both data series are strongly
correlated with time (e.g. population almost always grows with time), and X and Y may not be correlated with each other.
Indeed, the first thing a statistician would do when encountering two data series is to create a scatter plot. Economists, by contrast, seem to prefer two line charts, superimposed.
The reason for looking at the scatter plot is to remove the time component. If X and Y are correlated systematically (and not individually with the time component), then even if we disturb the temporal order, we should still be able to see that correlation. If the correlation goes away in an x-y plot, then we know that the two variables are not correlated, and that the superimposed line charts created an illusion.
The catch is that the scatter plot analysis is necessary but not sufficient. In many cases, we will find strong correlation in the scatter plot. But that does not prove there is X-Y correlation beyond each data series being correlated with time. By plotting X and Y and ignoring time, we introduce time as an omitted variable, which can still be controlling both X and Y series.
The scatter plot (right) shows the per capita miles driven against the civilian labor force participation rate. Having hidden the time dimension, we still see a very strong correlation between the two data series.
This is because time is still the invisible hand. Time is running from left to right on the chart still. This pattern is visible if we have line segments connecting the data in temporal order, as in the chart below.
One solution to this problem is to de-trend the data. We want to remove the effect of time from each of the two data series individually, then we plot the residual signals against each other.
Here is the result (right). We now have a random scatter of points that average about zero. If anything, there may be a slightly negative correlation, meaning that when the labor force participation rate is above trend, the per-capita miles driven tend to be slightly below trend; this effect if it exists is small.
What I have done here is to establish the trend for each of the two time series. The actual data being plotted is what is above/below trend. What this chart is saying is that when one value is above trend, it gives us little information about whether the other value is above or below trend.