Luck is not easy to nail down in a number. For the fantasy football league, I have a way of looking at luck. One aspect of luck is which team you are matched up with in any given week. There is a matter of facing a stronger or a weaker opponent. There is a different matter of whether you face a given opponent on his/her hot or cold day. Sort of like whether a hitter faces a pitcher on his good or bad day.
As noted before, each FFL player picks nine players out of 14 every week, and those nine earn points. There are typically 200-300 possible choices of nine players. So we can measure how well any FFL owner performs by looking at the points total of the activated squad against the whole distribution of 200-300 options. This was the topic of my earlier post.
Now, if I am lucky, then I tend to face opponents in the weeks in which they perform poorly. And the following chart shows this measure from week to week:
In Week 1, this owner was rather unlucky, in the sense that his opponent pretty much used his best possible squad. On the other hand, in Week 4, his opponent (a different team) played a weak hand, something close to the median squad (in addition, the entire histogram sits on the left side of the chart, meaning that even his opponent's best possible squad this week would have been easy to beat.)
Luck can be measured over the course of the 13 weeks. If the vertical lines tend to show up on the right tails of these histograms, then this owner isn't lucky. On the other hand, if the lines show up mostly on the left half of the histograms, then this owner is lucky.
In Chapter 8 of Numbersense, I use such an analysis to figure out the role of luck. This luck factor turned out to be even more important than the owner's own skills!
Special for Junk Charts readers: here is an excerpt from Chapter 8 (link).
The second book giveaway contest is under way on the sister blog. Enter the contest here.
The winner of the Numbersense Book Quiz has been announced. See here.
GOOD NEWS: McGraw-Hill is sponsoring another quiz. Same format. Another chance to win a signed book. Click here to go directly to the quiz.
I did a little digging around the quiz data. The first thing I'd like to know is when people sent in responses.
This is shown on the right. Not surprisingly, Monday and Tuesday were the most popular days, combining for 70 percent of all entries. The contest was announced on Monday so this is to be expected.
There was a slight bump on Friday, the last day of the contest.
I'm at a loss to explain the few stray entries on Saturday. This is very typical of real-world data; strange things just happen. In the software, I set the stop date to be Saturday, 12:00 AM, and I was advised that they abide by Pacific Standard Time. This doesn't seem to be the case, unless... the database itself is configured to a different time standard!
The last entry was around 7 am on Saturday. Pacific Time is about 8 hours behind Greenwich Mean Time, which is also the ISO 8601 standard used by a lot of web servers.
That's my best guess. I can't spend any more time on this investigation.
The next question that bugs me is how could only about 80% of the entries contained 3 correct answers. The quiz was designed to pose as low a barrier as possible, and I know based on interactions on the blog that the IQ of my readers is well above average.
I start with a hypothesis. Perhaps the odds of winning the book is rather low (even though it's much higher than any lottery), and some people are just not willing to invest the time to answer 3 questions, and they randomly guessed. What would the data say?
Haha, these people are caught red-handed. The boxplots (on the left) show the time spent completing the quiz.
Those who have one or more wrong answers are labelled "eligible = 0" and those who have all 3 answers are labelled "eligible = 1".
There is very strong evidence that those who have wrong answers spent significantly less time doing the quiz. In fact, 50 percent of these people sent in their response less than 1 minute after starting the quiz! (In a boxplot, the white line inside the box indicates the median.)
Also, almost everyone who have one or more wrong answers spent less time filling out the quiz than the 25th-percentile person who have three correct answers.
As with any data analysis, one must be careful drawing conclusions. While I think these readers are unwilling to invest the time, perhaps just checking off the answers at random, there are other reasons for not having three correct answers. Abandonment is one, maybe those readers were distracted in the middle of the quiz. Maybe the system went down in the middle of the process (I'm not saying this happened, it's just a possibility.)
Finally, among those who got at least one answer wrong, were they more likely to enter the quiz at the start of the week or at the end?
There is weak evidence that those who failed to answer all 3 questions correctly were more likely to enter the contest on Friday (last day of the quiz) while those who entered on Wednesday or Thursday (the lowest response days of the week) were more likely to have 3 correct answers. It makes sense that those readers were more serious about wanting the book.
Now, hope you have better luck in round 2 of the Numbersense book quiz. Enter the quiz here.
I generated a big data set when writing Chapter 8 of Numbersense. This chapter discusses the question of how to measure your skills in managing/coaching a fantasy sports team. The general statistical question is how to separately measure two factors that both contribute to a single outcome.
In fantasy football (NFL), there is a matchup every week. Each week, you pick nine players from a roster of 14 players (rules vary by league). These nine players will score points for your team, based on how those players actually perform in real-life NFL games that week. You notch a win that week if your team scores more points than your opponent's team.
There are many ways to pick 9 players out of 14. In fact, in any given week, there are 200-300 eligible squads, of which only one is fielded. My big data set consists of all possible squads for every week for every team in the league. This data set contains rich information; the challenge is how to surface the information.
Visualization comes to the rescue. I'll be posting a series of charts here. Today's is the first one.
There are 13 plots, each of which represents a week of the season. The 13 plots trace the decisions of a single team over the course of the season. In each plot, the vertical line indicates the points total for the 9-player squad that was actually fielded by the team owner.
The histogram shows the range of choices the team owner could have made each week. Recall there are 200-300 possible squads of nine players from which the owner selected one. For example, in week 1, the owner didn't choose very well; there are many other sets of 9 players he could have chosen that would have scored him more points (the area to the right of the vertical line).
In Week 4, though, the owner could not have done much better. There were very few changes he could have made that would have increased his points total. Similarly, in Weeks 5 and 8.
You can also see that in Week 7, the 15 players he owned all tanked (in real life). The entire histogram is on the left side, meaning the points totals are horrible. Contrast this with Week 13, when the histogram is located on the right side of the chart, implying that this team owner would score pretty high no matter which 9 players he fielded.
You can get a copy of Numbersensehere. Or enter the book giveaway quiz to try your luck.
One of the most important steps in analyzing data is to remove noise. First, we have to identify where the noise is, then we find ways to reduce the noise, which has the effect of surfacing the signal.
The labor force participation rate data, discussed here and here, can be decomposed into two components, known as the trend and residuals. (See right.) The residuals are the raw data minus the trend; in other words, they are the data after removing the trend.
If the purpose of the analysis is to describe the evolution of the labor force participation rate over time, then the trend is the signal we're after.
Our purpose is the opposite. I want to remove the trend in order to surface correlations that are unrelated to time evolution. Thus, the residuals are where the signal is.
Another way to think about the residuals (bottom chart) is that positive values imply the actual data was above trend while negative values imply the actual data was below trend.
After decomposing the miles-driven data in the same way, I obtain two sets of residuals. These were plotted in the last post in a scatter plot.
The lack of correlation is also obvious in the plot below. You can see that the periods when one series of residuals went above trend was not well correlated with the other series being above trend (or below trend).
After I wrote the post about superimposing two time series to generate fake correlations, there was a lively discussion in the comments about whether a scatter plot would have done better. Here is the promised follow-up post.
The contentious issue is that X and Y might appear correlated but in
fact, what we are observing is that both data series are strongly
correlated with time (e.g. population almost always grows with time), and X and Y may not be correlated with each other.
Indeed, the first thing a statistician would do when encountering two data series is to create a scatter plot. Economists, by contrast, seem to prefer two line charts, superimposed.
The reason for looking at the scatter plot is to remove the time component. If X and Y are correlated systematically (and not individually with the time component), then even if we disturb the temporal order, we should still be able to see that correlation. If the correlation goes away in an x-y plot, then we know that the two variables are not correlated, and that the superimposed line charts created an illusion.
The catch is that the scatter plot analysis is necessary but not sufficient. In many cases, we will find strong correlation in the scatter plot. But that does not prove there is X-Y correlation beyond each data series being correlated with time. By plotting X and Y and ignoring time, we introduce time as an omitted variable, which can still be controlling both X and Y series.
The scatter plot (right) shows the per capita miles driven against the civilian labor force participation rate. Having hidden the time dimension, we still see a very strong correlation between the two data series.
This is because time is still the invisible hand. Time is running from left to right on the chart still. This pattern is visible if we have line segments connecting the data in temporal order, as in the chart below.
One solution to this problem is to de-trend the data. We want to remove the effect of time from each of the two data series individually, then we plot the residual signals against each other.
Here is the result (right). We now have a random scatter of points that average about zero. If anything, there may be a slightly negative correlation, meaning that when the labor force participation rate is above trend, the per-capita miles driven tend to be slightly below trend; this effect if it exists is small.
What I have done here is to establish the trend for each of the two time series. The actual data being plotted is what is above/below trend. What this chart is saying is that when one value is above trend, it gives us little information about whether the other value is above or below trend.
I will be at Book Expo this Friday signing books at the McGraw-Hill
booth. If you're in NYC, drop by and say hi between 11 and 12.
Yes, it's a new book! The title is Numbersense: How to Use Big Data to Your Advantage (link).
If you read my blogs, you already know where I'm going with this. How
can we be smart consumers of data analyses in a world overflowing with
data? It will be in stores in July. Between now and then, you can come
back here to learn more.
Also, at 12:30, I'll be interviewed at the Shindig event by Peggy Sanservieri, who blogs at Huffington Post on book marketing. This is an online live chat event. Go to their site to register, and you'd have the opportunity to ask me questions.
Nick C. on Twitter sent us to the following chart of salaries in Major League Soccer. (link)
This chart is hosted at Tableau, which is one of the modern visualization software suites. It appears to be a user submission. Alas, more power did not bring more responsibility.
Sorting the bars by total salary would be a start.
The colors and subsections of the bars were intended to unpack the composition of the total salaries, namely, which positions took how much of the money. I'm at a loss to explain why those rectangles don't seem to be drawn to scale, or what it means to have rectangles stacked on top of each other. Perhaps it's because I don't know much about how the cap works.
Combined with the smaller chart (shown below), the story seems to be that while all teams have similar cap numbers, the actual salaries being paid could differ by multiples.
This is the standard stacked bar chart showing the distribution of salary cap usage by team:
I have never understood the appeal of stacking data. It's not easy to compare the middle segments.
After quite a bit of work, I arrived at the following:
The MLS teams are divided into five groups based on how they used the salary cap. Salary cap figures are converted into proportion of total cap. For example, the first cluster includes Chicago, Los Angeles, New York, Seattle and Toronto, and these teams spread the wealth among the D, F, and M players while not spending much on goalie and "others". On the other hand, Groups 2 and 3, especially Group 3 allocated 30-45% of the cap on the midfield.
Three teams form their own clusters. CLB spends more of its cap on "others" than any other team (others are mostly hyphenated positions like D-F, F-M, etc.) DAL and VAN spend a lot less on midfield players than other teams. VAN spends a lot on defense.
My version has many fewer data points (although the underlying data set is the same) but it's easier to interpret.
I tried various chart types like bar charts, and even pie charts. I still like the profile (line) charts best.
In a modern software (I'm using JMP's Graph Builder here), it's only one click to go from line to bar, and one click to go to pie.
This post is long over-due. I have been meaning to write about this blog for a long time but never got around to it. It's like the email response you postponed because you want to think before you fire it off. But I received two mentions of it within the last few days, which reminded me I have to get to work on this one.
One of the best blogs to read - that is similar in spirit to Junk Charts - is ChartNThings. This is the behind-the-scenes blog of the venerable New York Times graphics department. They talk about the considerations that go into making specific charts that subsequently showed up in the newspaper. You get to see their sketches. Kind of like my posts here, except with the graphics professional's perspective.
As Andrew Gelman said in his annotated blog roll (link), ChartNThings is "the ultimate graphics blog. The New York Times graphics team presents
some great data visualizations along with the stories behind them. I
love this sort of insider’s perspective."
The other mention is from a friend who reviewed something I wrote about fantasy football. He pointed me to this particular post from the ChartNThings blog that talks about luck and skill in NFL.
They have a perfect illustration of how statistics can help make charts better.
Start with the following chart that shows the value of players picked organized by the round in which they are picked.
Think of this as plotting the raw data. A pattern is already apparent, which is that on average, the players picked in earlier rounds (on the left) have produced higher value for their clubs. However, there is quite a bit of noise on the page. One problem with dot plots is over-plotting when the density of points is high, as is here. Our eyes cannot judge density properly especially in the presence of over-plotting.
What the NYT team did next is to take the average value for all players picked in each round in each year, and plot those instead. This drastically reduces the number of dots per round, and cleans up the canvass a great deal.
It's amazing how much more powerful is this chart than the previous one. Instead of the average value, one can also try the median value, or plot percentiles to showcase the distribution. (They later offered a side-by-side box plot, which is also an excellent idea.)
The post then goes into exploring a paper by some economists who wanted to ignore the average and focus on the noise. I'll make some comments on that analysis on my other blog. (The post is now live.)
One behind-the-scenes thing I'd add about this behind-the-scenes blog is that the authors must have spent quite a bit of time organizing the materials and creating the streamlined stories for us to savor. Graphical creation involves a lot of sketching and exploration, so there are lots of dead ends, backtracking, stuff you throw away. There will be lots of charts with little flaws that you didn't care to correct because it's not your final version. There will be lots of charts which will only be intelligible to the creator since they are missing labels, scales, etc., again because those were supposed to be sketch work. There will even be charts that the creator can't make sense of because the train of thought has been lost by the end of the project.
So we should applaud what the team has done here for the graphics community.
When we visualize data, we want to expose the information contained within, or to use the terminology Nate Silver popularized, to expose the signal and leave behind the noise.
When graphs are not done right, sometimes they manage to obscure the information.
Reader John H. found a confusing bar chart while studying a paper (link to PDF) in which the authors compared two algorithms used to determine the position of Wi-Fi access points under various settings.
The first reaction is maybe the researchers are telling us there is no information here. The most important variable on this chart is what they call "datanum", and it goes from left to right across the page. A casual glance across the page gives the idea that nothing much is going on.
Then you look at the row labels, and realize that this dataset is very well structured. The target variables (AP Position Error) is compared along four dimensions: datanum, the algorithm (WCL, or GPR+WCL), the number of access points, and the location of these access points (inner, boundary, all).
When the data has a nice structure, there should be better ways to visualize it.
John submitted a much improved version, which he created using ggplot2.
This is essentially a small multiples chart. The key differences between the two charts are:
Giving more dimensions a chance to shine
Spacing the "datanum" proportional to the sample size (we think "datanum" means the number of sample readings taken from each access point)
Using a profile chart, which also allows the y-axis to start from 2
When you read this chart, you finally realize that the experiment has yielded several insights:
Increasing sample size does not affect aggregate WCL error rate but it does reduce the error rate of aggregate GPR+WCL.
The improvement of GPR+WCL comes only from the inner access points.
The WCL algorithm performs really well in inner access points but poorly in outer access points.
The addition of GPR to the WCL algorithm improves the performance of outer access points but deteriorates the performance of inner access points. (In aggregate, it improves the performance... this is only because there are almost two outer access points to every inner access point.)
Now, I don't know anything about this position estimation problem. The chart leaves me thinking why they don't just use WCL on inner access points. The performance under that setting is far and away the best of all the tested settings.
The researchers described their metric as AP Position Error (2drms, 95% confidence). I'm not sure what they mean by that because when I see 95% confidence, I am expecting to see the confidence band around the point estimates being shown above.
And yet, the data table shows only point estimates -- in fact, estimates to two decimal places of precision. In statistics, the more precision you have, the less confidence.