« June 2013 | Main | August 2013 »

Upcoming talk in Chicago

I'm busy preparing for my talk tomorrow in Chicago. The topic is Big Data and Marketing, a topic that is central to Numbersense.

The event is free, and you can register here. It's hosted by BIGfrontier's Steve Lundin.

Luck in sports visualized

Luck is not easy to nail down in a number. For the fantasy football league, I have a way of looking at luck. One aspect of luck is which team you are matched up with in any given week. There is a matter of facing a stronger or a weaker opponent. There is a different matter of whether you face a given opponent on his/her hot or cold day. Sort of like whether a hitter faces a pitcher on his good or bad day.

As noted before, each FFL player picks nine players out of 14 every week, and those nine earn points. There are typically 200-300 possible choices of nine players. So we can measure how well any FFL owner performs by looking at the points total of the activated squad against the whole distribution of 200-300 options. This was the topic of my earlier post.

Now, if I am lucky, then I tend to face opponents in the weeks in which they perform poorly. And the following chart shows this measure from week to week:


In Week 1, this owner was rather unlucky, in the sense that his opponent pretty much used his best possible squad. On the other hand, in Week 4, his opponent (a different team) played a weak hand, something close to the median squad (in addition, the entire histogram sits on the left side of the chart, meaning that even his opponent's best possible squad this week would have been easy to beat.)

Luck can be measured over the course of the 13 weeks. If the vertical lines tend to show up on the right tails of these histograms, then this owner isn't lucky. On the other hand, if the lines show up mostly on the left half of the histograms, then this owner is lucky.

In Chapter 8 of Numbersense, I use such an analysis to figure out the role of luck. This luck factor turned out to be even more important than the owner's own skills!

Special for Junk Charts readers: here is an excerpt from Chapter 8 (link).


The second book giveaway contest is under way on the sister blog. Enter the contest here.


Book quiz data geekery, plus another free book

The winner of the Numbersense Book Quiz has been announced. See here.

GOOD NEWS: McGraw-Hill is sponsoring another quiz. Same format. Another chance to win a signed book. Click here to go directly to the quiz.


Numbersense_quiz1_timingI did a little digging around the quiz data. The first thing I'd like to know is when people sent in responses.

This is shown on the right. Not surprisingly, Monday and Tuesday were the most popular days, combining for 70 percent of all entries. The contest was announced on Monday so this is to be expected.

There was a slight bump on Friday, the last day of the contest.

I'm at a loss to explain the few stray entries on Saturday. This is very typical of real-world data; strange things just happen. In the software, I set the stop date to be Saturday, 12:00 AM, and I was advised that they abide by Pacific Standard Time. This doesn't seem to be the case, unless... the database itself is configured to a different time standard!

The last entry was around 7 am on Saturday. Pacific Time is about 8 hours behind Greenwich Mean Time, which is also the ISO 8601 standard used by a lot of web servers.

That's my best guess. I can't spend any more time on this investigation.


The next question that bugs me is how could only about 80% of the entries contained 3 correct answers. The quiz was designed to pose as low a barrier as possible, and I know based on interactions on the blog that the IQ of my readers is well above average.

I start with a hypothesis. Perhaps the odds of winning the book is rather low (even though it's much higher than any lottery), and some people are just not willing to invest the time to answer 3 questions, and they randomly guessed. What would the data say?

Numbersense_quiz_durationeligiblesHaha, these people are caught red-handed. The boxplots (on the left) show the time spent completing the quiz.

Those who have one or more wrong answers are labelled "eligible = 0" and those who have all 3 answers are labelled "eligible = 1".

There is very strong evidence that those who have wrong answers spent significantly less time doing the quiz. In fact, 50 percent of these people sent in their response less than 1 minute after starting the quiz! (In a boxplot, the white line inside the box indicates the median.)

Also, almost everyone who have one or more wrong answers spent less time filling out the quiz than the 25th-percentile person who have three correct answers.

As with any data analysis, one must be careful drawing conclusions. While I think these readers are unwilling to invest the time, perhaps just checking off the answers at random, there are other reasons for not having three correct answers. Abandonment is one, maybe those readers were distracted in the middle of the quiz. Maybe the system went down in the middle of the process (I'm not saying this happened, it's just a possibility.)


Finally, among those who got at least one answer wrong, were they more likely to enter the quiz at the start of the week or at the end?

Numbersense_quiz1_eligiblebyday There is weak evidence that those who failed to answer all 3 questions correctly were more likely to enter the contest on Friday (last day of the quiz) while those who entered on Wednesday or Thursday (the lowest response days of the week) were more likely to have 3 correct answers. It makes sense that those readers were more serious about wanting the book.


Now, hope you have better luck in round 2 of the Numbersense book quiz. Enter the quiz here.




Just one change evokes an entirely new world

Before I get to normal programming, please note that today (Friday) is the last day to enter the contest to win my new book. Only three easy questions, and you may get a nice summer read, with my autograph. Enter here.

New: the sample pages are now on Slideshare as well, so no need to download PDF.


Yesterday, I showed a chart of alternative outcomes for a fantasy football team for each week of the season. I looked at whether the team owner activated the right set of 9 players, from a roster of 14. The area of the histogram to the right of the vertical line is the probability of scoring more points than the squad that was activated. The smaller the area, the better was the owner's performance.

In the chart today, I switched one thing... what the vertical lines represent. In the following chart, the (pink) line represents the score compiled by one's opponent during each week. (The opponent changes each week but the schedule is fixed at the start of the season.)


This is a completely different chart. This chart tells you a little about the luck of the draw.

Look at Week 4. The opponent activated a really wretched squad. No matter what this team owner does, he/she will score more points than the opponent.

Now look at Week 3. No matter what this owner does, he/she is bound to lose because the opponent scored more points than his/her best possible squad.

The area to the right of the line is the probability of beating your opponent.

More to come.

Visualizing alternative outcomes in fantasy football

I generated a big data set when writing Chapter 8 of Numbersense. This chapter discusses the question of how to measure your skills in managing/coaching a fantasy sports team. The general statistical question is how to separately measure two factors that both contribute to a single outcome.

In fantasy football (NFL), there is a matchup every week. Each week, you pick nine players from a roster of 14 players (rules vary by league). These nine players will score points for your team, based on how those players actually perform in real-life NFL games that week. You notch a win that week if your team scores more points than your opponent's team. 

There are many ways to pick 9 players out of 14. In fact, in any given week, there are 200-300 eligible squads, of which only one is fielded. My big data set consists of all possible squads for every week for every team in the league. This data set contains rich information; the challenge is how to surface the information.

Visualization comes to the rescue. I'll be posting a series of charts here. Today's is the first one.



There are 13 plots, each of which represents a week of the season. The 13 plots trace the decisions of a single team over the course of the season. In each plot, the vertical line indicates the points total for the 9-player squad that was actually fielded by the team owner.

The histogram shows the range of choices the team owner could have made each week. Recall there are 200-300 possible squads of nine players from which the owner selected one. For example, in week 1, the owner didn't choose very well; there are many other sets of 9 players he could have chosen that would have scored him more points (the area to the right of the vertical line).

In Week 4, though, the owner could not have done much better. There were very few changes he could have made that would have increased his points total. Similarly, in Weeks 5 and 8.

You can also see that in Week 7, the 15 players he owned all tanked (in real life). The entire histogram is on the left side, meaning the points totals are horrible. Contrast this with Week 13, when the histogram is located on the right side of the chart, implying that this team owner would score pretty high no matter which 9 players he fielded.


You can get a copy of Numbersense here. Or enter the book giveaway quiz to try your luck.

Light entertainment: a ribbon chart

A reader sent in this amusement. See if you can figure out the chart:


The article is here. It then goes into a lot of numbers about 200 accidents. I didn't pay much attention after that first paragraph, where it said 16% of the accidents were in one year, with 84% in the next year. That implied a growth of more than five times from one year to the next. Seems to me an issue with data collection. The author then goes on to aggregate the two years, and reports dozens of findings.

Anyway, what is the point of the ribbon chart?


Reminder: Contest to win a book is open till Friday. Enter through here.

Win a Signed Copy of my New Book

This is cross-posted on my two blogs.

For my fans on either of my two blogs, I'm giving away a free signed copy of my new book, Numbersense. (See my book announcement.)

All you have to do is to answer 3 questions, based on a few sample pages (see the PDF or Slideshare). Click on the quiz to enter.

The contest is open until Friday, July 19, 2013 (11:59 PM PST).

This is an open anything quiz, although as I like to tell my students, if you are working too hard, you're probably missing the point.


I will draw the winner among those who answer all 3 questions correctly.

Please provide a valid email address so I can get in touch with you to send you the book. 


Climate change and duelling charts

Abhinav asks me to check out his blog post on a chart on global warming (I prefer the term climate change) featured on Wonkblog. The chart is sourced to a report by the World Metereological Association (link to PDF).


Hello, start the axis at zero whenever you are using plotting columns. That's as fundamental as only plot proportions on a pie chart.

There is a reason why the designer didn't like to start the axis at zero. It is this (Abhinav helpfully made all these charts):


The trouble is that for this data set (on global average temperature), the area below 13 is completely useless. It's like plotting body temperature on a scale of 0 - 100 Celsius when all feasible values fall into a tight range, maybe 35-38 Celsius. I recount a similar situation that led to a college president saying something stupid in Chapter 1 of my new book, Numbersense. (Information on the book is here.)

So we understand the desire to get rid of the irrelevant white space. This is accomplished by using a line chart. (I'd prefer to omit the data values, and rely on the axis.)


Abhinav then created various versions of this by compressing and expanding the vertical scales. I don't think there is anything wrong with the above scale. As I mentioned, the scale should focus on the range of values that are feasible.

Nice work, Abhinav.

Avert your eyes

Reader omegatron came back with another shocking instance of a pie chart:


Here is the link to the AVERT organization in the U.K. that published the chart and several others.

For the umpteenth time, the pie chart plots proportions. All proportions are percentages but some percentages are not proportions. The data here would appear to be "rate of diagnosis" rather than proportion of diagnoses by age.

The data came from Table 3a of this CDC report (link), and they are clearly labelled "Rate". The footnote even disclosed that the "Rate" is measured per 100,000 people so they are being mislabeled as percentages.

Let's summarize. The percentages add up to much more than 100%, they are clearly not proportions, they are not even percentages, they are rates per 100,000.

omegatron even got confused by the colors. You'd think that the slices would be arranged by age group but no! The order of the slices is by size of the pie slices, with one exception--the lime green slice of 11.4%, which I cannot explain. In practice, this means the order goes from Under 13 to 13-14 to Over 65 to 60-64 to 50-54, etc.

A smarter use of color here would be to stick to one color while varying the tinge acccording to the rate of diagnosis. Using 13 colors for 13 age groups is distracting.

Here is the same data using a column chart:



As a teacher, it's shocking that such pie charts continue to see the light of day. It's very disappointing, as I'd assume every teacher who teaches the pie chart will have pointed out the pitfalls. Why is this happening?


Trifecta_checkupWith this chart, I'm mostly baffled by the top corner of the Trifecta Checkup. What is the point of this data? If I understand the "per 100,000 population" definition, these rates are computed as the number of diagnosed divided by the population in each age group. So the diagnosis rate is a function of how many people in each age group are actually infected, and how effective is the diagnosis procedures, and whether that effectiveness varies with age. Plus, the completeness of reporting by age group (the footnote acknowledged that the mathematical model does not account for incomplete reporting. To call a spade a spade, that means the model assumes complete reporting.)

The rate of diagnosis can be low because the rate of infection is low or the proportion of the infected who gets diagnosed is low. I just can't conceive of a use of data that confound these factors.

A time series treatment would be interesting althought that addresses a different question.