Lunch and talk Wednesday

Numbersense_cover_smI will be the luncheon speaker at INFORMS NYC on Wednesday in NYC. The talk will provide some context for my new book Numbersense (link), and discuss a few examples from the book. You can pre-register here.

INFORMS is the professional society for Operations Research and Management Science people. For some years, I have attended these regularly and learned a lot from other industry speakers.

If you decide at the last minute, you can pay the $5 extra fee on the day of the talk. Or register now.

***

Junk Charts is featured in an article in Harvard Business Review about data visualization. A few new reviews have appeared: CFA InstituteFlagstaff Business News.

***

I maintain a list of events on my book blog. Look to the right column.


Upcoming talk in Chicago

I'm busy preparing for my talk tomorrow in Chicago. The topic is Big Data and Marketing, a topic that is central to Numbersense.

The event is free, and you can register here. It's hosted by BIGfrontier's Steve Lundin.


Luck in sports visualized

Luck is not easy to nail down in a number. For the fantasy football league, I have a way of looking at luck. One aspect of luck is which team you are matched up with in any given week. There is a matter of facing a stronger or a weaker opponent. There is a different matter of whether you face a given opponent on his/her hot or cold day. Sort of like whether a hitter faces a pitcher on his good or bad day.

As noted before, each FFL player picks nine players out of 14 every week, and those nine earn points. There are typically 200-300 possible choices of nine players. So we can measure how well any FFL owner performs by looking at the points total of the activated squad against the whole distribution of 200-300 options. This was the topic of my earlier post.

Now, if I am lucky, then I tend to face opponents in the weeks in which they perform poorly. And the following chart shows this measure from week to week:

Ad_opp_against_options

In Week 1, this owner was rather unlucky, in the sense that his opponent pretty much used his best possible squad. On the other hand, in Week 4, his opponent (a different team) played a weak hand, something close to the median squad (in addition, the entire histogram sits on the left side of the chart, meaning that even his opponent's best possible squad this week would have been easy to beat.)

Luck can be measured over the course of the 13 weeks. If the vertical lines tend to show up on the right tails of these histograms, then this owner isn't lucky. On the other hand, if the lines show up mostly on the left half of the histograms, then this owner is lucky.

In Chapter 8 of Numbersense, I use such an analysis to figure out the role of luck. This luck factor turned out to be even more important than the owner's own skills!

Special for Junk Charts readers: here is an excerpt from Chapter 8 (link).

***

The second book giveaway contest is under way on the sister blog. Enter the contest here.

 


Book quiz data geekery, plus another free book

The winner of the Numbersense Book Quiz has been announced. See here.

GOOD NEWS: McGraw-Hill is sponsoring another quiz. Same format. Another chance to win a signed book. Click here to go directly to the quiz.

***

Numbersense_quiz1_timingI did a little digging around the quiz data. The first thing I'd like to know is when people sent in responses.

This is shown on the right. Not surprisingly, Monday and Tuesday were the most popular days, combining for 70 percent of all entries. The contest was announced on Monday so this is to be expected.

There was a slight bump on Friday, the last day of the contest.

I'm at a loss to explain the few stray entries on Saturday. This is very typical of real-world data; strange things just happen. In the software, I set the stop date to be Saturday, 12:00 AM, and I was advised that they abide by Pacific Standard Time. This doesn't seem to be the case, unless... the database itself is configured to a different time standard!

The last entry was around 7 am on Saturday. Pacific Time is about 8 hours behind Greenwich Mean Time, which is also the ISO 8601 standard used by a lot of web servers.

That's my best guess. I can't spend any more time on this investigation.

***

The next question that bugs me is how could only about 80% of the entries contained 3 correct answers. The quiz was designed to pose as low a barrier as possible, and I know based on interactions on the blog that the IQ of my readers is well above average.

I start with a hypothesis. Perhaps the odds of winning the book is rather low (even though it's much higher than any lottery), and some people are just not willing to invest the time to answer 3 questions, and they randomly guessed. What would the data say?

Numbersense_quiz_durationeligiblesHaha, these people are caught red-handed. The boxplots (on the left) show the time spent completing the quiz.

Those who have one or more wrong answers are labelled "eligible = 0" and those who have all 3 answers are labelled "eligible = 1".

There is very strong evidence that those who have wrong answers spent significantly less time doing the quiz. In fact, 50 percent of these people sent in their response less than 1 minute after starting the quiz! (In a boxplot, the white line inside the box indicates the median.)

Also, almost everyone who have one or more wrong answers spent less time filling out the quiz than the 25th-percentile person who have three correct answers.

As with any data analysis, one must be careful drawing conclusions. While I think these readers are unwilling to invest the time, perhaps just checking off the answers at random, there are other reasons for not having three correct answers. Abandonment is one, maybe those readers were distracted in the middle of the quiz. Maybe the system went down in the middle of the process (I'm not saying this happened, it's just a possibility.)

***

Finally, among those who got at least one answer wrong, were they more likely to enter the quiz at the start of the week or at the end?

Numbersense_quiz1_eligiblebyday There is weak evidence that those who failed to answer all 3 questions correctly were more likely to enter the contest on Friday (last day of the quiz) while those who entered on Wednesday or Thursday (the lowest response days of the week) were more likely to have 3 correct answers. It makes sense that those readers were more serious about wanting the book.

***

Now, hope you have better luck in round 2 of the Numbersense book quiz. Enter the quiz here.

 

 

 


Getting inside my head

[This is a cross-post from the sister blog, Numbers Rule Your World]

Some interviews with me or snippets of such have surfaced recently. Here is a list:

Kate Meersschaert interviewed me for New Learning Times (link; registration required). I talked about my teaching philosophy, and why I write books.

Jay Ulfelder, a political scientist who keeps an interesting blog, recommends Numbers Rule Your World, and a few other books for political scientists (link).

If you haven't heard already, 2013 is the International Year of Statistics. I was one of the talking heads here.

Here, I talk about the history of Junk Charts, and the new paradigm of interactively building graphs, as opposed to the template paradigm popularized by Excel.

 


Rank confusion

This chart, found in Princeton Alumni Weekly, only partially scanned here, supposedly gave reasons for "Princeton's top-rated [Ph.D.] programs" "to celebrate". My alma mater has outstanding academic departments, but it would be difficult to know from this chart!

Phdranks

Due to the color scheme, the numbers that jump out at you are the ones in the bright orange background, which refers to how many other departments are ranked equal to Princeton's in those subjects. It takes some effort to realize that the more zeroes there are in the top buckets (fading orange), the better.

The editor started with a nice idea, which is to convert raw rankings into clusters of rankings. She recognized that in this type of rankings (see a related post on my book blog here), it is meaningless to gloat about #1 versus #2 because they are probably statistically the same. For instance, in the ranking of Architecture departments (ARC), 37 schools (including Princeton) all belonged to the same cluster as Princeton, judged to be a statistical tie.

One of the main reasons why this chart looks so confusing is its failing the self-sufficiency test. It really is a disguised data table, with some colorful background and shadows; the graphical elements add nothing to the data at all. If one covered up all the data, there is nothing left to see!

In the following rework, I emphasize the cluster structure. Each subject has three possible clusters, schools ranked above, equal to, and below Princeton. Instead of plotting raw numbers, the chart shows proportions of schools in each category. The order is roughly such that the departments with the relatively higher standing float to the top. Because a bar chart is used, the department names could be spelt out in their entirety and placed horizontally.

Jc_phdranks

If one has access to the raw data, it would be even better to reveal the entire cluster structure. It is quite possible that the clusters above and below Princeton can be further subdivided into more clusters. This will allow readers to understand better what the cluster ranks mean.

 


Head-shaking at the deep hole

Continuing to work through the pile of submissions, here is Jeannie C. recommending one of my favorite economics charts. The economics blogs are generating lots of charts, many of which uninspiring and run-of-the-mill but this one about the jobs picture, relative to past recessions, truly paints a harrowing story. (Looks like TPM took the chart from Business Insider but this chart has appeared everywhere).

Tpm_scaryjobs

The little dotted extension at the end of the current curve (red) indicates the jobs picture after removing Census jobs. I have already explained why this adjustment is necessary (here, and here).

Two small improvements I'd make to the chart:

  • Instead of a rainbow of colors, should use a foreground-background concept. Have all the past recessions in gray, and the current one in red. This change necessitates a change in curve labeling strategy: should affix the year labels directly on the 0% line above the curves. Doing so eliminates the head shakes needed to find the year of the curve.
  • Smoothing out some of the curves will help remove clutter without harming the central message of the chart.