Report from the NBA Hackathon 2017

Yesterday, I had the honor of being one of the judges at the NBA Hackathon. This is the second edition of the Hackathon, organized by the NBA League Office's analytics department in New York. Here is Director of Basketball Analytics, Jason Rosenfeld, speaking to the crowd:


The event was a huge draw - lots of mostly young basketball enthusiasts testing their hands at manipulating and analyzing data to solve interesting problems. I heard there were over 50 teams who showed up on "game day." Hundreds more applicants did not get "drafted." Many competitors came from out of town - amongst the finalists, there was a team from Toronto and one from Palo Alto.

The competition was divided into two tracks: basketball analytics, and business analytics. Those in the basketball track were challenged with problems of interest to coaches and managers. For example, they are asked to suggest a rule change that might increase excitement in the game, and support that recommendation using the voluminous spatial data. Some of these problems are hard: one involves projecting shot selection ten years out - surely fans want to know if the craze over 3-pointers will last. Nate Silver was one of the judges for the basketball analytics competition.

I was part of the business analytics judging panel, along with the fine folks shown below:


The business problems are challenging as well, and really tested the competitors' judgment, as the problems are open-ended and subjective. Technical skills are also required, as very wide-ranging datasets are made available. One problem asks contestants to combine a wide number of datasets to derive a holistic way to measure "entertainment value" of a game. The other problem is even more open: do something useful and interesting with our customer files.

I visited the venue the night before, when the teams were busy digging into the data. See the energy in the room here:


The competitors are given 24 hours to work on the datasets. This time includes making a presentation to showcase what they have found. They are not allowed to utilize old code. I overheard several conversations between contestants and the coaches - it appeared that the datasets are in a relatively raw state, meaning quite a bit of time would have been spent organizing, exploring, cleaning and processing the data.

One of the finalists in the business competition started their presentation, telling the judges they spent 12 hours processing their datasets. It does often seem like as analysts, we are fighting with our data.


This team from Toronto wrestled with the various sets of customer-indiced data, and came up with a customer segmentation scheme. They utilized a variety of advanced modeling techniques.

The other two finalists in the business competition tackled the same problem: how to measure entertainment value of a game. Their approaches were broadly similar, with each team deploying a hierarchy of regression models. Each model measures a particular contributor to entertainment value, and contains a number of indicators to predict the contribution.

Pictured below is one of the finalists, who deployed Lasso regression, a modern technique to select a subset of important factors from a large number of possibilities. This team has a nice handle on the methods, and notably, was the only team that presented error bars, showing the degree of uncertainty in their results.


The winning team in the business competition went a couple of steps beyond. First, they turned in a visual interface to a decision-making tool that scores every game according to their definition of entertainment value. I surmise that they also expressed these scores in a relative way, because some of their charts show positive and negative values. Second, this team from Princeton realized the importance of tying all their regression models together into a composite score. They even allow the decision makers to shift the component weights around. Congratulations to Data Buckets! Here is the pair presenting their decision-making tool:


Mark Tatum, deputy commissioner of the NBA League Office, presented the award to Team Data Buckets:


These two are also bloggers. Look here.

After much deliberation, the basketball analytics judges liked the team representing the Stanford Sports Analytics Club.


These guys tackled the very complicated problem of forecasting future trends in shot selection, using historical data.

For many, maybe most, of the participants, this was their first exposure to real-world datasets, and a short time window to deliver an end-product. Also, they must have learned quite a bit about collaboration.

The organizers should be congratulated for putting together a smoothly-run event. When you host a hackathon, you have to be around throughout the night as well. Also, the analytics department staff kindly simplified the lives of us judges by performing the first round of selection overnight.


Last but not least, I like to present the unofficial Best Data Graphics Award to the team known as Quire Sultans. They were a finalist in the basketball analytics contest. I am impressed with this display:


This team presented a new metric using data on passing. The three charts are linked. The first one shows passer-passee data within a specific game; the second shows locations on the court for which passes have more favorable outcomes; the third chart measures players' over/under performance against a model.

There were quite a few graphics presented at the competition. This is one of the few in which the labels were carefully chosen and easily understood, without requiring in-depth knowledge about their analysis.

Much more to do after selecting a chart form


I sketched out this blog post right before the Superbowl - and was really worked up as I happened to be flying into Atlanta right after they won (well, according to any of our favorite "prediction engines," the Falcons had 95%+ chance of winning it all a minute from the end of the 4th quarter!) What I'd give to be in the SuperBowl-winning city the day after the victory!

Maybe next year. I didn't feel like publishing about SuperBowl graphics when the wound was so very raw. But now is the moment.

The following chart came from Orange County Register on the run-up to the Superbowl. (The bobble-head quarterbacks also came from OCR). The original article is here.


The choice of a set of dot plots is inspired. The dot plot is one of those under-utilized chart types - for comparing two or three objects along a series of metrics, it has to be one of the most effective charts.

To understand this type of design, readers have to collect three pieces of information: first is to recognize the dot symbols, which color or shape represents which object being compared; second is to understand the direction of the axis; third is to recognize that the distance between the paired dots encodes the amount of difference between the two objects.

The first task is easy enough here as red stands for Atlanta and blue for New England - those being the team colors.

The second task is deceptively simple. It appears that a ranking scale is used for all metrics with the top ("1st") shown on the left side and the bottom ("32nd") shown on the right. Thus, all 32 teams in the NFL are lined up left to right (i.e. best to worst).

Now, focus your attention on the "Interceptions Caught" metric, third row from the bottom. The designer indicated "Fewest" on the left and "Most" on the right. For those who don't know American football, an "interception caught" is a good defensive play; it means your defensive player grabs a ball thrown by the opposing team (usually their quarterback), causing a turnover. Therefore, the more interceptions caught, the better your defence is playing.

Glancing back at the chart, you learn that on the "Interceptions Caught" metric, the worst team is shown on the left while the best team is shown on the right. The same reversal happened with "Fumbles Lost" (fewest is best), "Penalties" (fewest is best), and "Points Allowed per Game" (fewest is best). For four of nine metrics, right is best while for the other five, left is best.

The third task is the most complicated. A ranking scale always has the weakness that a gap of one rank does not yield information on how important the gap is. It's a complicated decision to select what type of scale to use in a chart like this, and in this post, I shall ignore this issue, and focus on a visual makeover.


I find the nine arrays of 32 squares, essentially the grid system, much too insistent, elevating information that belongs to the background. So one of the first fixes is to soften the grid system, and the labeling of the axes.

In addition, given the meaningless nature of the rank number (as mentioned above), I removed those numbers and used team logos instead. The locations on the axes are sufficient to convey the relative ranks of the two teams against the field of 32.


Most importantly, the directions of all metrics are now oriented in such a way that moving left is always getting better.


While using logos for sports teams is natural, I ended up replacing those, as the size of the dots is such that the logos are illegible anyway.

The above makeover retains the original order of metrics. But to help readers address the key question of this chart - which team is better, the designer should arrange the metrics in a more helpful way. For example, in the following version, the metrics are subdivided into three sections: the ones for which New England is significantly better, the ones for which Atlanta is much better, and the rest for which both teams are competitive with each other.


In the Trifecta checkup (link), I speak of the need to align your visual choices with the question you are trying to address with the chart. This is a nice case study of strengthening that Q-V alignment.







Lining up the dopers and their medals

The Times did a great job making this graphic (this snapshot is just the top half):


A lot of information is packed into a small space. It's easy to compose the story in our heads. For example, Lee Chong Wai, the Malaysian badminton silver medalist, was suspended for doping for a short time during 2015, and he was second twice before the doping incident.

They sorted the athletes according to the recency of the latest suspension. This is very smart as it helps make the chart readable. Other common ordering such as alphabetically by last name, by sport, by age, and by number of medals will result in a bit of a mess.

I'm curious about the athletes who also had doping suspensions but did not win any medals in 2016.

Counting the Olympic medals

Reader Conor H. sent in this daily medals table at the NBC website:


He commented that the bars are not quite the right lengths. So even though China and Russia both won five total medals that day, the bar for China is slightly shorter.

One issue with the stacked bar chart is that the reader's attention is drawn to the components rather that the whole. However, as is this case, the most important statistic is the total number of medals.

Here is a different view of the data:




Various ways of showing distributions

The other day, a chart about the age distribution of Olympic athletes caught my attention. I found the chart on Google but didn't bookmark it and now I couldn't retrieve it. From my mind's eye, the chart looks like this:


This chart has the form of a stacked bar chart but it really isn't. The data embedded in each bar segment aren't proportions; rather, they are counts of athletes along a standardized age scale. For example, the very long bar segment on the right side of the bar for alpine skiing does not indicate a large proportion of athletes in that 30-50 age group; it's the opposite: that part of the distribution is sparse, with an outlier at age 50.

The easiest way to understand this chart is to transform it to histograms.


In a histogram, the counts for different age groups are encoded in the heights of the columns. Instead, encode the counts in a color scale so that taller columns map to darker shades of blue. Then, collapse the columns to the same heights. Each stacked bar chart is really a collapsed histogram.


The stacked bar chart reminds me of boxplots that are loved by statisticians.


In a boxplot, the box contains the middle 50% of the athletes in each sport (this directly maps to the dark blue bar segments from the chart above). Outlier values are plotted individually, which gives a bit more information about the sparsity of certain bar segments, such as the right side of alpine skiing.

The stacked bar chart can be considered a nicer-looking version of the boxplot.



Super-informative ping-pong graphic

Via Twitter, Mike W. asked me to comment on this WSJ article about ping pong tables. According to the article, ping pong table sales track venture-capital deal flow:


This chart is super-informative. I learned a lot from this chart, including:

  • Very few VC-funded startups play ping pong, since the highlighted reference lines show 1000 deals and only 150 tables (!)
  • The one San Jose store interviewed for the article is the epicenter of ping-pong table sales, therefore they can use it as a proxy for all stores and all parts of the country
  • The San Jose store only does business with VC startups, which is why they attribute all ping-pong tables sold to these companies
  • Startups purchase ping-pong tables in the same quarter as their VC deals, which is why they focus only on within-quarter comparisons
  • Silicon Valley startups only source their office equipment from Silicon Valley retailers
  • VC deal flow has no seasonality
  • Ping-pong table sales has no seasonality either
  • It is possible to predict the past (VC deals made) by gathering data about the future (ping-pong tables sold)

Further, the chart proves that one can draw conclusions from a single observation. Here is what the same chart looks like after taking out the 2016 Q1 data point:


This revised chart is also quite informative. I learned:

  • At the same level of ping-pong-table sales (roughly 150 tables), the number of VC deals ranged from 920 to 1020, about one-third of the vertical range shown in the original chart
  • At the same level of VC deals (roughly 1000 deals), the number of ping-pong tables sold ranged from 150 to 230, about half of the horizontal range of the original chart

The many quotes in the WSJ article also tell us that people in Silicon Valley are no more data-driven than people in other parts of the country.

Football managers on the hot seat

Chris Y. asked how to read this BBC Sports graphic via Twitter:


These are managers of British football (i.e. soccer) teams. Listed are some of the worst tenures of some managers. But what do the numbers mean?

The character "V" holds the key. When I first read the chart title, I wonder why managers are opposed to win percentages. Also, the legend at the bottom right confuses me. Did they mean "W" when they printed "V"? "Games W%" seems like a shorthand for winning percentage.

After looking up John Carver's not-so-impressive record, I learned that the left column are total number of matches managed and the right column is the winning percentage expressed as a number between 0 and 100.

I think even the designer got confused by those scales. Witness the little bar charts in the middle:


The two numbers are treated as if they are on the same scale. The left column is assumed to be the number of matches won while the right column is treated as the number of matches lost (or vice versa). Under this interpretation, the bar charts would depict the winning percentages. Let me fix the data:


While these managers have compiled similar losing records on a relative basis, some of them lasted longer than others. The following chart brings out the difference in tenure while keeping the winning percentages: (I have re-sorted the managers.)


 When they finally got the sack, they reached the end of the line.

Bewildering baseball math

Over Twitter, someone asked me about this chart:


It's called the MLB pipeline. The text at the top helpfully tells us what the chart is about: how the playoff teams in baseball are built. That's the good part.

It then took me half a day to understand what is going on below. There are four ways for a player to be on a team: homegrown, trades and free agents, wherein homegrown includes drafted players or international players.

Each row is a type of player. You can look up which teams have exactly X players of a specific type. It gets harder if you want to know how many players team Y has of a given type. It is even harder if you don't know the logos of every team (e.g. Toronto Blue Jays).

Some fishy business is going on with the threesomes and foursomes. Here is the red threesome:


Didn't know baseball employs half a player. The green section has a different way to play threesomes:


The blue section takes inspiration from both and shows us a foursome:


I was stuck literally in the middle for quite a while:


Eventually, I realized that this is a summary of the first two sections on the page. I still don't understand why there is no gap between 11 and 14 but then the 14 and 15 arrows are twice as large as 9, 10 and 11 even though every arrow contains exactly one team.


The biggest problem in the above chart is the hidden base: each team's roster has a total of 25 players.

Here is a different view of the data:


With this chart, I want to emphasize two points: first, addressing the most interesting question of which team(s) emphasize which particular player acquisition tactic; second, providing the proper reference level to interpret the data.

Regarding the vertical, reference lines: take the top left chart about players arriving through trade. If every team equally emphasizes this tactic, then each team should have the same number of traded players on the 25-person roster. This would mean every team has approximately 11 traded players. This is clearly not the case. Several teams, especially Cubs and Blue Jays, utilized trades more often than teams like Mets and Royals.



Don't pick your tool before having your design

My talk at Parsons seemed like a success, based on the conversation it generated, and the fact that people stuck around till the end. One of my talking points is that one should not pick a tool before having a design.

Then, last night on Twitter, I found an example to illustrate this. Jim Fonseca tweeted about this chart from Business Insider: (link)


The style is clean and crisp, which I credit them for. Jim was not happy about the length of the columns. It seems that no matter how many times we repeat the start-at-zero rule, people continue to ignore it.

So here we go again. The 2015 column is about double the height of the 2013 column but 730 is nowhere near double the value of 617.

The standard remedy for this is to switch to a line chart, or a dot plot. Something like this can be quickly produced in any software:


Is this the best we can do?

Not if we are willing to free ourselves from the tool. Think about the message: NFL referees have been calling more penalties this year. Compared to what?

I want to leave readers no doubt as to what my message is. So I sketched this version:


This version cannot be produced directly from a tool (without contorting your body in various painful locations).

The lesson is: Make your design, then find a way to execute it.

Reimagining the league table

The reason for the infrequent posting is my travel schedule. I spent the past week in Seattle at JSM. This is an annual meeting of statisticians. I presented some work on fantasy football data that I started while writing Numbersense.

For my talk, I wanted to present the ubiquitous league table in a more useful way. The league table is a table of results and relevant statistics, at the team level, in a given sports league, usually ordered by the current winning percentage. Here is an example of ESPN's presentation of the NFL end-of-season league table from 2014.


If you want to know weekly results, you have to scroll to each team's section, and look at this format:


For the graph that I envisioned for the talk,  I wanted to show the correlation between Points Scored and winning/losing. Needless to say, the existing format is not satisfactory. This format is especially poor if I want my readers to be able to compare across teams.


The graph that I ended up using is this one:


 The teams are sorted by winning percentage. One thing should be pretty clear... the raw Points Scored are only weakly associated with winning percentage. Especially in the middle of the Points distribution, other factors are at play determining if the team wins or loses.

The overlapping dots present a bit of a challenge. I went through a few other drafts before settling on this.

The same chart but with colored dots, and a legend:


Only one line of dots per team instead of two, and also requiring a legend:


 Jittering is a popular solution to separating co-located dots but the effect isn't very pleasing to my eye:


Small multiples is another frequently prescribed solution. Here I separated the Wins and Losses in side-by-side panels. The legend can be removed.



As usual, sketching is one of the most important skills in data visualization; and you'd want to have a tool that makes sketching painless and quick.