I try hard to not hate all hover-overs. Here is one I love

One of the smart things Noah (at WNYC) showed to my class was his NFL fan map, based on Facebook data.

This is the "home" of the visualization:

Veltman_nfl_home

The fun starts by clicking around. Here are the Green Bay fans on Facebook:

Veltman_nfl_greenbay

Also, you can see these fans relative to other teams in the same division:

Veltman_nfl_afcnorth

A team like Jacksonville has a tiny footprint:

Veltman_nfl_jags

 

What makes this visualization work?

Notice the "home" image and those straight black lines. They are the "natural" regions of influence, if you assume that all fans root for the team that they are physcially closest to. 

To appreciate this, you have to look at a more generic NFL fan map (this is one from Deadspin):

Deadspin_nfl_fans

This map is informative but not as informative as it ought to be. The reference point provided here are the state boundaries but we don't have one NFL team per state. Those "Voronoi" boundaries Noah added are more reasonable reference points to compare to the Facebook fan data.

When looking at the fan map, the most important question you have is what is each team's region of influence. This work reminds me of what I wrote before about the Beer Map (link). Putting all beer labels (or NFL teams) onto the same map makes it hard to get quick answers to that question. A small-multiples presentation is more direct, as the reader can see the brands/teams one at a time.

Here, Noah makes use of interactivity to present these small multiples on the same surface. It's harder to compare multiple teams but that is a secondary question. He does have two additions in case readers want to compare multiple teams. If you click instead of mousing over a team, the team's area of influence sticks around. Also, he created tabs so you can compare teams within each division.

I usually hate hover-over effects. They often hide things that readers want (creating what Noah calls "scavenger hunts"). The hover-over effect is used masterfully here to organize the reader's consumption of the data.

***

Moving to the D corner of the Trifecta checkup. Here is Noah's comment on the data:

Facebook likes are far from a perfect method for measuring NFL fandom. In sparsely-populated areas of the country, counties are likely to have a very small sample size. People who like things on Facebook are also not a perfect cross-section of football fans (they probably skew younger, for example). Other data sources that could be used as proxies for fan interest (but are subject to their own biases) are things like: home game attendance, merchandise sales, TV ratings, or volume of tweets about a team.
 
 

 


An infographic showing up here for the right reason

Infographics do not have to be "data ornaments" (link). Once in a blue moon, someone finds the right balance of pictures and data. Here is a nice example from the Wall Street Journal, via ThumbsUpViz.

 

Thumbsupviz_wsj_footballinjuries

 

Link to the image

 

What makes this work is that the picture of the running back serves a purpose here, in organizing the data.  Contrast this to the airplane from Consumer Reports (link), which did a poor job of providing structure. An alternative of using a bar chart is clearly inferior and much less engaging.

Redowsjinjuries_bar

***

I went ahead and experimented with it:

Redo_wsj_nflinjuries

 

I fixed the self-sufficiency issue, always present when using bubble charts. In this case, I don't think it matters whether the readers know the exact number of injuries so I removed all of the data from the chart.

Here are  three temptations that I did not implement:

  • Not include the legend
  • Not include the text labels, which are rendered redundant by the brilliant idea of using the running guy
  • Hide the bar charts behind a mouseover effect.

 


Revisiting the home run data

Note to New York metro readers: I'm an invited speaker at NYU's "Art and Science of Brand Storytelling" summer course which starts tomorrow. I will be speaking on Thursday, 12-1 pm. You can still register here.

***

The home run data set, compiled by ESPN and visualized by Mode Analytics, is pretty rich. I took a quick look at one aspect of the data. The question I ask is what differences exist among the 10 hitters that are highlighted in the previous visualization. (I am not quite sure how those 10 were picked because they are not the Top 10 home run hitters in the dataset for the current season.)

The following chart focuses on two metrics: the total number of home runs by this point in the season; and the "true" distances of those home runs. I split the data by whether the home run was hit on a home field or an away stadium, on the hunch that we'd need to correct for such differences.

Jc_top10hitters_homeaway_splits

The hitters are sorted by total number of home runs. Because I am using a single season, my chart doesn't suffer from a cohort bias. If you go back to the original visualization, it is clear that some of these hitters are veterans with many seasons of baseball in them while others are newbies. This cohort bias explains the difference in dot densities of those plots.

Having not been following baseball recently, I don't know many of these names on the list. I have to look up Todd Frazier - does he play in a hitter-friendly ballpark? His home to away ratio is massive. Frazier plays for Cincinnati, at the Great American Ballpark. That ballpark has the third highest number of home runs hit of all ballparks this season although up till now, opponents have hit more home runs there than home players. For reference, Troy Tulowitzki's home field is Colorado's Coors Field, which is hitter's paradise. Giancarlo Stanton, who also hits quite a few more home runs at home, plays for Miami at Marlins Park, which is below the median in terms of home run production; thus his achievement is probably the most impressive amongst those three.

Josh Donaldson is the odd man out, as he has hit more away home runs than home runs at home. His O.co Coliseum is middle-of-the-road in terms of home runs.

In terms of how far the home runs travel (bottom part of the chart), there are some interesting tidbits. Brian Dozier's home runs are generally the shortest, regardless of home or away. Yasiel Puig and Giancarlo Stanton generate deep home runs. Adam Jones Josh Donaldson, and Yoenis Cespedes have hit the ball quite a bit deeper away from home.  Giancarlo Stanton is one of the few who has hit the home-run ball deeper at his home stadium.

The baseball season is still young, and the sample sizes at the individual hitter's level are small (~15-30 total), thus the observed differences at the home/away level are mostly statistically insignificant.

The prior post on the original graphic can be found here.

 


Interactivity as overhead

Making data graphics interactive should improve the user experience. In practice, interactivity too often becomes overhead, making it harder for users to understand the data on the graph.

Reader Joe D. (via Twitter) admires the statistical sophistication behind this graphic about home runs in Major League Baseball. This graphic does present interesting analyses, as opposed to acting as a container for data.

For example, one can compare the angle and distance of the home runs hit by different players:

Redo_baseballhr

One can observe patterns as most of these highlighted players have more home runs on the left side than the right side. However, for this chart to be more telling, additional information should be provided. Knowing whether the hitter is left- or right-handed or a switch hitter would be key to understanding the angles. Also, information about the home ballpark, and indeed differentiating between home and away home runs, are also critical to making sense of this data. (One strange feature of baseball fields is that they all have different dimensions and shapes.)

Mode_homerunsBut back to my point about interactivity. The original chart does not present the data in small multiples. Instead, the user must "interact" with the chart by clicking successively on each player (listed above the graphic).

Given that the graphic only shows one player at a time, the user must use his or her memory to make the comparison between one player and the next.

The chosen visual form discourages readers from making such comparisons, which defeats one of the primary goals of the chart.


The missing Brazil effect, and BYOC charts

Announcement: I'm giving a free public lecture on telling and finding stories via data visualization at NYU on 7/15/2014. More information and registration here.

***

The Economist states the obvious, that the current World Cup is atypically high-scoring (or poorly defended, for anyone who've never been bothered by the goal count). They dubiously dub it the Brazil effect (link).

Perhaps in a sly vote of dissent, the graphic designer came up with this effort:

Economist_worldcup

(Thanks to Arati for the tip.)

The list of problems with this chart is long but let's start with the absence of the host country and the absence of the current tournament, both conspiring against our ability to find an answer to the posed question: did Brazil make them do it?

***

Turns out that without 2014 on the chart, the only other year in which Brazil hosted a tournament was 1950. But 1950 is not even comparable to the modern era. In 1950, there was no knock-out stage. They had four groups in the group stage but divided into two groups of four, one group of three and one group of two. Then, four teams were selected to play a round-robin final stage. This format is so different from today's format that I find it silly to try to place them on the same chart.

This data simply provide no clue as to whether there is a Brazil effect.

***

The chosen design is a homework assignment for the fastidious reader. The histogram plots the absolute number of drawn matches. The number of matches played has tripled from 16 to 48 over those years so the absolute counts are highly misleading. It's worse than nothing because the accompanying article wants to make the point that we are seeing fewer draws this World Cup compared to the past. The visual presents exactly the opposite message! (Hint: Trifecta Checkup)

Unless you realize this is a homework assignment. You can take the row of numbers listed below the Cup years and compute the proportion of draws yourself. BYOC (Bring Your Own Calculator). Now, pay attention because you want to use the numbers in parentheses (the number of matches), not the first number (that of teams).

Further, don't get too distracted by the typos: in both 1982 and 1994, there were 24 teams playing, not 16 or 32. The number of matches (52 in each case) is correctly stated.

***

Wait, the designer provides the proportions at the bottom of the chart, via this device:

Econ_worldcup_sm

As usual, the bubble chart does a poor job conveying the data. I deliberately cropped out the data labels to demonstrate that the bubble element cannot stand on its own. This element fails my self-sufficiency test.

***

I find the legend challenging as well. The presentation should be flipped: look at the proportion of ties within each round, instead of looking at the overall proprotion of ties and then breaking those ties by round.

The so-called "knockout round" has many formats over the years. In early years, there were often two round-robin stages, followed by a smaller knockout round. Presumably the second round-robin stage has been classified as "knockout stage".

Also notice the footnote, stating that third-place games are excluded from the histogram. This is exactly how I would do it too because the third-place match is a dead rubber, in which no rational team would want to play extra-time and penalty shootout.

The trouble is inconsistency. The number of matches shown underneath the chart includes that third-place match so the homework assignment above actually has a further wrinkle: subtract one from the numbers in parentheses. The designer gets caught in this booby trap. The computed proportion of draws displayed at the bottom of the chart includes the third-place match, at odds with the histogram.

***

Here is a revised version of the chart:

Redo_econ_worldcup1

Redo_econ_worldcup2

A few observations are in order:

  • The proportion of ties has been slowly declining over the last few Cups.
  • The drop in proportion of ties in 2014 is not drastic.
  • While the proportion of ties has dropped in the 2014 World Cup, the proportion of 0-0 ties has increased. (The gap between the two lines shows the ties with goals.)
  • In later rounds, since the 1980s, the proportion of ties has been fairly stable, between 20 and 35 percent.

Another reason for separate treatment is that the knockout stage has not started yet in 2014 when this chart was published. Instead of removing all of 2014, as the Economist did, I can include the group stage for 2014 but exclude 2014 from the knockout round analysis.

In the Trifecta Checkup, this is Type DV. The data do not address the question being posed, and the visual conveys the wrong impression.

 ***

Finally, there is one glaring gap in all of this. Some time ago (the football fans can fill in the exact timing), FIFA decided to award three points for a win instead of two. This was a deliberate effort to increase the point differential between winning and drawing, supposedly to reduce the chance of ties. Any time-series exploration of the frequency of ties would clearly have to look into this issue.

 


On the cool maps about baseball fandom

Josh Katz, who did the dialect maps I featured recently, is at it again. He's one of the co-authors of a series of maps (link) published by the New York Times about the fan territorities of major league baseball teams.

Nyt_baseballfandom

Similar to the dialect maps, these are very pleasing to look at, and also statistically interesting. The authors correctly point out that the primary points of interest are at the boundaries, and provide fourteen insets on particular regions. This small gesture represents a major shift from years past, when designers would have just printed an interactive map, letting readers figure out where the interesting stuff is.

The other interesting areas are the "no-man lands", the areas in which there are no local teams. The map uses the same kind of spatial averaging technology that blends the colors. The challenge here would be the larger number of colors.

I'd have preferred that they have given distinct colors to the teams like the Yankees and the Red Sox that have broader appeal. Maybe the Yankees is the only national team they discovered, since it does have the unique gray color which is very subtle.

I also think it is smart to hide the political boundaries of state, zip, etc. in the maps (unless you click on them).

I'd like to see a separate series of maps: small multiples by team, showing the geographical extent of each team. This is a solution to the domination issue to be addressed below.

***

Nyt_yankeesterritoryThe issue of co-dominant groups I discussed in the dialect maps also shows up here. Notably, in New York, the Mets are invisible, and in the Bay Area, the Oakland As similarly do not appear on the map.

Recall that the each zip code is represented by the team with the highest absolute proportion of fans. It may be true that the Mets are perennial #2 in all relevant zip codes. Zooming into the Yankee territory, I didn't see any zip code in which Mets fans are more numerous. So this may be the perfect example of what falls through the cracks when the algorithm just drops everything but the top level.

***

Now, in the Trifecta checkup, we want to understand what the data is saying. I have to say this is a bit challenging. The core dataset contains Facebook Likes (aggregated to the zip-code level). It is not even clear what the base of those proportions are.  Is it the total population in a zip code? the total Facebook users? the total potential baseball fans?

As I have said elsewhere, Facebook data is often taken to be "N=All". This is an assumption, not a fact of the data. Different baseball teams may have different social-media/Facebook strategies. Different teams may have different types of fans, who are more/less likely to be on Facebook. This is particularly true of cross-town rivals.

Apart from the obvious problem with brands buying or otherwise managing Likes, "Like" is a binary metric that doesn't measure fan fervor. It is a static measure as I don't believe Facebook users manage their list of Likes actively (please correct me if I am wrong about this behavior.)

We are not provided any real numbers, and none of the maps have scales. Unless we see some absolute counts, it is hard to know if the data make sense relative to other measures of fandom, like merchandise and ticket sales. With Facebook data, it is sometimes possible to have too much--in other words, you might find there are more team fans than potential baseball fans or even population in a specific zip code.

It is very likely that Facebook, which is the source of the aggregated data, did not want to have raw counts published. This is par for the course for the Internet giants, and also something I find completely baffling. Here are the evangelizers of privacy is dead, and they stockpile our data, and yet they lock the data up in their data centers, away from our reach. Does that make any sense?


The Numbers Guy went on vacation

Carl Bialik used to be the Numbers Guy at Wall Street Journal - he's now with FiveThirtyEight. Apparently, he left a huge void. John Eppley sent me to this set of charts via Twitter.

This chart about Citibike is very disappointing.

Ss_spincity

Using the Trifecta checkup, I first notice that it addresses a stale question and produces a stale answer. The caption below the chart says "the peak times ... seem to be around 9 am and 6 pm." What a shock!

I sense a degree of meekness in usnig "seem to be". There is not much to inspire confidence in the data: rather than the full statistics which you'd think someone at Citibike has, the chart is based on "a two-day sample last autumn". The number of days is less concerning than the question of whether those two autumn days are representative of the year. Curious readers might want to know what data was collected, how it was collected, and the sample size.

Finally, the graph makes a mess of the data. While the black line appears to be data-rich, it is not. In fact, the blue dots might as well be randomly scattered and connected. As you can see from the annotations below, the scale of the chart makes no sense.

Jc_wsj_citibike

Plus, the execution is sloppy, with a missing data label.

***

 The next chart is not much better.

Wsj_babybumps

The biggest howler is the choice of pie charts to illustrate three numbers that are not that different.

But I have to say the chart raises more questions than it answers. I am not an expert in pregnancy but doesn't a pregnant woman's weight include the weight of the baby she's carrying? So the more weight the woman gains, on average, the heavier is her baby. What a shock!

***

 The last and maybe the least is this chart about basketball players in the playoff.

Wsj_fabfive

It's the dreaded bubble chart. The players are arranged in a perplexing order. I wonder if there is a natural numbering system for basketball positions (center = #1, etc.), like there is in soccer. Even if there is such a natural numbering system, I still question the decision to confound that system with a complicated ranking of current-year playoff players against all-time players.

Above all, the question being asked is uninteresting, and so the chart is uninformative. A more interesting question to me is whether the best players are playing in this year's playoff. To answer this question, the designer should be comparing only currently active players, and showing the all-time ranks of those players who are playing in the playoffs versus those who aren't.

 


Law of small numbers, in action

Loyal reader John M. expressed dismay over Twitter about 538's excessive use of bubble charts. Here's the picture that pushed John over the edge:

538-morris-datalab-trout

The associated article is here.

The question on the table is motivated by the extraordinary performance of a young baseball player Mike Trout. The early success can be interpreted either as evidence of future potential or as evidence of a future drought. As an analogy, someone wins a lottery. You can argue that the odds are so low that winning again is impossible. Or you can argue that winning once indicates that this person is "lucky" and lucky people might win again.

The chart shows the proportion of players who performed even better after the initial success, given the age at which they first broke out. One way to read this chart is to mentally replace the bubbles with dots (or columns), and then interpret the size of the bubbles as the statistical significance of the corresponding probability estimate. The legend says number of players, which is the sample size, which governs the error bar associated with that particular number.

This bubble chart is no different from others: it is impossible to judge the relative sizes of bubbles. Even though the legend provides us two reference points (a nice enough idea on its own), it is still impossible to know, for example, what proportion of players did better later in life when they first peaked at age 24. The bubble for age 23 looks like it's exactly five players but I still cannot figure out how many players the adjacent bubble represents.

The designer should have just replaced each bubble with an error bar, and the chart is instantly more readable. (I have another version of this at the end of the post.)

The rest of the design elements are clean and well-done, particularly use of notes to point out interesting aspects of the data.

***

From a Trifecta checkup perspective, I am uncertain about how the nature of the data used to investigate the interesting question posed above.

Readers should note the concept of "early success" and "later success" are not universally defined. The author here selects two proxies. Reaching an early peak is equated to "batters first posting 15+ WAR over two seasons". Next, reversion to the mean is defined as not having a better two-year span subsequent to the aforementioned early peak.

Why two seasons? Why WAR and not a different metric? Why 15 as the cutoff? These are all design decisions made while working with the data.

One can make reasonable arguments to justify the above two questions. A bigger head-scratcher relates to the horizontal axis, which identifies the first time a player reaches his "early peak," as defined above. The way the above chart is set up, it is almost preordained to exhibit a negative slope. The older the player is when he reaches the first peak, the fewer years left in his playing career to try to emulate or surpass that feat.

This last point is nicely illustrated in the next chart of the article:

538-morris-datalab-trout2

 This chart is excellent on many levels. It's not clear, though, whether it says anything other than aging.

***

Near the end of the post, the author rightfully pointed out that "there’s not really enough data to demonstrate this effect". Going back to the first chart, it appears that no single bubble contains a double-digit count of players. So every sample size is between one and, say, seven. We should be wary of conclusions based on so little data.

It's always fun to find examples of the Law of Small Numbers, courtesy of Kahneman & Tversky.

***

Here is a sketch of how I might re-make the first chart (I made up data; see the note below).

Redo_538_miketrout

While making this chart, I realize another issue with the original bubble chart. When the proportion of players improving on their early peak is zero percent, how many players did not make it is quite hidden. In the revised chart, this data is clearly seen (look at age 22).

Note: I wonder if I totally missed the point of the original chart.... I actually had trouble eyeballing the data so I ended up making up numbers. The bubble at age 22 looks like it should stand for 5 players and yet it sits at precisely 50%, which would map to 2.5 players. If I assume the 22 bubble to be 4 players, then I don't know what the 26 bubble is. If it is 4 players also, then the minimum non-zero proportion should have been 1/4, but the bubble clearly lies below 25%. If it is 3 players, the minimum non-zero proportion is 1/3, which should be at 33%.

 


There's nothing wrong with Eli Manning on this chart

The Giants QB Eli Manning is in the news for the wrong reason this season. His hometown paper, the New York Times, looked the other way, focusing on one metric that he still excels at, which is longevity. This is like the Cal Ripken of baseball. The graphic (link) though is fun to look at while managing to put Eli's streak in context. It is a great illustration of recognition of foreground/background issues. (I had to snip the bottom of the chart.)

Nyt_elimanning

After playing around with this graphic, please go read Kevin QuigleyQuealy's behind-the-scenes description of the various looks that were discarded (link). He showed 19 sketches of the data. Sketching cannot be stressed enough. If you don't have discarded sketches, you don't have a great chart.

Pay attention to tradeoffs that are being made along the way. For example, one of the sketches showed the proportion of possible games started:

Cnt_elimanning

I like this chart quite a bit. The final selection arranges the data by team rather than by player so necessarily, the information about proportion of possible games started fell by the wayside.

(Disclosure: I'm on Team Philip. Good to see that he is right there with Eli even on this metric.)

 

 


Beautiful spider loses its way

On Twitter, Andy C. (@AnkoNako) asked me to look at this pretty creation at NFL.com (link).

Nfl_spiderweb

There is a reason why you don't read much about spider charts (web charts, radar charts, etc.) here. While this chart is beautifully constructed, and fun to play with, it just doesn't work as a vehicle for communication.

This example above allows us to compare four players (here, quarterbacks) on eight metrics. Each white polygon represents one player, and the orange outline represents the league average quarterback. 

What are some of the questions one might have about comparing quarterbacks?

  • Who is the best quarterback, and who is the worst?
  • Who is the better passer? (ignoring other skills, like rushing ability)
  • Is each quarterback better or worse than the average quarterback?

How will you figure these out from the spider chart?

  • Not sure. The relative value of the quarterbacks is definitely not encoded in the shape of the polygon, nor the area. To really figure this out, you'd need to look at each of the eight spokes independently, and then aggregate the comparisons in your head. Unless... you are willing to ignore seven of the eight metrics, and just look at passer rating (below right).
  • Focusing on passing only means focusing on five of the eight metrics, from pass attempts to interceptions. How do you combine five metrics into one evaluation is your own guess.
  • One can tell that Joe Flacco is basically the average quarterback as his contour is almost exactly that of the average (orange outline). Are the others better or worse thean average? Hard to tell at first glance.

***

There are a number of statistical points worth noting.

First, the chart invites users to place equal emphasis on each of the eight dimensions. (There is a control to remove dimensions.) But the metrics are clearly not equally important. You certainly should value passing yards more than rushing yards, for example.

Second, the chart ignores the correlation between these eight metrics. The easiest way to see this is the "Passer Rating", which is a formula comprising the Passing Attempts, Passing Completions, Interceptions, Touchdown Passes, and Passing Yards. Yes, all those five components have been separately plotted. Another easy way to see the problem is that Passing Yards are highly correlated with Passing Attempts or Passing Completions.

Third, the chart fails to account for different types of quarterbacks. I deliberately chose these four because Joe Flacco was a starter, Tyrod Taylor was a backup who almost never played, while at San Francisco, Alex Smith and Colin Kaepernick shared the starting duties. So for Passing Yards, the numbers were 3817, 179, 1737 and 1814 respectively. Those numbers should not be directly compared. Better statistics are something like yards per minute played, yards per offensive series, yards per plays executed, etc. The way that this data is used here, all the second- and third-string quarterbacks will be below average and most of the starters will be above average.

***

From a design perspective, there are a small number of misses.

Mysteriously, the legend always has only two colors no matter how many players are being compared. The orange is labeled Average while the white is labeled "Leader". I have no idea why any of the players should be considered the "Leader".

The only way to know which white polygon represents which player is to hover on the polygon itself. You'll notice that in my example, several of those polygons overlap substantially so sometimes, hovering is not a task easily accomplished.

The last issue is scale. Turns out that some of the metrics like interceptions, touchdown passes, rushing yards, etc. can be zeroes. Take a look at this subset of the chart where I hovered on Tyrrod Taylor.

Nfl_spider_zeroesDo you see the problem? The zero point is definitely not the center of the circle. This problem exists for any circular charts like bubble charts.

Now look at Interceptions. Because the scale is reverse (lower is better), the zero point of this metric will lie on the outer edge of the circle. This is a vexing issue because the radius is open-ended on the outside but closed-ended on the inside.

***

In the next post, I will discuss some alternative presentation of this data.