« May 2011 | Main | July 2011 »

Football madness

The article about football (soccer) analytics discussed at the sister blog contains a few pictures.

Here is an example:


And here is the legend made legible:


Too much art, not enough science. (See this post.)

I wish the designer lost some of the data. The graphic would stand a better chance of succeeding if the unimportant bits were not shown, or faded out. Giving every piece of information equal status, whether it's a shot on goal or a dribble, is another way to distort information. It downplays the important information while overstressing the filler materials.

Colors shouldn't be assigned at random. They should surface patterns. Make the Barca data visually distinct from Man U's data. Similarly, unify the numerous statistics on goal-keeping.

A more subtle misstep is mixing up whites and blanks. According to the legend, blank means ball out of play while white means offside (for either team) but readers can't tell these two apart. The whitespace looks like gaps.

For me, this is a lost opportunity. Visual exploration of data is a very powerful concept; it can guide further analysis and even guide the construction of mathematical models. But the visual has to help organize the information. Here it didn't.



Return this plate, I want my pie chart

Reader Brad E. reminds me about the USDA's attempt to "improve" the visual presentation of dietary standards. As reported here, the food pyramid failed its mission and is retired. Here comes MyPlate!

According to this report, the government wants to impart these key points:

  • MyPlate offers a visual reminder to make healthy food choices when you choose your next meal.
  • It can help prioritize food choices and remind us to make fruits and vegetables half of our plates each meal.
  • On the other side of the plate – and beside it – we see the other important food groups for a healthy meal: whole grains, lean proteins, and low fat dairy.

We have been warned not to think of this as a pie chart.

What do we call a circular chart in which the area of the circle is partitioned into separate regions?


How does one say return MyPlate to the kitchen fast enough? The biggest problem here is that the key points are out of sync with the chart details. The MyPlate diet, as depicted, has less than the recommended amounts of fruits and vegetables! Since those two important food groups only equal grains and proteins, the presence of dairy means that fruits and vegetables form less than half of this diet.

The core message is that one should split one's diet in half, with fruits and veggies on the one side and grains, proteins and dairy on the other. If this is so, the following chart gets this point across with minimal probability of confusion:


If, on the other hand, the above chart is deemed too simple, and the message really does require proportions of each of the five food categories, then the sad truth is that a pie chart would have conveyed the message better.


MyPlate serves up strange portions that cannot be properly sized. How big is the "dairy" circle compared to any of the quadrants? How does one judge the irregularly-sized quadrants (grains and vegetables)? If there is any use for the pie chart, it is to display simple concepts with limited dimensions.

Want a signed book?

JMP is giving away signed copies of Numbers Rule Your World.  See details here.

JMP is a great piece of software for those who like to point and click, drag things around, interactively build models. People I hire who are analytical but don't have proper statistical training seem to enjoy using it and produce good work from it. There are other similar software on the market; I haven't tried them out so I don't know if they are better or worse but I can say I have had a pleasant time with JMP.


Speaking of which, if you haven't already, do subscribe to my sister blog, where I discuss the  statistical thinking behind everything that's happening around us.

The RSS feed: here. The twitter feed combines the two blogs.


Drugged-up American graphic

Reader Chris P. found this chart on Visualizing.org, which is one of those sites that invite anyone to contribute graphics to it:


It looks like the designer has taken Tufte's advice of maximizing data-to-ink ratio too literally. There are many, many things going on in a tight space, which leaves the reader feeling drugged-up and cloudy.

From a cosmetic standpoint, fixing the following would help a lot:

  • Make fonts 1-2 points larger in all cases, especially the text on the left hand side
  • Use colors judiciously to stress the key data. In this version, the trends, which are more interesting, are shown in pale gray while the raw data, which are not very exciting, are shown in loud red. Just flip the gray with the red. 
  • Rethink the American flag motive: is drug abuse a uniquely American phenomenon? Should data about the American people always be accompanied by the American flag?
  • Separately present in two charts the time-series data on total arrests, and the cross-sectional data (2008)

Stars_and_drugs Also, realize that by forcing the data into the 50-star configuration, one arbitrarily decides that the data should be rounded to 2-percent buckets. (see right). 

And always ask the fundamental question: what makes this data tick?


As I explored the data, I noticed various arithmetic problems. For example, the arrests by race analysis is itself split into two parts: White/black/Indian/Asian add up to 100 percent and then Hispanic Latino and Latino non Hispanic add up to 100 percent. In some surveys, Hispanics are counted within whites but that doesn't seem to be the case here. The numbers just don't add up.

Also, adding the types of drugs involved does not yield the total number of arrests. Perhaps the category of "others" has been omitted without comment. Now I closed my eyes and proceeded to make a chart out of this.


The new version focuses on one insight: that certain races seem to get arrested for certain drugs. The relative incidence for arrests are not similar among the races for any given drug. Asians and Native Americans appear to have higher proportions of people arrested for marijuana or meth while blacks are much more likely to be arrested for crack. 


You're going to need to click on the chart for the large version to see the text.

Doing this chart gives me another chance to plug the Profile chart. We deliberately connect with lines the categorical data. The lines are meant to mean anything; they are meant to guide our eyes towards the important features of the chart.

One can sometimes superimpose all the lines onto the same plot but the canvass clogs up quickly with more lines, and then a small-multiples presentation like this one is preferred.

We have a temptation to generalize arrest data to talk about drug habits by race but if you intend to do so, bear in mind that arrests need not correlate strongly with usage.

Small data sets present graphing challenges

Felix Salmon, a blogger and foodie, investigated whether a restaurant changes its pricing based on the number of stars it gets from Sam Sifton, the New York Times' food critic. His conclusion is that "price hikes happen all over the place, from the worst-reviewed restaurants to the best." Fs_siftongraphThis plot was used in the post.

His message doesn't jump out of his chart. We would have to recognize that it's the dark green pieces we should be focused on, and it's the relative heights of these pieces within each stacked column. I was also misdirected by the two axis labels: number of stars and number of reviews aren't the primary dimensions. So, I thought one could find a better alternative.


This data turn out to be harder to plot than expected. The problem is that the sample size is small, and because of this, the data have ragged edges. We are better at reading patterns from smooth objects.

Here is what I ended up with, a small multiples chart with grouped columns. I adopted Felix's color scheme although no differentiation of color is really necessary in this version. Relative percentages are plotted instead of raw number of reviews. Each set of four columns can be viewed as a histogram or probability distribution. (Again, with more samples, the histograms will look smoother, revealing the pattern more clearly.)

Redo_siftongraph I agree with Felix that there is not much correlation between star rating and pricing. However, this applies truly only to the middle three categories. At the edges, there are a couple of observations: all of the 4-star restaurants hiked their prices while the only restaurant that closed since it got reviewed received zero stars.

I'm a fan of annotating charts and so I'd recommend sticking a note on the 4 stars column, another note on the single gray column, and a third note bracketing the middle three categories, telling readers that there is nothing to see here.


Another reminder: most pie charts are unreadable

I'm outsourcing today's post to reader Joel D., who describes himself as a geographer, professional researcher and son of a graphics designer. He wrote a wonderful entry about the following pie chart:


 You know this "big graphic" is in big trouble when the caption tells you only 4 clubs made a profit and yet you'd be hard pressed to locate those 4 clubs from the chart itself!

The other telling sign is using blocks to represent profit/loss. This never works whether the blocks are pie sectors or stacked bars. Line charts are much better.

Joel has much more commentary, and an improved chart here.  Enjoy!

All potatoes are not born equal, says chart

Also from Consumer Strategist magazine comes the following chart about "PotatoPacks". (To their credit, the magazine uses a lot of charts, nost of which are completely harmless.)


 This is a good example of what Tufte calls a low data-to-ink ratio. There are exactly 10 pieces of data on this chart, the number of potato packs and the market share for a five-year period.

Much resources have been thrown at the problem of showing growth: it's a surround-sound treatment with loud speakers. The potatoes, the gridlines, the axis, the data labels. And yet, it's unclear what the message is.

According to the title, "both sales volume and market share have steadily increased since the introduction of the PotatoPack." It would have been a very nice touch to add a little arrow letting us know when PotatoPack was introduced. Was it in 2006 the starting point of the data set? Or was it in 2007 when the sales volume started to increase?

There is an unintended message. It's that all potatoes are not born equal!

Take a look at the two stacks labeled 572 and 493. How is it that 572 gets us 4 potatoes, and 493 gets us 3 potatoes? So for 2006, each potato is worth 143 packs while in 2007, it's worth 164 packs.


For 2010, they plotted projected data, which is exactly how it should be done. 

The following chart shows the year-to-year growth rate of the PotatoPack sales relative to the growth rate of the entire market. This may be the more interesting aspect of this data set.