Erik left a comment on a different post, pointing us to this informative article about the "data revolution" in football (soccer). This is a topic I have written about before as I follow the sport (here and here).
Given the breadth of the article, I shall limit my comments to what I think are the most interesting points.
The example of baseball ala Moneyball has been cited often. More slowly, the same kind of data analytics has spread to basketball, which is regarded as an example of a "fluid game". Billy Beane, the hero of Moneyball, is cited as saying "If it can be done there, it can be done on the soccer field."
Billy, it ain't so. Statistics is, as Andrew Gelman likes to say, about methods. There is truly a plutocracy of statistical methods. The reason is, however strange, that one method may work superbly well on a particular problem but fail miserably when applied to another problem. Linear regression is probably one of the most widely applied method but even then it does not solve all problems. Researchers have so far been stumped in learning why a given method succeeds or fails for a given problem.
I believe football analytics needs its own methods. Basketball is different from football in several important ways: the number of points scored is much higher than the number of goals scored; the number of games played is much higher than the number of matches played; and the number of players on the court is half the number of footballers. In each case, the basketball problem is made easier by virtue of larger sample size.
Similarly, baseball cannot be compared to football because it is a static game: baseball is as close as you get to a series of somewhat independent trials, which is the kind of thing for which probability and statistics were originally discovered.
I'm glad to hear that the budding football statisticians have realized
many of the stats they had been trusting for years were useless. In any industry, people use the data they have. The data companies had initially calculated passes, tackles and kilometres per player, and so the clubs had used these numbers to judge players. However, it was becoming clear that these raw stats... mean little.
I wish the engineers at Google, Yahoo!, AOL, Groupon, Linkedin, Netflix, etc. are reading this. It's not about how "big" the data is, it's not about how fast data can be processed, it's about how relevant. It's about knowing what you want to measure, and going out to measure those things. See also my previous post on football statistics.
Correlation is not causation. All of the analytics in sports have to do with correlations. It is very easy to confuse the two. A coach was cited as saying "there is a correlation between the number of sprints and winning". This might lead one to think "let's get our players to sprint more". Think about that conclusion for a moment. One has moved from correlation to causation.
As I argued in Chapter 2 of Numbers Rule Your World, correlational models are fine in many real-world applications. So I'm not debunking the whole field. I'm just cautioning against using an assumption of causality without recognizing it or validating it.
The most promising anecdote in the article concerns analyzing "sociograms", which describes who passes the ball to whom, who tends to start dangerous attacks and so on. I truly believe that in football, it's the pattern of interactions that is the key. If only I had the time, I would surely delve into this stuff.