A lot of Big Data analyses default to analyzing count data, e.g. number of searches of certain keywords, number of page views, number of clicks, number of complaints, etc. Doing so throws away much useful information, and frequently leads to bad analyses.
***
I was reminded of the limitation of count data when writing about the following chart, which I praised on my sister blog as a good example of infographics, a genre chock-full of deplorable things.
On the other blog, I explained why I prefer to hide the actual numbers, from a dataviz perspective.
There is also a statistical reason for not drawing undue attention to the counts.
These counts do not indicate the severity of the injuries: some may have knocked the player out of the game, others may have been much milder. Some injuries may be sustained by first-team players who spend a much longer time on the field than backups, thus raising their rate of injuries.
Another statistical consideration is heterogeneity. I'd like to see a small-multiples version of this chart, with the data split by position on the field. I think it will be quite telling which body parts are hurt more depending on one's role in the game. Similarly, splitting by age, body size, and other factors will yield interesting insights.
***
At about the same time, I was reading the July issue of Significance magazine (an RSS and ASA publication). Here is the link (not free).
In an article about assessing whether iceberg risk was particularly high in the year of the Titanic, the authors quantified the risk in terms of "number of icebergs crossing latitude 48 N each year". It'd seem worthwhile to ask whether there is also a relevant size distribution.
Then, in an article about "black box modeling" (i.e. data mining) by Max Kuhn and Kjell Johnson, they invoke an example of the FDA adverse event reporitng database, an example of "events data". Events data is everywhere these days, and the most popular analyses of such data revolve around counting the number of adverse events. The severity and type of events are frequently ignored.
P.S. In their otherwise gung-ho article, Kuhn and Johnson also point to one of the biggest challenges of OCCAM data. "If there is a systematic bias in a small data set, there will be a systematic bias in a larger data set, if the source is the same."If one is analyzing the FDA adverse events database, one must hope to apply the learning to people who don't yet have adverse events, but then such an analysis would be flawed since the database doesn't have any controls, i.e. people without these adverse reactions.
Comments