A postscript on the post about ProPublica's display of ambulance spending data on Junk Charts.
This chart (of which I excerpted the top) is used in support of an article exposing potential fraud by ambulance operators in New Jersey.
But the chart by itself is not convincing evidence of fraud. It presents a symptom, and that's really all exploratory analysis of observational data can realistically achieve.
The ProPublica investigation is not complete without sending the reporter to dig up the dirt relating to ambulance companies overcharging for rides and literally taking people along for rides charged to insurers. The investigation is focused on New Jersey, perhaps the worst offender but there are several other states which show suspicious statistics.
Statistical investigations are important but rarely conclusive. We can determine with high probability that something fishy has happened. Because the data is observational, we typically cannot pinpoint the cause, i.e. fraud. This is true whether you are trying to explain test scores or expose lottery fraud (something I cover in Numbers Rule Your World). So, some legwork, "shoe leather", is needed for confirmation.
One component is still missing from the visual analytics work by ProPublica. This is the first C in my OCCAM framework. The dataset has no controls, and as yet, there does not appear to be any attempt to create appropriate controls. This requires domain expertise, available through collaboration with people in the industry.
The issue is that we don't know if the types and circumstances of diabetes patients in different states are similar or not. We should verify that the demographics of patients, and the characteristics of the hospitals are indeed invariant between states. I could also imagine that the availability of other forms of transportation, the distance between patients and hospitals, etc. are all potential factors explaining the variance between states.
The current analysis makes the implicit assumption that there are no meaningful differences of these types, thus attributing all observed variations to fraud.
One of the biggest myth of Big Data is that data alone produce complete answers.