The most famous Excel spreadsheet error in recent memory is the one that asserts that countries with high debt-to-GDP ratios experience slow growth.
The most recent case of Excel error is the National Highway Traffic Safety Administration's (NHTSA) analysis using data supplied by Tesla to conclude that the "autosteer" feature reduces crash rates by 40%. NHTSA no longer stands behind that analysis, after it was debunked by a consulting firm called Quality Control Systems which spent two years to force the data to be released.
The error can happen not just in Excel but any analytical tool because it relates to how missing data are treated. First, the analyst has to notice the existence of missings. Then, the analyst has to recognize - through some further analysis - that the data with missing values are not like the data without missing values. Finally, the analyst decides to treat the missing values in the appropriate manner - sometimes, they can be dropped; other times, imputed.
The original analysis leads to a simple conclusion. Auto-steering was a feature that was included in Tesla cars since 2014 but the feature was not enabled until October 2015. So, the analytical plan is to compare the same cars before and after the feature was enabled. (There might be a state of enabled but not utilized but it's not in the analysis.) The outcome metric - crash rate - is a measure of airbag activation, based on what one sees in the spreadsheet. Notably the airbag activation has a TRUE/FALSE value so if the airbag activated in the same car more than once, it will only count once.
According to the original NHTSA report, the cars in the dataset had a crash rate of 1.3 per million miles before the auto-steering and 0.8 per million miles after, which led to the claim of 40% reduction.
***
The consultant who obtained the dataset noticed two worrying problems in the data (supplied by Tesla):
- Two-thirds of the cars have empty cells for pre-autosteering miles but non-empty cells for post-autosteering miles. It's implausible that these cars were not driven until auto-steering was activated.
- 47% of the cars have different values of mileage just before auto-steering and just after. These two values should be the same if the moment of activating autosteer was clearly identified for each car. There is a gap in which 134 million miles were driven but not counted in either pre- or post-autosteering miles.
The NHTSA analysis used a zero imputation, meaning that it assumed that two-thirds of the cars had zero pre-autosteering miles. If it had used a mean imputation, that would have added 129 million miles to the pre-autosteering count (average before-autosteering miles for those cars with data was 4426 miles). Since all crashes have been counted, these additional miles drop the crash rate from the reported 1.3 per million miles to 0.4, wiping out the entire 40% reduction and reversing the effect!
Let's turn to the cars that show a gap in just before vs just after autosteering mileage. We can hypothesize that having such a gap or not is uncorrelated with crash rate. This means that if we segment the cars by those with a gap and those without, we should expect the crash propensity to be roughly the same. The data show that cars with gap accounted for 60% of the pre-autosteering crashes but 48% of the post-autosteering crashes. So these cars with unexplained gaps in mileage are pushing up the pre-autosteering rate relative to the post-autosteering rate. This calls for investigating why those gaps exist, and why they are associated with crash propensity.
We still can't decide where to place the 134 million miles driven. That requires understanding the data collection process. This leads us back to realizing that all the numbers in the spreadsheet were presented to NHTSA by Tesla and not audited.
I knew the news but I was unable to form an opinion. Thank you very much for your clear explanation!
Posted by: Antonio Rinaldi | 03/05/2019 at 04:43 AM
Big Data=I have all this data but most of it is wrong
Posted by: Ken | 03/09/2019 at 10:52 PM