Figuring out the location (of the data)
Apr 29, 2013
When we visualize data, we want to expose the information contained within, or to use the terminology Nate Silver popularized, to expose the signal and leave behind the noise.
When graphs are not done right, sometimes they manage to obscure the information.
Reader John H. found a confusing bar chart while studying a paper (link to PDF) in which the authors compared two algorithms used to determine the position of Wi-Fi access points under various settings.
The first reaction is maybe the researchers are telling us there is no information here. The most important variable on this chart is what they call "datanum", and it goes from left to right across the page. A casual glance across the page gives the idea that nothing much is going on.
Then you look at the row labels, and realize that this dataset is very well structured. The target variables (AP Position Error) is compared along four dimensions: datanum, the algorithm (WCL, or GPR+WCL), the number of access points, and the location of these access points (inner, boundary, all).
When the data has a nice structure, there should be better ways to visualize it.
John submitted a much improved version, which he created using ggplot2.
This is essentially a small multiples chart. The key differences between the two charts are:
- Giving more dimensions a chance to shine
- Spacing the "datanum" proportional to the sample size (we think "datanum" means the number of sample readings taken from each access point)
- Using a profile chart, which also allows the y-axis to start from 2
- One color versus six colors, and no chartjunk
- Using fewer decimal points
When you read this chart, you finally realize that the experiment has yielded several insights:
- Increasing sample size does not affect aggregate WCL error rate but it does reduce the error rate of aggregate GPR+WCL.
- The improvement of GPR+WCL comes only from the inner access points.
- The WCL algorithm performs really well in inner access points but poorly in outer access points.
- The addition of GPR to the WCL algorithm improves the performance of outer access points but deteriorates the performance of inner access points. (In aggregate, it improves the performance... this is only because there are almost two outer access points to every inner access point.)
Now, I don't know anything about this position estimation problem. The chart leaves me thinking why they don't just use WCL on inner access points. The performance under that setting is far and away the best of all the tested settings.
The researchers described their metric as AP Position Error (2drms, 95% confidence). I'm not sure what they mean by that because when I see 95% confidence, I am expecting to see the confidence band around the point estimates being shown above.
And yet, the data table shows only point estimates -- in fact, estimates to two decimal places of precision. In statistics, the more precision you have, the less confidence.
Hi Kaiser -- thanks for posting my submission, as well as for your follow up questions which prompted me to attempt the above re-make. Initially, it was just a horrible, horrible 3D bar chart. Actual issues with what the visualization portrayed became more clear with the re-plot.
I'd add that it wasn't initially clear that "All APs" was just a weighted average of the inner APs and outer APs. Trying a few points (8/25 * inner + 17/25 * outer) confirmed this. Initially, I thought they took a separate sampling from all APs for that data. Since this means they're really only looking at inner vs. outer and WCL vs. GPR+WCL, we could just list or plot the averages with a different color on the dot plot vs. them making separate bars for them.
Plus, with two less bars per grouping... they could have stayed away from plaid :)
Posted by: John Henderson | Apr 29, 2013 at 11:11 AM