In the comments to the previous post, a PhD student asked for general advice on testing data for irregularities. This topic merits a separate post, indeed multiple posts.
***
Here are some initial thoughts:
1. Your data is guilty until proven innocent
2. The top N rows of your data may be false friends
3. With experience, you develop an intuitive feel for the common types of problems to look for
4. Look for problems in slices of the data, because problems are not randomly distributed throughout your dataset
5. Avoid inferring metadata from the data - find the metadata or ask the data collector
6. Seek contradicting statistics: e.g. if A and B have these values, then it's impossible for C to have this value
7. Pushing bad data into your analysis pipeline, and then fixing problems as they surface does not save time; on the contrary, it will cost you much more time
8. Many problems are caused by data collectors who have no knowledge of how the collected data would be used in the future by data analysts
More initial thoughts:
9. Graph your data. Look for the unexpected
10. Consider re-expression to simplify structure: make distributions of individual variables more nearly symmetric, make scatterplots straighter, make tables more nearly additive
11. Then (AFTER re-expressing) look for and deal with outliers--both in each variable and possibly in pairs or multiple variables together (e.g. the tall, thin subject who is neither extraordinarily tall nor extraordinarily thin, but together is a medical outlier.) Don't allow outliers to dominate your analysis
12. Graph your data again in at least one other way than you did at step 9.
Posted by: Paul Velleman | 12/03/2021 at 02:14 PM
PV: Thanks! (Enjoyed your books)
Along those lines, a different set of posts is needed for fixing problems, and how not to make things worse.
Posted by: Kaiser | 12/03/2021 at 04:14 PM
This is great - thanks so much!
Posted by: Annony | 12/03/2021 at 07:03 PM
I strongly agree that graphing data in various ways can help. A great example of how your data can fool you is "Anscombe's quartet".
Very appealing dinosaur-based animated examples are given in the paper: "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing" by Justin Matejka and George Fitzmaurice, available here: https://www.autodesk.com/research/publications/same-stats-different-graphs
Posted by: Aleksander B | 12/06/2021 at 02:40 AM
AB: Agreed. I do lots of boxplots, pdfs and cdfs because the visuals are clearly much more efficient and effective than staring at a list of summary statistics.
Posted by: Kaiser | 12/06/2021 at 10:43 AM
Also frequency histograms are useful. I found one where the data was given in two different units. Also with consulting I make it clear that they wont receive anything until the data is clean, so they had better answer my e-mails.
Posted by: Ken | 12/12/2021 at 04:32 AM
Hey Kaiser,
One tips I'd to look for data's that looks like from top down created.
Here is some data from ground view in covid. So build up picture levels should matching somehow with this
https://doi.org/10.1007/s00134-020-06267-0
Posted by: A Palaz | 12/12/2021 at 11:40 AM
Hey,
So some new bottom up data dumping. Same sources up to date but bit mess.
ASD
https://www.icnarc.org/DataServices/Attachments/Download/e08134e0-c264-ec11-9139-00505601089b
Can you see interesting things?
Posted by: A Palaz | 12/26/2021 at 05:34 PM