« Better late than never | Main | Cybersecurity mystery »

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Paul Velleman

More initial thoughts:
9. Graph your data. Look for the unexpected
10. Consider re-expression to simplify structure: make distributions of individual variables more nearly symmetric, make scatterplots straighter, make tables more nearly additive
11. Then (AFTER re-expressing) look for and deal with outliers--both in each variable and possibly in pairs or multiple variables together (e.g. the tall, thin subject who is neither extraordinarily tall nor extraordinarily thin, but together is a medical outlier.) Don't allow outliers to dominate your analysis
12. Graph your data again in at least one other way than you did at step 9.

Kaiser

PV: Thanks! (Enjoyed your books)

Along those lines, a different set of posts is needed for fixing problems, and how not to make things worse.

Annony

This is great - thanks so much!

Aleksander B

I strongly agree that graphing data in various ways can help. A great example of how your data can fool you is "Anscombe's quartet".

Very appealing dinosaur-based animated examples are given in the paper: "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing" by Justin Matejka and George Fitzmaurice, available here: https://www.autodesk.com/research/publications/same-stats-different-graphs

Kaiser

AB: Agreed. I do lots of boxplots, pdfs and cdfs because the visuals are clearly much more efficient and effective than staring at a list of summary statistics.

Ken

Also frequency histograms are useful. I found one where the data was given in two different units. Also with consulting I make it clear that they wont receive anything until the data is clean, so they had better answer my e-mails.

A Palaz

Hey Kaiser,

One tips I'd to look for data's that looks like from top down created.
Here is some data from ground view in covid. So build up picture levels should matching somehow with this

https://doi.org/10.1007/s00134-020-06267-0

A Palaz

Hey,

So some new bottom up data dumping. Same sources up to date but bit mess.
ASD

https://www.icnarc.org/DataServices/Attachments/Download/e08134e0-c264-ec11-9139-00505601089b

Can you see interesting things?

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated, and will not appear until the author has approved them.

Your Information

(Name is required. Email address will not be displayed with the comment.)

Get new posts by email:
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR, Wired.

See my Youtube and Flickr.

Search3

  • only in Big Data
Numbers Rule Your World:
Amazon - Barnes&Noble

Numbersense:
Amazon - Barnes&Noble

Junk Charts Blog



Link to junkcharts

Graphics design by Amanda Lee

Next Events

Jan: 10 NYPL Data Science Careers Talk, New York, NY

Past Events

Aug: 15 NYPL Analytics Resume Review Workshop, New York, NY

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here

Principal Analytics Prep



Link to Principal Analytics Prep

Community