« December 2016 | Main | February 2017 »

February talks, and exploratory data analysis using visuals

News:

In February, I am bringing my dataviz lecture to various cities: Atlanta (Feb 7), Austin (Feb 15), and Copenhagen (Feb 28). Click on the links for free registration.

I hope to meet some of you there.

***

On the sister blog about predictive models and Big Data, I have been discussing aspects of a dataset containing IMDB movie data. Here are previous posts (1, 2, 3).

The latest instalment contains the following chart:

Redo_scorebytitleyear_ans

The general idea is that the average rating of the average film on IMDB has declined from about 7.5 to 6.5... but this does not mean that IMDB users like oldies more than recent movies. The problem is a bias in the IMDB user base. Since IMDB's website launched only in 1990, users are much more likely to be reviewing movies released after 1990 than before. Further, if users are reviewing oldies, they are likely reviewing oldies that they like and go back to, rather than the horrible movie they watched 15 years ago.

Modelers should be exploring and investigating their datasets before building their models. Same thing for anyone doing data visualization! You need to understand the origin of the data, and its biases in order to tell the proper story.

Click here to read the full post.

 

 


Lines that delight, lines that blight

This WSJ graphic caught my eye. The accompanying article is here.

Wsj_ipo_dealdrought_full

The article (judging from the sub-header) makes two separate points, one about the total amount of money raised in IPOs in a year, and the change in market value of those newly-public companies one year from the IPO date.

The first metric is shown by the size of the bubbles while the second metric is displayed as distances from the horizontal axis. (The second metric is further embedded, in a simplified, binary manner, in the colors of the bubbles.)

The designer has decided that the second metric - performance after IPO - to be more important. Therefore, it is much easier for readers to know how each annual cohort of IPOs has performed. The use of color to map to the second metric (and not the first) also helps to emphasize the second metric.

There are details on this chart that I admire. The general tidiness of it. The restraint on the gridlines, especially along the horizontal ones. The spatial balance. The annotation.

And ah, turning those bubbles into lollipops. Yummy! Those dotted lines allow readers to find the center of each bubble, which is where the values of the second metrics lie. Frequently, these bubble charts are presented without those guiding lines, and it is often hard to find the circles' anchors.

That leaves one inexplicable decision - why did they place two vertical gridlines in the middle of two arbitrary years?


Race to the top, Erasmus edition

(This is a submission from reader Lawrence Mayes. Thank you Lawrence!)

I came across this unusual graphical representation of the destinations of scholarship students:

Erasmus_destinations

[Kaiser here: The charts are hidden inside an annoying Flash app and it seems that the bottom half of the chart is cropped out.]

(the original can be seen here: http://www.ibercampus.eu/-270-000-students-benefitted-from-eu-grants-to-study-or-2076.htm)

The question is: what parameter is used to illustrate the figures? - Line length or angle?

The answer is - line length. But the eye is likely to use the angle as the measure and this is where an error may arise. It's almost an optical illusion - the smaller number of students lie on the circumferences of smaller circles - and a smaller length goes further around the circle. Thus, for example, Turkey attracts about one fifth of those students attracted by Germany but it looks like it's nearer half (45 degrees vs 90 degrees).

(The case of Spain is really bizarre - it looks like it's gone round the circle by over 280 degrees but actually what they've done is to break off the line at 90 degrees and stick the bit they broke off back on the diagram at the left.)

I have never seen this type of 'bar chart' before but it is really misleading.

***

Long-time readers may remember my discussion of the "race-track graph." (here) The "optical illusion" Lawrence mentions above is well known to any track runner. The inside lanes are shorter than outside lanes, so you stagger the starting positions.


Happy new year (Junk Charts)

Happy2017

I am on vacation for another week, and posting will gradually resume.

If you discovered the blog last year, I hope you will find much to enjoy in the coming year.

If you are a long-time reader, thank you for giving me a bit of your time, once in a while.

I wish everyone a productive, peaceful new year!

Kaiser