The scatter-plot matrix: a great tool

Jun 17, 2010

The scatter-plot matrix is one of the lesser known graphical tools beloved by statisticians. A scatter plot displays the correlation between a pair of variables. Given a set of n variables, there are n-choose-2 pairs of variables, and thus the same numbers of scatter plots. These scatter plots can be organized into a matrix, making it easy to look at all pairwise correlations in one place.

***

Since Nate Silver's feature article about New York neighborhoods came out, I have been working on capturing the data because so much was left unsaid in that article.  His ranking formula takes 12 factors (housing affordability, transit, green space, nightlife, etc.) and combines individual scores into an overall score based on chosen weights (e.g. housing affordability counted for 25%). Scores are then converted to ranks.

Silver's discussion focuses on explaining which factors caused which neighborhoods to be ranked high (or low). I'm interested in whether the individual factors are correlated. For example, do neighborhoods with more expensive housing also tend to have higher-quality housing? what about better schools? are more diverse neighborhoods also more creative? and so on. There is really a treasure trove of information locked up in this data.

***

A scatter-plot matrix neatly organizes all of the pairwise correlation information.  See below.

Each small chart shows the correlation between the given pair of variables (one listed on the right, the other listed below). The dots represent the neighborhoods. The pink patch contains the "middle 75%" of the nieghborhoods, and we can use the orientation of these patches to get a sense of whether the two variables are positively, negatively or not correlated.

There are lots to see in this chart. I just picked a random few things for illustration:

• In the top left corner, the slant shows that the more affordable the homes are, the worse is the transit.
• The better the shopping, the better the dining.
• Interestingly, more diversity seems to mean lower creative capital (also the correlation is only moderate).
• Wellness scores fall within a rather narrow range compared to other categories, and they seem to be almost completely unrelated to any of the other factors.

***

(Note: I used JMP to generate this matrix. Excel unfortunately does not make scatter-plot matrices natively. JMP is great for such exploration... if the developers are reading this, please make it easier to man-handle the category labels! I made a mess of rotating the text on the right.)

P.S. I had an adventure processing the data from New York magazine. There appears to have been quite a few typos. For more, see my writeup on the book blog.

You can follow this conversation by subscribing to the comment feed for this post.

Sweet!

Have you tried doing this in Protovis?
http://vis.stanford.edu/protovis/ex/brush.html

Nice post, some points to add:

1. With "only" 11 variables and some dozens of observations the SPLOM does still work reasonable well. For 20 variables and some hundreds of cases this plot will fail.
2. The ellipses help a lot in judging the correlations, but do we need a plot if this is essentially all we look at?
3. Linking cases across the scatterplots will take us to even higher dimensional insights than just 2-d.
4. Is the data available somewhere? I am keen on looking at the data in Mondrian.
5. If there is a geographical reference in the data, i.e., the neighborhoods, we should link the map with the data. This will be far more powerful than any analysis which ignores this aspect.
But the important point is that you actually collected real data and addressed a real problem!

Check out the "ezCor" function in the "ez" package for R. It plots something similar, but with additional features such as univariate densities, correlation coefficients, etc.

regrg

For a really great-looking and versatile scatterplot matrix, check out RegressIt, a free PC Excel add-in: http://regressit.com. Each element in the matrix is a separate native Excel chart, fully labeled and intelligently scaled and already formatted for presentation. It can be further edited with any of the usual charting tools and it can be live-linked to Powerpoint documents. The individual charts may optionally include regression lines and center-of-mass points. Axes are scaled to the minimum and maximum values of the variables, and the chart title includes the correlation and either its square or the slope coefficient. You can produce either a full square matrix or else a column of plots which all have a specified variable on either the X or Y axis (e.g., the dependent variable for a regression model). An example can be found at the bottom of this page: http://regressit.com/descriptive-data-analysis.html. RegressIt also produces many other well-designed charts, e.g., parallel time series plots of many variables and 7 different types of charts for regression models.

The comments to this entry are closed.