Degrees of likeness 2
Aug 05, 2009
We left off the other day with an interactive graphic with the ability to peer into subgroups. This feature assumes implicitly that the overall average obscures differences within subgroups. What statisticians do with this type of data is to compare the subgroups, and identify the factors that make someone different from the average.
For example, there is a clear distinction between the employed and the unemployed in how they spend the day (not surprising).
This happens to be what NYT printed in the paper edition that day. (Note, though, that the graphic loses quite a bit without the interactivity.)
On the other hand, there appears to be little differentiation between men and women.
Nor is there much difference between blacks and whites.
One factor that matters is age. Older people are not exactly like the young. A lot of these factors (for example, age and employment status) are correlated, by the way.
I showed all these in order to talk about the statistical concept of "aggregation". We noted that the distribution of time use of the employed is different from that of the unemployed. Thus, we cannot use the "average" distribution to describe both groups, and so we show the data in disaggregated form. Similarly for time use and age.
But there is not much gain in disaggregating race and gender: the "average" is representative of the subgroups for these two factors. This is one distinction I see between information graphics and statistical graphics: the former typically shows all possible subgroups while in the latter, the designer zooms in on the factors that matter.
On the topic of aggregation, a neat story in the psychology of learning can be found in the "power law" of practice. It turns out that practice actually follows an exponential function, but naive aggregation of individual data yields a (misleading) power function (http://en.wikipedia.org/wiki/Power_Law_of_Practice).
Posted by: Mike Lawrence | Aug 06, 2009 at 01:31 PM