More prevalent versus more likely
Jul 12, 2007
Aleks pointed to an interesting Business Week chart used to explain what people in different age groups are doing on-line. This is a pretty chart that does an admirable job with a difficult data set.
The key to this chart, unfortunately missing, is that the percentages must be read as vertical columns to make sense. So the top left square says 34% of "Young Teens" who answered the survey said they create web pages on-line. In addition, the total of each column can be much more than 100% because multiple responses were allowed.
Realizing the above, we should interpret the bottom (grey) row as saying: "Older boomers" and "seniors" are more likely to be "Inactives" than younger people. A tempting interpretation is: "Inactives" are more likely to be "seniors" and "older boomers". But this is wrong because the chart hides the age distribution. While 70% of "Seniors" are inactive, "Seniors" may represent a small proportion of the population, and thus they may not account for a large proportion of "Inactives". This is the difference between prevalence and incidence rate. (Another way to grasp this is to add the percentages across a row and try and fail to understand what the row sum could mean.)
The construct of the square grids is less damaging than it seems. In effect, the data has been rescaled by dividing by 10. The reader is then forced to apply "rounding". If you are someone who sees $19.95 as $19, then you'd round down the partial rows. If you see $19.95 as $20, you'd round up the partial rows. So the designer has pushed you to think in terms of whole numbers between 0 and 10, in other words, in units of 10%, rather than units of 1% or, horror of horrors, 0.1% or at some other unrealistic precision.
Here's another example where the profile chart shines. Because the percentages don't sum up to 100%, the other alternatives like stacked bar charts and "Merrimeckos"/mosaic charts don't work. (Prior discussion of this issue here.)
This version gives a column view of the data, the lines linking percentages of each age group performing on-line activities. The profiles nicely cluster into three groups: the younger people are more likely to say they are "joiners", "spectators" or "creators" but less likely to be "inactives". We also see that the likelihood of being "Collectors" has little to do with age.
Source: "Inside Innovation -- In Data", Business Week, June 11 2007.
Mosaic plots will still work - they'll be dividing up something other than 100% but that's ok.
Posted by: Hadley Wickham | Jul 12, 2007 at 02:03 AM
I think it's a nice touch that they align the final, incomplete row of filled squares according to the neighbour that is higher where possible, giving a smoother profile.
Posted by: derek | Jul 12, 2007 at 06:27 AM
Nice post! But I'm wondering if one can do a better job ordering the activities in your profile chart? For example, I'd put Collectors just before Inactives, and Joiners in the beginning. Moreover, I'd use hue/saturation among the greens and blues to help trace back the exact age category.
Posted by: Aleks | Jul 12, 2007 at 10:15 AM
Another chart of interest:
Posted by: | Jul 13, 2007 at 09:07 PM
Wouldn't the problems be largely solved if, in the original chart, each "block" stood for a certain number of people rather than a certain column percentage? That way the relative distribution could be read easily across either columns or rows.
This is a common problem when percentages are used in tables as well: are the percentages of the columns? of the rows? of the total? A sloppy table won't tell you.
Posted by: zbicyclist | Jul 14, 2007 at 11:52 AM
Aleks: I like both suggestions.
zbicyclist: Using raw numbers sometimes help. In this case, the raw numbers represent the number of survey respondents, which is not useful. For surveys, the only relevant data are the percentages.
Posted by: Kaiser | Jul 16, 2007 at 12:54 AM
Both are pretty, but why "Merrimeckos"/mosaic charts don't work?
Posted by: Fun Dates | Oct 22, 2009 at 12:09 PM