Here is a close-up of California:
Anytime someone expands the possibilities of a chart type, like the word cloud, it's a commendable project. So I'm quite enthusiastic about what they tried to do here. Not every new feature is successful, though.
These are the things I like:
- Using colors that mean something: they use different colors to indicate different countries of origin of particular surnames. Good idea. I prefer to have the same color and different shades for each continent.
- For once, the data being depicted is not a speech or a piece of text; it's a set of surnames.
- This chart (or map) is multi-variate: it tries to address deeper questions such as the correlation between geography and origin of popular names, and the correlation between geography and popularity of names, etc. This is an important advance from all those word clouds out there that tells us nothing but the frequency of words in a document. In general, statistical clustering methods can be combined with text mining methods to develop multivariate word clouds.
- The designers realize it's a futile -- as well as ill-advised -- task to try to print every name on the map so they only include the top 25 names in each state. As I explain below, I'm not happy with this inclusion/exclusion criterion but the key point is by taking out the minor bits of data ("noise"), the chart is more able to draw our attention to the more interesting parts.
These are things I don't like:
- They really ought to have used relative popularity rather than absolute popularity. This is another area of improvement for all word clouds. Today, word clouds plot the number of times a specific word appears in a piece of text. We often try to compare several word clouds against each other; and when we do that, the only sensible measure is the proportion (relative frequency) of time a specific word appear. Say, one compares Obama and McCain speeches by comparing two word clouds. If these two speeches differ significantly in length, then comparing the number of times each candidate use "education" words is silly -- we have to compare the number of times per length of the speech.
- The cutoff of top 25 names in each state suffers a similar problem. The 26th most popular name in California, a populous state, is of more interest than say the 15th most popular name in Montana (or insert your favorite small state). Instead, a more sensible cutoff would be including names that account for at least 2 percent (say) of a state's population. By doing this, the more populated states would have more entries than the less populated states.
- Given the above bullets, it is not surprising that the word-size scale has serious problems. Because it is an absolute number and not relative to each state's population, the big words can only show up in populous states. In other words, the size of the words tells us about the geographical distribution of the U.S. population. As I mentioned before (such as here), this insight is available on pretty much every map used to plot data that has ever been produced. The one thing that all these maps never fail to tell us is the fact that most of the U.S. population is bi-coastal. Unfortunately, the real message of the map -- in this case, the geography of surnames -- is subsumed.
- And then, the map invents false data. Notice that there are 1,250 geographic sites on the map (25 names times 50 states). This is a visually prominent feature of the map, and yet there is no rhyme or reason as to where the names are placed, with the exception of respecting state boundaries. The casual reader may think that the appearance of the Chinese name "Lee" in the inner, central part of California implies that Lee-named Chinese-Americans aggregate in those parts of California. Far from the truth!
So, I think they did a reasonable job in rethinking the possibilities of word clouds. It's well intentioned and there is room for improvement.
Lastly, they might get some ideas from the Baby Names navigator.