A smarter word cloud: likes and not likes
Jan 24, 2011
Martha left a comment on my previous post asking my comments on this National Geographic word cloud map of surnames in the U.S. (Click on the link to look at the interactive map.)
Here is a close-up of California:
Anytime someone expands the possibilities of a chart type, like the word cloud, it's a commendable project. So I'm quite enthusiastic about what they tried to do here. Not every new feature is successful, though.
These are the things I like:
- Using colors that mean something: they use different colors to indicate different countries of origin of particular surnames. Good idea. I prefer to have the same color and different shades for each continent.
- For once, the data being depicted is not a speech or a piece of text; it's a set of surnames.
- This chart (or map) is multi-variate: it tries to address deeper questions such as the correlation between geography and origin of popular names, and the correlation between geography and popularity of names, etc. This is an important advance from all those word clouds out there that tells us nothing but the frequency of words in a document. In general, statistical clustering methods can be combined with text mining methods to develop multivariate word clouds.
- The designers realize it's a futile -- as well as ill-advised -- task to try to print every name on the map so they only include the top 25 names in each state. As I explain below, I'm not happy with this inclusion/exclusion criterion but the key point is by taking out the minor bits of data ("noise"), the chart is more able to draw our attention to the more interesting parts.
These are things I don't like:
- They really ought to have used relative popularity rather than absolute popularity. This is another area of improvement for all word clouds. Today, word clouds plot the number of times a specific word appears in a piece of text. We often try to compare several word clouds against each other; and when we do that, the only sensible measure is the proportion (relative frequency) of time a specific word appear. Say, one compares Obama and McCain speeches by comparing two word clouds. If these two speeches differ significantly in length, then comparing the number of times each candidate use "education" words is silly -- we have to compare the number of times per length of the speech.
- The cutoff of top 25 names in each state suffers a similar problem. The 26th most popular name in California, a populous state, is of more interest than say the 15th most popular name in Montana (or insert your favorite small state). Instead, a more sensible cutoff would be including names that account for at least 2 percent (say) of a state's population. By doing this, the more populated states would have more entries than the less populated states.
- Given the above bullets, it is not surprising that the word-size scale has serious problems. Because it is an absolute number and not relative to each state's population, the big words can only show up in populous states. In other words, the size of the words tells us about the geographical distribution of the U.S. population. As I mentioned before (such as here), this insight is available on pretty much every map used to plot data that has ever been produced. The one thing that all these maps never fail to tell us is the fact that most of the U.S. population is bi-coastal. Unfortunately, the real message of the map -- in this case, the geography of surnames -- is subsumed.
- And then, the map invents false data. Notice that there are 1,250 geographic sites on the map (25 names times 50 states). This is a visually prominent feature of the map, and yet there is no rhyme or reason as to where the names are placed, with the exception of respecting state boundaries. The casual reader may think that the appearance of the Chinese name "Lee" in the inner, central part of California implies that Lee-named Chinese-Americans aggregate in those parts of California. Far from the truth!
So, I think they did a reasonable job in rethinking the possibilities of word clouds. It's well intentioned and there is room for improvement.
Lastly, they might get some ideas from the Baby Names navigator.
This surname map of london is similar addresses some of the issues you have ...
... each surname is positioned on top of an area of roughly equal population.
The presentation method seems to work better at city scale where population is - very roughly - evenly distributed at least by comparison with the whole of the US.
Posted by: Tomp | Jan 25, 2011 at 06:14 AM
What I would want to do in the interactive version is "hide/highlight all names that are shared across more than xx% of all states." Ideally be able to specify xx as well.
This would let you focus on the names that are distinctive, or that are found in lots of places. What state doesn't have Smith & Jones? That fact is on the chart but impossible to find. Also would think that state boundaries would be useful at some level... since the data is cut that way.
Posted by: Gary | Jan 25, 2011 at 04:53 PM
I'm confused by your point #2 in the needs to improve list. I'd think that, if surnames have anything like a power law distribution, using a percent cutoff wouldn't affect the number of surnames very much.
Posted by: John Roth | Jan 26, 2011 at 10:00 AM
I provided the data and helped create the map. I have posted some of my thoughts on the above suggestions here: http://spatial.ly/fG8axi
Posted by: Spatialanalysis | Jan 31, 2011 at 07:19 AM