Two challenging charts showing group distributions
Jan 09, 2025
Long-time reader Georgette A. found this chart from a Linkedin post by David Curran:
She found it hard to understand. Me too.
It's one of those charts that require some time to digest. And when you figured it out, you don't get the satisfaction of time well spent.
***
If I have to write a reading guide for this chart, I'd start from the right edge. The dataset consists of the top 2000 English words, ranked by popularity. The right edge of the chart says that roughly two-thirds of these 2000 words are of Germanic origin, followed by 20% French origin, 10% Latin origin, and 3% "others".
Now, look at the middle of the chart, where the 1000 gridline lies. The analyst did the same analysis but using just the top 1000 words, instead of the top 2000 words. Not surprisingly, Germanic words predominate. In fact, Germanic words account for an even higher percentage of the total, roughly three-quarters. French words are at 16% (relative to 20%), and Latin at 7% (compared to 10%).
The trend is this: as we restrict the word list to fewer and more popular words, the more Germanic words dominate. Of the top 50 words, all but 1 is of Germanic origin. (You can't tell that directly from the chart but you can figure it out if you measure it and do some calculations.)
Said differently, there are some non-Germanic words in the English language but they tend not to be used particularly often.
As we move our eyes from left to right on this chart, we are analyzing more words but the newly added words are less popular than those included prior. The distribution of words by origin is cumulative.
The problem with this data visualization is that it doesn't "locate" where these non-Germanic words exist. It's focused on a cumulative metric so the reader has to figure out where the area has increased and where it has flat-lined. This task is quite challenging in an area chart.
***
The following chart showing the same information is more canonical in the scientific literature.
This chart also requires a reading guide for those uninitiated. (Therefore, I'm not saying it's better than the original.)
The chart shows how words of a specific origin accumulates over the top X most popular English words. Each line starts at 0% on the left and ends at 100% on the right.
Note that the "other" line hugs to the zero level until X = 400, which means that there are no words of "other" origin in the top 400 list. We can see that words of "other" origin are mostly found between top 700-1000 and top 1700-2000, where the line is steepest. We can be even more precise: about 25% of these words are found in the top 700-1000 while 45% are found in the top 1700-2000.
In such a chart, the 45 degree line acts as a reference line. Any line that follows the 45 degree line indicates an even distribution: X% of the words of origin A are found in the top X% of the distribution. Origin A's words are not more or less popular than average anywhere in the distribution.
In this chart, nothing is on top of the 45 degree line. The Germanic line is everywhere above the 45 degree line. This means that on the left side, the line is steeper than 45 degrees while on the right side, its slope is less than 45 degrees. In other words, Germanic words are biased towards the left side, i.e. they are more likely to be popular words.
For example, amongst the top 400 (20%) of the word list, Germanic words accounted for 27%.
I can't imagine this chart is easy for anyone who hasn't seen it before; but if you are a scientist or economist, you might find this one easier to digest than the original.
Is it possible that some of the Germanic words are themselves of Latin origin?
Posted by: Richard Krablin | Jan 09, 2025 at 10:37 AM
Finally a chart that is right in my professional wheelhouse!
I grokked the first chart immediately. Because I understood the full argument before I looked at the chart, it made perfect sense. I have to admit that the line graph is a lot harder for me to make sense of; it seems to argue that 100% of the top 2000 words are simultaneously of Germanic, French, Latin, and "other" origin (which of course isn't the case).
In general, the form of the first chart would work for most philologists, but the labeling needs work. Also, I would reverse the x-axis, so that it started with the entire 2000-word corpus and gradually worked its way down to the 100 most common words.
A stacked column chart could work too, except that I freaking hate them on principle.
To answer Richard's question above: Your instincts are good -- in many cases there is that kind of multilingual interference. But not in this case; Latin influence on Germanic languages occurred after Old English developed.
Posted by: meg | Jan 09, 2025 at 11:16 AM
Meg: Great comment! So both these charts require learning how to read, and once one is familiar with the chart form, it's easy. The key difference is the first one sums to 100% vertically. But the second one sums to 100% within each subgroup (horizontally so to speak) and it doesn't provide information on the relative proportions across the subgroups.
As for the stacked column chart, you just previewed the next post.
Posted by: Kaiser | Jan 09, 2025 at 11:38 AM
These charts hurt. I think the problem lies with using a cumulative distribution. Bucketing the popularity of words into deciles, or whatever and then doing a stacked bar would be clearer.
Posted by: Arthur | Jan 09, 2025 at 01:09 PM
@Richard Krablin: Wiktionary lists 30 Proto-Germanic words and about 400 Old English words loaned from Latin, the latter including a lot of religious terms and names. How these words are categorized would depend on the researchers' methodology.
Posted by: Jesse O. | Jan 11, 2025 at 12:04 PM