The World Economic Forum published this chart:
The "EF EPI Score" is a measure of English proficiency. So the evidence is clear as day: "Better English and Income Go Hand in Hand," as their headline blares.
Last time I was in the New York subway, the panhandler spoke good English.
What's a blogger to do? I pulled out the EPI scores from the EPI report, and downloaded the Gross National Income per Capita (PPP $) data from the World Bank. But I failed to reproduce the above scatter plot. The EPI scores match what's on the chart but I couldn't find the exact series of GNI data they used. I tried both 2012 and 2013 and neither matched. This is the closest I can get to the original chart:
Notice that the income of Singapore is over $70,000 on my chart but it is just above $60,000 on theirs. The general shape of the data is approximately preserved so what I discuss in this blog post should still hold.
But there is something else that bugs me a lot. To get to the above reproduction, I have to take out some of the data. To be precise, all the orange dots in the following chart were removed:
A clear effect of removing those orange dots is to steepen the slope of the regression line. In other words, the GNI data are more correlated with English proficiency in the first chart than in the second. Correspondingly, the "R-squared" (how well the line fits the dots) jumps from 6% to 42%.
***
I will return to the missing countries later. The other huge problem with this type of scatter plot is that it imposes a misleading analytical frame on the readers. Readers of such charts are led to believe that English proficiency is the only - or most powerful - factor that explains a country's per-capita income.
To illustrate this point further, I found some data on urbanicity, the proportion of population living in urban areas, and added this factor to the regression. Here is the resulting model:
The first thing to note is that the slope is much steeper in the % Urban panel, meaning that GNI is much more correlated with % Urban than with EPI Score.
Even more telling is the following chart. Notice that the EPI regression line is now almost flat.
In the first panel plot, the EPI regression line assumes an average degree of urbanity (in the 50s). In the second panel plot, I set the urbanity to over 90%. For countries in which the vast majority lives in urban areas, EPI score is essentially uncorrelated with income levels.
When trying to explain numbers like income levels, one must use complex models involving many variables.
***
Removing data is sometimes acceptable, if the analyst has a good reason to do it. Improving R-squared alone is not a good reason. So I am curious if there is a good reason to remove those orange dots (missing countries).
Let's see if there are any patterns.
The removed countries exist in the lower right corner of this scatter plot, meaning that they are countries with higher urban populations but below average English scores.
Several of these countries have very high incomes, as you can see in GNI vs. EPI graph above. Those high-income, low-EPI countries are Kuwait, UAE and Saudi Arabia. These countries present challenges to the one-factor, EPI-only model as they contradict the hypothesis that speaking English well makes one rich.
You can also see why the two-factor model fits the data better. Those three Middle Eastern countries have high urban populations, which can explain the high per-capita income, even when the English factor can't.
Iran and Iraq were probably omitted because of unreliable GNI data. It seems like they excluded all, or almost all, of the Middle East. The reasoning isn't explained by the author. Notwithstanding, I believe Egypt is included.
Outside those three countries mentioned above, the other omissions do not affect the regression line much since there are other dots that live nearby. Many of the countries are probably left off inadvertently, lost in the merging of the datasets. For example, Venezuela is labeled as "Venezuela, RB" in one of the data series so this country may disappear (unexpectedly) when the datasets were combined. Similarly, South Korea was known as "Korea, Rep." in one of the datasets. This is a very tricky one because those two names appear at very different levels of the data sets.
P.S.
(1) Even the charts I created do not contain all of the countries. It only contains countries that have valid EPI Scores as well as available GNI data. In general, there are many more countries for which we have income data than EPI Scores.
(2) There is no reason to stop at a two-factor model. There are many other factors that are correlated with GNI.
Recent Comments