Turning in his grave 1
May 06, 2008
(Thanks to reader Josh R. for the tip.) The "plucky statisticians" at Urbanspoon decided to tackle the political hot potato: is Barack Obama an elitist? Scratch that -- what they actually did was to determine if Obama supporters were elitists (of course, Obama would then be, due to guilt by association.) Scratch that -- what they actually analyzed was if there tended to be more Starbucks per capita in those states in which Obama won Democratic primaries.
Suffice it to say, even if it can be proven that most states with high densities of Starbucks are more likely to have more Democratic primary voters who prefer Obama to Clinton, it is a far cry from proving Obama an elitist. However, we take the leap of faith and look at the evidence presented to us.
The star witness was this chart plotting the "vote spread" of Obama minus Clinton and the per-capita Starbucks density. The black line was a linear fit to the Starbucks data as shown in green dots. Since the black and blue lines both pointed northeast roughly speaking, we were told: "States with more latte-purveying Starbucks stores are more likely to have gone for Obama." (So Obama is indeed an elitist.)
To cover all bases, the creator of this chart suggested that "my statistics professor might be rolling over in his grave to hear me say it, but there's a mild but real correlation here!".
Mr. Urbanspoon, the statistics professor is here and he disapproves. As discussed before (and here), plotting two series of data on the same chart and applying two different scales is a recipe for disaster. Not reaching immediately for the scatter plot when one has two data series is another serious misstep. (Indeed, Josh sent the link in with a note wondering why "people dislike scatter plots so much".) So here is the appropriate graphic:
A quick first glance at the left chart indicates that any correlation, if it exists, is very weak indeed. A simple linear regression analysis shows that Starbucks density explains only 14% of the variability in vote spread. Note especially the wide dispersion of dots around the line. Further, for the vast majority of the states (say those with vote spread between -20% and 40%), there appears to be no correlation. This is seen on the right chart.
To the extent that there is a linear correlation, the points (orange dots) would be most influential. The top cluster included Alaska, Kansas, DC, Hawaii and Idaho in which Obama had a large winning margin while the Starbucks density was above average. The bottom cluster included Arkansas and Olkahoma where Obama was wiped out and where Starbucks had the lowest density. These two clusters alone explained the mild relationship; removing them wiped it out.
Following Nyhan, we should remove some obvious outliers, such as Arkansas, Illionois and New York (home states), Michigan and Florida (disputed) and New Hampshire and Iowa (Edwards territory). The result is also mild correlation (R-sq = 0.075).
Till next post, when the professor rolls over again ...
Notice that I prefer the number of people per Starbucks metric, as opposed to the number of Starbucks per thousand people (See prior discussion on Gelman's blog.) The reason is that every number on the former metric is reality-based while the latter metric produces imaginary numbers for small states, i.e. the imputed number of Starbucks is smaller than what actually exists!
Also note that I used a renormalized vote spread so that the Obama proportion and the Clinton proportion added up to 100%. This made the assumption that Edwards and other voters would split among Obama and Clinton in the same proportions as those who explicitly voted for the two frontrunners.
Speaking of regression, has anyone ever implemented lowess (locally-weighted regression), in VBA or otherwise, for Excel? I see Kelly O'Day has a note that he's looked and come up blank.
Posted by: derek | May 07, 2008 at 01:32 AM
Speaking of secondary axes, Stephen Few discussed them in "Dual-Scaled Axes in Graphs-Are They Ever the Best Solution?" (the answer is no), and I countered with a class of examples where they are acceptable in "Secondary Axes that Work - Proportional Scales"
Posted by: Jon Peltier | May 07, 2008 at 07:29 AM
Derek - After trying to get Excel - VBA to handle lowess and panel charts for quite a while, I've decided to move up to r.
There is a great Excel add-in for r, link to tutorial that gives Excel users full access to r graphics. To me, this is the best of both worlds, Excel for data manipulation, r for statistical and graphical analysis.
Kaiser - great post. I look forward to seeing more from Mr. Urbanspoon.
Posted by: Kelly O'Day | May 07, 2008 at 09:35 AM
One problem is that true elitists, wouldn't drink at starbucks if they have a local roaster. Starbucks is pretty mainstream, I prefer my local house and I'm basically a snob.
Posted by: Matt | May 07, 2008 at 10:35 AM
about dual-scaled axes and proportional scales... while I agree with Jon Peltier's example sometimes the opposite happens: same scale for two variables that have nothing to do with each other, for instance indices or percentages that happen to go for 0 to 100.
like in your (Kaiser's) example they are often misused to invoke correlation when there is none.
Posted by: vozome | May 07, 2008 at 11:07 AM
I'd like to do a little analysis of my own on data like this. For example, how does the Obama-Clinton vote depend on the rural-urban ratio of a state (or the average population density), or the democrat-republican vote in the 2004 elections?
Can anyone recommend a good source for the raw data? I'd rather not type everything in by hand if I can avoid it. State by state would be fun, but county by county would be amazing.
Posted by: Aler | May 07, 2008 at 11:52 AM
Professor, I have missed you. Thanks for providing a more rigorous analysis! Indeed, the correlation is as weak as bad coffee. I'll try harder on my next statistical foray.
Oh, and Vozome -- you can find county level data, albeit not well organized, here:
http://www.cnn.com/ELECTION/2004/pages/results/president/
Posted by: Ethan | May 07, 2008 at 01:27 PM
(whoops -- that county-level data is for Aler)
Posted by: Ethan | May 07, 2008 at 01:28 PM
Does removing all the data that best fits the theory disprove the theory?
Or is this simply a case of cherry picking - only choosing the data that agrees with your theory and discarding that which doesn't.
Posted by: Michael | May 07, 2008 at 10:09 PM
It's okay to remove data temporarily, to show how sensitive the line is to the data. The data that "best fits" the theory, in this case, only fits the ends. The bulk of the data doesn't back the theory up at all.
This is why I mentioned lowess. It's basically a fancy mathematical way of doing exactly what Kaiser did: see what happens to the line when you take the ends away. If the linear regression slope goes away, you know the line wasn't strong enough to bear the weight of the hypothesis laid on it.
The lowess version would have a wavy pattern that would be much less impressive as proof of a trend than a linear least-squares fit, which is the next closest thing to a simple mean (a mean is an aggregate measure boiling the data down to one number, a line boils it down to two numbers).
Posted by: derek | May 08, 2008 at 07:36 AM
Thanks Derek. Both the original exercise and what I did fall into the realm of exploratory data analysis. The linear fit was poor and I tried to show why it was so. In general, though, Michael's point is valid.
Posted by: Kaiser | May 08, 2008 at 09:18 AM
Lowess (aka Loess) is a locally weighted regression technique
link. Lowess uses all the data by fitting a series of "local" regressions, giving insight into the overall relationship which may vary over the range of x.
Rather than eliminate data, lowess actually uses all the data several times by computing weighted regressions in multiple subsets of the data.
Posted by: D Kelly O'Day | May 08, 2008 at 12:37 PM
I drank starbucks once. Once. It is burnt, disgusting and way over priced. I can purchase more than enough Folgers for me than that garbage they sell. Folgers, Dunkin, whatever - anything is better than starbucks. And if Obama and his followers like it, then all the more reason to wonder about starbucks.
Posted by: Mike | May 23, 2008 at 12:14 PM
Kelly -
I don't know. Dunkin used to be pretty good, but they've worried so much about their menu selections that they've forgotten how to make coffee. I can't drink it any more.
I never used to like Starbucks coffee either, but I find it doesn't seem so burnt the last few times I've tried it. It's a little pricey, but if you stick to regular coffee without any embellishments, it's not so bad.
Posted by: Jon Peltier | May 27, 2008 at 10:19 PM
I have implemented loess in vba from scratch.
Posted by: nick | Oct 15, 2008 at 12:25 PM
nick, that's great to hear, and something I've long wanted to do myself. Is it in the form of a UDF with a format similar to FORECAST(), or some other method? And would you be willing to share it?
Posted by: derek | Oct 16, 2008 at 02:51 AM
Probably not the most elegant implementation, but it works. Its a UDF that takes in as arguments the known X's (xRng), known Y's (yRng), number of points to use in the regression (l), and the point to calculate (newX).
Function loess(xRng As Range, yRng As Range, l As Integer, newX)
Dim x As Dictionary
Set x = CreateObject("Scripting.Dictionary")
Dim y As Dictionary
Set y = CreateObject("Scripting.Dictionary")
Dim d As Dictionary
Set d = CreateObject("Scripting.Dictionary")
Dim w As Dictionary
Set w = CreateObject("Scripting.Dictionary")
'// convert ranges to dictionaries and fill in distances
For i = 1 To xRng.Count
x.Add i, xRng.Item(i)
y.Add i, yRng.Item(i)
d.Add i, Abs(xRng.Item(i) - newX)
Next i
'// loop, removing largest distance until we're down to the number of points we want
Do Until d.Count = l
maxDist = -1
For Each i In d.Keys
If d.Item(i) > maxDist Then
maxDist = d.Item(i)
mx = i
End If
Next i
d.Remove mx
Loop
'// Find max distance, then calc weights (w) using scaled distances
maxDist = -1
For Each i In d.Items
If i > maxDist Then maxDist = i
Next i
For Each i In d.Keys
w.Add i, (1 - (d.Item(i) / maxDist) ^ 3) ^ 3
Next i
'// now to calculate I to V and denom for linear regression
For Each i In w.Keys
one = one + w.Item(i)
two = two + x.Item(i) * w.Item(i)
three = three + x.Item(i) ^ 2 * w.Item(i)
four = four + y.Item(i) * w.Item(i)
five = five + x.Item(i) * y.Item(i) * w.Item(i)
Next i
denom = one * three - two ^ 2
'// calculate slope (slp), intercept(inter), and finally the loess value
slp = (one * five - two * four) / denom
inter = (three * four - two * five) / denom
loess = slp * newX + inter
'// empty the dictionaries
Set w = Nothing
Set d = Nothing
Set y = Nothing
Set x = Nothing
End Function
Posted by: nick | Oct 16, 2008 at 09:16 AM
Ah, it uses a restricted number of points, each with 100% weighting. I can implement that without VBA.
What I'm looking for is a way to take all the points, but weight them according to a function, e.g. Gaussian, of distance from the newX point. I've always been hazy about the difference between Cleveland's cute names, loess and lowess; is that the difference?
Thanks for the reply.
Posted by: derek | Oct 16, 2008 at 11:19 AM
Sorry its not what you were looking for. When I was writing it, I was working off of information from
http://www.itl.nist.gov/div898/handbook/pmd/section1/pmd144.htm
and used the example from:
http://www.itl.nist.gov/div898/handbook/pmd/section1/dep/dep144.htm
which my code matches exactly. The local points (l=7 in the example page) ARE weighted using the tri-cube weighting function (see in my code the "w" dictionary).
Although the NIST page mentions "At each point in the data set a low-degree polynomial is fit to a subset of the data..." from the example, they were using a 1st degree (ie linear) polynomial, so I also wrote my code to do a weighted linear regression on local points.
Posted by: nick | Oct 16, 2008 at 12:34 PM
of course, if you wanted to use all the data points, you could either a) set l equal to the number of data points or b) remove the "Do Until d.Count = l" loop from the VBA code. If you wanted a different weighting function than the tri-cube that I used, replace
"w.Add i, (1-(d.Item(i)/maxDist)^3)^3"
with whatever weighting funciton you prefer.
Posted by: nick | Oct 16, 2008 at 12:40 PM
Oh, sorry, I didn't spot that bit. I thought you had written it local, but not weighted.
Posted by: derek | Oct 16, 2008 at 02:03 PM
I've made some refinements to Nick's LOESS function:
LOESS Smoothing in Excel
Posted by: Jon Peltier | Mar 09, 2009 at 07:05 AM
Jon, that's great. I initially, and for months afterward, didn't follow up Nick's fine and generous code, but when I finally did, I liked it a lot. I look forward to seeing what you've done with it.
Posted by: derek | Mar 09, 2009 at 03:32 PM