« Turning the table | Main | A matter of timing »

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341e992c53ef00e552258af18834

Listed below are links to weblogs that reference Turning in his grave 1:

» Scatterplots from Customer Satisfaction Measurement - The Leadership Factor
The intermittently good blog Junk Charts has another post highlighting the danger of thinking that you can analyse the relationship between two variables simply by plotting them next to each other on a line chart. The problem is that line [Read More]

Comments

derek

Speaking of regression, has anyone ever implemented lowess (locally-weighted regression), in VBA or otherwise, for Excel? I see Kelly O'Day has a note that he's looked and come up blank.

Jon Peltier

Speaking of secondary axes, Stephen Few discussed them in "Dual-Scaled Axes in Graphs-Are They Ever the Best Solution?" (the answer is no), and I countered with a class of examples where they are acceptable in "Secondary Axes that Work - Proportional Scales"

Kelly O'Day

Derek - After trying to get Excel - VBA to handle lowess and panel charts for quite a while, I've decided to move up to r.

There is a great Excel add-in for r, link to tutorial that gives Excel users full access to r graphics. To me, this is the best of both worlds, Excel for data manipulation, r for statistical and graphical analysis.

Kaiser - great post. I look forward to seeing more from Mr. Urbanspoon.

Matt

One problem is that true elitists, wouldn't drink at starbucks if they have a local roaster. Starbucks is pretty mainstream, I prefer my local house and I'm basically a snob.

vozome

about dual-scaled axes and proportional scales... while I agree with Jon Peltier's example sometimes the opposite happens: same scale for two variables that have nothing to do with each other, for instance indices or percentages that happen to go for 0 to 100.
like in your (Kaiser's) example they are often misused to invoke correlation when there is none.


Aler

I'd like to do a little analysis of my own on data like this. For example, how does the Obama-Clinton vote depend on the rural-urban ratio of a state (or the average population density), or the democrat-republican vote in the 2004 elections?

Can anyone recommend a good source for the raw data? I'd rather not type everything in by hand if I can avoid it. State by state would be fun, but county by county would be amazing.

Ethan

Professor, I have missed you. Thanks for providing a more rigorous analysis! Indeed, the correlation is as weak as bad coffee. I'll try harder on my next statistical foray.

Oh, and Vozome -- you can find county level data, albeit not well organized, here:
http://www.cnn.com/ELECTION/2004/pages/results/president/

Ethan

(whoops -- that county-level data is for Aler)

Michael

Does removing all the data that best fits the theory disprove the theory?

Or is this simply a case of cherry picking - only choosing the data that agrees with your theory and discarding that which doesn't.

derek

It's okay to remove data temporarily, to show how sensitive the line is to the data. The data that "best fits" the theory, in this case, only fits the ends. The bulk of the data doesn't back the theory up at all.

This is why I mentioned lowess. It's basically a fancy mathematical way of doing exactly what Kaiser did: see what happens to the line when you take the ends away. If the linear regression slope goes away, you know the line wasn't strong enough to bear the weight of the hypothesis laid on it.

The lowess version would have a wavy pattern that would be much less impressive as proof of a trend than a linear least-squares fit, which is the next closest thing to a simple mean (a mean is an aggregate measure boiling the data down to one number, a line boils it down to two numbers).

Kaiser

Thanks Derek. Both the original exercise and what I did fall into the realm of exploratory data analysis. The linear fit was poor and I tried to show why it was so. In general, though, Michael's point is valid.

D Kelly O'Day

Lowess (aka Loess) is a locally weighted regression technique
link. Lowess uses all the data by fitting a series of "local" regressions, giving insight into the overall relationship which may vary over the range of x.

Rather than eliminate data, lowess actually uses all the data several times by computing weighted regressions in multiple subsets of the data.


Mike

I drank starbucks once. Once. It is burnt, disgusting and way over priced. I can purchase more than enough Folgers for me than that garbage they sell. Folgers, Dunkin, whatever - anything is better than starbucks. And if Obama and his followers like it, then all the more reason to wonder about starbucks.

Jon Peltier

Kelly -

I don't know. Dunkin used to be pretty good, but they've worried so much about their menu selections that they've forgotten how to make coffee. I can't drink it any more.

I never used to like Starbucks coffee either, but I find it doesn't seem so burnt the last few times I've tried it. It's a little pricey, but if you stick to regular coffee without any embellishments, it's not so bad.

nick

I have implemented loess in vba from scratch.

derek

nick, that's great to hear, and something I've long wanted to do myself. Is it in the form of a UDF with a format similar to FORECAST(), or some other method? And would you be willing to share it?

nick

Probably not the most elegant implementation, but it works. Its a UDF that takes in as arguments the known X's (xRng), known Y's (yRng), number of points to use in the regression (l), and the point to calculate (newX).

Function loess(xRng As Range, yRng As Range, l As Integer, newX)
Dim x As Dictionary
Set x = CreateObject("Scripting.Dictionary")
Dim y As Dictionary
Set y = CreateObject("Scripting.Dictionary")
Dim d As Dictionary
Set d = CreateObject("Scripting.Dictionary")
Dim w As Dictionary
Set w = CreateObject("Scripting.Dictionary")

'// convert ranges to dictionaries and fill in distances
For i = 1 To xRng.Count
x.Add i, xRng.Item(i)
y.Add i, yRng.Item(i)
d.Add i, Abs(xRng.Item(i) - newX)
Next i

'// loop, removing largest distance until we're down to the number of points we want
Do Until d.Count = l
maxDist = -1
For Each i In d.Keys
If d.Item(i) > maxDist Then
maxDist = d.Item(i)
mx = i
End If
Next i
d.Remove mx
Loop

'// Find max distance, then calc weights (w) using scaled distances
maxDist = -1
For Each i In d.Items
If i > maxDist Then maxDist = i
Next i
For Each i In d.Keys
w.Add i, (1 - (d.Item(i) / maxDist) ^ 3) ^ 3
Next i

'// now to calculate I to V and denom for linear regression
For Each i In w.Keys
one = one + w.Item(i)
two = two + x.Item(i) * w.Item(i)
three = three + x.Item(i) ^ 2 * w.Item(i)
four = four + y.Item(i) * w.Item(i)
five = five + x.Item(i) * y.Item(i) * w.Item(i)
Next i
denom = one * three - two ^ 2

'// calculate slope (slp), intercept(inter), and finally the loess value
slp = (one * five - two * four) / denom
inter = (three * four - two * five) / denom
loess = slp * newX + inter

'// empty the dictionaries
Set w = Nothing
Set d = Nothing
Set y = Nothing
Set x = Nothing
End Function

derek

Ah, it uses a restricted number of points, each with 100% weighting. I can implement that without VBA.

What I'm looking for is a way to take all the points, but weight them according to a function, e.g. Gaussian, of distance from the newX point. I've always been hazy about the difference between Cleveland's cute names, loess and lowess; is that the difference?

Thanks for the reply.

nick

Sorry its not what you were looking for. When I was writing it, I was working off of information from
http://www.itl.nist.gov/div898/handbook/pmd/section1/pmd144.htm

and used the example from:
http://www.itl.nist.gov/div898/handbook/pmd/section1/dep/dep144.htm

which my code matches exactly. The local points (l=7 in the example page) ARE weighted using the tri-cube weighting function (see in my code the "w" dictionary).

Although the NIST page mentions "At each point in the data set a low-degree polynomial is fit to a subset of the data..." from the example, they were using a 1st degree (ie linear) polynomial, so I also wrote my code to do a weighted linear regression on local points.

nick

of course, if you wanted to use all the data points, you could either a) set l equal to the number of data points or b) remove the "Do Until d.Count = l" loop from the VBA code. If you wanted a different weighting function than the tri-cube that I used, replace
"w.Add i, (1-(d.Item(i)/maxDist)^3)^3"
with whatever weighting funciton you prefer.

derek

Oh, sorry, I didn't spot that bit. I thought you had written it local, but not weighted.

Jon Peltier

I've made some refinements to Nick's LOESS function:

LOESS Smoothing in Excel

derek

Jon, that's great. I initially, and for months afterward, didn't follow up Nick's fine and generous code, but when I finally did, I liked it a lot. I look forward to seeing what you've done with it.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Marketing analytics and data visualization expert. Author and Speaker. Currently at Vimeo and NYU. See my full bio.

Book Blog



Link to junkcharts

Graphics design by Amanda Lee

The Read



Good Books

Keep in Touch

follow me on Twitter

Residues