## Following one's nose 1

##### Oct 27, 2009

Andrew Gelman has a great post about a so-called Immigrant paradox here, which should be interesting to our readers too.

He posed a set of sharp questions.  My read, in reverse order:

6. The graph is pretty effective, I agree.  This is known as an "interaction plot".  The message the authors were trying to send was that the gap between immigrants and U.S. born in terms of prevalence of mental illness is not constant across sub-groups of Latinos.  For example, the gap for Mexicans (light blue) is larger than the gap for Puerto Ricans (pink).  Thus, the authors concluded that one should be careful about speaking of an aggregate (average) gap.

The graph lays this out clearly.  The steeper the line, the bigger the gap between the  immigrants and non-immigrants.

When Andrew showed this, I knew for sure someone will cry foul that a line is drawn between unrelated, discrete things.  Indeed, the very first commenter weighed in with this complaint.  In fact, whenever I show such charts to non-statisticians, a lot of people have this reaction.

So I'll take this as another chance to convince you to release interaction plots from jail.

Typically, a dissenter will offer up a dot plot as an alternative.  So let's look at the same chart without the lines.  Since the reader is supposed to figure out how the gap between U.S. born and immigrant groups across different subgroups of Latinos, the proverbial nose is tracing a line from a left dot to a right dot.  Thus, to follow one's nose is to mentally draw the lines I just removed.  The chart designer has done us a favor by making the lines explicit.

In addition, as Andrew pointed out, it is always better to try to get rid of the legend and put the line labels directly onto the chart.

One shortcoming of the interaction plot is that it does not disclose the relative importance of the different lines, which correspond to the relative proportions of people in these subgroups.  Without this information, the reader will likely assume the lines have equal weight.  This assumption, as I will explain in a future post, may be a problem.

This post dealt with the graphical aspect.  I will have more to say about Andrew's other points on the statistics in a future post.

You can follow this conversation by subscribing to the comment feed for this post.

My issue with using lines to connect discrete things is that it can imply that there is something "half-way" between two discrete points, even when there is not.
However, when you have just two points on each 'line', it's a slightly diofferent thing... you are comparing two things, which somehow feels different to comparing a range of things.
A possible aletrnative would be a bar chart, grouped by type of Latino, with a bar for US born and immigrant... then the biggere the difference in bar height, teh biggere the difference between the two groups.

I'm trained as a social scientist, and of course we use interaction plots all the time. They show all the data, and emphasize the slopes and interactions. They quickly become unwieldy if you have confidence intervals and more than 3 lines, though. (This graph really should have confidence intervals!)

If all you care about is the slopes, you can always make a difference plot. Probably best shown as a (horizontal) bar graph or dot graph with confidence intervals, plotting differences in this case of (US - Immigrant) mental illness. You can get many more cases on the page with that sort of chart, since the lines won't overlap.

I'm on your side w.r.t. line graphs generally, but I think this is a bad particular example, since it's only got two labels on the "horizontal axis". So the obvious solution to everyone's objedtions is to transpose the graph so that the immigrant subgroups are labelled along the bottom and there are two series, US born and Immigrant.

Now you can have a much larger variety of graph types; two lines, two point series, two bars or a single floating bar or line with two ends.

The lines are helpful in that they look at the relative difference between the US born and the Immigrant population. Parallel lines mean the subgroups have similar relative proportions--and that is almost impossible to read from the dot version. Also the lines help with the overlap problem with Mexicans and Others in the Immigrant group.

One way of showing this would be to have another column of Immigrant/US Born, perhaps on the same graph but with a different scale (or the same range stretched to higher levels with a second label).

I would choose the lined version over alternatives.

You need to be mindful with the scale--you just dropped it by 100 with a label that says (%) and then showing the decimal value.

Was anyone else disturbed by the fact that the native non-Latino whites are the most likely to have disorders? Perhaps a graph that breaks down the foreign born by % of life spent in the US would tell us if it was delivery here in the US or just kids' TV shows that drive the greater incidence. Probably not a discussion for this page.

Since no one bothered to say it at the other site, I'll mention the obvious: anyone born in Puerto Rico is US-born. Are immigrant Puerto Ricans those born in Mexico or China to Puerto Rican parents who chose not to take birthright US citizenship and later decided to be naturalized? That must be an awfully small sample.

These plots are also used all the time in biology, where we call the change the "norm of reaction". One item of great value is to be able to compare the amount of change between two conditions for two different groups to the difference in change (i.e., are the slopes of the connecting lines the same or different, regardless of the absolute value of the "dots").

I've had people, smart people, tell me that parallel coordinates plots are not valid, because the lines connect across discrete categories. They refuse to see that a line sometimes shows a continuous trend, but also connects points to help see other types of patterns.