## Missing data, mysterious order, reverse causation wipes out a simple theory

##### Jun 02, 2014

New York Times columnist Floyd Norris published a set of charts purportedly to show that the housing market in the U.S. is on the mend. Not so quick Floyd.

His theory - originating from an economist at Hanley Wood, a real estate research firm - is that in a recovering market, the share of new home sales by home builders should be higher than the share by banks, as the bank share is associated with foreclosed houses. The data offered are both in aggregate and by regions. I'm particularly interested in the regional chart from a design perspective.

The published chart is the one shown on the left below. I am not a fan of nested bar charts. I don't think there is any justification for treating two data series (here, share by banks and share by builders) differently. Which of the two series should one assign to the fatter bars?

If we slim the fat bars down, we retrieve the more conventional paired bars chart, shown on the right. Among these two, I prefer the paired version.

***

There is a weakness with both versions. The theory rests on the relative share, which is clearer in a stacked presentation as shown on the right.

This presentation also shines the light on a dark corner of Norris's analysis. In every city but Detroit, an unmentioned group of sellers accounts for the majority of home sales! Nowhere in the article did Norris tell readers who those sellers are, and why they are ignored.

In all these charts, I have kept the original order of cities. Before reading further, see if you can tease out the criterion for sorting the cities.

With some effort, you'll learn that the cities are arranged in the order of degree of housing recovery, which is measured by the difference in share: the cities at the top (Houston, Dallas, etc.) have a higher share of builders selling than banks selling.

Ironically, the difference in share is the least emphasized data in a nested bar chart. In fact, how you compute the difference depends on the relative share! When the olive bar is longer than the blue bar, the reader sizes up the white space between the edges of the bars; when the blue bar is longer, though, the reader must look inside the blue area, and compute the interior distance.

The reader can use some help here. Possible fixes include using a footnote, or adding a note informing readers that up implies stronger recovery, or creating a visual separation between those cities in which the share by builders exceeds that by banks, and vice versa.

Here is a dotplot with annotations. The separation between the dots is easily estimated.

***

Recall the theory that in recovering markets, banks account for a lesser share of home sales. The analyst turned this into a metric, by taking the difference in the share by builders from the share by banks.

This metric is highly problematic. The first problem, already discussed, is that there exist more than these two types of sellers, and it is absolutely not the case that if the share by banks goes down, the share by builders goes up.

Another issue is that the structure of the housing market in different cities is probably different. The chart promotes the view that there is a general trend that extends to all markets. In fact, the variation over time within one city should be more telling than the variation across twenty cities of a point in time.

And there is the third strike.

This is a confusion between forward and reverse causation (see Andrew's post here for a general discussion of this important practical issue). The Floyd Norris/Hanley Wood theory expresses a forward causation: if a housing market is recovering, then banks will work through its inventory of foreclosed homes, and account for a decreasing share of home sales.

The analysis addresses the reverse of this relationship. The analyst observes that banks (in some cities) are selling fewer homes, and concludes that the housing market is recovering. Notice that this is a problem of reverse causation: instead of cause -> effect, we have effect -> cause. The rub is that any given outcome has many possible causes. Banks sell fewer homes for many possible reasons, only one of which is a recovering market.

Here are some other possibilities. The banks expect prices to rise in the future, and they are holding on to the inventory. The economy is sputtering and banks are tightening up on mortgage lending, making it harder to sell homes. Instead of selling the homes, the banks decide to destroy the homes to reduce supply and raise prices. The mysterious third group of sellers has put a lot of homes on the market. etc.

In making claims based on observational data, one must conduct side investigations to rule out other causes.

***

From a Trifecta Checkup perspective, this chart addresses an interesting Question. The Visual design has hiccups. The biggest problem is that the Data provide an unsatisfactory answer to the question at hand. (Type DV)

I would assume that the remainder are sales by existing home owners.

House prices have risen over the last year by a surprising amount (10%+) and that would reduce the foreclosures. So the figures are already available that they are trying to prove. Agree that their model is too simplistic to actually prove anything.

I dont agree with the nested/paired bar chart bit at all. When showing two series like this the important thing is good visual separation. Clearly separation is better in the original chart.
Which series should be the fat one is the same problem as asking what series should be the black or blue? It is a question of whwere to put focus. I think focus is more even in the original.

Stackad bar charts are very problematic as the top series is impossible to see properly. Here we have the same problem of focus but much worse; which should be on top?

If one wants to show the relative parts, just calculate it and show that instead, as two series. Maybe with a seperate chart of the sum.

Jorgen: if you notice my language, I'd not have used a bar chart at all. The best way to show separation is a dot plot.

yes,sorry I did read that and I think it is a good choice too. Sometimes I forget to write what I agree with. I agree with most of your writings.
Still..
The drawback with dot plot is that you mostly show the differences between values not their quantative relation. That is much clearer in a bar chart.
About separation separation can be a problem if the values are the same.

The comments to this entry are closed.