The trouble with percentages
Scribbling as art

Emergent patterns

It's always a pleasure to read blow-by-blow accounts of how charts were constructed.  The piece on time-travel maps was instructive.  Similarly in the previous post, I quoted the following:

It’s easier to answer this question if you leave out the six states that didn’t elect any Republicans in 2000; after all, they didn’t have any to throw out. If you also remove New Hampshire and South Dakota, where the percentage of Republicans elected dropped to 0 from 100 — New Hampshire only has two seats in the House and South Dakota has one — a pattern starts to appear.

At first sight, this appears as a case of removing outliers, which many statisticians recommend.  Except that the data omitted were not outliers.  Indeed, when both x- and y-variables are bounded (between 0% and 100% share of the House seats; between -100% and +100% change in share), there can be no extreme values.

In effect, when the author eliminated those eight points, he followed the "emergent pattern" theory, by which I mean the notion of removing data until a pattern "emerges".  (By the way, emergence is now a science, as expounded here.)  If enough data is removed, one can produce any pattern as one pleases.  One can find subsets of data to support a hypothesis of positive linear, flat linear or quadratic, as shown below.


Focusing now on the full data set on the upper left corner, one is hard pressed to conclude that a positive correlation exists between the two variables. In particular, most states experienced no changes in the share of House seats, and in these states, the income growth ranged from under 20% to over 40%, which is pretty much the extent of variability across the full data set.


Feed You can follow this conversation by subscribing to the comment feed for this post.


The link to "Emergence is now a science" is broken. Care to share the right one?


This one:


gah, except that didn't work. let's try again: link


The link is fixed now. Thanks for the note.


Thinking about a solution to the percentage problem: Instead of focussing on states-as-whole, why not look at the chance of switching party for each congressperson, using, perhaps, some form of ordinal logistic regreesion, possibly within a mixed model?

Could rank the possible changes:
Switch to D
Stay same
Switch to R

then, the IVs could be at both state level and district level (and maybe national level, too).


Kaiser, as you are aware these are not emergent patterns but scientist-produced patterns occurring from data manipulation. They should not be called “emergent”. An emergent pattern is not produced by the scientist; it is only seen by the scientist. In the case of emergence the underlying elements produce the pattern through relation. The relation of the elements to each other give access to a new and emergent layer of meaning and new properties which were not seen by inspection of the isolated elements. The emergent phenomenon is biased or leads to false conclusions when inclusion/exclusion of the elements that belong to an entity is not clearly defined.


The reference to the Emergence book was definitely tongue-in-cheek. However, I do think one of the challenges of that field is to ascertain that observed patterns are not of random origin.

This theme in fact arises again and again. Classical statisticians are very familiar with it through the multiple comparisons issue in analysis of experiments. In machine learning, and in particular the area known as "association rules", the biggest challenge is how to declassify correlations that just happen to show up in your sample.

The comments to this entry are closed.