What if the RNC assigned seating randomly

The punditry has spoken: the most important data question at the Republican Convention is where different states are located. Here is the FiveThirtyEight take on the matter:


They crunched some numbers and argue that Trump's margin of victory in the state primaries is the best indicator of how close to the front that state's delegation is situated.

Others have put this type of information on a map:


The scatter plot with the added "trendline" is often misleading. Your eyes are drawn to the line, and distracted from the points that are far away from the line. In fact, the R-squared of the regression line is only about 20%. This is quite obvious from the distribution of green shades in the map below.


So, I wanted to investigate the question of how robust this regression line is. The way statisticians address this question is as follows: imagine that the seating has been assigned completely at random - how likely would the actual seating plan have arisen from random assignment?

Take the seating assignments from the scatter plot. Then randomly shuffle the assignment to create simulated random seating plans. We keep the same slots, for example, four states were given #1 positions in the actual arrangement. In every simulation, four states got #1 positions - it's just that which four states were decided by flipping coins.

I did one hundred simulated seating plans at a time. For each plan, I created the scatter plot of seating position versus Trump margin (mirror image of  the FiveThirtyEight chart), and fitted a regression line. The following shows the slopes of the first 200 simulations:


The more negative the slope, the more power Trump margin has in explaining the seating arrangement.

Notice that even though all these plans are created at random, the magnitude of the slopes range widely. In fact, there is one randomly created plan that sits right below the actual RNC plan shown in red. So, it is possible--but very unlikely--that the RNC plan is randomly drawn up.

Another view of this phenomenon is the histogram of the slopes:


This again shows that the actual seating plan is very unlikely to be produced by a random number generator. (I plotted 500 simulations here.)

In statistics, we measure rarity by "standard errors". The actual plan is almost but not quite three standard errors away from the average random plan. A rule of thumb is that 3 standard errors or more is rare. (This corresponds to over 99% confidence.)


PS. Does anyone have the data corresponding to the original scatter plot? There are other things I want to do with the data but I'd need to find (a) the seating position by state and (b) the primary results nicely set in a spreadsheet.

Once more, superimposing time series creates silly theories

After I wrote the post about superimposing two time series to generate fake correlations, there was a lively discussion in the comments about whether a scatter plot would have done better. Here is the promised follow-up post.

The contentious issue is that X and Y might appear correlated but in fact, what we are observing is that both data series are strongly correlated with time (e.g. population almost always grows with time), and X and Y may not be correlated with each other.

Indeed, the first thing a statistician would do when encountering two data series is to create a scatter plot. Economists, by contrast, seem to prefer two line charts, superimposed.

The reason for looking at the scatter plot is to remove the time component. If X and Y are correlated systematically (and not individually with the time component), then even if we disturb the temporal order, we should still be able to see that correlation. If the correlation goes away in an x-y plot, then we know that the two variables are not correlated, and that the superimposed line charts created an illusion.

Redo_milesdriven_1The catch is that the scatter plot analysis is necessary but not sufficient. In many cases, we will find strong correlation in the scatter plot. But that does not prove there is X-Y correlation beyond each data series being correlated with time. By plotting X and Y and ignoring time, we introduce time as an omitted variable, which can still be controlling both X and Y series.

The scatter plot (right) shows the per capita miles driven against the civilian labor force participation rate. Having hidden the time dimension, we still see a very strong correlation between the two data series.

This is because time is still the invisible hand. Time is running from left to right on the chart still. This pattern is visible if we have line segments connecting the data in temporal order, as in the chart below.




One solution to this problem is to de-trend the data. We want to remove the effect of time from each of the two data series individually, then we plot the residual signals against each other.

Redo_milesdriven_3Here is the result (right). We now have a random scatter of points that average about zero. If anything, there may be a slightly negative correlation, meaning that when the labor force participation rate is above trend, the per-capita miles driven tend to be slightly below trend; this effect if it exists is small.

What I have done here is to establish the trend for each of the two time series. The actual data being plotted is what is above/below trend. What this chart is saying is that when one value is above trend, it gives us little information about whether the other value is above or below trend.



Complex is not random

There is a tendency to mistake complexity for randomness.  Faced with lots of data, especially when squeezed into a small area, one often has trouble seeing patterns, leading to a presumption of randomness -- when upon careful analysis, distinctive patterns can be recognized.

We encountered this when looking at the "sad tally" of the Golden Gate Bridge suicides (here, here, here, here and here).  Robert Kosara's recent work on scribbling maps of zip codes also highlights the hidden patterns behind seemingly random numbers.

Estrellaloto Robert found
a related example (via Information Aesthetics, originally here): the artist takes random numbers (lottery numbers), and renders them in a highly irrelevant graphical construct, as if to prove that spider webs can be generated randomly.

According to Infosthetics, each color represents a number between 1 and 49, which means the graph contains 49 colored zigzag lines (not counting gridlines and axes).  Each point on the year axis represents a frequency of occurrence.

Imagine if you are tasked with using this chart to ascertain the fairness of the lottery, that is, the randomness of the winning numbers.  The complexity of this spider web makes a tough job impossible!  We must avoid the tendency to jump to the conclusion of randomness based on this non-evidence.

In fact, testing for randomness can be done using any of the methods described in the postings on the "Sad Tally" (links above).  A first step will be to plot the frequency of occurrence data as a simple column chart with 1 to 49 on the horizontal axis.  We'd like to show that the resulting histogram is flat, on average over all years.

The sad tally 4: the boxplot

Given several datasets (scatter plots), how does one tell random from non-random?  We plot features of the data structure and hope to see abnormal behavior, which we take to indicate the presence of non-randomness.

In a last post, we noticed that the key distinguishing feature is the degree of variability in the data: for purely random data, we expect a certain degree of variability; Realsuidkeys_1non-random data betrays their character because they would show more or less variability at different points in the distribution (see right).  So each scatter plot reveals variability around the mean (0) but the variability has apparently a different shape in Dataset 9 (the real data) than in the other datasets.  We saw this in a "cumulative distribution" plot and less strongly in a plot of the frequency of extreme values.

Given what we know now, a much simpler plot will suffice.  The boxplot, an invention of John Tukey, a giant among statisticians, succinctly summarizes the key features of a distribution, such as the median (bold line), the 25th and 75th percentiles (edges of the box), the spread (height of the box), and outliers (individual dots above or below the box).

SuicidesboxThe evidence of non-randomness in Dataset 9 now starkly confronts us.  Its box is much wider; the median line is significantly smaller than the other 8 datasets; the extreme value of 61 is clearly out of whack.


Now, lets take a step back and look at what this non-randomness means.  We are concluding that the choice of suicide location is not random.  Location 69/70 (the outlier, with 61 deaths) is definitely the most popular.  Partly because of this, many of the other locations have had fewer than 20 deaths, which are fewer than expected if people had randomly selected locations.

Next time, I will describe another way to compare distributions; this method is more advanced but also more direct.

The sad tally

John Shonder, a reader, alerted me to the following unusual chart which identifies the precise locations where people jumped from the Golden Gate Bridge:


He asks: is the choice of location "random"?

This is a very rich question and different statisticians will take different approaches.   In this post, I take a purely visual, non-rigorous look at the question; and if I have time (and if other readers haven't commented already), I may discuss more rigorous methods in the future.

First, I restrict my attention to light poles 43 through 112, i.e. the bridge segment that lies above the water.  Also, I only consider the north-south locations: in other words, 43 and 44 are counted as one, so are 111 and 112.  Otherwise, the distribution is clearly biased (towards the water and the east side).

When we say "random", we usually mean there is equal chance that someone will jump from location 43/44 or from location 111/112 or any location in between.  There are 35 locations and 755 documented suicides, averaging to 21.6 suicides per location.  But 21.6 is the average which is not observable; assuming that the choice of location is random, we still would not find exactly 21-22 suicides at each location.  (Similarly, even if there is a 50/50 chance of getting a head when we flip a coin, in any given run of 100 flips, it is very unlikely that we will see exactly 50 heads.) 

So, at some locations we will see more than 21.6 deaths; at others, fewer.  The question becomes whether the fluctuations are too much to refute the notion that the choice of location is random.

In the following set of graphs, I ran some simulations.  Eight of the nine graphs represent scenarios under which I sent 755 people to the bridge and randomly assign them one of the 35 locations to jump from (okay, this is a thought experiment only; please don't do this at home).  The x-axis represent locations; the y-axis represent the number of suicides at that location -- but on a standardized scale.


The standardized scale allows us to compare across graphs.  The zero line represents the mean number of suicides per location.  The number of suicides at most locations is within one standard deviation away from this mean (i.e. between -1 and 1 on the y-axis).  In some extreme cases, the number of suicides is more than 3 standard deviations larger than the mean (i.e. greater than 3).

Back to randomness: well, one of the 9 graphs is the real data from the map above.  If you can guess which of the 9 is real, then the real data is probably not random.  If you can't, then the real data may be random!

I will publish the answer tomorrow.  In the meantime, feel free to take a guess and/or comment on what other approach you'd take.  One take-away from this exercise is that it's very hard to tell non-random from random unless it is very obvious.

References: San Francisco Chronicle, Boing Boing