Light entertainment IV
Review: Gapminder 1

The sad tally 5: comparing quantiles

Today I return to analysis of the sad tally, or are suicide locations on the Golden Gate Bridge random, or how does one determine if a sequence of numbers is random?  The visual evidence, from cumulative distributions and box plots, tells us that the shape of distribution matters.  One way to directly compare two distributions is by comparing quantiles.

The following chart shows the (smoothed) cumulative distribution of some non-random data (Dataset 9) on the left, and randomly generated data on the right.  It is clear that the two lines are not the same shape; is there a systematic way to compare them?

Suicidescdf2
The orange line identifies the point at which the number of suicides equal 40% of the total.  On the left, this means the number of suicides committed between locations 41 and 72 is 40% of the total.  On the right, the same number occurred between locations 41 and 70.  The pink line similarly compares the point at which the suicides equal 20% of the total.  Notice that at this point on the distribution, the locations are significantly different, 41-65 on the left versus 41-58 on the right.

Such comparisons can be made at different points on the distribution, 10%, 20%, 30%, etc.  The result is a qqplot (quantile-quantile plot) as shown below.  Each distribution is compared to an ideal "uniform" distribution (i.e. random) which is the straight line.  Not surprisingly, the data on the right, generated randomly, is much more likely to be random.  The left line is consistently above the straight line, which indicates systematic difference from random.

Suicidesqq

P.S.  I have neglected the tricky issue of how much difference from random is required to pronounce the visual evidence conclusive.  Usually, after inspecting graphs, we have to resort to mathematics by running statistical tests.  But statistical tests, with the omnipresent p-values, often give a false sense of security, particularly where the theory is incomplete, as is the case in tests of randomness.  Running statistical tests without visualizing the data is dangerous.

Comments

The comments to this entry are closed.