## Complex is not random

##### Jan 10, 2007

There is a tendency to mistake complexity for randomness.  Faced with lots of data, especially when squeezed into a small area, one often has trouble seeing patterns, leading to a presumption of randomness -- when upon careful analysis, distinctive patterns can be recognized.

We encountered this when looking at the "sad tally" of the Golden Gate Bridge suicides (here, here, here, here and here).  Robert Kosara's recent work on scribbling maps of zip codes also highlights the hidden patterns behind seemingly random numbers.

Robert found
a related example (via Information Aesthetics, originally here): the artist takes random numbers (lottery numbers), and renders them in a highly irrelevant graphical construct, as if to prove that spider webs can be generated randomly.

According to Infosthetics, each color represents a number between 1 and 49, which means the graph contains 49 colored zigzag lines (not counting gridlines and axes).  Each point on the year axis represents a frequency of occurrence.

Imagine if you are tasked with using this chart to ascertain the fairness of the lottery, that is, the randomness of the winning numbers.  The complexity of this spider web makes a tough job impossible!  We must avoid the tendency to jump to the conclusion of randomness based on this non-evidence.

In fact, testing for randomness can be done using any of the methods described in the postings on the "Sad Tally" (links above).  A first step will be to plot the frequency of occurrence data as a simple column chart with 1 to 49 on the horizontal axis.  We'd like to show that the resulting histogram is flat, on average over all years.

You can follow this conversation by subscribing to the comment feed for this post.

"the peak on the outer line of 2006, for instance, is number 41 (green color) & has a value of 18%, which is the frequency of appearing as winner in that year."

This is an opportunity for me to complain again about the mania for turning everything into a percentage. Why not simply say that in 52 weeks of the two lotteries, the number 41 appeared on 19 occasions? Presenting the results as integers would also show how heavily quantised the results are, which isn't apparent from the percentages alone (eventually the viewer should notice that the results are 17.31% or 18.27%, never anything in between).

Recently I saw a graph of percentage Democratic seats in Congress over a century or so, and the data was in percent. Okay, that made the majority line a simple straight horizontal 50%, but it wouldn't have been beyond the wit of the graph maker to have the majority line rise as a step curve over the decades, giving the opportunity to show the actual number of seats, the actual seat majorities, and the changing size of the House, all in one convenient graph.

i just happened upon this post in my rss reader. i had coincidentally been examining this same data when a user posted it on swivel.com.

here's a scatter plot of the lottery numbers in question. this seems to be a simple visual way to show that complex-looking data is random.

Ah, truly a graph that fits this site's motto: "Recycling chartjunk as junk art."

I'm particularly fond of the lines connected between 2006 and 1988, as if there was some sort of great Mandala (lotter wheel?) at work here.

Thanks to huned for the data. I have to take back my complaint about normalising all the years; I had assumed they had two lotteries a week=104 a year, but it seems they had 123-124 a year, with substantially fewer than that in some years. I don't know what pattern would produce that number of draws a year, but if it's variable, then of course it's not wrong to take a percentage.

It's suspicious that the day value of each date from the Swivel data (is there another source) never exceeds 12. I wonder if m/d/y got crossed with d/m/y somewhere.

Doing a distribution on day of week on each interpretation yields two plausible results. With the dates as provided each day of week is equally likely. Reversing month and day fields shows Sunday (5%) to be much less common that the others which are almost equal but not quite.

Looking at the day of week *by year* with the reversed month/day interpretation (http://www.forthgo.com/blog/wp-content/uploads/2007/01/lottobyday.PNG) show a believable pattern: Sunday lottos stopped in 1993 while Friday lottos started and Saturday lottos picked up. Distributions before and after those transitions are steady.

Either way, it seems there is quite a big chunk of data missing, which is not readily noticed because of the large volume of data present.

funny tips about how to win the lottery:
http://winthelottery.myzing.net

Thanks for sharing this data. I’ve been searching for it lately for my review.

The comments to this entry are closed.