## The sad tally 4: the boxplot

##### Nov 15, 2005

Given several datasets (scatter plots), how does one tell random from non-random?  We plot features of the data structure and hope to see abnormal behavior, which we take to indicate the presence of non-randomness.

In a last post, we noticed that the key distinguishing feature is the degree of variability in the data: for purely random data, we expect a certain degree of variability; non-random data betrays their character because they would show more or less variability at different points in the distribution (see right).  So each scatter plot reveals variability around the mean (0) but the variability has apparently a different shape in Dataset 9 (the real data) than in the other datasets.  We saw this in a "cumulative distribution" plot and less strongly in a plot of the frequency of extreme values.

Given what we know now, a much simpler plot will suffice.  The boxplot, an invention of John Tukey, a giant among statisticians, succinctly summarizes the key features of a distribution, such as the median (bold line), the 25th and 75th percentiles (edges of the box), the spread (height of the box), and outliers (individual dots above or below the box).

The evidence of non-randomness in Dataset 9 now starkly confronts us.  Its box is much wider; the median line is significantly smaller than the other 8 datasets; the extreme value of 61 is clearly out of whack.

Now, lets take a step back and look at what this non-randomness means.  We are concluding that the choice of suicide location is not random.  Location 69/70 (the outlier, with 61 deaths) is definitely the most popular.  Partly because of this, many of the other locations have had fewer than 20 deaths, which are fewer than expected if people had randomly selected locations.

Next time, I will describe another way to compare distributions; this method is more advanced but also more direct.

## The sad tally 3: first analysis

##### Nov 13, 2005

Last week, I posed the question of how one can ascertain if a data set is "random" using Golden Gate Bridge suicide data compiled by the SF Chronicle.  For review, see here (and less importantly, here).  I plan to document my own exploration slowly over a few posts as the material may be a bit heavy for those not trained in statistics.  Don't run away yet as I'll try to explain things in simple terms as best I can.

In the original plots, I compared eight randomly generated datasets with the real suicide data (bottom right graph here).  The random data imagine the scenario in which death-seekers randomly picked the locations from which to jump off the bridge.  While in theory there should be about 21.6 suicides at each of the 35 locations, the random data exhibited a wide spread, with some locations having the number of deaths over 2 to 3 standard deviations away from the average (norm).  Standard deviation is a way to measure the spread of data around the average value.

Staring at a graph like the one shown above, how does one decide whether the pattern is random or not?  I will, for now, stick to visualization methods.

(1) Frequency of extreme values
Intuitively, assuming random, then the number of suicides should be close to the mean at most locations, with few extreme values.  If the number of locations having extreme values in the real data is significantly higher (or lower) than the same count in the random data, one might reasonably conclude that the real data is not random.
In my chart, there are 9 lines, each a dataset.  I identified two unusual lines by red color.  We have evidence that the real data (Dataset 9) is not random since the line deviated from most other lines.  This evidence is inconclusive because Dataset 4 also strayed even though it was randomly generated.

This illustrates why the question of randomness is tricky: we rely on strange (rare) behavior to detect non-randomness but strange things can happen randomly too.  Therefore, it is best to collect many pieces of evidence.

(2) Cumulative distribution
Like Robert, I plotted the cumulative distribution.  This graph shows the % of locations with less than 10, 20, 30, etc. suicides.
Again, each "staircase" line represents one of the 9 datasets.  They all converge at the point (20, 0.6): that is, in each dataset, about 21 locations (60% x 35 locations = 21) had 20 suicides or fewer.

The red line clearly stood out in this chart.  It is Dataset 9.  So we have stronger evidence still that it is different from random data.

Observe that for random data (the black lines), a steep rise occurred between 14 and 35 suicides, indicating that if randomly selected, most locations will see approximately 14-35 suicides.  By contrast, the red line rose fast at about 10 suicides or so, then saw less action in the 25+ range, and had an extreme outlier at 60.  In other words, the variance (spread) in the real data was higher than can be expected in random data.

These visualizations, of extreme values and cumulative distribution, faciliated our objective of comparing the real and the random data.  The difference showed up more potently here than in the scatter plots.

In a future post, I will look at other ways, both simpler and more complex ideas, to identify the important differentiator of variance.

## On the popularity of bar charts

##### Nov 12, 2005

I and others have commented, to no apparent avail, on the inadequacy of the bar chart, or its variant, such as the paired/grouped bar chart.

These two examples appeared on the same day in the Wall Street Journal.  The junkchart versions, using line charts, are clearly superior in drawing attention to the key messages.

In the first example, the improved chart facilitates comparison on either the time axis or the type of media.

In the second example (below), all of the key messages came out more potently, including the reversal of growth directions, the cross-over circa 2001-2, the dip in early-stage investments in 1999 and leveling off of early-stage investments in recent years.

One other trend remains buried in both versions, that is, the total proportion of VC funds invested in seed, early and late stage companies increased from about 55% to 70% of the total in these 10 years.  One wonders what other investment type suffered during this period...

Reference: "TV On-Demand May Make Ads More Targeted" and "Venture-Capitalists Think Large", Wall Street Journal, Nov 9 2005.

## Dizzying dots

##### Nov 09, 2005

One of Tufte's many contributions is the concept of "data-ink ratio": how much of the ink used to print a chart is used to show the data as opposed to, say, decoration?

This example, showing the quintile ranking of utility funds, has a very low data-ink ratio.  The dots serve only one purpose, to make the reader dizzy.  The data stands out once the dots are banished, as shown below.

Reference: "Lipper Leaders", Wall Street Journal (free this week), Nov 7 2005.

## Uh um

##### Nov 08, 2005

An analysis of "uh" "um" linguistic data is cited at Prof. Gelman's blog.  I highly recommend it.  My comments are at his site (tried to do a Trackback but didn't succeed).

## Polling and the obvious

##### Nov 07, 2005

I noticed a creative but flawed attempt to improve upon the stacked bar chart.  In the usual stacked bar presentation, it is easy to compare the leftmost and rightmost categories, but not the middle categories.   For example (below left), try comparing the percentage who thought women's legal rights were the same against the percentage who thought family rights were the same.

Above right, one designer's solution is to disaggregate the bars (blue, gray, red), turning them into columns of squares.  Disaggregating the bars is a good idea but the use of squares is unfortunate, especially when the relative percentage is made proportional to the edge length, not the area of the square.  Observe that one can fit four "50%" squares into the "100%" square.

I'd welcome any ideas that would improve upon the stacked bar/column chart.  How to make the middle categories easier to compare?

The poll itself raises more questions than it answers:

• Biased sample: Asking immigrants to compare conditions between the U.S. and other countries is like asking someone who just paid \$2 million for a one-bedroom apartment in Manhattan whether there is a housing bubble.  This group made a conscious decision to come to the U.S., possibly to escape what they consider unsatisfactory circumstances in their places of birth.  It makes me wonder why we need this poll since the answers are rather obvious.
• Heterogeneity: Immigrants come in all stripes and their answers to these questions will most likely be affected by where they came from, what socio-economic status they occupy in their home countries, what level of education they have, where they live in the U.S., their family income and so on.  The aggregate numbers do not mean much when the underlying population is so diverse.
• It would be instructive to compare these results with polls where they ask foreigners to rate their perception of America.

Two results were omitted from the graph for unknown reasons: 34% thought the U.S. was better on "safety from crime" and 28% thought the U.S. was better on "moral values of society".

Reference: "Migrant Worry", New York Times Magazine, Nov 6, 2005.

## The sad tally 2: the data

##### Nov 05, 2005

The last post contained a little riddle: which of the 9 graphs (if any) is different from the other 8?  I will disclose the answer here soon so to avoid the spoiler, read the previous post first.

Here is the data gleaned from the graphic in the SF Chronicle (any error is purely mine):

Location,Frequency
101,31
99,29
97,35
95,35
93,25
91,24
89,12
87,25
85,32
83,12
81,15
79,12
77,22
75,17
73,17
71,40
69,61
67,28
65,13
63,12
61,19
59,11
57,11
55,12
53,13
51,11
49,4
47,16
45,17
43,18

There are two ways to solve the riddle.

First, one can think of it as a pattern matching problem: which of 9 graphs contain a pattern that matches that in the map?  This really isn't the point I was trying to make but I realize now that the question could have been interpreted this way.  In this line of reasoning, one needs to identify the features that distinguish the pattern in the map.  The most standout feature, for me, is the spike at location 69/70.  Only the last two graphs contain spikes near this location and more careful inspection will reveal the bottom right chart to have the real data.

Alternately, one can ignore the context (of the sad tally) and treat this as a problem of comparing probability distributions.  This was my original intent.  Is there an "odd man out" among the 9 distributions?

We now know that the bottom right chart contains the real data and the other 8 charts plot random data.  If the real data is the "odd man out", which features of the distribution allow us to differentiate it from the other graphs?  I'll discuss my findings on some features in the next post.

##### Nov 04, 2005

John Shonder, a reader, alerted me to the following unusual chart which identifies the precise locations where people jumped from the Golden Gate Bridge:

He asks: is the choice of location "random"?

This is a very rich question and different statisticians will take different approaches.   In this post, I take a purely visual, non-rigorous look at the question; and if I have time (and if other readers haven't commented already), I may discuss more rigorous methods in the future.

First, I restrict my attention to light poles 43 through 112, i.e. the bridge segment that lies above the water.  Also, I only consider the north-south locations: in other words, 43 and 44 are counted as one, so are 111 and 112.  Otherwise, the distribution is clearly biased (towards the water and the east side).

When we say "random", we usually mean there is equal chance that someone will jump from location 43/44 or from location 111/112 or any location in between.  There are 35 locations and 755 documented suicides, averaging to 21.6 suicides per location.  But 21.6 is the average which is not observable; assuming that the choice of location is random, we still would not find exactly 21-22 suicides at each location.  (Similarly, even if there is a 50/50 chance of getting a head when we flip a coin, in any given run of 100 flips, it is very unlikely that we will see exactly 50 heads.)

So, at some locations we will see more than 21.6 deaths; at others, fewer.  The question becomes whether the fluctuations are too much to refute the notion that the choice of location is random.

In the following set of graphs, I ran some simulations.  Eight of the nine graphs represent scenarios under which I sent 755 people to the bridge and randomly assign them one of the 35 locations to jump from (okay, this is a thought experiment only; please don't do this at home).  The x-axis represent locations; the y-axis represent the number of suicides at that location -- but on a standardized scale.

The standardized scale allows us to compare across graphs.  The zero line represents the mean number of suicides per location.  The number of suicides at most locations is within one standard deviation away from this mean (i.e. between -1 and 1 on the y-axis).  In some extreme cases, the number of suicides is more than 3 standard deviations larger than the mean (i.e. greater than 3).

Back to randomness: well, one of the 9 graphs is the real data from the map above.  If you can guess which of the 9 is real, then the real data is probably not random.  If you can't, then the real data may be random!

I will publish the answer tomorrow.  In the meantime, feel free to take a guess and/or comment on what other approach you'd take.  One take-away from this exercise is that it's very hard to tell non-random from random unless it is very obvious.

References: San Francisco Chronicle, Boing Boing

## Wrong variable and omitted record

##### Nov 01, 2005

The rise of robots elicited an uninspired, robotic graphical response from the Economist, reprinted by Mahalanobis.

A first fix, shown on the left below, puts the two data series in a scatter plot.  If one accepts the existence of a linear relationship between 2004 installations and 2004 stock, one would be mistaken indeed as such a comparison is meaningless; for countries differ significantly in terms of the number of robots deployed (Japan has over 300,000 while many other countries have fewer than 1,000).

A second fix substitutes the 2004 growth rate for absolute number of installations.  It is now clear that the growth rate is not much associated with the size of the installed base, contradicting the perceived linear relationship from before.  (Note that the x-axis is plotted on a log scale.)  The European countries are shown in red, most of whom have grown their stock of robots at a higher rate than Japan.

In order to highlight the Europe/Japan comparison, one can plot the European average, rather than individual countries. The message is less murky because the graph is less busy. The following set illustrates this.  What really stands out from these graphs is China (& Taiwan), not Europe.  Incidentally, China was omitted from the Economist chart, which is a rather mischievous deletion -- but is understandable since China's data is hidden when they used the original data series of installations versus stock (green text on the left chart).

Reference: Economist; United Nations Commission for Europe

I'm writing this from a different computer while I'm travelling and I'm having trouble with the tools at my disposal.  Apologies for some glitches in the charts.