Jan 10, 2007

Complex is not random

There is a tendency to mistake complexity for randomness.  Faced with lots of data, especially when squeezed into a small area, one often has trouble seeing patterns, leading to a presumption of randomness -- when upon careful analysis, distinctive patterns can be recognized.

We encountered this when looking at the "sad tally" of the Golden Gate Bridge suicides (here, here, here, here and here).  Robert Kosara's recent work on scribbling maps of zip codes also highlights the hidden patterns behind seemingly random numbers.

Estrellaloto Robert found
a related example (via Information Aesthetics, originally here): the artist takes random numbers (lottery numbers), and renders them in a highly irrelevant graphical construct, as if to prove that spider webs can be generated randomly.

According to Infosthetics, each color represents a number between 1 and 49, which means the graph contains 49 colored zigzag lines (not counting gridlines and axes).  Each point on the year axis represents a frequency of occurrence.

Imagine if you are tasked with using this chart to ascertain the fairness of the lottery, that is, the randomness of the winning numbers.  The complexity of this spider web makes a tough job impossible!  We must avoid the tendency to jump to the conclusion of randomness based on this non-evidence.

In fact, testing for randomness can be done using any of the methods described in the postings on the "Sad Tally" (links above).  A first step will be to plot the frequency of occurrence data as a simple column chart with 1 to 49 on the horizontal axis.  We'd like to show that the resulting histogram is flat, on average over all years.

Dec 01, 2006

Smoking-Screening

Smokeathome2

Behind the smokescreen lies the informative conclusion: among households with smokers, about 40% smoke in residence all the time while about half never smoke in residence.

This graphic, unfortunately chosen, contains many distractions from the main message, including:

  • the liberal sprinkling of colors
  • the inclusion of data for 1, 2, 3, 4, 5, 6 days, almost all of which were effectively zero
  • the redundant vertical scale, as all the data already appeared on the chart itself
  • the comparison of smokers to "total sample" (rather than non-smokers)
     

The last point merits special attention.  The total sample contains households with smokers as well as households without smokers. Any data from the total sample is a weighted average of these two types of households.  It is better to directly compare the two household types than to indirectly compare one type to the overall.

Further, households without smokers should be extremely likely to have no smoking in residence all week. 
And if most households have no smokers (76% of this sample), then the statistics of the total sample will mimic those of no-smoker households. That is to say, the total sample statistics do not add much to the analysis.  Our junkart version below corrects for this as well as other things.

Redo_smokeathomeOne of the key functions of a graph is data reduction, i.e. to aggregate data in such a way as to expose the information contained within.  Typically, a graph that uses aggregated data is clearer and stronger than one that plots every piece of data.  In this example, by combining 1-6 days into a single category ("smokes in residence part of the week"), we have a graph that is much more readable.

I want to thank Dr. Mike Rabinoff for inspiring me to look up these second-hand smoking statistics.  Mike recently published a book called "Ending the Tobacco Holocaust", which tells you more than you want to know about the tobacco industry.


Reference: "Second Hand Smoke Survey: Final Report", Madison Department of Public Health, Dec 2003.

Nov 26, 2006

Wading in waste

Sciam_bacteria A poor graphic leaves readers wading in waste, in this case, the waste of time.  (Thanks to a tip from Dr. Bruce W.)

This very busy chart conveys a simple research finding, that the density of bacteria increases with the prevalence of impervious surfaces.  As Bruce pointed out, underlying this chart is but six observations taken at selected tidal creeks, each observation being a (paired) measurement of bacteria count and prevalence of impervious surfaces.

A factory worth of graphical elements was employed, including columns, pies, colors, data labels, legends and so on.  The result is utter confusion.  How is it that the tip of each column does not coincide with the center of each pie?  Do equal-sized pies imply equal surface areas?  What is the bacteria count at each location?

Redo_bacteriaA scatter plot brings out the key correlation with minimal fuss.










Reference: "Wading in Waste", Scientific American, June 2006

Oct 02, 2006

Graphical equity 2

Based on my last post, Zuil and Lope engaged in a lively conversation about "flow charts", apparently also called "Sankey charts" in some circles.  Here is an example Zuil found at the EIA site:Govt_sankey

Zuil commented that

Though often difficult to draw, Sankey diagrams are IMHO unbeatable to represent any type of lossless flow (energy, money, fluids, etc).

I mostly agree: flow charts are great at tracing flows, and it's easy to figure out proportional sources and uses from this example.  Moreover, as Lope suggested, it's fun (to read).

But... the data content of this chart is lower than that of the network graph or the Marimekko.  Imagine removing all the lines (arcs) in the network graph: that is what the flow chart includes.  It achieves more readability by simplification.


Graphical equity 1

I've been slow checking my email lately: several of you have pointed me to interesting charts; I will work through them over the next week or so.  This post is inspired by John S. who forwarded two charts, illustrating where the U.S. gets its energy and how the U.S. uses its energy.

Govt_energy

The first visualization, created by Energy Information Administration, emphasizes the physical connections between energy sources and energy use sectors.  This construct is known as a "network graph", and widely used by engineers; the ovals/rectangles are called "nodes", the lines "arcs".  It functions well as a map visualizing physical relationships but it fails as a vessel for data.  Problems are multiple:

  • The web of arcs is messy and gets worse with more nodes
  • Here, each node has either an input or an output but not both, keeping it simple.  If a node is allowed to take both input and output (the so-called transhipment node), then the graph gets messier
  • Arcs converging at a node leave little space for data labels

Optimist123_energyNext, the Skeptical Optimist blog recast the data onto a construct known to "Marimekko" to management consultants.  Deconstructed, these are column charts,  such that the width of each column represents the relative size of each energy source.

This one does a fairly effective job showing most of our transportation needs are met with oil, our electricity needs are met with coal, our energy sources are roughly split between oil, gas and coal, and so on.

One weakness of Marimekko is "inequity": by its origin as a column chart, it elevates one variable over the other.  What's the relative size of energy used by the industrial sector (blue)?  That's not a question easily answered by this chart.  Even when the column segments are adjoining, as in the case of electricity use (yellow), it is very taxing to size up the yellow area relative to the total area.

So it is that we seek a graph that treats the two variables (source, use sector) equitably.  More later.

Update: Jon posted a response here, and points to a tutorial for creating Marimekko type charts.

Sep 22, 2006

Small and beautiful

Nyt_allegiantThe creater of this map understands small is beautiful: simple concepts deserve simple charts.

As discussed in the NYT article, Allegiant's business model is small and beautiful -- rather than focusing on popular routes between major cities like most startup airlines, Allegiant serves a web of routes going to just two destinations.

In this map, the two destinations are clearly labeled; all the originating cities are marked with those serving both Las Vegas and Orlando highlighted.  Extra information is provided through shading of the States served, and through the route lines (roughly indicating distance / time).

This simple chart can be made simpler by removing the route lines.  Not much is lost by removing them.

Reference: "Flying Where Big Airlines Aren't", New York Times, Sep 21 2006.

Aug 28, 2006

The dots don't connect

Nyt_stockownerNew York Times published a bar chart reminiscent of the one discussed here last week.  They added the 50% line and did not cluster the countries into groups of five. 

I like this chart for clarity and simplicity.  (Removing the decimal from the data would improve it.)  The U.S. and her special partner stand out as countries with the highest outside ownership of corporate shares. 

So far, so good.

Until I scanned the article itself, which startled and started with:

It turns out that most American investors are not xenophobic... Shareholders in the United States have been criticized as harboring "home bias" -- allocating far less to foreign stocks than they would if they did not let familiarity, patriotism and national loyalties stand in the way.

The dots don't connect, notwithstanding the academic references contained.  The chart shows how much U.S. stocks are owned by outsiders (which includes some foreigners but also many U.S. investors).  What has this to do with how much money U.S. investors spend on foreign stocks?

Even a good chart can't save a poor story.

Reference: "Investors without Borders", New York Times, Aug 27, 2006

Aug 07, 2006

Illusion or junk? 2

Bondchart_1Previously we saw that the appearance of stacked area charts changes with how variables are ordered.  This is a serious deficiency.

Lets return to the bond market chart.

What is the relative prevalence of each type of debt over time?  In the original chart, this information is buried and can be extracted only painstakingly.

Redo_bonddataThe jun
kart version brings out this insight without much fuss.  As a percentage of total bond market debt, US Treasuries has dropped by half over the last two decades, much of it happening in the late 1990s.  Meanwhile, mortgage debt more than doubled during the same period, much of it occurring during 1985-1993.  The current distribution is also more balanced than in the last 20 years, as can be seen from the narrow spread.

A few design features are worth noting.  The vertical axis is given on both sides of the chart.  Limited colors are introduced to help readers distinguish the various lines.  Light vertical gridlines are provided to allow analysis during each 5-year period.  Non-essential tick labels are removed from vertical axes.

In fact, this is again a variant of the Bumps chart.  For more, see here, here and here.



Reference: Data from Bondmarkets.com (via Mahalanobis)

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Search Junk Charts


  • Custom Search

Residues

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31