« May 2007 | Main | July 2007 »

Tricks of the trade 2

In a previous post, I explained the value of sketching when creating graphs. Today, I will share a few other graphs that plot the same data as we discussed the other day, regarding the proportion of time spent on developing different modules of software.

A stacked column chart, suggested by John J., would look like this:

Compared to the profile chart, this chart has some weaknesses:

  • it's difficult to read off the proportions for middle blocks like Blinksale-Billing;
  • because the middle blocks "float", it is impossible to compare them properly;
  • it requires as many colors as there are variables.

These problems get worse as the data scale: more difficult to read off the data; more colors needed.

The Merrimecko, suggested by Bernard L., is the same chart as above except that the widths of the columns are made proportional to the relative number of lines of code.  However, because these four companies do not make up the entire universe, proportional width make little sense here.

The profile chart can be drawn up in two ways:
These charts typically display results of cluster analysis.  This is a statistical data mining technique which discovers groups of like objects within a large data set.  Often times, the computer will only tell you these 15 belong to Cluster 1, those 22 form Cluster 2, etc. 

To figure out why the 15 belong together, the analyst needs to plot the explanatory variables against cluster index.  Now, think of WuFoo, FeedBurner, etc. as clusters, and the proportion of code given to Application, etc. as variables.

While the line segments don't signify anything real, they trace out  the precise paths our eyes would take when reading the stacked column chart above!  Remember we wanted to compare the number of lines given to each function across companies.  If shown the column chart, my eyes would flip across the top of the  Application (blue) blocks from WuFoo to regonline.  This path is exactly the brown line on our first profile chart.

The numbers for Marketing, Support and Billing are much easier to read too as they all start from zero for each company.

The right chart is another possibility but for this particular situation, I prefer the left one.

Finally, I am less familiar with the "parallel coordinates plot" that Derek talked about.  I believe it is a variant of the profile chart.

Dizzy display

Wufoo Xan G. tells us that these "inconsistent pie charts ... make [his] head hurt".  The dizzy array of colors is unfortunate, especially when "Application" gets a medium blue in three of four pies but an orange-red in one of them.  Just like the baby names charts, it's important to keep the background constant when constructing small multiples.

We cite from the horse's mouth:

The goal of this section was to uncover any [software development] task that might be overlooked [by these startup companies]. When writing a software product, the tendency is to focus 100% on the application. Items like support, marketing, and especially billing never cross your mind.

The junkart version below is designed to bring out this one message: that Blinksale has distinguished itself from the rest by having spent more time developing code for purposes other than the application itself. Redo_wufoo 

I removed the raw counts of lines of code and focused only on the relative proportions.  The former does nothing to argue the author's case.

The pie charts fail our self-sufficiency test.  The reader must rely on the data table and data labels to understand the chart.  If removed, the key message is obscured.

Source: "Web App Autopsy", ParticleTree, June 2007.

Baby names and success

Wsj_babynamesWhile we speak of baby names, David F. nominates this set of 6 charts from WSJ.  Compare this with Wattenberg's names voyager, and the benefit of interactive graphics is immediately evident.

In David's words:

They show graphs of six different names, but the two on the bottom use a dramatically different scale (from 1st to ~20th, instead of from 1st to 1000th). The introductory text notes the difference, but it is still a shock.

We like the use of "small multiples" but their impact is compromised if we don't keep the background material constant so that readers can compare between charts.  By having  different scales, the message was distorted: Mary has had a much larger drop than David, and it's easily missed in these charts.

Lines should take the place of areas which carry scant meaning in this context.

The use of blue and red is a nice touch but dovetailing the male and female charts strikes us as excessive fun.  It would have been clearer to give the sons and the daughters their own columns.

The article itself relates the anguish of modern parents in naming their babies.  Much of this angst can be traced to serious econometric studies that claim to have found cause-and-effect relationships between someone's name and their eventual success in life.  Some of this research was highlighted in Freakonomics, for example.  My stance is that all such studies are dubious, there being innumerable confounding factors (socio-economic, genetic, cultural, luck, etc. etc.).  In addition, the measured response can range from "happiness" to income to many other metrics.  The danger of finding something because one looks hard enough is very real.  We don't currently have tools powerful enough to substantiate this sort of studies.

Source: "The Baby-Name Business", Wall Street Journal, June 22, 2007

Losing the tune

Wsj_music Duncan C. nominates this Wall Street Journal chart.  A sure sign of trouble is when the accompanying article waxes about a new on-line music service, another one that practises "loss-based" pricing, i.e. priced to ensure a loss; the article does not mention anything about this chart.  It just stood on the side, like wallpaper.

Its key message does not seem to connect with the data.  "Growth in digital music downloads has not been rapid enough to offset declining CD sales": but in terms of total units, the chart shows a small dropoff in CD sales coupled with an explosion in digital songs sold.  Besides, the units of discs, songs and iPods are not directly comparable.

Source: "Listen to Music Free But Pay to Carry", Wall Street Journal, June 5, 2007.


An anonymous reader dropped a comment pointing us to Martin Wattenberg's gallery at Business Week.  Martin's work falls into the category of information visualization, which typically concerns cramming as much high-dimensional data as possible onto 2D or 3D displays, augmented heavily by colors, shapes, interactivity, superpositioning and other tricks.  Often pleasing to the eye, these graphics usually take time to warm up to.  Sites like Infosthetics and Visual Complexity cover them well.

Mw_baby Martin is responsible for the baby names visualization, which tracks the popularity of names over the years.

Mv_treemap_2 Martin also created treemaps like this one.  Does this show relative stock performance better than other designs?

Foreground, background

Derek C. points us to this effort by a science journalist to use graphs to help "clarify the concept of climate change".  The graph on the left shows that actual greenhouse gas emissions have exceeded the level predicted by the most pessimistic climate models.  The 3D bar chart on the right examines which countries had most increased emissions since 1990. Warming

While the bar chart contains many of Tufte's "ducks" (not sorted by percent change, 3D, color, gridlines, sufficiency, etc.), it's the left chart that can be made more powerful.  Redo_warming2

The casual observer does not need to know which model led to which trajectory of predictions; the graph is vastly simplified, and the message much clearer in the junkart version.  (I only included the CDIAC data because I didn't locate the EIA numbers.)

The general point here is recognizing what is foreground, and what is background.  Aside from gridlines, data labels, axis labels and so on, some of the data usually constitute background material, often as in this case being used to establish comparability.

One message I got out of this chart is that these climate models have done a good job!  (Now, I have no idea if part of the curve included the training period.  It is curious that the predictions were very narrowly contained in the early 1990s.)

Source: The Island of Doubt Blog, June 6, 2007.

The Immigrants' Path

Wsj_illegal A recent Wall Street Journal editorial used this chart (via the National Foundation for American Policy) to claim success for the "Bracero" guest worker program, initiated in 1942.  Their analysis:

... illegal border crossings subsequently plummeted.  Between 1953 and 1959, they fell by some 95%.  In 1960, mainly in response to complaints from labor unions, the program was scaled back and eventually phased out.




Long-time readers may recall Friedman's Crossover Law of Petropolitics, where the opportune criss-crossing of lines
plotted along double axes was taken as proof of causality.  Friedman's Law lurked here, right in the 1953-1959 range. 


Nfap_illegal1The NFAP went one better: in their original version, they blew up the 1953-1959 period to show us the criss-crossing lines!

We see trouble right from the start.  The "subsequent" effect that proved the case occurred in 1953, over 10 years after the program started. During that first decade, the number of apprehensions rose 4388%, in spite of the guest worker program.

A scatter plot (below left) now shows the lack of any meaningful relationship between these two variables.  While high admissions appeared together with low apprehensions, any level of admissions had historically been paired with low apprehensions.


On the right, I connected the dots in chronological order.  Any claim of a negative relationship between admissions and apprehensions has been debunked.  From 1942 on (as we trace the line clockwise from lower left), first the nation experienced stepwise increasing admissions coupled with stepwise increasing apprehensions; then it witnessed sharply dropping apprehensions with relatively stable admissions; and finally it saw plummeting admissions while apprehensions remained low.  Three separate episodes, three distinct patterns.  There was no association, let alone causation.

Source: "Immigration Plan B", Wall Street Journal, June 13 2007.

A disconnect

Nyt_kuoThe Times ran a slate of graphics "analyzing" seven nights of concerts by a blogger.  On the left is one of these charts. 

I am not sure what to make of it.    All I can say is the chart designer had fun.  More on his blog.

Source: "7 Nights of Bright Eyes (in as Many Colors", New York Times, June 10, 2007

Airline bumps and bump charts

The Harvard Social Science Statistics blog pointed to an NYT article about revenue optimization in the airline industry.  Huge props to the Times for explaining the science (and art and politics) of one of the most successful applications of operations research.

In short, valuable business travellers want refundable tickets.  Because of this and other reasons, about 10% of booked tickets become no shows.  Airlines recoup the loss by over-booking.  Implicitly, they trade off the potential for dissatifying a few unlucky passengers (who would be bumped from their flights) and the potential for flying with 10% empty seats (in addition to unsold seats).  Optimization algorithms (constantly tuned by entry-level staff) try to strike a balance.

Recently, because the average percentage of seats sold has been going up, the room for such maneuvreing has been squeezed, leading to higher bump rates, and more travellers being stranded.  There is some variation across airlines due to the level of sophistication of their revenue optimization algorithms, corporate strategy, etc.

The following charts present data by airline of the bump rates in 2005 and 2006.  One would be interested in answering questions such as:

  • Which airlines have the best (or worst) bump rate?
  • Are some airlines consistently better (or worse) at controlling the bump rate?
  • Which airlines have improved (or worsened) from year to year?
  • Are the differences of practical significance?


The original chart shown on the left does not reveal the answers readily.  My favourite bumps chart offers them up clearly (well, except on the question of significance).

The biggest problem, though, is the header: number of passengers per 10,000 bumped.  The data plotted appeared to be the reverse: the number of bumps per 10,000 passengers.  Otherwise, there would have been more bumped passengers than passengers!

Source: "Bumped Fliers and No Plan B", New York Times, May 30, 2007.