« September 2010 | Main | November 2010 »

Halloween comes early: scare of the Times

Has the New York Times lost the plot? This was the thought in my head when I saw this scary thing last Friday, right in the center of the front page of the New York edition:


This Halloween scare is supposed to tell us who the biggest political donors are, and to what party. Here is a close-up view:

22chamber-g-popup Well, this one looks slightly different from the one shown above but this is the one that they are currently carrying on the website.

I think the stacked squares, arranged in this particular way--staggered both horizontally and vertically--is supposed to represent something. I can't figure it out. Maybe a loudspeaker? An accordian?

This chart fails our self-sufficiency test. The only way one can appreciate the scale of the donations is to extract the data from the chart, in which case, this is the same as a data table.

The use of the unexplained red color for the Chamber of Commerce (the subject of the accompanying article) is also problematic. Based on the inset, it's pretty clear that the Chamber is "leaning Republican" so it should just be colored pink.

What is very hard to understand is why the amount of overlapping from one square to the next square is not the same.

We can visualize the problem as follows:

Chamber1 According to the data, the US Chamber of Commerce donated about the same amount as the Democratic Senatorial Campaign Committee ($21.1 v. $22). However, the visible areas on the chart are vastly different. There is a presumption that readers will look behind the squares to see the full square but that's pretty hard to do when they are stacked 15-deep!

What's the amount of distortion due to this design?


A lot. Much of this is due to the fact that the lowest ranked item plotted is an actual square which happens to have the largest unobstructed area.


Here's a dot plot that conveys the essential information with minimal fuss:

Of course, a bar chart works fine too.


Happy Halloween!

The nothing that is

When Mike K. sent this in, he had a few comments, including "This is, of course, from the Chronicle of Higher Education", and "talking about a math course", which mean "very naively", we would "impose a higher standard." Should scientists be held to a higher standard? Lead by example, perhaps? I had the same feeling when I wrote the post on "Unscientific American" about the charts that'd flunk Ed Tufte's intro class, published in Scientific American.




In one word: confusion. Mike couldn't understand the relationship between the first row of bubbles and the second row of bubbles. It is as if the one course taken at Bronx Community College results in credits recognized everywhere! (You basically have to read all the footnotes to get some clues.)

Also note the usual confusion about areas and diameters.

In addition, the zero-bubbles prove themselves to be the nothing that is. They expose the folly of using bubbles when the data series contain zeroes (not to mention negative numbers). We can visualize this problem:


It gets worse.

For those mathematically inclined, we actually have an impossible situation: the size 4 bubble really contains the zero bubble plus a size 4 bubble; that is, 0 + 4 = 4 but if 0 has positive area, then the area of the 4 on the left hand side of the equation must be smaller than the area of the 4 on the right side. So, basically, don't use bubble charts if your data has zeroes.

Audio bookmarks

I look at a fair number of online videos, especially those embedded on blogs. But I haven't seen this feature implemented broadly. It is a wow feature.

Look at the dots above the progress bar: they tell you what topic is being discussed and allow you to jump back and forth between segments. (the particular dot I moused over said "Randy Moss") The video I saw came from this link.


This simple-looking feature is immensely useful to users. You can efficiently search through the audio file and find the segments you're interested in. It's like bookmarks students might put on pages of a textbook for easy reference, except these are audio bookmarks.

Why isn't this feature more prevalent? I think it's because of the amount of manual effort needed to set this up. Imagine how the data has to be processed. In the digital age, the audio file is a bunch of bits (ones and zeroes) so no computer or humans will be able to identify topics from data stored in that way. So, someone would need to listen to the audio file, and mark off the segments manually, and tag the segments. Then, the audio bookmarks can be plotted on the progress bar... basically a dot plot with time on the horizontal axis.

In theory, you can train a computer to listen to an audio file and approximate this task. The challenge is to attain the required accuracy so you don't need to hire an army of people to correct mistakes.

A very simple concept but immensely functional. Great job!

Showing dynamics on a business chart

Dave S. achieved a rare feat, which is to send in a great-looking set of charts. This post at Asymco is worth reading in its entirety; the author Horace discusses the process by which he worked through several charts, arriving at the one he's most happy with.


The secret to the success here is the careful framing of the question, and the collection of the appropriate data to address that question. The question is the competition between wireless phone vendors in the last three years. It was established that the right way to view this competition is in two dimensions: share of revenues, and share of profits. Note the word "share". Share of profits is not a metric that is often discussed but it is the right metric to compare to the share of revenues -- getting both numbers onto the same comparable scale is what makes this work.

Needless to say, the raw data one would collect come from the financial statements of the eight individual vendors. Plotting these numbers directly would be a mistake. So you take the numbers, making sure that you're really counting wireless revenues and wireless profits, and then compute the shares. (I am not actually sure that they have wireless profit data because large companies like Apple and Nokia typically don't break out their profit shares, even if they provide the revenue shares by line of business.)

Horace also avoided the plague of plotting all time-series data as line charts (similar to the plague of plotting all geographic data on maps). By plotting revenues and profits simultaneously, he no longer can plot time (years) on one of the two axes, and that is a good thing.


Screen-shot-2010-10-05-at-10-5-10.04.22-PM This is the final graph Horace landed on. It puts all the vendors at the origin in 2007 and then tells us where they landed in 2010 in terms of revenue and profit share growth/decline.

It would be even better if he makes the scales work harder: e.g. have equal lengths for the 10% change along both the vertical and horizontal axes. Alternatively, you can scale it such as each unit on either axis represent equal dollars.

This is a very focused chart that answers the question about the relative change in positioning of each vendor. What it doesn't answer is the starting position or ending position of each. Note, while Nokia is depicted as losing share on both revenues and profits, Nokia still has twice the revenue share of the other vendors, and out-earns everyone except Apple!

I am not saying this is a bad chart. It is designed to answer the relative question, not the absolute question. That's all.


There is one way to have the cake and eat it too.  Horace almost created that chart. He showed two scatter plots, one for 2007 and one for 2010.

If he just overlays one on the other, and use lines to connect the dots for each phone vendor, he will have a chart that shows absolute and relative values all at once. Here's a crude illustration of this: (missing the labels to show that the arrow end of the line represents 2010 positions)


I like this kind of chart a lot. It is great for showing dynamics in a set of variables, without actually making the chart dynamic.

(Even on this chart, it is better to harmonize the two scales.)


Unscientific American 2: a review of key concepts

This is part 2 of a two-part piece on a set of charts published in Scientific American, describing the results of a poll asking their readers about their trust of various types of experts. In part 1, I looked in detail at one of the charts: this chart has many problems-- problems with the execution of the bar chart, problems with the choice of comparisons, problems with the design of poll questions.


Sa_isscienceworthLong time ago, I wrote about racetrack graphs. An example shows up here. Note the length of the yellow track relative to the length of the pink track. It should be about 2:1 but this type of chart makes a many-fold illusion.

Bar chart, please.






Sa_technofears For this chart illustrating fears of technology, the designer chose to emphasize the data (percentages), with a subtle embedding of a column chart in which the columns were over-stretched to form squares.

Love these pictures, very telling.

The data jump at us, rendering the graphical element--the embedded column chart--secondary, even redundant.

But columns are very useful things. They are self-sufficient. We can judge the lengths of columns (or bars) very easily, so we do not need to print data next to columns.

However, thin columns work better than thick columns, so stretching columns to squares is not a good idea, even if the relative areas are not distorted.

Also, even the choice of colors here can be quibbled with. For me, the dark pink draws more attention than the medium orange, which means that my eyes and my brain (reading the printed data) start having an argument.

So, here is a sketch of what a bar chart would look like. See, you don't need to print the data on a bar chart--it's not hard at all to see that about 50% fear nuclear power.  I have lightened the pink.  Just add in the pictures, and the captions on the right side of the chart, and you have an improved chart.










Now, start the stopwatch, and measure how much time it takes to figure out what this chart is telling you:


This is truly a head-turning chart if you know what I mean...

In the article, this chart has the title "Climate Denial on the Decline". Maybe, maybe not--but oh dear, how are we to tell?

Instead of fixing the chart, it is perhaps more important to fix the poll question. It's not a good idea to describe a trend when one has data from this year and last year only. If they are isolating the effect of some major event (say, a Kyoto protocol) that happened during that year, it would be fine but that's not what is attempted here.

For part 1, see here.




Unscientific American 1: misreadings

Chris P. sent me to this set of charts / infographics with the subject line "all sorts of colors and graphs."  I let the email languish in my inbox, and I now regret it. For two reasons: one, the topic of how scientists can communicate better with, and thus exert stronger influence on, the public is very close to my heart (as you can tell from my blogs and book), and this article presents results from a poll on this topic done on on-line readers of Scientific American, and Nature magazines; two, some of the charts are frankly quite embarrassing, to have appeared in venerable publications of a scientific nature (sigh); three, these charts provide a convenient platform to review some of the main themes on Junk Charts over the years.

Since the post is so long, I have split it into two parts. In part 1, I explore one chart in detail. In part 2, I use several other charts to illustrate some concepts that have been frequently deployed on Junk Charts.


Exhibit A is this chart:


First, take a look at the top left corner. At first glance, I took the inset to mean: among scientists, how much do they trust scientists (i.e., their peers) on various topics?  That seemed curious, as that wouldn't be a question I'd have thought to ask, certainly not as the second question in the poll.

 Sa_howmuchdopeople1On further inspection, that is a misreading of this chart. The "scientists" represented above are objects, not subjects, in the first question. As the caption tells us, the respondents rated scientists at 3.98 overall, which is an average rating across many topics. The bar chart below tells us how the respondents rated scientists on individual topics, thus providing us information on the spread of ratings.

 Unfortunately, this chart raises more questions than it answers. For one, you're working out how the average could be 3.98 (at that 4.0 white line) when all but three of the topic ratings were below 3.98. Did they use a weighted average but did not let on?

Oops, I misread the chart, again. I think, what I stumbled on here is the design of the poll itself. The overall rating is probably a separate question, and not at all related to the individual topic ratings. In theory, each person can assign a subjective importance as well as a rating to each topic; the average of the ratings weighted by their respective importance would form his or her overall rating of scientists. That would impose consistency to the two levels of ratings. In practice, that makes an assumption that the topics span the space of what topics each person considers when rating the scientists overall.


The bar chart has a major problem... it does not start at zero.  Since the bars are half as long as the longest, you might think the level of trust associated with nuclear power or climate change would be around 2 (negative). But it's not; it's in the 3.6 range. This is a lack of self-sufficiency. The reader cannot understand the chart without fishing out the data.

Now, ask this question: in a poll in which respondents are asked to rate things on a scale of 1, 2, 3, 4, 5, do you care about the average rating to 2 decimal places?  The designer of the graphic seems to think not, as the rating was rounded up to the nearest 0.5, and presented using the iconic 5-star motive. I think this is a great decision!

Citizensvsjournalists But then, the designer fell for loss aversion: having converted the decimals to half-stars, he should have dropped the decimals; instead, he tucked them at the bottom of each picture. This is no mere trivia. Now, the reader is forced to process two different scales showing the same information. Instead of achieving simplification by adopting the star system, now the reader is examining the cracks: is the trust given citizens groups the same as journalists (both 2.5 stars) or do "people" trust citizens groups more (higher decimal rating)?


The biggest issues with this chart concern the identification of the key questions and how to collect data to address those questions. This is the top corner of the Trifecta checkup.

1) The writer keeps telling us "people" trust this and that but the poll only covered on-line readers of Scientific American and Nature magazines. One simply cannot generalize that segment of the population to the common "people".

2) Insufficient attention has been paid to selecting the right wording in the questions. For example, in Exhibit A, while the overall trust question was phrased as trusting the "accuracy" of the information provided by scientists vs. other groups, the trust questions on individual topics mentioned only a generic "trust".  Unless one thinks "trust" is a synonym of "accuracy", the differential choice of words makes these two set of responses hard to compare. And compairing them is precisely what they chose to do.


In part 2, I examine several other charts, taking stops at several concepts we use on Junk Charts a lot.


Ranking airlines: no easy task


Reader Joel D. submitted this chart showing airline revenues of major airlines around the world, another chart that puts bubbles on top of a map.  

Joel said:

Cool, but really quite naughty. Take the different size bubbles of British Airways (12.8) and Air France-KLM (29.7). Grossly disproportionate. I appreciate the designer's attempts to introduce a geographic element to this but the immediate take-outs here from the bubbles are misleading. 

Sometimes a bar chart is all life needs.

I think it fails all three facets of the Trifecta checkup: it does not have a well-defined practical question; the data is not processed properly; and the chart type does not work with this data.

  • Most airlines are multinational companies that make substantial revenues outside their home countries... so the locations of their registered headquaraters are irrelevant. What is the question being addressed? It would appear to be where are the headquarters of the largest airlines in the world? I don't think this is an especially engaging question. What might be more interesting, for example, is the split between domestic and international revenues for different airlines, or the split among airlines of the revenues within each continent.
  • Besides, the aggregate revenues data is not very useful for comparison purposes.  It ignores the  population... a circle in an European country is in reality much "larger" than a circle of the same size in China!  Because $200 million on 2 billion people is very different from $200 million from say 50 million! The right base for this data is probably something like revenue per passenger or passenger miles.
  • The inclusion of Fedex also must be thought through thoroughly. I'd imagine that all the large airlines of the world also have freight divisions, and if we really want to address both passenger and commercial air revenues on the same chart (with which I don't agree), we should at least break out the freight revenues.

Visualizing your inbox

Bill Zeller, a PhD student at Princeton, sent me the link to his project "graph your inbox", that is an attempt to visualize the "data" in your Gmail account.


Seems to me that it acts as a sophisticated "search my mail" engine. The most interesting part is the ability to click on a point or a bar in one of the charts, and have the corresponding emails show up in the preview panel. This interactive ability is also available in the modern commercial graphing packages, and they are extremely useful for data exploration.

Technically, this is a compelling achievement. The amounts of data being processed, organized, summarized, plotted.

I think he needs to figure out some compelling use cases for something like this. Can you help? How would you use this capability if it is available?