Apr 14, 2008

Progress and retrogress

Joran E. pointed to this "icky" chart he found on Clive Crooks' blog at the Atlantic. 
Orig_tertiary

He ordered a "junkchart treatment", so here it comes.

First we wanted to process the triangles, dots and squares to make sense of this data.  We noted that the data came from a single year (2005) so the chart did not trace the development of the education sector over time.  But wait, it used a different route to get at the same idea.  The author compared different generations within each country to see if more and more citizens took university degrees.  So each vertical "arrow" was kind of a historical record of different generations within a country.  Under this criterion, Korea and Japan had come a long way while the US and China stagnated.

The chart is quite impossible to read as designed.  There is little reason to sort by 25-34-year-old proportion when the message concerns improvement over generations.  Besides, what about countries that apparently retrogressed?  (like Russia and Germany)

Redo_tertiary2For this data, I returned to my favored bumps chart.  Here is version one.  There are two ways to read this chart: across countries, we note that most of the European states (blue) had similar profiles showing roughly a constant rate of growth.  The Asian duo of Japan and Korea (brown) had the most marked growth.  Of North America (black), Canada diverged from the US since the 35-44 generation.

Alternatively, we can focus on the change generation-over-generation.  From 55-64 to 45-54, almost all countries in this sample (except Japan) grew at the same rate.  Then between 45-54 and 35-44, the two Asian countries clearly set the pace.  The generation between 35-44 and 25-34 is most interesting: Korea has not slowed, Japan has slowed a little but still grew as fast as Canada.  A trio of European countries (Spain, Ireland, France) outpaced their neighbors.

Below I show version two.  This one combines bumps chart with small multiples.  North America, Europe and Asia/Australia are now in separate charts.  This removes clutter.

Redo_tertiary

 

Apr 12, 2008

Hanging tough

Orig_literacy

Reader Nick B. sent in this example calling it "interesting".  The chart tells a compelling story once we figure out what it is.  Grasping the tree structure is key.

It illustrates the important idea that averaging sometimes masks  variations in the data.  For example, while the province of Guerrero scored 78% on literacy, the municipalities within Guerrero had scores ranging from 28% to 90%.

It also shows that the gender gap was larger in lesser Metlatonoc municipality than in more literate Cuautitian.

In addition, it tells us that while Mexico on average measured very well on literacy, subpopulations within Mexico spanned the world's best and worst (from about Mali's level to Italy's).

While I find this chart adequate, the pieces hanging off each other did not seem ideal, especially the two overlapping municipality pieces which were placed next to each other.  However, it is tough to come up with an alternative.  Here's one attempt; the changes are mild.

Redo_literacy_2 I prefer the horizontal orientation.

The branches are emphasized (as opposed to the "T" junction) because that's a key part of the story.

The national level, especially the span between Mali and Italy, is de-emphasized; I treat it as gridlines.

Instead of placing the overlapping pieces next to each other, I let the ranges literally overlap, which serves to stress this feature.


 

 

Jan 24, 2008

Oscar diseconomy

OscarBusiness Week dissected the beneficiaries of the Oscar show as shown on the right.  Although this doesn't work well as a data graphic, if thought as a variant on the data table, it is more engaging for readers.

Lets have some fun with the Oscar statue.  First, putting a bar chart next to the statue confirms that the height of the segments (rather than the area) is in proportion to the dollar values (below left).

Tufte, Chambers and others have shown that our eyes react to the areas, not heights.  So next, I estimated the areas but stretched them out into segments of equal width.  Squeezing the entire column back down to the height of the statue, the following chart (below right) puts perceived proportions next to the true proportions, displaying visually the extent of distortion. 

Redo_oscar


































Reference: "News you need to know", Business Week, Jan 28 2008.

Jan 04, 2008

Maps and dots

Happy New Year

The cosmos of university ranking got more interesting recently with the advent of the "brain map" by Wired magazine.  This new league table counts the total number of winners of five prestigious international prizes (Nobel, Fields, Lasker, Turing, Gairdner) in the past 20 years (up to 2007); and the researcher found that almost all winners were affiliated with American institutions.
Wired_brainmap
As discussed before, the map is a difficult graphical object; it acts like a controlling boss.  In this brain map, the concentration of institutions in the North American land mass causes over-crowding, forcing the designer to insert guiding lines drawing our attention in myriad directions.  These lines scatter the data asunder, interfering with the primary activity of comparing universities.

Wired_dots The chain of dots object cannot stand by itself without an implicit structure (e.g. rows of 10).  This limitation was apparent in the hits and misses chart as well.  Sticking fat fingers on paper to count dots is frustrating.  Simple bars allow readers to compare relative strength with less effort.

Redo_brainmap_2

In the junkart version, we ditched the map construct completely,  retaining only the east-west axis.  [For lack of space (and time), I omitted the US East Coast and Washington-St. Louis.]  With this small multiples presentation, one can better contrast institutions.

To help comprehend the row structure, I inserted thin strikes to indicate zero awards. A limitation of the ranking method is also exposed: UC-SF has a strong medical school and not surprisingly, it has received a fair share of Nobel (medicine), Lasker and Gairdner prizes; but zero Lasker and Gairdner could be due to less competitive medical schools or none at all!


Reference: "Mapping Who's Winning the Most Prestigious Prizes in Science and Technology", Wired magazine, Nov 2007.

Dec 16, 2007

Hits and misses

In this NYT article, we are told that "the most likely result when a policeman discharges a gun is that he or she will miss the target completely."  That's a shocker for those of us conditioned by Hollywood movies to think anyone who picks up a gun for the first time hits the villain right on the temple.  The following graphic attempts to tell the story.

Nyt_bullets

The one hit here is how the distances are visually presented.  The elliptical lines remind us of the neglected variable of direction; it also means the scale is correct only along one direction.

The dot matrix construct highlights the absolute numbers of shots, hits and misses but barely addresses the key issue of hit rates (accuracy). Nyt_bullets3 Specifically, this data set was presumably collected to explore the relationship between hit rates and distances from the target.  The use of different widths clouds our judgement of proportions.  To wit, it is not obvious that the 10-wide block and the 40-wide block shown left depict roughly equal hit rates (23%, 29%).

Redo_bullets The junkart version adopts a different approach.  This is the Lorenz curve, often used to show income inequality (see also here and here).  Here, the shots were ordered from closest to furthest from target, then summed up by distance segments.  For example, shots from 0 to 6 feet accounted for 60% of all shots but 72% of all hits.

If distance does not affect hit rates, we'd expect 60% of all shots to result in 60% of all hits.  This data point would show up on the 45-degree diagonal on the chart, labelled "totally unpredictable".  Any data appearing above the diagonal indicates that closer shots are more accurate, accounting for more than their fair share of hits.

Comparing the fitted blue line and the diagonal, one sees that distance is a weak predictor of hit rate.  The police commissioner explains this in the article; many other variables also affect accuracy, including "the adrenaline flow, the movement of the target, the movement of the shooter, the officer, the lighting conditions, the weather..."

Note that the shots with "unknown" distances were removed from the analysis.  Also, the categories of 21-45 and 45-above were combined: the rates were similar and with only three hits, it does not make sense to treat these as separate categories.

Of course, this version would not work well in the mass media.  For that, one can just plot hit rates against the distance categories.

Source: "A Hail of Bullets, a Heap of Uncertainty", New York Times, Dec 9 2007; New York Firearms Discharge Report 2006.

Oct 15, 2007

Sense of proportion

[I'm back from vacation.  Will provide my reaction to the responses to the Gelman challenge, and for those who have sent me email, I will work through them soon.]

The NYT commented on a trend among marketers to shift their advertising spending from so-called "measured" media like print and TV to so-called "unmeasured" media like product placements, contests, etc. 
The following chart accompanied the article:

Nyt_ads_2


This construct is akin to a population pyramid; it's great for comparing two groups along one metric, say age groups between males and females.  Here, the two halves aren't comparable groups but two different metrics.  The main metric, that is, the proportion of unmeasured, is not directly depicted: the reader must figure out mentally how much of each bar the black part covers.  Also, the companies are sorted by unmeasured media spending but this leaves the measured spending with a jagged profile, confusing matters.

As for the little white slits on the gray bars, they are admittedly cute but it is difficult to compare the detailed breakdown between print, TV and other media among companies.

The following dot plot gives the two halves equal weight.  Redoads1(Pink dots are measured, blue unmeasured.) It's not a very interesting graphic though. The sense of proportion is still missing.

I settled on a scatter plot which relates the proportion spent on unmeasured to the total amount of spending.  It appears that the largest advertisers had the lowest proportional unmeasured spend while the smallest (among the majors) had the highest.  (It's only a weak correlation: a linear fit yields only 16% R-squared.)
Redoads2


















Source: "The New Advertising Outlet: Your Life", New York Times, Oct 14, 2007.









Sep 27, 2007

A challenge

The Gelman blog has issued a challenge on how to present the following Venn diagram in a more comprehensible way.  This one is pretty tough.
Gelmanvenn

Antony Unwin sent in this entry:

Unwinvenn_2
Do you have other ideas?







May 17, 2007

People picture

Ind_cancersurvival This graphic appeared on the front page of the British paper, the Independent.  I find it to be effective, although defiantly not efficient a la Tufte: the data-to-ink ratio is abysmal.  Two data points on the entire page, with both data labels drawn in extra large font!

It can be improved if the 24 guys are given a different color so we can see the amount of improvement between 1971 and "NOW".

Some may complain that the use of percentages obscured population growth during this period.  Perhaps there should be fewer men on the left than on the right.  Unfortunately, that would in turn obscure the comparison of percentages.

A bit of research into the data (at Cancer Research UK) reveals that the average survival rate hides a very wide range of rates (by type of cancer, by gender, by gender and type, etc.).  One might argue that the average is quite meaningless for most users.

An alternative construct is a time series chart showing the increase in survival rate over time.  It would plot more data and depict a trend (or lack thereof).  I'd have to agree with the editor that such a chart would look unattractive on the newstand.

Source: "Cancer: the good news", The Independent, May 16, 2007


May 06, 2007

Visualizing sensitivity

A reader wrote:

I'm a loyal reader who hopes you'll indulge him in just one or two questions.

In finance (valuation, specifically), we often create two-way sensitivity tables. Unfortunately, a three-way sensitivity table is what's most often called for. Of course, we work around this by producing multiple two-way tables.

Now, obviously, it's pretty hard to build  three-way table or chart in two dimensions, and the use-bigger-bubbles method doesn't really make sense in this kind of application-- but can you conceive of a good way to present the data in any other form?

3waydata_2 Like he indicated, we typically see multiple two-way data tables for such data.  The virtue of this approach is that the data is exceptionally well-organized; it's great for looking up the outcome given the three dimensions (I called them Red, Green and Blue to protect the innocent.)

Further, starting from a baseline i.e. a particular cell in the table, it's easy to move our eyes up, down or jump tables to observe the impact of changing dimensions (so-called sensitivity analysis).

These data tables facilitates "local" sensitivity analysis but obscure "global" sensitivity: staring at those numbers, we feel lost in the trees and can't see the forest.  What's the effect of increasing Green on average?  What's the effect of increasing Green while decreasing Blue? etc. etc.

3waygraph The junkart construct (right) is made to address these questions.  The black stripes establish the baseline, the overall range of values.  Then, if interested in the effect of Red = 0.11, we can compare those red stripes with the black.  Since the spread is wide, we note that Red = 0.11 is not a strong indicator of value, and to the extent it is, it points to lesser values.

What about Red = 0.11 and Green = 2?  Now, we focus on the first red stripes and the first green stripes.  We note that the overlapping region (which is where both conditions apply) is highly concentrated to the low end of value range.  Thus, we conclude that under those conditions, value is low (below 10,000) and further, that it is low primarily because Green = 2.

On and on for any one-way, two-way or three-way effects.

Although it's not the purpose of the chart, local sensitivity can also be observed.  For example, the highest value comes from Red = 0.09, Green = 16 and Blue = 0.30.  What if Blue decreases to 0.28?  We start on the Blue = 0.28 layer; going from right to left, as we see a blue stripe, we scan vertically to find the corresponding red and green stripes; the 3rd stripe from the right, we find the scenario of interest.  Such analysis would benefit from adding an interactive vertical guiding line.

Do you prefer 3-D plots?  Contour plots? Feel free to share your ideas!

Apr 28, 2007

Cutting through the noise

A terrific application of tag clouds can be seen over at pollster.com, following the first debate of Democratic Presidential hopefuls the other night.  Here is Senator Biden's "tag cloud", depicting the top 50 words that came out of his mouth that night.  The size of each word is proportional to how often he uttered it.

Bidentag400_2 Having not seen the debate, I can use this summary device to get a quick read on what his main points were.  It's clear that he talked about the war ("Iraq", "troops"), education ("teachers", "students"), abortion ("roe", "wade" but interesting not the word "abortion").  Of course, if he had a distinct message, that would have been even better. For what the tag cloud exposed (assuming it was done right) was that he was pretty much all over the place, touching on many different things about equally often. 

It is disconcerting that a word like "so-called" made it into the top 50.  Better is "better" is his #1 word.

It is typical to process text-based data by removing all the most common words that do not carry real meaning (um, ur, the, so-called, etc.) but in this case, keeping them is helpful so the candidates can catch problems like the excessive use of "so-called".

However, the tag cloud would have been improved if "stemming" were used to collapse "talk" and "talking", "teacher" and "teachers", etc.

Clintontag400_2 Pollster did tag clouds for every candidate.  Comparing them provides even more insights!  Here's one for Senator Clinton. Her message is much more focused, quite a lot of time spent proclaiming her "readiness" for "President", quite a bit on "healthcare" and quite a bit on the "war".

As Pollster correctly pointed out, it is unclear if the size of words could be compared across tag clouds.  If so, the setup would be even more powerful.

The entire set of tag clouds can be seen here.   Long-time readers of this blog will remember that we have advocated such use back in Jan 2006, when discussing the "concordance" feature at Amazon.  This successful application validates our enthusiasm.

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Recent Comments

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31