Jul 21, 2008

Joining the fun

We hope this is indication that the British paper Guardian (with one of the best websites out there) is joining the fun.  It appears that they have quietly debuted an interactive graphics feature.  The first edition addressed the oil price crisis.

This time-series chart has much to be commended:

Guk_blackgold1


The use of inflation-adjusted figures seems obvious but we don't see much of these in the press.  Highlighting the peaks and providing annotation (when moused over) is an excellent touch.  The gridlines and axis labels (especially the year axis) are thankfully restrained.  We don't see the need for the unadjusted series (blue line), however.  The fact that the gap grew larger the more time we went back told us little, as it invited readers to read into it more than what it truly was, the time value of money.

Later on, they used an oil barrel object to illustrate the components of retail oil price.  The height of the cylinder is indeed proportional to the data plotted.  If only they colored the end of the cylinder gray instead of green!  As it stands, the green portion has about the same area as the red.


Guk_blackgold2


Reference: "Interactive: oil price", Guardian, July 14 2008.

Jun 21, 2008

Close races

Nyt_citylimits1 Perhaps harkening to the close race between Obama and Clinton, the designer chose to illustrate this with what we have called the "racetrack" graph.  We have previously discussed the problems here and here.

Nyt_citylimits2 In this rendition, a pie chart was divided into three race tracks with "cities" getting the inside track and "rural/small cities" getting the outside track.  (As the Clinton supporters might say, elitism was in the air.)  There were two great choices: the courage to not print the data and let the chart speak for itself, and the wisdom to white out the votes for "others".

Nevertheless, as we discussed before, the data is coded into the angles rather than the lengths of the strips, which presents a real problem in comparing vote shares.  For example, try figuring out if there were more Obama supporters in rural Tennessee than there were Clinton supporters in cities in Tennessee (bottom right).

Nyt_citylimits3 Also note where the white "others" space were, and the impossibliity of comparing them.

The arrangement for Wisconsin, meanwhile, posed a challenge for anyone who wanted to estimate how many rural Wisconsin voters went for "others".


In the junkart version, we go with the two-sided bar chart, typically found in population pyramids.  The information presented jumps out at you.

Redo_citylimits3 This chart is essentially the same as the racetrack; one just needs to straighten out the strips from the original chart, and pull the Clinton ones clockwise, and Obama ones anti-clockwise.



Reference: some recent issue of New York Times magazine.

Jun 13, 2008

A budding field

Avinash has an interesting piece about some examples of visualization of Web data.  That's a very rich area since there is so much data.  I agree with his observation that there are precious few truly great charts that have thus far appeared.  (Note, though, that typically the more data, the more noise.  See this post.)

He discussed a tag cloud display of the top cities from which website visitors hail.  We like tag clouds too. See here, here and here.

He praised a particular pie chart because "the pie ... is just a stage prop".  It worked because all the data was printed on the chart itself.  This violates our self-sufficiency principle: if all the data is printed on the chart, and the only way to read it is to look at the data, then the chart serves no purpose.  More here.

He liked the Amazon's feature of customer ratings distributions.  Me too.  A powerful example of small graphics that make a huge impact.  Here is the typical Web rating display:
Amazon1 Almost everyone uses the statistical average. This hides information about how dispersed (or not) customer's reactions were.  The current Amazon display gives us this information:
Amazon2
Notice that 108 customers actually gave this book the lowest rating even though the average was four stars.

The most intriguing example was Google's comparison of keyword performance to the site average.  It's a good idea but the execution is wanting.

Googlekeywords

Firstly, I believe the percentages are much better presented as index values, with 100 being the site average.  Secondly, it is unnerving to have red associated with positive values, green with negative values, or to have negative values on the right of positive values.  I think they realize green and to the right should represent "good" (bounce rate of visitors lower than average) but this just doesn't work.  Thirdly, are the data labels really necessary?  they impede our sight lines when comparing bars.  And do we need to know to two decimal places?




PS. Apologies for the inconsistent font.  Typepad continues its mischief: I couldn't change the font size after adding a hyperlink.  Apparently I have to fix the font size before adding a link.  You also might notice the changing font size as I write this paragraph.  Don't know why there was a switch; I didn't ask for it.

Jun 07, 2008

The right scale

Oftentimes, picking the right scale for a chart makes all the difference.  The following chart showed up in the New York Times Magazine some time ago.  Readers will immediately recognize this as "infotainment" rather than a serious attempt to convey the data.

Nyt_minutes2

The data came from a study by the Center on Education Policy which counted the amount of instruction time spent on various subjects at a sample of elementary schools in the U.S.

A simple bar chart would make a nice graphic, as shown on the right.  Instead of sorting by decreasing minutes, we pulled out "lunch" and "recess" since they belong to a separate category.

Our main focus, though, is on the scale.  The original report - and thus the original graphic - used minutes per week.  We contend minutes per day (or even hours per day) to be more user-friendly.  This is because any number makes sense only in comparison to other numbers.  There is no easy reference to a number such as 500 minutes per week.  However, being told it's 100 minutes per day (or 1 hr 40 min per day) means a lot because everyone knows there are 24 hours in a day.

This is a small example of a larger problem with using averages.  The media loves to give out statistics like six people are dying of diabetes every minute (e.g. here).  This is typically done by dividing the total number of diabetes-related deaths in a year by the number of minutes in a year.   Why divide by total number of minutes in a year?  The fallacy of such a calculation is evident if one applies this logic to natural deaths (since we all have to die some day).  As the world population grows, there will just be more and more people dying every minute!

Choosing the appropriate reference point -- just like picking the right scale -- is the beginning of any good analysis.


Reference: New York Times magazine, April 27 2008; Center on Education Policy.

Apr 14, 2008

Progress and retrogress

Joran E. pointed to this "icky" chart he found on Clive Crooks' blog at the Atlantic. 
Orig_tertiary

He ordered a "junkchart treatment", so here it comes.

First we wanted to process the triangles, dots and squares to make sense of this data.  We noted that the data came from a single year (2005) so the chart did not trace the development of the education sector over time.  But wait, it used a different route to get at the same idea.  The author compared different generations within each country to see if more and more citizens took university degrees.  So each vertical "arrow" was kind of a historical record of different generations within a country.  Under this criterion, Korea and Japan had come a long way while the US and China stagnated.

The chart is quite impossible to read as designed.  There is little reason to sort by 25-34-year-old proportion when the message concerns improvement over generations.  Besides, what about countries that apparently retrogressed?  (like Russia and Germany)

Redo_tertiary2For this data, I returned to my favored bumps chart.  Here is version one.  There are two ways to read this chart: across countries, we note that most of the European states (blue) had similar profiles showing roughly a constant rate of growth.  The Asian duo of Japan and Korea (brown) had the most marked growth.  Of North America (black), Canada diverged from the US since the 35-44 generation.

Alternatively, we can focus on the change generation-over-generation.  From 55-64 to 45-54, almost all countries in this sample (except Japan) grew at the same rate.  Then between 45-54 and 35-44, the two Asian countries clearly set the pace.  The generation between 35-44 and 25-34 is most interesting: Korea has not slowed, Japan has slowed a little but still grew as fast as Canada.  A trio of European countries (Spain, Ireland, France) outpaced their neighbors.

Below I show version two.  This one combines bumps chart with small multiples.  North America, Europe and Asia/Australia are now in separate charts.  This removes clutter.

Redo_tertiary

 

Apr 12, 2008

Hanging tough

Orig_literacy

Reader Nick B. sent in this example calling it "interesting".  The chart tells a compelling story once we figure out what it is.  Grasping the tree structure is key.

It illustrates the important idea that averaging sometimes masks  variations in the data.  For example, while the province of Guerrero scored 78% on literacy, the municipalities within Guerrero had scores ranging from 28% to 90%.

It also shows that the gender gap was larger in lesser Metlatonoc municipality than in more literate Cuautitian.

In addition, it tells us that while Mexico on average measured very well on literacy, subpopulations within Mexico spanned the world's best and worst (from about Mali's level to Italy's).

While I find this chart adequate, the pieces hanging off each other did not seem ideal, especially the two overlapping municipality pieces which were placed next to each other.  However, it is tough to come up with an alternative.  Here's one attempt; the changes are mild.

Redo_literacy_2 I prefer the horizontal orientation.

The branches are emphasized (as opposed to the "T" junction) because that's a key part of the story.

The national level, especially the span between Mali and Italy, is de-emphasized; I treat it as gridlines.

Instead of placing the overlapping pieces next to each other, I let the ranges literally overlap, which serves to stress this feature.


 

 

Jan 24, 2008

Oscar diseconomy

OscarBusiness Week dissected the beneficiaries of the Oscar show as shown on the right.  Although this doesn't work well as a data graphic, if thought as a variant on the data table, it is more engaging for readers.

Lets have some fun with the Oscar statue.  First, putting a bar chart next to the statue confirms that the height of the segments (rather than the area) is in proportion to the dollar values (below left).

Tufte, Chambers and others have shown that our eyes react to the areas, not heights.  So next, I estimated the areas but stretched them out into segments of equal width.  Squeezing the entire column back down to the height of the statue, the following chart (below right) puts perceived proportions next to the true proportions, displaying visually the extent of distortion. 

Redo_oscar


































Reference: "News you need to know", Business Week, Jan 28 2008.

Jan 04, 2008

Maps and dots

Happy New Year

The cosmos of university ranking got more interesting recently with the advent of the "brain map" by Wired magazine.  This new league table counts the total number of winners of five prestigious international prizes (Nobel, Fields, Lasker, Turing, Gairdner) in the past 20 years (up to 2007); and the researcher found that almost all winners were affiliated with American institutions.
Wired_brainmap
As discussed before, the map is a difficult graphical object; it acts like a controlling boss.  In this brain map, the concentration of institutions in the North American land mass causes over-crowding, forcing the designer to insert guiding lines drawing our attention in myriad directions.  These lines scatter the data asunder, interfering with the primary activity of comparing universities.

Wired_dots The chain of dots object cannot stand by itself without an implicit structure (e.g. rows of 10).  This limitation was apparent in the hits and misses chart as well.  Sticking fat fingers on paper to count dots is frustrating.  Simple bars allow readers to compare relative strength with less effort.

Redo_brainmap_2

In the junkart version, we ditched the map construct completely,  retaining only the east-west axis.  [For lack of space (and time), I omitted the US East Coast and Washington-St. Louis.]  With this small multiples presentation, one can better contrast institutions.

To help comprehend the row structure, I inserted thin strikes to indicate zero awards. A limitation of the ranking method is also exposed: UC-SF has a strong medical school and not surprisingly, it has received a fair share of Nobel (medicine), Lasker and Gairdner prizes; but zero Lasker and Gairdner could be due to less competitive medical schools or none at all!


Reference: "Mapping Who's Winning the Most Prestigious Prizes in Science and Technology", Wired magazine, Nov 2007.

Dec 16, 2007

Hits and misses

In this NYT article, we are told that "the most likely result when a policeman discharges a gun is that he or she will miss the target completely."  That's a shocker for those of us conditioned by Hollywood movies to think anyone who picks up a gun for the first time hits the villain right on the temple.  The following graphic attempts to tell the story.

Nyt_bullets

The one hit here is how the distances are visually presented.  The elliptical lines remind us of the neglected variable of direction; it also means the scale is correct only along one direction.

The dot matrix construct highlights the absolute numbers of shots, hits and misses but barely addresses the key issue of hit rates (accuracy). Nyt_bullets3 Specifically, this data set was presumably collected to explore the relationship between hit rates and distances from the target.  The use of different widths clouds our judgement of proportions.  To wit, it is not obvious that the 10-wide block and the 40-wide block shown left depict roughly equal hit rates (23%, 29%).

Redo_bullets The junkart version adopts a different approach.  This is the Lorenz curve, often used to show income inequality (see also here and here).  Here, the shots were ordered from closest to furthest from target, then summed up by distance segments.  For example, shots from 0 to 6 feet accounted for 60% of all shots but 72% of all hits.

If distance does not affect hit rates, we'd expect 60% of all shots to result in 60% of all hits.  This data point would show up on the 45-degree diagonal on the chart, labelled "totally unpredictable".  Any data appearing above the diagonal indicates that closer shots are more accurate, accounting for more than their fair share of hits.

Comparing the fitted blue line and the diagonal, one sees that distance is a weak predictor of hit rate.  The police commissioner explains this in the article; many other variables also affect accuracy, including "the adrenaline flow, the movement of the target, the movement of the shooter, the officer, the lighting conditions, the weather..."

Note that the shots with "unknown" distances were removed from the analysis.  Also, the categories of 21-45 and 45-above were combined: the rates were similar and with only three hits, it does not make sense to treat these as separate categories.

Of course, this version would not work well in the mass media.  For that, one can just plot hit rates against the distance categories.

Source: "A Hail of Bullets, a Heap of Uncertainty", New York Times, Dec 9 2007; New York Firearms Discharge Report 2006.

Oct 15, 2007

Sense of proportion

[I'm back from vacation.  Will provide my reaction to the responses to the Gelman challenge, and for those who have sent me email, I will work through them soon.]

The NYT commented on a trend among marketers to shift their advertising spending from so-called "measured" media like print and TV to so-called "unmeasured" media like product placements, contests, etc. 
The following chart accompanied the article:

Nyt_ads_2


This construct is akin to a population pyramid; it's great for comparing two groups along one metric, say age groups between males and females.  Here, the two halves aren't comparable groups but two different metrics.  The main metric, that is, the proportion of unmeasured, is not directly depicted: the reader must figure out mentally how much of each bar the black part covers.  Also, the companies are sorted by unmeasured media spending but this leaves the measured spending with a jagged profile, confusing matters.

As for the little white slits on the gray bars, they are admittedly cute but it is difficult to compare the detailed breakdown between print, TV and other media among companies.

The following dot plot gives the two halves equal weight.  Redoads1(Pink dots are measured, blue unmeasured.) It's not a very interesting graphic though. The sense of proportion is still missing.

I settled on a scatter plot which relates the proportion spent on unmeasured to the total amount of spending.  It appears that the largest advertisers had the lowest proportional unmeasured spend while the smallest (among the majors) had the highest.  (It's only a weak correlation: a linear fit yields only 16% R-squared.)
Redoads2


















Source: "The New Advertising Outlet: Your Life", New York Times, Oct 14, 2007.









Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Search Junk Charts


  • Custom Search

Residues

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31