Sep 17, 2007

Structuring a chart

Nytmpg This chart from the NYT was intended to show how the EPA has moved the bar on vehicle mileage ratings: 2008 estimates were lower than 2007 estimates across the board, regardless of manufacturer, model and city/highway.

The chart was built from one basic component, repeated for each model. 
Nytmpgsm_2I like the discreet gridlines (the white ticks) which enable readers to count off the mileage ratings.

The data is rich: ratings were given along three dimensions (model, year of estimate and city/highway).  Readers can benefit from a stronger guidance in where to look for the most pertinent information.  As the chart stands, it is merely a container for the data.  It fails our self-sufficiency test: all the data were printed on the chart, and the bars add little.

In the junkart version, I use knowledge of the data to structure the chart. First, noting that sedans, hybrids and trucks/SUVs/minvans have different levels of mileage ratings, I clustered the models into three groups.  Secondly, the city and highway ratings were separated into two columns as I consider the between-model comparisons more important than city-highway comparisons. 
RedompgThe chart is a dot plot, with a vertical tick for 2007 estimates and a dot for 2008 estimates.  It's easy to see that all dots sit to the left of vertical ticks.

More subtly, we can also see that the hybrids appeared to have been penalized more.  Or perhaps, the higher the rating, the larger the downward adjustment...

Source: "Mileage Ratings Are Still Estimates, Though Closer to Reality", New York Times, Sept 16 2007.

Aug 15, 2007

Could-be-light entertainment

OnionIt's the heat of the summer so here's another entertaining contribution.  Mike K, a reader, helpfully points us to this chart from The Onion (a satirical paper).

The artist must know some best practices since he/she can get so many things wrong at once.  At least he/she can do math, the percentages do add up to 100.

Histograms are the second most popular chart, that's a surprise!

Source: "America's Most Popular Charts", The Onion, Jan 7, 2007.

Aug 08, 2007

On the bubble

Nyt_candminsA couple of you noticed this table of bubbles in the Times, and asked what I think of it.  Dustin J suggested that this could be considered a decent application of bubble charts.  I agree, with some reservations.

The data set is the best thing about this chart.  The riches that lay beneath!  Many questions can be addressed, including:

  • Which Presidential candidates are getting the most face time?
  • Are candidates seen equally often across the stations?
  • Are there differences between network and cable stations in terms of total face time?  In terms of individual face time?
  • Are there Democratic/Republican leanings by station?  by type of station?

The intrepid can even build a regression out of it.

The bubble chart contains answers to all those questions but nothing jumps out. Okay, it's easy to see the station that gives each candidate the most face time.  Anything else requires moderate to a lot of effort.  Here's the junkart version.


Redocandmins_2 The list of things done to the data is long:

  • Candidates are grouped together by party
  • Candidates within each party are arranged in order of decreasing maximum face time
  • Stations are arranged by increasing total face time, this order happens to retain the network vs cable divide
  • A heat map construct is used instead of bubbles: the legend is missing but there are four hues for each color: darkest = top 10%; medium = 10th - 50th percentile; light = bottom 50th percentile excepting zeroes; white = no face time.  In raw numbers, 90th percentile = 81 minutes, 50th percentile = 19 minutes.
  • The only data shown are the totals by candidate and totals by station.
  • On the right margin are little bar charts that show the distribution of network/cable for each candidate.
  • On the bottom margin are little column charts showing the distribution of party affiliation by station.

A few observations follow:

  • Cable stations gave much more face time to the candidates in general.  Fox, no surprise, gives Republicans 85% of its time while all the others were roughly equal.
  • The more mainstream the candidate, the balanced was the time spent on networks versus cable.  John McCain (R), Hillary Clinton (D) and John Edwards (D) had the highest proportion of network time.
  • More time is not necessarily good since McCain was the clear winner but his campaign is struggling

Source: "Tracking Face Time", New York Times, August 1, 2007.

Jul 29, 2007

Transgender trends

One of the many gratifications of blogging is to connect with others who have similar interests; so it has been fantastic to receive user submissions (though admittedly I don't check my inbox frequently enough).  The thoughtfulness of these nominations continues to impress me.

Evan sent in 254 charts he created after looking at the post on baby namesJordanv31970200528yrs_2An example is shown on the right. 

He is particularly interested in the question of names that are given to both males and females. 

For example, the bottom chart shows that Jordan is primarily a male name, and saw a period of growth followed by decline, although the decline has been more severe on the male side than the female side. 

It's a nice touch to label the most recent year.  I'd also label the values for the most recent year on the axes.

Evan also offers the following solution to the scaling problem we identified in the original WSJ chart:

My solution was just to put two charts on each chart. One at a fixed scale for every chart to give a sense of size and one at a variable scale to better show the shape of the plot.

In other words, for less popular names, the top chart would look much more compressed.

There are many more charts to sift through on his site.  Evan welcomes suggestions.

Jul 12, 2007

More prevalent versus more likely

Aleks pointed to an interesting Business Week chart used to explain what people in different age groups are doing on-line.  This is a pretty chart that does an admirable job with a difficult data set.

Bw_onlinedataThe key to this chart, unfortunately missing, is that the percentages must be read as vertical columns to make sense.  So the top left square says 34% of "Young Teens" who answered the survey said they create web pages on-line.  In addition, the total of each column can be much more than 100% because multiple responses were allowed.

Realizing the above, we should interpret the bottom (grey) row as saying: "Older boomers" and "seniors" are more likely to be "Inactives" than younger people.  A tempting interpretation is: "Inactives" are more likely to be "seniors" and "older boomers".  But this is wrong because the chart hides the age distribution.  While 70% of "Seniors" are inactive, "Seniors" may represent a small proportion of the population, and thus they may not account for a large proportion of "Inactives".  This is the difference between prevalence and incidence rate.  (Another way to grasp this is to add the percentages across a row and try and fail to understand what the row sum could mean.)

The construct of the square grids is less damaging than it seems.  In effect, the data has been rescaled by dividing by 10.  The reader is then forced to apply "rounding".  If you are someone who sees $19.95 as $19, then you'd round down the partial rows.  If you see $19.95 as $20, you'd round up the partial rows.  So the designer has pushed you to think in terms of whole numbers between 0 and 10, in other words, in units of 10%, rather than units of 1% or, horror of horrors, 0.1% or at some other unrealistic precision.

Here's another example where the profile chart shines.  Because the percentages don't sum up to 100%, the other alternatives like stacked bar charts and "Merrimeckos"/mosaic charts don't work.  (Prior discussion of this issue here.)

Redo_onlinedata

This version gives a column view of the data, the lines linking percentages of each age group performing on-line activities.  The profiles nicely cluster into three groups: the younger people are more likely to say they are "joiners", "spectators" or "creators" but less likely to be "inactives".  We also see that the likelihood of being "Collectors" has little to do with age.

Source: "Inside Innovation -- In Data", Business Week, June 11 2007.


Jun 19, 2007

Wizardry

An anonymous reader dropped a comment pointing us to Martin Wattenberg's gallery at Business Week.  Martin's work falls into the category of information visualization, which typically concerns cramming as much high-dimensional data as possible onto 2D or 3D displays, augmented heavily by colors, shapes, interactivity, superpositioning and other tricks.  Often pleasing to the eye, these graphics usually take time to warm up to.  Sites like Infosthetics and Visual Complexity cover them well.

Mw_baby Martin is responsible for the baby names visualization, which tracks the popularity of names over the years.















Mv_treemap_2 Martin also created treemaps like this one.  Does this show relative stock performance better than other designs?

May 23, 2007

Looking for survival

Retention_rate_by_daniel_waisberg_2 Daniel W of esnips has started a collection of graphics on visualizing web statistics.  The following graph is an attempt to capture the ability of the web-site to attract returning customers.

The time axis serves double duty here: it is an indication of which "cohort" the users belong to, in other words, when they signed up; it is, also, the month of returning visits.

Web_surv A more typical chart used by statisticians is the survival curve.  As shown here, these are the same curves as above but having the same starting point.  Now, the time axis is interpreted as number of months after registration.  Of 100 members who registered in January, how many returned one month later, two months later, etc.

If the purpose is to evaluate the consistency of retaining customers by cohort, then this graphic is less cluttered.  I also used a fading metaphor to color the lines so that the oldest cohort (also, the longest line) is the faintest.  Line labels are best hidden, and revealed interactively when the user mouses over a line of interest.

Not sure if Daniel was plotting real data; in general, we expect a certain amount of criss-crossing.  If the data is real, then his site has seen uninterrupted improvement every month thus far.

Source: The Web Analytics Graph Collection, eSnips.

May 06, 2007

Visualizing sensitivity

A reader wrote:

I'm a loyal reader who hopes you'll indulge him in just one or two questions.

In finance (valuation, specifically), we often create two-way sensitivity tables. Unfortunately, a three-way sensitivity table is what's most often called for. Of course, we work around this by producing multiple two-way tables.

Now, obviously, it's pretty hard to build  three-way table or chart in two dimensions, and the use-bigger-bubbles method doesn't really make sense in this kind of application-- but can you conceive of a good way to present the data in any other form?

3waydata_2 Like he indicated, we typically see multiple two-way data tables for such data.  The virtue of this approach is that the data is exceptionally well-organized; it's great for looking up the outcome given the three dimensions (I called them Red, Green and Blue to protect the innocent.)

Further, starting from a baseline i.e. a particular cell in the table, it's easy to move our eyes up, down or jump tables to observe the impact of changing dimensions (so-called sensitivity analysis).

These data tables facilitates "local" sensitivity analysis but obscure "global" sensitivity: staring at those numbers, we feel lost in the trees and can't see the forest.  What's the effect of increasing Green on average?  What's the effect of increasing Green while decreasing Blue? etc. etc.

3waygraph The junkart construct (right) is made to address these questions.  The black stripes establish the baseline, the overall range of values.  Then, if interested in the effect of Red = 0.11, we can compare those red stripes with the black.  Since the spread is wide, we note that Red = 0.11 is not a strong indicator of value, and to the extent it is, it points to lesser values.

What about Red = 0.11 and Green = 2?  Now, we focus on the first red stripes and the first green stripes.  We note that the overlapping region (which is where both conditions apply) is highly concentrated to the low end of value range.  Thus, we conclude that under those conditions, value is low (below 10,000) and further, that it is low primarily because Green = 2.

On and on for any one-way, two-way or three-way effects.

Although it's not the purpose of the chart, local sensitivity can also be observed.  For example, the highest value comes from Red = 0.09, Green = 16 and Blue = 0.30.  What if Blue decreases to 0.28?  We start on the Blue = 0.28 layer; going from right to left, as we see a blue stripe, we scan vertically to find the corresponding red and green stripes; the 3rd stripe from the right, we find the scenario of interest.  Such analysis would benefit from adding an interactive vertical guiding line.

Do you prefer 3-D plots?  Contour plots? Feel free to share your ideas!

Apr 25, 2007

Shower of bullets

Nyt_gundeaths_sm Here's one of those infographics that makes the reader work hard (via Dustin J).  The graphic in its full glory is here; it's much too large to be reproduced, and I have clipped off the bottom half.

Much to the designer's credit, he extracted data of interest, rather than trying to cram everything onto the page.  In particular, he was most interested in the distribution of deaths among different age groups, the types of deaths (suicides, homicides) and the identities of the deceased (race, gender).

Just like the election fraud graphic, such rich data lend themselves to multiple levels of aggregation.  Here, the designer focuses on the most detailed level, making it easiest to see facts like "among the 18-25 age group, there were 6 black men murdered per day".

However, it takes much more attention to notice higher-level facts like "homicides per day are relatively flat across age groups while suicides heavily skew toward 40+".

Redo_gundeaths_sm In the junkart version, I decided to emphasize the more aggregated data, showing the number of deaths of each type across age groups. The detailed break-down of race and gender is shoved into parentheses, as they can be omitted by less serious readers.

The reader who discovers that the homicide/suicide pattern described above may surmise that homicide gunfire deaths are more "random" while suicides, being  premeditated, may affect older people disproportionately.  More research would be needed to confirm such and other suspicions.

Source: "An Accounting of Daily Gun Deaths", New York Times, April 21 2007.

 

Apr 20, 2007

Embedding logic

Bernard L. (from France) submitted this bubble chart for consideration.  It accompanied an NYT article claiming the absence of evidence of election fraud.  (Of course, as is well-known, absence of evidence is not the same as evidence of absence.  Here, I'm purely interested in data presentation.)

As a seasoned consultant, Bernard asked if a Marimekko chart would be superior.

Nyt_convictions_2 This is one ambitious chart.  Ignoring the bubbles (which are more nuisance than anything), we are asked to interpret data at three different levels of aggregation in one go.

First, there were 95 cases classified into five indictment types.  Second, these cases resulted in either convictions or acquittals/dismissals.  Third, among the cases ending in convictions (the highlighted area), we were shown the occupations of those convicted.

By flattening three levels into one table, some key information is obscured.  For example, how many cases resulted in conviction?  The reader has to compute either 95-25 or 26+31+10+3.  What percent of civil rights violation convictions were committed by party/campaign workers?  It's not 2/3 = 67% (bottom row) but rather 2/2 = 100%.

The following junkart brings out the logic that is embedded in the complicated bubble-table.  While there is a lot on the page, the text labels plus the flow directions allow readers to absorb the data one level at a time.

Redo_convictions2

I have not attempted the Marimekko as I am not a fan of such charts.  You're welcome to try.

Source: "In 5-Year Effort, Scant Evidence of Voter Fraud", New York Times, April 2007.

PS. I will be working through the backlog of reader submissions.  Thanks for your patience.  Keep them coming!

 

Remark (Apr 25 2007): Thanks to readers for keeping me honest (see comments below).  The conviction rates shown previously were indeed the inverse.  I have now fixed them.

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Recent Comments

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31