Mar 28, 2008

Two books

Nathan from FlowingData announces a competition to win Tufte's classic book on visual representation of data.   There are still a few days left to participate.  While his more recent books start getting repetitive, he still has published one of the most accessible books on this topic.

I also had the pleasure of reading Naomi Robbins' Creating More Effective Graphs.  She adopts a cookbook format providing hints on graphs in one, two and more dimensions, scales, visual clarity and so on.  Since she has already read Cleveland, Tufte, etc., she manages to put all that learning inside on cover.  The page design - with half of every page blank - is refreshingly easy on the eyes.  Inclusion of examples is generous. 

Lets review her point of view of some of the topics we discuss frequently on Junk Charts:

Starting axis at zero: she thinks "all bar charts must include zero.  However, the answer is not as clear for line charts or other charts for which we judge positions along a common scale." (p.240)

Jittering: she does not provide a clear guideline but gave an example of a strip chart with jittered dots, commenting that "it gives a much better indication of the distributions than would a plot without jittering" (p.85) so I infer that she's generally in favor.

Parallel coordinates plot / profile plot: she provides an example of such a plot on p.141 and describes how to read such a plot.  Again, I infer she's in favor.

Jan 22, 2008

Football rankings 1.1

Long-time reader Jon sent in a different view of the QB data.  He uses a nifty tool in Excel to generate a parallel coordinates plot (also called profile plot) on which pairs of QBs can be highlighted and compared.

Jon_garrard This chart exploits the foreground background concept very nicely.  One way to deal with abundant data is to highlight only those bits that matter to the question at hand, and relegating the rest to the background.

The gray lines in the background provide context without grabbing undue attention. He also converted every metric to a scale between 0 and 1, similar to what we did with our version.

The Eli Manning / Philip Rivers comparison shows that both QBs were below average on most of these metrics, with Manning near the bottom of each.




Jul 26, 2007

Noisy subways

This NYC subway report is impossible to read.
Nyt_subwayreport

However, it is very difficult to find a good way to show the information.  In fact, the data contained very little of that.  Curiously, the ratings are very dispersed so that each line is graded high on some category and low on others.  Here's one view of it:

Redo_subwayreport

I have grouped the subway lines together (A/C/E, 4/5/6, etc.).  The metrics are plotted left to right in the same order as in the original.  Is it all noise and no signal?

(I just realized the vertical axis is reversed: best ratings are at the bottom, worst ratings at the top.  Doesn't matter anyway since I can't see any patterns.)

Source: "No. 1 Train is Rated Highest by Commuter Advocates", New York Times, July 24 2007.

PS. Two contributions from readers.  Still looking for insight from this data...

Trains789fg5_2 Trainspotmatrix_2


Jul 12, 2007

More prevalent versus more likely

Aleks pointed to an interesting Business Week chart used to explain what people in different age groups are doing on-line.  This is a pretty chart that does an admirable job with a difficult data set.

Bw_onlinedataThe key to this chart, unfortunately missing, is that the percentages must be read as vertical columns to make sense.  So the top left square says 34% of "Young Teens" who answered the survey said they create web pages on-line.  In addition, the total of each column can be much more than 100% because multiple responses were allowed.

Realizing the above, we should interpret the bottom (grey) row as saying: "Older boomers" and "seniors" are more likely to be "Inactives" than younger people.  A tempting interpretation is: "Inactives" are more likely to be "seniors" and "older boomers".  But this is wrong because the chart hides the age distribution.  While 70% of "Seniors" are inactive, "Seniors" may represent a small proportion of the population, and thus they may not account for a large proportion of "Inactives".  This is the difference between prevalence and incidence rate.  (Another way to grasp this is to add the percentages across a row and try and fail to understand what the row sum could mean.)

The construct of the square grids is less damaging than it seems.  In effect, the data has been rescaled by dividing by 10.  The reader is then forced to apply "rounding".  If you are someone who sees $19.95 as $19, then you'd round down the partial rows.  If you see $19.95 as $20, you'd round up the partial rows.  So the designer has pushed you to think in terms of whole numbers between 0 and 10, in other words, in units of 10%, rather than units of 1% or, horror of horrors, 0.1% or at some other unrealistic precision.

Here's another example where the profile chart shines.  Because the percentages don't sum up to 100%, the other alternatives like stacked bar charts and "Merrimeckos"/mosaic charts don't work.  (Prior discussion of this issue here.)

Redo_onlinedata

This version gives a column view of the data, the lines linking percentages of each age group performing on-line activities.  The profiles nicely cluster into three groups: the younger people are more likely to say they are "joiners", "spectators" or "creators" but less likely to be "inactives".  We also see that the likelihood of being "Collectors" has little to do with age.

Source: "Inside Innovation -- In Data", Business Week, June 11 2007.


Jun 29, 2007

Tricks of the trade 2

In a previous post, I explained the value of sketching when creating graphs. Today, I will share a few other graphs that plot the same data as we discussed the other day, regarding the proportion of time spent on developing different modules of software.

A stacked column chart, suggested by John J., would look like this:
Redo_wufoo3

Compared to the profile chart, this chart has some weaknesses:

  • it's difficult to read off the proportions for middle blocks like Blinksale-Billing;
  • because the middle blocks "float", it is impossible to compare them properly;
  • it requires as many colors as there are variables.

These problems get worse as the data scale: more difficult to read off the data; more colors needed.

The Merrimecko, suggested by Bernard L., is the same chart as above except that the widths of the columns are made proportional to the relative number of lines of code.  However, because these four companies do not make up the entire universe, proportional width make little sense here.

The profile chart can be drawn up in two ways:
Redo_wufoo2
These charts typically display results of cluster analysis.  This is a statistical data mining technique which discovers groups of like objects within a large data set.  Often times, the computer will only tell you these 15 belong to Cluster 1, those 22 form Cluster 2, etc. 

To figure out why the 15 belong together, the analyst needs to plot the explanatory variables against cluster index.  Now, think of WuFoo, FeedBurner, etc. as clusters, and the proportion of code given to Application, etc. as variables.

While the line segments don't signify anything real, they trace out  the precise paths our eyes would take when reading the stacked column chart above!  Remember we wanted to compare the number of lines given to each function across companies.  If shown the column chart, my eyes would flip across the top of the  Application (blue) blocks from WuFoo to regonline.  This path is exactly the brown line on our first profile chart.

The numbers for Marketing, Support and Billing are much easier to read too as they all start from zero for each company.

The right chart is another possibility but for this particular situation, I prefer the left one.

Finally, I am less familiar with the "parallel coordinates plot" that Derek talked about.  I believe it is a variant of the profile chart.

Jun 26, 2007

Dizzy display

Wufoo Xan G. tells us that these "inconsistent pie charts ... make [his] head hurt".  The dizzy array of colors is unfortunate, especially when "Application" gets a medium blue in three of four pies but an orange-red in one of them.  Just like the baby names charts, it's important to keep the background constant when constructing small multiples.

We cite from the horse's mouth:

The goal of this section was to uncover any [software development] task that might be overlooked [by these startup companies]. When writing a software product, the tendency is to focus 100% on the application. Items like support, marketing, and especially billing never cross your mind.

The junkart version below is designed to bring out this one message: that Blinksale has distinguished itself from the rest by having spent more time developing code for purposes other than the application itself. Redo_wufoo 

I removed the raw counts of lines of code and focused only on the relative proportions.  The former does nothing to argue the author's case.

The pie charts fail our self-sufficiency test.  The reader must rely on the data table and data labels to understand the chart.  If removed, the key message is obscured.

Source: "Web App Autopsy", ParticleTree, June 2007.

Mar 17, 2007

Picking up the right file

The Institutional Investor advises its readers:

Going public may just be the most important -- and nerve racking -- decision any company will make.  Managing and pricing an IPO is tricky, so picking the right underwriter is crucial.  Bankers often boast of their league table prowess to win mandates, but quantity does not necessarily mean quality.

By quantity, they meant the amount of underwriting fees (revenues) earned; and by quality, the average stock performance of the newly-public companies, as of Feb 16, 2007.

Ten banks were compared on the two Qs using this chart, which is best described as the "file folder chart".

Iporanks

Amusingly, its creator sized the height of each file according to the quality metric, which is the return % listed at the top right corner of each file.  The files were sorted by decreasing quality.  Since each file is a parallelogram, its area is proportional to quality.

However, the files overlap, preventing us from comparing the areas of the files.  Besides, the point made in the article about the importance of both Qs is lost since this chart stressed quality over quantity.  Quantity showed up as a low dot on the tallest file and a high dot on the shortest file.

Redo_iporanks The junkart version restores the balance.  The blue lines highlighted several banks that scored high on one metric but low on the other.  The construct is a profile chart, with only two variables.

Curious readers may wonder if there were only 10 banks in the IPO underwriting market.  Far from it.  The chart designer introduced a selection bias because banks were included based on Quantity, and then Quality was rated.  This meant there is possibly a boutique firm with small revenues but higher quality than any of the 10 in the plot.

Furthermore, much useful information is missing, including the dispersion of returns, the number of deals, etc.

Reference: "Grading the IPO Underwriters", Institutional Investor, March 2007.

Sep 29, 2006

Where are the crimes?

Msn_crimeThe author of this data table and the readers are asking the same question, "Where are the crimes?", but for different reasons.

While the author wanted to convey regional differences in crime growth, as readers, we are not sure which part of the table to look at; every cell is given equal "weight".

Redo_crimeJudging from this "profile plot", we can conclude:

  • the Mid-West (blue line) experienced a crime spurt that is very much worse than the national average (dots) in all categories except forcible rapes and murder
  • the West (red line), in general, had crime increases less severe than the national average
  • that said, the regional profiles are relatively similar, showing few meaningful regional differences (compared to other profile plots I've seen)

Reference: "Communities Grapple With Rise in Violence", MSNBC.com
Thanks to Maya for sending in the link.

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Recent Comments

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31