Apr 12, 2007

Peripherals 2

In terms of interactive charting, Google Finance did much more than hide the legend.  In their main stock price chart, they used a number of neat features.

Google_ahm1

This chart effectively conveys a huge amount of information in a small space.  The bottom strip which shows relative prices for the past two years provides context to interpret the five-day movement shown in the main chart area.  I prefer to see a scale on the bottom strip as well. 

The sliding scrollbar can be dragged to show historical data.  Besides, the width of the window shown in the main area can be controlled.  For instance:

Google_ahm2

Without any effort, we are now looking at a 3-month chart for Q2 2006.  Notice the summary statistic on the top right corner also morphed.  The axis scale changed, and it never did start from zero to begin with.  (This shortcoming is alleviated by the profile chart in the bottom strip.)

Further, by placing the cursor in the chart area, we can highlight a particular day: a dot appeared on the price curve, the volume on that day was highlighted, and the text on the top right switched.  That text is what we typically place inside the chart area as a "data label".  The effect of moving it to the corner is similar to hiding the legend: it makes the graph more legible and provides space for longer descriptions.  As we move the cursor from left to right, the graph dynamically adapts.  Marvellous!

Google_ahm3

It may not be obvious the amount of data processing that has to take place to implement these sorts of features. I don't have space to address the data issue but maybe some of our readers can comment on it. 

Mar 27, 2007

Illusory disparity

The WSJ published a chart with the cheeky title of "Rich Get Richer" (reminiscent of the Economist).  The underlying data concerned one-, three- and ten-year returns for the buyout fund category.  For each return class, the overall mean and the means for the top and bottom 25% funds were depicted.

I won't go into the relevance of the title as I simply could not figure out how it connected with the data.  The following shows the original chart side by side with the junkart version.

Redo_richgetricher

Improvements include:

  • Lines show the comparisons with a minimum of fuss compared with colored bars
  • The overall mean return is placed in the middle of each line segment where it belongs, instead of being the first column
  • The axis label, "annualized return", tells readers what is the performance measure
  • Adding the word "funds" to "top quartile" and "bottom quartile" removes the possible confusion that those represent individual returns of the funds ranked at 25th and 75th percentiles, rather than the average returns of the bottom 25% and top 25% of funds
  • The linear construct paints the correct picture that individual fund returns fall into a continuum

(Thanks to my students for some of these points.)

Reference: Wall Street Journal, Mar 3-4 2007.

Mar 21, 2007

Dot com bubbles

Web_dotcombubbles Thanks to Dustin J for the pointer as well as the title of this post.  Dotcom bubbles is the most appropriate name for this overblown chart (featured as the "chart of the day" here).

The chart has no title or axis labels so only the diligent reader will figure out that the data consist of acquisition value of several high-profile Internet companies in the past three years.

There are less data than it seems.  Both the heights and the areas of the bubbles indicate the same thing, the deal values.  If we are supposed to see a trend, we are not finding it.

Most of these deals are not directly comparable anyway.  Webex and Ironport are infrastructure type companies with real business models.  Skype is a phone service.  Ask Jeeves is not a leader in its own space. Myspace and YouTube are traffic sites.

Reference: "Chart of the Day: Web deals", Valleywag, Mar 15 2007.

Feb 25, 2007

Going out on a limb

Earlier in the month, Prof. Gelman linked to Brandon's fascinating analysis of on-line weather forecasting accuracy.  I have done some additional analysis of the data and the result can be visualized as follows.

Redoonlineweather


I'll concentrate my comments on three observations:

  • CNN was the clear winner in forecasting accuracy during this period based on two criteria: its median error in forecasting daily lows, and its median error in forecasting daily highs.  Moreover, both the median errors were zero, which gives us confidence about its accuracy.  The Weather Channel (TWC) and Intellicast (INT) were not far behind.
  • The ability to forecast highs was better across the board than that of forecasting lows (except BBC).  I am not sure why this should be so.
  • Overall, our weather forecasters were much too risk-averse.  Notice that the errors were heavily biased in the lower left quadrant.  A negative error on low temperatures means predicted low is higher than actual low; a negative error on high temperatures means predicted high is lower than actual high.  Taking these together, we observe that the range of actual temperatures have generally been larger than the range of predicted temperatures!  No one was willing to go out on a limb, so to speak, to forecast extremes.

Actually, I believe this inability or unwillingness to forecast extreme values is endemic to all forecasting methodologies.

Before closing, I mention that the graph was based on a subset of Brandon's data.  I only considered same-day forecasts, did not consider Unisys (because they didn't provide forecasts for lows), and also noted that there might be bias since there were breaks in the time series.  Also, I retained the sign information and didn't take absolute values as Brandon did.

Jan 24, 2007

Convenience charting

Statisticians have long riled against "convenience sampling", that is, the practice of selecting samples based on what's easily available, not at random.  Say picking your friends.

Wpost_childmortality Dustin J sent in this example of what can only be called "convenience charting".  Dustin said he had no clue what this chart is saying, and I am not surprised. 

The chart plots a statistical object known as the "survival function".  It is likely that "survival analysis" was done, after which the chart creator  picked up the resulting statistical object and dumped it onto this "convenience chart".

If we take the top line on the "child survival" graph, it shows the probability of one child surviving up to a certain age, if the child belonged to a family with 1-3 kids.  The chance is about 92.5% that the child will survive through age 2, and 88% that the child will survive through age 18.  The difference between those percentages is due to the chance that the child may die between ages 2 and 18.

A slight transformation of the data will make this point much clearer.  What is the probability of a child dying by a certain age?  Using the example, a child has 12% chance to die by age 18, and 7.5% chance of dying between ages 0-2.

Redochildmortality The junkart chart depicts this probability.  (I reverse-engineered the data which explains why the distances between the line segments look strange.)

What this chart doesn't address is how we are to interpret the probability of "a child dying" in a family with more than one child.  Is it a random child dying?  At least one child dying?  Exactly one child dying (the other X-1 surviving)? 

The original chart also committed a number of standard errors.  The child survival function represent probabilities, not percentages.  The third category should be 8-11 kids, not 7-11.  If we are picky, then we would also like to see "confidence intervals" because there must have been many fewer families in the 12+ sample than the 1-3 sample.  In the second chart (which I don't have space to discuss), some data labels are missing, which indicates a presumption that all readers have seen the first chart.

Reference:  "Child, Parents Drive Each Other to Early Graves", Washington Post, Jan 14, 2007. 

Jan 17, 2007

Losing count of Doomsday

The Doomsday Clock is making the news today: because of the  growing nuclear threat and continued denial of global warming, scientists say we are "five minutes from Doomsday".

Nyt_doomsdayclock This graph traces the movement of the clock's hand over the last few decades.  (I think it appeared on the New York Times website but I cannot find it now.)

The little tickmarks are superfluous, and the thin white borders between red columns serve only to make us dizzy.
As shown below, a line chart is much easier on the eyes.







Redo_doomsday Now, a question for the scientists: Why the clock analogy?  Does it reflect a kind of fatalism that we can never be more than 60 minutes away from Armageddon?  How many minutes were we from Doomsday two hours ago?

Dec 01, 2006

Smoking-Screening

Smokeathome2

Behind the smokescreen lies the informative conclusion: among households with smokers, about 40% smoke in residence all the time while about half never smoke in residence.

This graphic, unfortunately chosen, contains many distractions from the main message, including:

  • the liberal sprinkling of colors
  • the inclusion of data for 1, 2, 3, 4, 5, 6 days, almost all of which were effectively zero
  • the redundant vertical scale, as all the data already appeared on the chart itself
  • the comparison of smokers to "total sample" (rather than non-smokers)
     

The last point merits special attention.  The total sample contains households with smokers as well as households without smokers. Any data from the total sample is a weighted average of these two types of households.  It is better to directly compare the two household types than to indirectly compare one type to the overall.

Further, households without smokers should be extremely likely to have no smoking in residence all week. 
And if most households have no smokers (76% of this sample), then the statistics of the total sample will mimic those of no-smoker households. That is to say, the total sample statistics do not add much to the analysis.  Our junkart version below corrects for this as well as other things.

Redo_smokeathomeOne of the key functions of a graph is data reduction, i.e. to aggregate data in such a way as to expose the information contained within.  Typically, a graph that uses aggregated data is clearer and stronger than one that plots every piece of data.  In this example, by combining 1-6 days into a single category ("smokes in residence part of the week"), we have a graph that is much more readable.

I want to thank Dr. Mike Rabinoff for inspiring me to look up these second-hand smoking statistics.  Mike recently published a book called "Ending the Tobacco Holocaust", which tells you more than you want to know about the tobacco industry.


Reference: "Second Hand Smoke Survey: Final Report", Madison Department of Public Health, Dec 2003.

Nov 28, 2006

Dropped, just like that

Quakeradsm_1 Frank W. sent in a timely reminder of the start-at-zero rule.  This ad from Quaker Oats pitches the impossible: the smart consumer will never believe that cholesterol levels can be dropped just like that!  According to Frank's measurement, the column heights plunged 77% from Week 1 to Week 4 in this chart.

In fact, if the vertical axis had started from 0, then the drop would more appropriately appear to be 5%.  Now, even that would have been a miracle, in my opinion.

Thus, I would like to know what is a "point" of cholesterol, and what do they mean by "representative" drop.  I suppose they are asking me to call that number.

My previous posts (with commentary from readers) about starting at zero can be found here and here.

Sep 12, 2006

Working with lines

Here's how a great idea can be made better.

Nyt_pharmacies

The unifying axis on the right hand side, described as "comparable percentage-change scales", is a great concept.  The data being plotted are the cumulative percent return for each stock from the start of 2006 to the day of publication.

Redo1apharmacies_1If the three lines are superimposed, we can see the relative performance throughout the year.  Within these three stocks, Walgreens has clearly underperformed until recently.  Also, plotting weekly rather than daily returns reduces clutter.  The only grid-line of importance is the 0% line, which is what is left.

In addition, the three other axes, depicting actual prices, are redundant; removing them significantly enhances readability. 

Some will insist that actual prices must be shown; the following includes key bits of data in a subtle way.

Redo2_pharmacy





Reference: "Drugstores are Looking More Like a Growth Story", New York Times, Sept 10 2006.

Aug 07, 2006

Illusion or junk? 2

Bondchart_1Previously we saw that the appearance of stacked area charts changes with how variables are ordered.  This is a serious deficiency.

Lets return to the bond market chart.

What is the relative prevalence of each type of debt over time?  In the original chart, this information is buried and can be extracted only painstakingly.

Redo_bonddataThe jun
kart version brings out this insight without much fuss.  As a percentage of total bond market debt, US Treasuries has dropped by half over the last two decades, much of it happening in the late 1990s.  Meanwhile, mortgage debt more than doubled during the same period, much of it occurring during 1985-1993.  The current distribution is also more balanced than in the last 20 years, as can be seen from the narrow spread.

A few design features are worth noting.  The vertical axis is given on both sides of the chart.  Limited colors are introduced to help readers distinguish the various lines.  Light vertical gridlines are provided to allow analysis during each 5-year period.  Non-essential tick labels are removed from vertical axes.

In fact, this is again a variant of the Bumps chart.  For more, see here, here and here.



Reference: Data from Bondmarkets.com (via Mahalanobis)

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Search Junk Charts


  • Custom Search

Residues

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31