Feb 10, 2008

Ordering and grouping

The Times reported that January retail sales generally disappointed, and consumers showed a preference for discount retailers over department stores.

Nyt_retailjan


Redo_retailjan

Taking the bar chart on the right, re-ordering by change in same-store sales, and grouping companies by type of retailer, we can present the data to match the text more closely.  The divergent performance between discount retailers and department stores is readily visible.












Reference: "Weak January dashed retailers' gift-card hopes", Feb 8 2008.

 

Dec 02, 2007

Live dynamic graphics

In the second interesting item of the week, I return to the fabulous Google Finance chart, which shows the distribution of stock market returns by sector.  I wrote about it twice (here and here).  In the original post, I saluted the engineers for figuring out the formidable technical issues of turning a live dynamic data stream into a live dynamic graphic but didn't go into details.  (Trust me.)

Goog_oops The other night, this chart popped up on my browser.

Oops.

If someone kept track of each time such a mishap showed up, the tally would probably be 1-5% of the time.

The triple challenge of generating this graphic is the volume of data that needs to be processed, the velocity at which it changes, and the flicker of time from input to output, probably not more than a few minutes. The analysis and charting must be maintained continuously during market hours.  For any such projects, the thing to manage is the error rate, and one should be totally thrilled if it's in the range Google engineers have achieved.

Nov 30, 2007

Digging deeper

Two items from other places caught my eye this week as they directly relate to some things we discussed on this blog.

First, I second Andrew's suggestion of a recent NYT article for teaching the concept of margin of error, or how to read political poll coverage intelligently.  Towards the end of this piece is a small gem:

Some pundits began by saying the horse race numbers were close but then tried to marshal evidence that they were not. On ABC's own Web site, Chris Cillizza, wrote: "Among women in the Post poll, Obama actually leads Clinton 32 percent to 31 percent among women. Voters 45 years of age or older are similarly divided, choosing Clinton by a 27 percent to 26 percent margin over Obama. Ditto for those who earn $50,000 or less a year; 29 percent for Clinton, 29 percent for Obama."

Mr. Cillizza failed to mention that if the margin of sampling error is plus or minus five percentage points for all of the likely Democratic caucus goers, then it is even higher for subgroups like women.

In a recent post, I call this the "oft-used device of subgroup support of a hypothesis".  This example illustrates the fallacy more clearly.  It's the "let dig deeper since we haven't found the gold yet" phenomenon.  Such analysis suffers from two serious statistical problems.  The article deals with the sample size problem: the margin of error at the subgroup level is by definition larger; what this means is the bar for statistical significance has been raised; and rare is the case where such analysis could lead to any further insights.  (Of course, I am assuming the original poll was not designed to be analyzed at the subgroup level.)

The other issue -- more difficult to explain and omitted in the article -- is the multiple hypothesis problem.  It is well known that if we dig around long enough, we may get so dizzy that anything that glitters will look like gold.  In other words, false positives.  Like the sample size problem, the remedy is to raise the bar for statistical significance even higher.  In practice, this frequently wipes out the rationale for such analysis.

I will address the other interesting item in a new post.

Nov 11, 2007

Red-lining by marriage

Bbc_family Tom W., a reader, noticed this map featured on a BBC News page about the UK family.

One can roughly make out the shape of Great Britain so this is some kind of cartogram.
The title announces that this cartogram concerns the "distribution of population". 

In a typical map like this, the redder reds would indicate higher densities of people.  Yet, the article tells us that the population is divided evenly into 85 squares, each containing
"roughly half a million people over 18 years old".

Instead, we seem to have 500K widowed people next to 500K re-married people (most of whom prefer the coasts, by the way), etc.  Apparently, the Brits practise a form of red-lining based on marital status!

The S/M/W/D/R labels are also redundant and very distracting; and the white gridlines interfere with our ability to read the grey boundaries.

Source: "The UK family", BBC News.

Oct 17, 2007

Points of comparison

Econ_mortgage In light of the current housing crisis, arising from mortgage defaults, I pulled this graphic from a Jan 2007 opinion piece that plotted historical default rates of mortgages.  Notice the high degree of stretching on the vertical axis that exaggerates the volatility: essentially, the annual delinquency rate ranged from 1.75% to 2.65% during the last six years or so.  One might be forgiven to think that a 2% default rate is quite acceptable.

Nyt_mortgage_2 Compare the above chart to the pair that showed up in the NYT in Oct 2007 (see right).  The default rates here are in the 10-20% range, very alarming indeed.

The two graphics illustrate a key issue of "aggregation" in statistical analysis.  The first graphic is super-aggregated: all types of mortgages of all ages are put together to calculate each year's default rate.  The second graphic hones in on subprime mortgages only.

More importantly, the second graphic presents data in "vintages".  Each line represents loans originated during a particular year (a "vintage").  This establishes comparability.  On the first chart, each point in time represents the default rate of mortgages averaged over all ages (some loans may be only a few months old; others may be 15 years old).  Since the default rate is much higher for very young mortgages than for older mortgages, such averaging hides crucial information.

Overall, the NYT graphic very effectively conveys the alarming trend of new mortgages performing much worse, especially those originated in 2007.

Redo_mortgage It can benefit from two slight edits: adding a few more years, and using vertical lines (the most critical comparisons are default rates for loans of a given age!)  Something like this...


Sources: "As Defaults Rise, Washington Worries", New York Times, Oct 16 2007; "Mounting Mortgage Credit Problems", economy.com, Jan 23 2007.

Aug 12, 2007

Non-elites

From Mikhail Simkin comes some intriguing analysis of "experts"; in this line of research, experts are compared to the "general public" and often "proved" to be shenanigans. Stock pickers don't do better than apes; economists don't do better than Big Macs; you get the idea.  In a new twist, Simkin puts twelve images of modern art on his website, and asks visitors to distinguish between those by grand masters and those "ridiculous fakes" produced by him apparently on a computer.

Since conventional wisdom says elite universities provide better education, Simkin attempted to find out if there is a difference between "elites" and "the crowd" in their ability to recognize modern art. (Elites, to him, meant the Ivy League and Oxbridge.)  The following pair of histograms clinched his point:

we see that there is not much difference between the elite and the crowd.

Simkin_fakeart


Since the shapes of the histograms are similar, one might be inclined to agree with the statement.  This is until one notes the wildly different scales used because only 143 of the 56,020 quiz-takers could be identified as "elites".

The shapes are clarified if we use a relative scale (percentages) rather than absolute scale.  Further, the difference is more easily seen when cumulative percentages are plotted.  In other words, we are interested in comparing the proportion of respondents who score at least X points out of 12.

Redo_fakeart

Two features are worth noting:

  • A gap opens up between 4 to 7: specifically, 40% of "non-elites" scored 7 points or below while only 25% of "elites" scored 7 points or below.
  • The curves criss-cross around 11 to 12: this shows that "non-elites" were more likely to have perfect scores (although this difference is small).  Perhaps museum directors don't have .edu addresses.

Notice that I plotted Elite vs Non-Elite rather than Elite vs All Respondents.  While it seems innocuous to use "All Respondents", and in this case, there is no noticeable difference since Elites were a tiny proportion, when the test group accounts for a significant proportion of the total, the value for "All Respondents" will be influenced by that for the test group.  As a general rule, compare A to not A.

Simkin's exercise raises many statistical issues of design, which we won't discuss here.

Source: "Properly Prescribed" (via, RSS Significance)

Aug 08, 2007

On the bubble

Nyt_candminsA couple of you noticed this table of bubbles in the Times, and asked what I think of it.  Dustin J suggested that this could be considered a decent application of bubble charts.  I agree, with some reservations.

The data set is the best thing about this chart.  The riches that lay beneath!  Many questions can be addressed, including:

  • Which Presidential candidates are getting the most face time?
  • Are candidates seen equally often across the stations?
  • Are there differences between network and cable stations in terms of total face time?  In terms of individual face time?
  • Are there Democratic/Republican leanings by station?  by type of station?

The intrepid can even build a regression out of it.

The bubble chart contains answers to all those questions but nothing jumps out. Okay, it's easy to see the station that gives each candidate the most face time.  Anything else requires moderate to a lot of effort.  Here's the junkart version.


Redocandmins_2 The list of things done to the data is long:

  • Candidates are grouped together by party
  • Candidates within each party are arranged in order of decreasing maximum face time
  • Stations are arranged by increasing total face time, this order happens to retain the network vs cable divide
  • A heat map construct is used instead of bubbles: the legend is missing but there are four hues for each color: darkest = top 10%; medium = 10th - 50th percentile; light = bottom 50th percentile excepting zeroes; white = no face time.  In raw numbers, 90th percentile = 81 minutes, 50th percentile = 19 minutes.
  • The only data shown are the totals by candidate and totals by station.
  • On the right margin are little bar charts that show the distribution of network/cable for each candidate.
  • On the bottom margin are little column charts showing the distribution of party affiliation by station.

A few observations follow:

  • Cable stations gave much more face time to the candidates in general.  Fox, no surprise, gives Republicans 85% of its time while all the others were roughly equal.
  • The more mainstream the candidate, the balanced was the time spent on networks versus cable.  John McCain (R), Hillary Clinton (D) and John Edwards (D) had the highest proportion of network time.
  • More time is not necessarily good since McCain was the clear winner but his campaign is struggling

Source: "Tracking Face Time", New York Times, August 1, 2007.

Jul 12, 2007

More prevalent versus more likely

Aleks pointed to an interesting Business Week chart used to explain what people in different age groups are doing on-line.  This is a pretty chart that does an admirable job with a difficult data set.

Bw_onlinedataThe key to this chart, unfortunately missing, is that the percentages must be read as vertical columns to make sense.  So the top left square says 34% of "Young Teens" who answered the survey said they create web pages on-line.  In addition, the total of each column can be much more than 100% because multiple responses were allowed.

Realizing the above, we should interpret the bottom (grey) row as saying: "Older boomers" and "seniors" are more likely to be "Inactives" than younger people.  A tempting interpretation is: "Inactives" are more likely to be "seniors" and "older boomers".  But this is wrong because the chart hides the age distribution.  While 70% of "Seniors" are inactive, "Seniors" may represent a small proportion of the population, and thus they may not account for a large proportion of "Inactives".  This is the difference between prevalence and incidence rate.  (Another way to grasp this is to add the percentages across a row and try and fail to understand what the row sum could mean.)

The construct of the square grids is less damaging than it seems.  In effect, the data has been rescaled by dividing by 10.  The reader is then forced to apply "rounding".  If you are someone who sees $19.95 as $19, then you'd round down the partial rows.  If you see $19.95 as $20, you'd round up the partial rows.  So the designer has pushed you to think in terms of whole numbers between 0 and 10, in other words, in units of 10%, rather than units of 1% or, horror of horrors, 0.1% or at some other unrealistic precision.

Here's another example where the profile chart shines.  Because the percentages don't sum up to 100%, the other alternatives like stacked bar charts and "Merrimeckos"/mosaic charts don't work.  (Prior discussion of this issue here.)

Redo_onlinedata

This version gives a column view of the data, the lines linking percentages of each age group performing on-line activities.  The profiles nicely cluster into three groups: the younger people are more likely to say they are "joiners", "spectators" or "creators" but less likely to be "inactives".  We also see that the likelihood of being "Collectors" has little to do with age.

Source: "Inside Innovation -- In Data", Business Week, June 11 2007.


May 22, 2007

Visualizing web statistics

Tim inquired about:

how to create an elegant graph for Web visitor traffic statistics that shows both how many views a page gets and then how many people click that page to go further ("conversion rate"). Part of the problem is that conversion rates vary from, say, .3% to 50% (a wide range).

Lets work with this sample data set.  Web1I ordered it from highest to lowest click rate, which is the primary metric of interest.  The number of page views is of interest too as sometimes rarely-visited pages may have high click rates.

At this point, it's important to know the context.  Specifically, who controls the allocation of pages? Did the data come from a randomized experiment? Or did they get a self-selected sample (e.g. web surfers deciding which section of the site to visit)?

Web_lift The first construct I tried is the "lift curve" often used in marketing.  It's the same thing as the Lorenz curve used by demographers but interpreted differently.  Here, we see that Guitar pages accounted for 26% of the page views but 37% of the clicks; House pages accounted for an incremental 44% of the pages and 59% of the clicks; etc.  The relative click rates are immediately clear from the steepness of the line segments.  The lift curve is appropriate for the self-selected case, in which we can take the allocation of page views as fixed.

Web_scatter If the allocation of page views is a decision to be made, then it doesn't make much sense to accumulate page views.  The second construct is the "scatter plot" of % clicks versus % page views.  The steepness of the line through the origin helps us compare the click rates.  Bicycles is clearly inferior in generating clicks.

Both these constructs are highly efficient; adding new data does not expand the chart at all.

Keen readers will observe that the slope of the line is not the click rate but rather a click rate index (relative to the overall click rate).  This means that any data point above the diagonal has above-average click rate.

May 03, 2007

Less is more

Suparse Derek pointed me to the style.org site which also parses political speeches.  Their preferred graphic is not the tag cloud but a labeled bar chart.

From top to bottom, each bar represents a sentence; the length of each bar is the length of each sentence.  Further, the user can specify word pairs for comparison.  Here the red bars are sentences containing the word "freedom"; the blue bars, "security".

It's a good illustration of the "small multiples" principle in constructing comparative graphics.

However, the choice of dimensions is perplexing.  I'd be much more interested in the timing of mentions of those words, rather than which sentence they appeared in.  I also find the length of each sentence to be irrelevant.

Redo_suparse Here's one concept that brings out the point better.  It uses less space and voluntarily gives up some of the data (the sentence structure).

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Recent Comments

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31