Jul 29, 2007

Transgender trends

One of the many gratifications of blogging is to connect with others who have similar interests; so it has been fantastic to receive user submissions (though admittedly I don't check my inbox frequently enough).  The thoughtfulness of these nominations continues to impress me.

Evan sent in 254 charts he created after looking at the post on baby namesJordanv31970200528yrs_2An example is shown on the right. 

He is particularly interested in the question of names that are given to both males and females. 

For example, the bottom chart shows that Jordan is primarily a male name, and saw a period of growth followed by decline, although the decline has been more severe on the male side than the female side. 

It's a nice touch to label the most recent year.  I'd also label the values for the most recent year on the axes.

Evan also offers the following solution to the scaling problem we identified in the original WSJ chart:

My solution was just to put two charts on each chart. One at a fixed scale for every chart to give a sense of size and one at a variable scale to better show the shape of the plot.

In other words, for less popular names, the top chart would look much more compressed.

There are many more charts to sift through on his site.  Evan welcomes suggestions.

Jul 18, 2007

Mid-week entertainment: dogma

Wsj_laff1This chart from a Wall Street Journal editorial has been making the rounds lately, being ridiculed left and right.  A number of you have been leaving comments here so I'm putting it up and center as our light entertainment for the week.

The chart is being used to justify this economic concept called the "Laffer Curve" which claims that lowering tax rates can increase total tax receipts (for example, because fewer people will cheat the government.)  As far as I know, it is dogma, and has never been proven empirically.

I also agree with Prof. Gelman's skepticism about using countries as experimental units to inform domestic policy.

Fire away!



Further reading:

Junk Chart readers

Economist's View
Tufte blog
Gelman blog


And more:

Cosmic Variance
Brad DeLong

Jun 15, 2007

The Immigrants' Path

Wsj_illegal A recent Wall Street Journal editorial used this chart (via the National Foundation for American Policy) to claim success for the "Bracero" guest worker program, initiated in 1942.  Their analysis:

... illegal border crossings subsequently plummeted.  Between 1953 and 1959, they fell by some 95%.  In 1960, mainly in response to complaints from labor unions, the program was scaled back and eventually phased out.

 

 

 

Long-time readers may recall Friedman's Crossover Law of Petropolitics, where the opportune criss-crossing of lines
plotted along double axes was taken as proof of causality.  Friedman's Law lurked here, right in the 1953-1959 range. 

 

Nfap_illegal1The NFAP went one better: in their original version, they blew up the 1953-1959 period to show us the criss-crossing lines!

We see trouble right from the start.  The "subsequent" effect that proved the case occurred in 1953, over 10 years after the program started. During that first decade, the number of apprehensions rose 4388%, in spite of the guest worker program.

A scatter plot (below left) now shows the lack of any meaningful relationship between these two variables.  While high admissions appeared together with low apprehensions, any level of admissions had historically been paired with low apprehensions.

Redo_illegal2

On the right, I connected the dots in chronological order.  Any claim of a negative relationship between admissions and apprehensions has been debunked.  From 1942 on (as we trace the line clockwise from lower left), first the nation experienced stepwise increasing admissions coupled with stepwise increasing apprehensions; then it witnessed sharply dropping apprehensions with relatively stable admissions; and finally it saw plummeting admissions while apprehensions remained low.  Three separate episodes, three distinct patterns.  There was no association, let alone causation.

Source: "Immigration Plan B", Wall Street Journal, June 13 2007.

May 22, 2007

Visualizing web statistics

Tim inquired about:

how to create an elegant graph for Web visitor traffic statistics that shows both how many views a page gets and then how many people click that page to go further ("conversion rate"). Part of the problem is that conversion rates vary from, say, .3% to 50% (a wide range).

Lets work with this sample data set.  Web1I ordered it from highest to lowest click rate, which is the primary metric of interest.  The number of page views is of interest too as sometimes rarely-visited pages may have high click rates.

At this point, it's important to know the context.  Specifically, who controls the allocation of pages? Did the data come from a randomized experiment? Or did they get a self-selected sample (e.g. web surfers deciding which section of the site to visit)?

Web_lift The first construct I tried is the "lift curve" often used in marketing.  It's the same thing as the Lorenz curve used by demographers but interpreted differently.  Here, we see that Guitar pages accounted for 26% of the page views but 37% of the clicks; House pages accounted for an incremental 44% of the pages and 59% of the clicks; etc.  The relative click rates are immediately clear from the steepness of the line segments.  The lift curve is appropriate for the self-selected case, in which we can take the allocation of page views as fixed.

Web_scatter If the allocation of page views is a decision to be made, then it doesn't make much sense to accumulate page views.  The second construct is the "scatter plot" of % clicks versus % page views.  The steepness of the line through the origin helps us compare the click rates.  Bicycles is clearly inferior in generating clicks.

Both these constructs are highly efficient; adding new data does not expand the chart at all.

Keen readers will observe that the slope of the line is not the click rate but rather a click rate index (relative to the overall click rate).  This means that any data point above the diagonal has above-average click rate.

Mar 01, 2007

Information gain and loss

The previous two posts indicated that CNN, TWC and Intellicast had the best on-line weather forecasting accuracy by looking at the median and mean error in predicting daily low and high temperatures over 41 days.  Is it possible to differentiate between those three?

For that, we need more data so I switched from summary statistics back to the data.  In this new chart, the day by day errors were plotted.  The gridlines labelled errors within 5 degrees, which is an arbitrary guideline for acceptable / unacceptable.  The three scatters looked remarkably similar although CNN appeared to hit the bull's eye (the middle square) with less bias (errors more evenly distributed) but not much better accuracy overall (similar number of unacceptable errors).

Redoonlineweather3

Feb 27, 2007

Mean and median

In the comments of the last post on on-line weather forecasts, Hadley raised the evergreen statistical question of mean vs median.  In this context, median error is unaffected by particular days in which the forecaster makes extreme errors while mean error takes into account the magnitude of every forecasting error in the sample.

Which one to use depends on the situation.  Brandon, who did the original analysis, was motivated by planning a trip to a unfamiliar location.  In this case, he might be better served by lower mean error, which would imply few extremely bad forecasts.

On the other hand, if I am interested in my local weather, then I'd likely be less concerned about a few extremely bad forecasts, and more concerned that the forecast is on the money on most days.  Then perhaps the median error would come into play.

Redoonlineweather2 It turns out it doesn't much matter for our weather forecast data.  In this new chart, I superimposed the mean error data (in black).  The scatter of points was exactly as it was for median error (in red).  (MSN had a particularly bad forecast for a low temperature one day, which pulled its location to the left.)

This shows further that the difference between CNN, Intellicast and The Weather Channel is negligible.

Feb 25, 2007

Going out on a limb

Earlier in the month, Prof. Gelman linked to Brandon's fascinating analysis of on-line weather forecasting accuracy.  I have done some additional analysis of the data and the result can be visualized as follows.

Redoonlineweather


I'll concentrate my comments on three observations:

  • CNN was the clear winner in forecasting accuracy during this period based on two criteria: its median error in forecasting daily lows, and its median error in forecasting daily highs.  Moreover, both the median errors were zero, which gives us confidence about its accuracy.  The Weather Channel (TWC) and Intellicast (INT) were not far behind.
  • The ability to forecast highs was better across the board than that of forecasting lows (except BBC).  I am not sure why this should be so.
  • Overall, our weather forecasters were much too risk-averse.  Notice that the errors were heavily biased in the lower left quadrant.  A negative error on low temperatures means predicted low is higher than actual low; a negative error on high temperatures means predicted high is lower than actual high.  Taking these together, we observe that the range of actual temperatures have generally been larger than the range of predicted temperatures!  No one was willing to go out on a limb, so to speak, to forecast extremes.

Actually, I believe this inability or unwillingness to forecast extreme values is endemic to all forecasting methodologies.

Before closing, I mention that the graph was based on a subset of Brandon's data.  I only considered same-day forecasts, did not consider Unisys (because they didn't provide forecasts for lows), and also noted that there might be bias since there were breaks in the time series.  Also, I retained the sign information and didn't take absolute values as Brandon did.

Feb 01, 2007

Error spotting

My friend Augustine pointed me to this interesting graph showing the time of sunset over the course of a year.  (The original author's write-up is here.)

Flickr_sunset

Of course, one can produce a perfect chart by looking up meterological records.  The main interest in this graph is how it was constructed.  Each cell in the graph represents an hour of a day, with days running across and time running down. The cells that are not dark each contain a photograph of the sunset contributed to Flickr, the photo-sharing site.  So this is in effect a graph created through mass collaboration (about 35,000 photos).

The "white" band roughly indicates the sunset.  What intrigues me is the variability... what are the reasons for lighted cells appearing all over the graph?

Some ideas include:

  • Different time zones
  • Incorrect time setting by some photographers
  • Erroneous tagging of photos as "sunset"

Jan 16, 2007

Subjectivity

Irwebfeature_1 When I look at charts like this one, I ponder: Should graph designers adopt "objectivity" as practiced by American journalists?

Is it even possible to make "objective" charts?  Every design choice we make seem to chip away some of the detachment.  In this chart, the choice to order important web-site features by shopper -- rather than merchant -- ratings is a tacit preference for those ratings.  Bringing out key messages in the data is a subjective act, isn't it?

Are "objective" charts useful?  In our example, the design choices are kept to a minimum, and so it seems is its usefulness.  In comparing shopper and merchant ratings, one would be most interested in identifying the most effective web-site features as well as those features offered by merchants that find little resonance with shoppers-users.  These questions are better addressed by directly plotting the average rank and the ranking gap between merchants and shoppers (see below).

Notice that I said "ranking" rather than "rating".  The footnote discloses that the ratings were obtained from two different surveys conducted by two different companies at two different times.  How should we interpret the difference of 13% between the 89% of shoppers rating "Free Shipping" "very to extremely helpful" and the 76% of merchants rating "Free Shipping" "somewhat to very valuable"?

RedowebfeatureIn the junkart chart, we can focus on three groups of features:

  • the three top features ("Promo Discounts", "Free Shipping" and "Keyword Search") which attained the best average rank and least ranking gap;
  • the three "orphan" features ("Recommended Products", "Top Sellers", "Gift Selection") created by loving web-site producers, abandoned by independent-minded shoppers;
  • the three "neglected stepchildren" ("Shop the Catalog", "Store Locator", "Product Comparison") whose importance to shoppers were vastly underestimated by the merchants.

Unfortunately, while being "objective",  the data table fails to point out anything of interest to the reader.

Reference: "Consumers want one thing -- merchants are delivering another", Internet Retailer, Jan 2007.

Dec 15, 2006

Emergent patterns

It's always a pleasure to read blow-by-blow accounts of how charts were constructed.  The piece on time-travel maps was instructive.  Similarly in the previous post, I quoted the following:

It’s easier to answer this question if you leave out the six states that didn’t elect any Republicans in 2000; after all, they didn’t have any to throw out. If you also remove New Hampshire and South Dakota, where the percentage of Republicans elected dropped to 0 from 100 — New Hampshire only has two seats in the House and South Dakota has one — a pattern starts to appear.

At first sight, this appears as a case of removing outliers, which many statisticians recommend.  Except that the data omitted were not outliers.  Indeed, when both x- and y-variables are bounded (between 0% and 100% share of the House seats; between -100% and +100% change in share), there can be no extreme values.

In effect, when the author eliminated those eight points, he followed the "emergent pattern" theory, by which I mean the notion of removing data until a pattern "emerges".  (By the way, emergence is now a science, as expounded here.)  If enough data is removed, one can produce any pattern as one pleases.  One can find subsets of data to support a hypothesis of positive linear, flat linear or quadratic, as shown below.

Redoelectiond

Focusing now on the full data set on the upper left corner, one is hard pressed to conclude that a positive correlation exists between the two variables. In particular, most states experienced no changes in the share of House seats, and in these states, the income growth ranged from under 20% to over 40%, which is pretty much the extent of variability across the full data set.

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Search Junk Charts


  • Custom Search

Residues

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31