Dec 09, 2007

Lacking buzz

Nielsen, they of the ratings, is roughing it in the information age.  When they announced on-line tracking tools, Wired quipped: "It's looking like online video policing companies will have to make room for another deputy."  Last year, cable companies revolted over a service measuring the effectiveness of commercials.

Via the Data Mining blog, I learnt about yet another new on-line offering, called "Hey! Nielsen" for obscure reasons.  (Perhaps Hey! Nielsen is the new Yahoo! !)

The site is an enigma wrapped in a mystery.  The official description says:

Hey! Nielsen is the place to make a name for yourself while trading opinions on TV, movies, music, personalities, web sites and more.

How does one "trade" opinions?

According to the FAQ, the "Hey! Nielsen" score, the cornerstone of the site, is:

a real-time indicator of a topic's impact and value and you play a major role. As the site evolves and users submit their opinions and commentary, the score will rise or fall based on a number of factors including, but not limited to, user opinions, news coverage, and raw data from our sister sites Billboard.com, HollywoodReporter.com, and BlogPulse.com.

Sounds like a product aimed at marketers to help them track public opinion but offering little control over sampling. 

The "Hey! Nielsen" buzz chart (below) captures the change in "Hey! Nielsen" score over time.

Heynielsen

This chart is an unfortunate case of flipping background into foreground.  What grabs our attention are those hideous white circles with numbers in them.  The legend explains that these are the daily numbers of opinions on the subject, in other words, the daily sample sizes.  As they stand now (with the site still in beta), they serve to expose the low level of participation, leading to small sample sizes, and irrelevance.  But what when the site became super-popular, would the circles say 56234, 19245, 90257, etc.?  Why would visitors care about daily sample sizes anyway?  Mousing over these circles reveal text but in most cases, they are blocked by neighboring white circles.

In the meantime, the circles obscure the line which shows the trend in the "Hey! Nielsen" score over time.  This chart reminds me of that Google toy known as Google Trends.  The Googlers provide no vertical scale so the graphs are unreadable.  "Hey! Nielsen"ers provide a vertical scale -- kind of -- but the graphs are still meaningless: what does a score of 881 mean?  how about 724?  what is the maximum score?  what is the minimum?  Beware numbers without context.

The vertical axis does start from zero but has an odd spacing of tick labels. The gridlines are distracting and serve no purpose.  The orange area under the curve also makes little sense.

We look forward to seeing version 2.0.

 

Dec 02, 2007

Live dynamic graphics

In the second interesting item of the week, I return to the fabulous Google Finance chart, which shows the distribution of stock market returns by sector.  I wrote about it twice (here and here).  In the original post, I saluted the engineers for figuring out the formidable technical issues of turning a live dynamic data stream into a live dynamic graphic but didn't go into details.  (Trust me.)

Goog_oops The other night, this chart popped up on my browser.

Oops.

If someone kept track of each time such a mishap showed up, the tally would probably be 1-5% of the time.

The triple challenge of generating this graphic is the volume of data that needs to be processed, the velocity at which it changes, and the flicker of time from input to output, probably not more than a few minutes. The analysis and charting must be maintained continuously during market hours.  For any such projects, the thing to manage is the error rate, and one should be totally thrilled if it's in the range Google engineers have achieved.

Nov 16, 2007

Large tables

PrivacyRichard J. asked how we might make sense of this tableLarge tables present lots of challenges.  The trick is to enhance the table with colors and shapes; and as usual, remove any data that doesn't help make your argument.

This table compares countries across different measures of privacy.  Each measure is rated on a scale of 1 to 5, with some blanks.  These ratings are averaged to obtain an overall rating, listed on the right.

In the junkart version, the ratings are presented as slots inside a box.   The overall rating is placed right below the name of the country since this is the most important measure, and how the countries were ordered.  The rows and columns are reversed so as to explain how the overall rating can be decomposed into individual metrics for each country.  I have only shown the top five countries but obviously the chart can be extended to cover all the data. 

Redo_privacy

If desired, the top 5 countries in each measure can be given a different color: this would increase the data-ink ratio on the chart.  One weakness of this type of chart is that the rows and columns do not have equal status: comparing across rows is more difficult than comparing up and down columns.

Richard also wonders about their treatment of the blanks.  It appears that they omit blanks so each country's rank is the average of non-blank measures.  Omitting blanks may seem innocuous but in fact, this is equivalent to assigning the blank measures ratings equal to the country's average non-blank rank.  Richard wonders if this is the best way to treat these blanks.

 

Source: "Leading surveillance societies", Privacy International.

(Thanks to Richard for sending me the data.)

Nov 11, 2007

Red-lining by marriage

Bbc_family Tom W., a reader, noticed this map featured on a BBC News page about the UK family.

One can roughly make out the shape of Great Britain so this is some kind of cartogram.
The title announces that this cartogram concerns the "distribution of population". 

In a typical map like this, the redder reds would indicate higher densities of people.  Yet, the article tells us that the population is divided evenly into 85 squares, each containing
"roughly half a million people over 18 years old".

Instead, we seem to have 500K widowed people next to 500K re-married people (most of whom prefer the coasts, by the way), etc.  Apparently, the Brits practise a form of red-lining based on marital status!

The S/M/W/D/R labels are also redundant and very distracting; and the white gridlines interfere with our ability to read the grey boundaries.

Source: "The UK family", BBC News.

Oct 28, 2007

Clocks and pies

Keith A submitted this graphical idea from the folks at Ikea (via Boing Boing). 
Ikea
Based on the comments, it seems like some people really like this presentation!

Consider these for amusement:

  • Does the "9" on Sunday mean 9 am or 9 pm?  (This chart mixes A.M. and P.M. hours in a totally nonchalant way.)
  • If the above is too easy, try the "9" for Saturday!
  • Why was "9" displayed on Sunday anyway?  Meanwhile, why wasn't "7" displayed for Saturday?  (How were the hour labels chosen?)
  • Why was "Closed" written on the chart while "High", "Mid", and "Low" were put into the legend?
  • Since pie charts show proportions, is it possible to describe what proportions were plotted?

Reminds me of this pie chart.



Sep 17, 2007

Structuring a chart

Nytmpg This chart from the NYT was intended to show how the EPA has moved the bar on vehicle mileage ratings: 2008 estimates were lower than 2007 estimates across the board, regardless of manufacturer, model and city/highway.

The chart was built from one basic component, repeated for each model. 
Nytmpgsm_2I like the discreet gridlines (the white ticks) which enable readers to count off the mileage ratings.

The data is rich: ratings were given along three dimensions (model, year of estimate and city/highway).  Readers can benefit from a stronger guidance in where to look for the most pertinent information.  As the chart stands, it is merely a container for the data.  It fails our self-sufficiency test: all the data were printed on the chart, and the bars add little.

In the junkart version, I use knowledge of the data to structure the chart. First, noting that sedans, hybrids and trucks/SUVs/minvans have different levels of mileage ratings, I clustered the models into three groups.  Secondly, the city and highway ratings were separated into two columns as I consider the between-model comparisons more important than city-highway comparisons. 
RedompgThe chart is a dot plot, with a vertical tick for 2007 estimates and a dot for 2008 estimates.  It's easy to see that all dots sit to the left of vertical ticks.

More subtly, we can also see that the hybrids appeared to have been penalized more.  Or perhaps, the higher the rating, the larger the downward adjustment...

Source: "Mileage Ratings Are Still Estimates, Though Closer to Reality", New York Times, Sept 16 2007.

Aug 08, 2007

On the bubble

Nyt_candminsA couple of you noticed this table of bubbles in the Times, and asked what I think of it.  Dustin J suggested that this could be considered a decent application of bubble charts.  I agree, with some reservations.

The data set is the best thing about this chart.  The riches that lay beneath!  Many questions can be addressed, including:

  • Which Presidential candidates are getting the most face time?
  • Are candidates seen equally often across the stations?
  • Are there differences between network and cable stations in terms of total face time?  In terms of individual face time?
  • Are there Democratic/Republican leanings by station?  by type of station?

The intrepid can even build a regression out of it.

The bubble chart contains answers to all those questions but nothing jumps out. Okay, it's easy to see the station that gives each candidate the most face time.  Anything else requires moderate to a lot of effort.  Here's the junkart version.


Redocandmins_2 The list of things done to the data is long:

  • Candidates are grouped together by party
  • Candidates within each party are arranged in order of decreasing maximum face time
  • Stations are arranged by increasing total face time, this order happens to retain the network vs cable divide
  • A heat map construct is used instead of bubbles: the legend is missing but there are four hues for each color: darkest = top 10%; medium = 10th - 50th percentile; light = bottom 50th percentile excepting zeroes; white = no face time.  In raw numbers, 90th percentile = 81 minutes, 50th percentile = 19 minutes.
  • The only data shown are the totals by candidate and totals by station.
  • On the right margin are little bar charts that show the distribution of network/cable for each candidate.
  • On the bottom margin are little column charts showing the distribution of party affiliation by station.

A few observations follow:

  • Cable stations gave much more face time to the candidates in general.  Fox, no surprise, gives Republicans 85% of its time while all the others were roughly equal.
  • The more mainstream the candidate, the balanced was the time spent on networks versus cable.  John McCain (R), Hillary Clinton (D) and John Edwards (D) had the highest proportion of network time.
  • More time is not necessarily good since McCain was the clear winner but his campaign is struggling

Source: "Tracking Face Time", New York Times, August 1, 2007.

Jul 29, 2007

Transgender trends

One of the many gratifications of blogging is to connect with others who have similar interests; so it has been fantastic to receive user submissions (though admittedly I don't check my inbox frequently enough).  The thoughtfulness of these nominations continues to impress me.

Evan sent in 254 charts he created after looking at the post on baby namesJordanv31970200528yrs_2An example is shown on the right. 

He is particularly interested in the question of names that are given to both males and females. 

For example, the bottom chart shows that Jordan is primarily a male name, and saw a period of growth followed by decline, although the decline has been more severe on the male side than the female side. 

It's a nice touch to label the most recent year.  I'd also label the values for the most recent year on the axes.

Evan also offers the following solution to the scaling problem we identified in the original WSJ chart:

My solution was just to put two charts on each chart. One at a fixed scale for every chart to give a sense of size and one at a variable scale to better show the shape of the plot.

In other words, for less popular names, the top chart would look much more compressed.

There are many more charts to sift through on his site.  Evan welcomes suggestions.

May 23, 2007

Looking for survival

Retention_rate_by_daniel_waisberg_2 Daniel W of esnips has started a collection of graphics on visualizing web statistics.  The following graph is an attempt to capture the ability of the web-site to attract returning customers.

The time axis serves double duty here: it is an indication of which "cohort" the users belong to, in other words, when they signed up; it is, also, the month of returning visits.

Web_surv A more typical chart used by statisticians is the survival curve.  As shown here, these are the same curves as above but having the same starting point.  Now, the time axis is interpreted as number of months after registration.  Of 100 members who registered in January, how many returned one month later, two months later, etc.

If the purpose is to evaluate the consistency of retaining customers by cohort, then this graphic is less cluttered.  I also used a fading metaphor to color the lines so that the oldest cohort (also, the longest line) is the faintest.  Line labels are best hidden, and revealed interactively when the user mouses over a line of interest.

Not sure if Daniel was plotting real data; in general, we expect a certain amount of criss-crossing.  If the data is real, then his site has seen uninterrupted improvement every month thus far.

Source: The Web Analytics Graph Collection, eSnips.

Apr 25, 2007

Shower of bullets

Nyt_gundeaths_sm Here's one of those infographics that makes the reader work hard (via Dustin J).  The graphic in its full glory is here; it's much too large to be reproduced, and I have clipped off the bottom half.

Much to the designer's credit, he extracted data of interest, rather than trying to cram everything onto the page.  In particular, he was most interested in the distribution of deaths among different age groups, the types of deaths (suicides, homicides) and the identities of the deceased (race, gender).

Just like the election fraud graphic, such rich data lend themselves to multiple levels of aggregation.  Here, the designer focuses on the most detailed level, making it easiest to see facts like "among the 18-25 age group, there were 6 black men murdered per day".

However, it takes much more attention to notice higher-level facts like "homicides per day are relatively flat across age groups while suicides heavily skew toward 40+".

Redo_gundeaths_sm In the junkart version, I decided to emphasize the more aggregated data, showing the number of deaths of each type across age groups. The detailed break-down of race and gender is shoved into parentheses, as they can be omitted by less serious readers.

The reader who discovers that the homicide/suicide pattern described above may surmise that homicide gunfire deaths are more "random" while suicides, being  premeditated, may affect older people disproportionately.  More research would be needed to confirm such and other suspicions.

Source: "An Accounting of Daily Gun Deaths", New York Times, April 21 2007.

 

Mentions


  • My Amazon.com Wish List

  • Yahoo! Picks

Search Junk Charts


  • Custom Search

Residues

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31