« May 2014 | Main | July 2014 »

Light entertainment: famous people, sleep, publication bias

Bernard L. tipped us about this "infographic":

3031754-inline-i-sleep-schedule

The chart is missing a title. The arcs present "sleep schedules" for the named people. The "data" comes from a book. I wonder about the accuracy of such data.

Also note the inherent "publication bias". People who do not follow a rigid schedule will not be able to describe a sleep schedule, thus taking themselves out of the chart.


Respect the reader's time

A graphic illustrating how Americans spend their time is a perfect foil to make the important case that the reader's time is a scarce resource. I wrote about this at the ASA forum in 2011 (link).

In the same WSJ that carried the DSL speed chart (link), they boldly placed the following graphic in the center of the front page of the printed edition:

Wsj_atus_printed_sm

The visual form is of a treemap displaying the results of the recently released Time Use Survey results (link to pdf).

What does the designer want us to learn from this chart?

***

What jumps out first is the importance of various activities, starting with sleep, then work, TV, leisure/sports, etc.

If you read the legend, you'll notice that the colors mean something. The blue activities take up more time in 2013 compared to 2003. Herein, we encounter the first design hiccup.

The size of the blocks (which codes the absolute amount) and the color of the blocks (which codes the relative change in the amount) compete for our attention. According to Bill Cleveland's research, size is perceived more strongly than color. Thus, the wrong element wins.

Next, if we have time on our hands, we might read the data labels. Each block has two labels, the absolute values for 2003 and for 2013. In this, the designer is giving an arithmetic test. The reader is asked to compute the change in time spent in his or her head.

It appears that the designer's key message is "Aging Americans sleep more, work less", with the subtitle "TV remains No.1 hobby".

***

Wsj_atus2013Now compare the treemap to this set of "boring" bar charts.

This visualization of the same data appears in WSJ online in lieu of the treemap. Here, the point of the article is made clear; the reader needs not struggle with mental gymnastics.

(One can grumble about the red-green color-blindness blindness but otherwise, the graphic is pretty good.)

 

***

When I see this sort of data, I like to make a Bumps chart. So here it is:

Redo_wsjatus

The labeling of the smaller categories poses a challenge because the lines are so close together. However, those numbers are so small that none of the changes would be considered statistically significant.

***

From a statistical/data perspective, a very important question must be raised. What is the error bar around these estimates? Is there anything meaningful about an observed difference of fewer than 10 minutes?

Amusingly, the ATUS press release (link to pdf) has a technical note that warns us about reliability of estimates but nowhere in the press release can one actually find the value of the standard error, or a confidence interval, etc. After emailing them, I did get the information promptly. The standard error of one estimate is roughly 0.025-0.05 hours, which means that standard error of a difference is roughly 0.05- 0.1 hours, which means that a confidence interval around any estimated difference is roughly 0.1-0.2 hours, or 6-12 minutes.

Except for the top three categories, it's hard to know if the reported differences are due to sampling.

***

A further problem with the data is its detachment from reality. There are two layers of averaging going on, once at the population level and once at the time level. In reality, not everyone does these things every day. This dataset is really only interesting to statisticians.

So, in a Trifecta Checkup, the treemap is a Type DV and the bar chart is a Type D.


Getting the basics right is half the battle

I was traveling quite a lot recently, and last week, read the Wall Street Journal cover to cover for the first time in a while. I am happy to report that there are many more data graphics than I remember of past editions.

The following chart illustrating findings of an FCC report on broadband speeds has a number of issues (a related blog post containing this chart can be found here):

Wsj_dsl_speeds

The biggest problem with the visual elements is the lack of linkage between the two components. The two charts should be connected: the one on the right presents ISP averages by the broadband technology while the one on the left presents individual ISP results. Evidently, the designer treats the two parts as separate.

If that was the intention, there are two decisions that create confusion for readers. First, the charts use two different but related scales. Just add 100% to the scale of the left chart and you get the scale of the right chart. There really is no need for two different scales.

Secondly, orange and blue are used in both charts but for different purposes. In the left chart, orange denotes all ISPs whose actual speeds were below their advertised speeds. In the right chart, orange denotes ISPs using DSL technology.

I also do not understand why some ISP names are bolded. The bolded companies include several cable providers (but not all), several DSL providers (but not all), one fiber provider and no satellite.

Lastly, I'd prefer they stick to one of "advertised" and "promised". I do like the axis labels, saying "faster than" and "slower".

***

One challenge of the data is that the FCC report (here) does not provide a mathematical linkage between the technology averages and the ISP data. We know that 91% for DSL is the average of the ISPs that use DSL as shown on the left of the chart, but we don't know the weights (relative popularity) of each ISP so we can't check the computation.

But if we think of the average by technology as a reference point to measure individual ISPs, we can still use the data, and more efficiently, such as in the following dot plot where the vertical lines indicate the appropriate technology average:

Redo_fccdslspeedwsj

(The cable section should have come before the DSL section but you get the idea.)

The key message of the chart, in my mind, is that DSL providers as a class over-promise and under-deliver.

In a Trifecta Checkup, this is a Type V chart.

 


A great visual of complicated schedules

Reader Joe D. tipped me about a nice visualization project by a pair of grad students at WPI (link). They displayed data about the Boston subway system (i.e. the T).

The project has many components, one of which is the visualization of the location of every train in the Boston T system on a given day. This results in a very tall chart, the top of which I clipped:

Mbta_viz_1

I recall that Tufte praised this type of chart in one of his books. It is indeed an exquisite design, attributed to Marey. It provides data on both time and space dimensions in a compact manner. The slope of each line is positively correlated with the velocity of the train (I use the word correlated because the distances between stations are not constant as portrayed in this chart). The authors acknowledge the influence of Tufte in their credits, and I recognize a couple of signatures:

  • For once, I like how they hide the names of the intermediate stations along each line while retaining the names of the key stations. Too often, modern charts banish all labels to hover-overs, which is a practice I dislike. When you move the mouse horizontally across the chart, you will see the names of the unnamed stations.
  • The text annotations on the right column are crucial to generating interest in this tall, busy chart. Without those hints, readers may get confused and lost in the tapestry of schedules. If you scroll to the middle, you find an instance of train delay caused by a disabled train. Even with the hints, I find that it takes time to comprehend what the notes are saying. This is definitely a chart that rewards patience.

Clicking on a particular schedule highlights that train, pushing all the other lines into the background. The side panel provides a different visual of the same data, using a schematic subway map.

Mbta_viz_2

 Notice that my mouse is hovering over the 6:11 am moment (represented by the horizontal guide on the right side). This generates a snapshot of the entire T system shown on the left. This map shows the momentary location of every train in the system at 6:11 am. The circled dot is the particular Red Line train I have clicked on before.

This is a master class in linking multiple charts and using interactivity wisely.

***

You may feel that the chart using the subway map is more intuitive and much easier to comprehend. It also becomes very attractive when the dots (i.e., trains) are animated and shown to move through the system. That is the image that project designers have blessed with the top position of their Github page.

However, the image above allows us to  see why the Marey diagram is the far superior representation of the data.

What are some of the questions you might want to answer with this dataset? (The Q of our Trifecta Checkup)

Perhaps figure out which trains were behind schedule on a given day. We can define behind-schedule as slower than the average train on the same route.

It is impossible to figure this out on the subway map. The static version presents a snapshot while the dynamic version has  moving dots, from which readers are challenged to estimate their velocities. The Marey diagram shows all of the other schedules, making it easier to find the late trains.

Another question you might ask is how a delay in one train propagates to other trains. Again, the subway map doesn't show this at all but the Marey diagram does - although here one can nitpick and say even the Marey diagram suffers from overcrowding.

***

On that last question, the project designers offer up an alternative Marey. Think of this as an indiced view. Each trip is indiced to its starting point. The following setting shows the morning rush hour compared to the rest of the day:

Mbta_viz_3

 I think they can utilize this display better if they did not show every single schedule but show the hourly average. Instead of letting readers play with the time scale, they should pre-compute the periods that are the most interesting, which according to the text, are the morning rush, afternoon rush, midday lull and evening lull.

The trouble with showing every line is that the density of lines is affected by the frequency of trains. The rush hours have more trains, causing the lines to be denser. The density gradient competes with the steepness of the lines for our attention, and completely overwhelms it.

***

There really is a lot to savor in this project. You should definitely spend some time reviewing it. Click here.

Also, there is still time to sign up for my NYU chart-making workshop, starting on Saturday. For more information, see here.


Second Dataviz Workshop Soon to Start, and Feedback from First Workshop

I'm excited to announce that there will be a summer session for my Dataviz Workshop at NYU (starting June 21). This is a chart-building workshop run like a creative writing workshop. You will work on a personal project throughout the term, receive feedback from classmates, and continually improve the product. I have previously written about the First Workshop here (with syllabus), here, here and here.

Here is the link to register for the course. (Note: the correct class time is 10a - 1p.)

***

The participants in the First Workshop were very happy with their experience. I can now report on the end-of-course survey. Ten people took the class, and seven responded to the survey. The satisfaction scores are as follows:

Dataviz_student_satisfaction

It's very gratifying to see that almost everyone thought the class time was well spent. During class, students gave each other feedback on projects. A key to making these sessions work is that students should be both givers and takers. It is really important that they become as comfortable giving critique as taking feedback. I asked the students to self-assess and this is what they said:

Dataviz_student_givetake

I'd also add that the few students who enrolled in the course with less background than the average ended up participating fully and actively in the discussion. As an instructor, I want to get out of the way while keeping the conversation on track. Based on the following rating, I think I did fine:

Dataviz_student_instructor

One of the feedback I received during class--not reflected here--is that some students want to spend more time discussing the reading. I assign three books, which everyone loved but I believe that it is hard for them to finish reading all three books in time for the second class. They would like to spread the discussion of the books over the course of the term. This arrangement would present a challenge. Due to the nature of a workshop, the first two sessions cannot involve project discussion, which is one of the reasons why I give introductory lectures and assign the books. In addition, students spend a lot of time during the term both working on their own projects and reviewing their classmates' projects; and I worry that assigning more reading distracts from the other activities.

Indeed, the course is not a gut course. Several students were surprised by how much work they put in. One or two learned that preparing the data took ten times as much time as they expected. (They selected particularly difficult datasets to work with.)

Dataviz_student_workload

A specific feedback is to add a session in the computer lab. This creates an opportunity for students to share their knowledge. Those who are good coders can help others who are not with pre-processing tasks. Those who are good with Illustrator can show others how to make the charts pretty. I am not ready for this change in the summer session but in the fall, I'll likely experiment with this.

Finally, the tools used by students are diverse: Excel (5), Illustrator (3), R (2), followed by Powerpoint, Pixelmator (draft stage), Tableau, Stata, Paint and SQL Server (1 each). Three of the students put their work on a Web page, which was the most popular format.

***

If you are serious about dataviz, please join me this summer for the Second Art of Data Visualization Workshop.

Click on this link to register for the course.



 

 


A reader submits a Type DV analysis

Darin Myers at PGi was kind enough to send over an analysis of a chart using the Trifecta Checkup framework. I'm reproducing the critique in full, with a comment at the end.

***

Kpcbtrends96

At first glance this looks like a valid question, with good data, presented poorly (Type V). Checking the fine print (glad it’s included), the data falls apart.

Question

It’s a good question…What device are we using the most? With so much digital entertainment being published every day, it pays to know what your audience is using to access your content. The problem is this data doesn’t really answer that question conclusively.

DATA

This was based on Survey data asking respondents “Roughly how long did you spend yesterday…watching television (not online) / using the internet on a laptop or PC / on a smartphone / on a tablet? Survey respondents were limited to those who owned or had access to a TV and a smartphone and/or tablet.

  • What about feature phones?
  • Did they ask everyone on the same day, random days, or are some days over represented here?
  • This is self-reported, not tracked…who accurately remembers their average screen time on each device a day later? I imagine the vast majority of answers were round numbers (30, 45 minutes or 2 hours). This data shows accuracy to the minute that is not really provided by the users.

In fact the Council for Research Excellence found that self-reported screen time does not correlate with actual screen time. “Some media tend to be over-reported whereas others tend to be under-reported – sometimes to an alarming extent.” -Mike Bloxham, director of insight and research for Ball State

VISUAL

The visual has the usual problems with stacked bar charts where it is easy to see the first bar and the total, but not to judge the other values. This may not be an issue based on the question, but the presentation is focusing on an individual piece of tech (smartphones), so the design should focus on smartphones. At the very least, smartphones should be the first column in the chart and it should be sorted by smartphone usage.

My implementation is simply to compare the smartphone usage to the usage of the next highest device. Overall 53% of the time people are using a smartphone compared to something else. I went back and forth on whether I should keep the Tablet category in the Key though it was not the first or second used device. In the end, I decided to keep it to parallel the source visual.

Myers_redokpcbtrend96a

Despite the data problems, I was really interested in seeing the breakdowns in each country by device, so I built the chart below with rank added (in bold). I also built some simple interaction to sort by column when you click the header [Ed: I did not attach the interactive excel sheet that came with the submission]. As a final touch, I displayed the color corresponding to the highest usage as a box to the left of the country name. It’s easy to see that the vast majority of countries use smartphones the most.

Myers_redokpcbtrend96b

***

Hope you enjoyed Darin's analysis and revamp of the chart. The diagnosis is spot on. I like the second revision of the chart, especially for analysts who really want to know the exact numbers. The first redo has the benefit of greater simplicity--it can be a tough sell to an audience, especially when using color to indicate the second most popular device while disassociating the color and the length of the bar.

The biggest problem in the original treatment is the misalignment of the data with the question being asked. In addition to the points made by Darin, the glaring issue relates to the responder population. The analysis only includes people who have at least a smartphone or a tablet. But many people in lesser developed countries do not have either device. In those countries, it is likely that the TV screen time has been strongly underestimated. People who watch TV but do not own a smartphone or tablet are simply dropped from consideration.

For this same reason, the other footnoted comment claiming that the sampling frame accounts for ~70 percent of the global population is an irrelevance.


Missing data, mysterious order, reverse causation wipes out a simple theory

New York Times columnist Floyd Norris published a set of charts purportedly to show that the housing market in the U.S. is on the mend. Not so quick Floyd.

His theory - originating from an economist at Hanley Wood, a real estate research firm - is that in a recovering market, the share of new home sales by home builders should be higher than the share by banks, as the bank share is associated with foreclosed houses. The data offered are both in aggregate and by regions. I'm particularly interested in the regional chart from a design perspective.

The published chart is the one shown on the left below. I am not a fan of nested bar charts. I don't think there is any justification for treating two data series (here, share by banks and share by builders) differently. Which of the two series should one assign to the fatter bars?

If we slim the fat bars down, we retrieve the more conventional paired bars chart, shown on the right. Among these two, I prefer the paired version.

Redo_nythousingmend

***

Nyt_housingmend_stackedbarsThere is a weakness with both versions. The theory rests on the relative share, which is clearer in a stacked presentation as shown on the right.

This presentation also shines the light on a dark corner of Norris's analysis. In every city but Detroit, an unmentioned group of sellers accounts for the majority of home sales! Nowhere in the article did Norris tell readers who those sellers are, and why they are ignored.

In all these charts, I have kept the original order of cities. Before reading further, see if you can tease out the criterion for sorting the cities.

With some effort, you'll learn that the cities are arranged in the order of degree of housing recovery, which is measured by the difference in share: the cities at the top (Houston, Dallas, etc.) have a higher share of builders selling than banks selling.

Ironically, the difference in share is the least emphasized data in a nested bar chart. In fact, how you compute the difference depends on the relative share! When the olive bar is longer than the blue bar, the reader sizes up the white space between the edges of the bars; when the blue bar is longer, though, the reader must look inside the blue area, and compute the interior distance.

The reader can use some help here. Possible fixes include using a footnote, or adding a note informing readers that up implies stronger recovery, or creating a visual separation between those cities in which the share by builders exceeds that by banks, and vice versa.

Here is a dotplot with annotations. The separation between the dots is easily estimated.

Redo_nythousingmend2

***

Recall the theory that in recovering markets, banks account for a lesser share of home sales. The analyst turned this into a metric, by taking the difference in the share by builders from the share by banks.

This metric is highly problematic. The first problem, already discussed, is that there exist more than these two types of sellers, and it is absolutely not the case that if the share by banks goes down, the share by builders goes up.

Another issue is that the structure of the housing market in different cities is probably different. The chart promotes the view that there is a general trend that extends to all markets. In fact, the variation over time within one city should be more telling than the variation across twenty cities of a point in time.

And there is the third strike.

This is a confusion between forward and reverse causation (see Andrew's post here for a general discussion of this important practical issue). The Floyd Norris/Hanley Wood theory expresses a forward causation: if a housing market is recovering, then banks will work through its inventory of foreclosed homes, and account for a decreasing share of home sales.

The analysis addresses the reverse of this relationship. The analyst observes that banks (in some cities) are selling fewer homes, and concludes that the housing market is recovering. Notice that this is a problem of reverse causation: instead of cause -> effect, we have effect -> cause. The rub is that any given outcome has many possible causes. Banks sell fewer homes for many possible reasons, only one of which is a recovering market. 

Here are some other possibilities. The banks expect prices to rise in the future, and they are holding on to the inventory. The economy is sputtering and banks are tightening up on mortgage lending, making it harder to sell homes. Instead of selling the homes, the banks decide to destroy the homes to reduce supply and raise prices. The mysterious third group of sellers has put a lot of homes on the market. etc.

In making claims based on observational data, one must conduct side investigations to rule out other causes.

***

Trifecta-dvFrom a Trifecta Checkup perspective, this chart addresses an interesting Question. The Visual design has hiccups. The biggest problem is that the Data provide an unsatisfactory answer to the question at hand. (Type DV)