A great visual of complicated schedules

Reader Joe D. tipped me about a nice visualization project by a pair of grad students at WPI (link). They displayed data about the Boston subway system (i.e. the T).

The project has many components, one of which is the visualization of the location of every train in the Boston T system on a given day. This results in a very tall chart, the top of which I clipped:

Mbta_viz_1

I recall that Tufte praised this type of chart in one of his books. It is indeed an exquisite design, attributed to Marey. It provides data on both time and space dimensions in a compact manner. The slope of each line is positively correlated with the velocity of the train (I use the word correlated because the distances between stations are not constant as portrayed in this chart). The authors acknowledge the influence of Tufte in their credits, and I recognize a couple of signatures:

  • For once, I like how they hide the names of the intermediate stations along each line while retaining the names of the key stations. Too often, modern charts banish all labels to hover-overs, which is a practice I dislike. When you move the mouse horizontally across the chart, you will see the names of the unnamed stations.
  • The text annotations on the right column are crucial to generating interest in this tall, busy chart. Without those hints, readers may get confused and lost in the tapestry of schedules. If you scroll to the middle, you find an instance of train delay caused by a disabled train. Even with the hints, I find that it takes time to comprehend what the notes are saying. This is definitely a chart that rewards patience.

Clicking on a particular schedule highlights that train, pushing all the other lines into the background. The side panel provides a different visual of the same data, using a schematic subway map.

Mbta_viz_2

 Notice that my mouse is hovering over the 6:11 am moment (represented by the horizontal guide on the right side). This generates a snapshot of the entire T system shown on the left. This map shows the momentary location of every train in the system at 6:11 am. The circled dot is the particular Red Line train I have clicked on before.

This is a master class in linking multiple charts and using interactivity wisely.

***

You may feel that the chart using the subway map is more intuitive and much easier to comprehend. It also becomes very attractive when the dots (i.e., trains) are animated and shown to move through the system. That is the image that project designers have blessed with the top position of their Github page.

However, the image above allows us to  see why the Marey diagram is the far superior representation of the data.

What are some of the questions you might want to answer with this dataset? (The Q of our Trifecta Checkup)

Perhaps figure out which trains were behind schedule on a given day. We can define behind-schedule as slower than the average train on the same route.

It is impossible to figure this out on the subway map. The static version presents a snapshot while the dynamic version has  moving dots, from which readers are challenged to estimate their velocities. The Marey diagram shows all of the other schedules, making it easier to find the late trains.

Another question you might ask is how a delay in one train propagates to other trains. Again, the subway map doesn't show this at all but the Marey diagram does - although here one can nitpick and say even the Marey diagram suffers from overcrowding.

***

On that last question, the project designers offer up an alternative Marey. Think of this as an indiced view. Each trip is indiced to its starting point. The following setting shows the morning rush hour compared to the rest of the day:

Mbta_viz_3

 I think they can utilize this display better if they did not show every single schedule but show the hourly average. Instead of letting readers play with the time scale, they should pre-compute the periods that are the most interesting, which according to the text, are the morning rush, afternoon rush, midday lull and evening lull.

The trouble with showing every line is that the density of lines is affected by the frequency of trains. The rush hours have more trains, causing the lines to be denser. The density gradient competes with the steepness of the lines for our attention, and completely overwhelms it.

***

There really is a lot to savor in this project. You should definitely spend some time reviewing it. Click here.

Also, there is still time to sign up for my NYU chart-making workshop, starting on Saturday. For more information, see here.


Two good maps, considered

A Relection on the past year:

Thanks to you for continuing to make this blog a success. Writing it has given me much enjoyment over the years, and I have learned much from your comments as well as from the visualization projects of many colleagues. 2013 also saw the publication of my new book Numbersense: How to Use Big Data to Your Advantage (link). I thank those of you who have purchased the book, and supported my writing. For those who haven't, please check it out. I have also been speaking at various events, mostly about interpreting data analyses published in the mass media, and building effective data analytics teams. In addition, I am heavily involved in the new Certificate in Analytics and Data Visualization at New York University (link). While the frequency of posting has suffered a little due to my other projects, I hope you found the contents as engaging, fun, and constructive as before.

Looking forward to 2014, I have as usual a basket of projects. Besides the two blogs, I will be expanding my teaching at NYU, including a visualization workshop that I'll be writing about here soon; taking on consulting projects; evangelizing better communications of data and analytics; and prospecting several book projects. I continue to spend most of the week at Vimeo, where my team analyzes data.

This will be my last post in 2013. It is an extra-long post to tie you over to the New Year. Happy New Year!

Kaiser

***

A short while ago, I was in correspondence with Thomas Rhiel who created a lovely map depicting the age of buildings in Brooklyn (link). In this case, it's the data that intrigues my interest. I haven't seen this type of data visualized before. The map type is exquisitely aligned to the data: buildings are geographically located and the age is a third, non-geographical dimension which is encoded in the colors. Red-orange is the most recent while green-blue is the oldest.

 

Bklynr_bldgmap1

 

The data is at the level of individual buildings. If you hover over a building, you find the raw data including the address and the year of construction. The details seem to show that even the shape of each building is depicted. This really impressed me since a lot of manual labor must have been applied (according to Rhiel, there is a source for this type of data). Here is the map at its most magnified:

Bklynr_bldgmap2

I came across this starry patch near the Manhattan Bridge, in which the buildings show up as red asterisks. (Rhiel said the shape came from the data. I am not sure I believe the data. Anyone lives near Sands Street?)

 

Bklynr_rhiel_q

The map is useful if you are interested in questions such as "where are the new developments" (look for the deep red buildings) or "what's the average age of the buildings in a specific block" or "what's the age distribution of the buildings in a set of blocks". At the magnified level shown above, the street names are available to help readers orient themselves. The light gray color keeps the roads and the names safely in the background.

Now, zoomed to the other extreme, we get the image of the whole of Brooklyn:

Bklynr_bldgmap3

 

I have a couple of suggestions for Rhiel. As someone who is not familiar with the geography of Brooklyn, this view presumes knowledge that I don't have. Unlike the magnified view, there are no text labels to help us decipher the different sections of Brooklyn. It would be nice if there is a background map to indicate the better-known areas like Williamsburg or Brooklyn Heights or Red Hook, etc.

The other concern is the apparent lack of pattern shown here. At this level, an appropriate question is which sections of Brooklyn are being redeveloped and which sections have older buildings. I see sprinkles of colors everywhere, giving the impression that everything is average. I suggested to Rhiel that aggregating the data would help bring out the pattern.

In data visualization, there is an obsession of plotting the "raw data" at its most granular level. Sometimes, this strategy backfires. It's the classic signal versus noise problem. Aggregation is a noise removal procedure. If for example, Rhiel gives up the data for individual buildings, including those beloved building shapes, and looks at the average age of buildings within each block, or even Census tracts, I suspect that the resulting map would be more informative.

It turns out that the Graphics team at the New York Times just published an interactive map that illustrates exactly what I suggested to Rhiel. Since this post is getting long, please go to the next post to continue reading.

 


The exception to the rule against dual axes

Dual axes are almost always a bad idea. But there is one situation under which I'd use it.

***

Last week, Alberto Cairo (link) engaged in a Twitter/blogging debate about a chart that first appeared in Reuters concerning the state of the woman CEO in the Fortune 500 companies. Here is the chart under discussion:

Original_women_ceo_left

This chart already is cleaner and more useful than the original original, which came from a research report from Catalyst (link):

Catalyst_us_ceos

Jonathan Keller re-made the Reuters chart as follows:

Keller_women_ceo_left

 

Cairo Jorge Camões contributed this version:

  Cairo_women_ceo_left

The Voila blog (link) has yet another take:

Voila_women_ceo_left

Then Chris Moore, responding to Cairo, created this view and also left some insightful comments:

Women_ceo_cmoore

***

What's at stake here? There are really three related topics of discussion.

First, there is the matter of the upper limit of the vertical axis. Three solutions were suggested: 100 percent, 50 percent, and 4 percent. (Cairo at one point suggested 25 percent, which can be wrapped into the 50 percent bucket.) In reality, this is an argument over which of two key messages should be emphasized. The first message is that women still comprises a pathetically small proportion of Fortune 500 CEOs. The second message is more hopeful, that the growth in this proportion has been quite rapid since 1995.

All versions of the chart actually display both messages. In the Reuters chart (as well as Moore and Cairo), the message about the absolute proportion of women is given as an annotation while the Keller and Voila versions extend the vertical axis, thus encoding this message directly to the chart. Conversely, the Keller and Voila versions deemphasize the growth in proportions, and so I'd have preferred to see a note about that growth when using their versions.

Voila selectes a 50% upper limit because the 50/50 split has an intuitive meaning in the context of gender balance. Because the resulting chart is so visually arresting, and so biased to one of the two key messages, I'd only consider it if the point of the display is to draw attention to the female deficit.

***

The second disagreement is in using absolute counts versus relative proportions. Moore chose absolute counts. I am in this camp as well. This is primarily because we are talking about Fortune 500 and the 500 number is an idee fixe. In Moore's version, I find the data labels distracting since all the numbers are small and insignificant.

Finally, the linkage between the absolute and the relative numbers also produces multiple solutions. Cairo's post pinpoints this issue. His solution is to include an inset pie chart with an arrow to explicitly link the two views. Moore likes the inset idea, but experimented with a donut chart or a partition in place of the pie chart. He also removes the explicit guiding arrow.

***

It turns out this dataset is perfectly made for the dual axes. The absolute counts and relative proportions are in one to one correspondence because it's really only one data series expressed twice. This happy situation leads to one line that can be cross-referenced on two axes, one side showing counts and the other side showing proportions. This is shown in my version below (the orange line).

Redo_women_ceo

In addition to having two axes, I have plotted two related data series. The second series (in red) shows the incremental change in the number of women CEOs from the previous year (also shown in both counts and proportions).

The first series (the same one everyone plotted) draws attention to the first message, that the growth rate of women CEOs is quite strong since 1995. The second series is a bit of a downer on that message, suggesting that from the absolute count perspective, the progress (only one or two additions per year) has been painfully slow, and not that impressive.

Thanks again to Alberto for making me aware of this discussion. This has been fun!

 

PS. I have left out the other chart and may return to it in a future post.


Beyond the obvious

Flowing Data has been doing some fine work on the baby names data. The names voyager is a successful project by Martin Wattenberg that has received praise from many corners. It's one of these projects that have taken on a commercial life as you can see from the link.

Here is a typical area chart presentation of the baby names data:

Namevoyager

The typical insight one takes from this chart is that the name "Michael" (as a boy's name) reached a peak in the 1970s and have not been as popular lately. The data is organized as a series of trend lines, for each name and each gender.

Speaking of area charts, I have never understood their appeal. If I were to click on Michael in the above chart, the design responds by restricting itself to all names starting with "Michael", meaning it includes Michael given to a girl, and Michaela, for example. See below.

Namevoyager_michael

What is curious is that the peak has a red lining. At first thought, one expects to find hiding behind the blue Michael a girl's name that is almost as popular. But this is a stacked area chart so in fact, the girl's name (Michael given to a girl, if you mouse over it) is much less popular than the boy Michael (20,000 to 500 roughly).

***

Nathan decides to dig a layer deeper. Is there more information beyond the popularity of baby names over time?

In this post, Nathan zones in on the subset of names that are "unisex," that is to say, have been used to name both boys and girls. He selects the top 35 names based on a mean-square-error criterion and exposes the gender bias for each name. The metric being plotted is no longer pure popularity but gender popularity. The larger the red area, the greater the proportion of girls being given that name.

You can readily see some interesting trends. Kim (#34) has become almost predominantly female since the 1960s. On the other hand, Robbie (#18) used to be predominantly female but is now mostly a boy's name.

Most-unisex-names1

 One useful tip when performing this analysis is to pay attention to the popularity of each name (the original metric) even though you've decided to switch to the new metric of gender bias. This is because the relative proportions are unstable and difficult to interpret for less popular names. For example, the Name Voyager shows no values for Gale (#29) after the 1970s, which probably explains the massive gyrations in the 1990s and beyond.


Graph redesign is hot

Joe D., a long time reader, points us to a few blogs that have been active creating redesigns of charts, similar to how we do it here.

First up, here are some examples from Storytelling With Data (link).

This example transformed a grouped bar chart into a line chart, something that I have long advocated. I'm still waiting for the day when market research companies start to switch from bars to lines.

Stwd_Student Makeover 2

***

Jorge Camoes, also a long-time reader, produced a redesign of a chart on military spending first printed in Time magazine. (link)

  Redo_militaryspend

Dual-axis plots have been pilloried here often, especially when the two axes have different and incompatible units, as in here. As usual, transforming to a scatter plot is a good first step, which is what Jorge has done here. He then connected the dots to indicate the time evolution of the relationship. This is a smart move here just because the pattern is so stark.

The chart now illustrates an "inflexion point" in 2000. Prior to 2000, troop size was decreasing while the budget was stable. After 2000, budget increased sharply while troop size remained relatively stable.

Now peer back at the original chart. You can discern the sharp decrease in troop size over time, and the sharp increase in budget over time, but separately. The chart teases a cross-over point around 1995 which turned out to be misleading. This is a great illustration of why dual-axis plots are dangerous.


Once more, superimposing time series creates silly theories

After I wrote the post about superimposing two time series to generate fake correlations, there was a lively discussion in the comments about whether a scatter plot would have done better. Here is the promised follow-up post.

The contentious issue is that X and Y might appear correlated but in fact, what we are observing is that both data series are strongly correlated with time (e.g. population almost always grows with time), and X and Y may not be correlated with each other.

Indeed, the first thing a statistician would do when encountering two data series is to create a scatter plot. Economists, by contrast, seem to prefer two line charts, superimposed.

The reason for looking at the scatter plot is to remove the time component. If X and Y are correlated systematically (and not individually with the time component), then even if we disturb the temporal order, we should still be able to see that correlation. If the correlation goes away in an x-y plot, then we know that the two variables are not correlated, and that the superimposed line charts created an illusion.

Redo_milesdriven_1The catch is that the scatter plot analysis is necessary but not sufficient. In many cases, we will find strong correlation in the scatter plot. But that does not prove there is X-Y correlation beyond each data series being correlated with time. By plotting X and Y and ignoring time, we introduce time as an omitted variable, which can still be controlling both X and Y series.

The scatter plot (right) shows the per capita miles driven against the civilian labor force participation rate. Having hidden the time dimension, we still see a very strong correlation between the two data series.

This is because time is still the invisible hand. Time is running from left to right on the chart still. This pattern is visible if we have line segments connecting the data in temporal order, as in the chart below.

Redo_milesdriven2 

 

***

One solution to this problem is to de-trend the data. We want to remove the effect of time from each of the two data series individually, then we plot the residual signals against each other.

Redo_milesdriven_3Here is the result (right). We now have a random scatter of points that average about zero. If anything, there may be a slightly negative correlation, meaning that when the labor force participation rate is above trend, the per-capita miles driven tend to be slightly below trend; this effect if it exists is small.

What I have done here is to establish the trend for each of the two time series. The actual data being plotted is what is above/below trend. What this chart is saying is that when one value is above trend, it gives us little information about whether the other value is above or below trend.




 

 


Superimposing time series is the biggest source of silly theories

Business Insider (link) published the following chart and declared "the end of the car age in one chart". The chart superimposed the monthly motor vehicle miles driven per capita and the labor force participation rate.

Bi_milesvspartiipation

This is the conclusion of the post:

There's a logical connection between the two. Not in the workforce? You're less inclined to drive.

It's strange that they chose to show a time series going back to the 1970s. The conclusion is logical only for the last five years of the data. Looking back even another decade, to the last recession (2001), one finds the exact opposite conclusion: as the work force participation rate fell, the per-capita miles driven went up.

The other problem is causation creep, about which I have written on the sister blog (link). This chart merely shows correlation (and that is questionable). The conclusion of cause and effect is purely theory. Another theory would be the rise in telecommuting and work-from-home situations. A counter-theory would be that the unemployed may have more free time to drive. Another theory is that gas prices have gone up:

US-Fuel-Prices-Long-2-19-2013

Any time series you can find that has a peak during the 2000s can be similarly interpreted as having caused people to stop driving. Here's a chart of real house prices from Calculated Risk.

RealPricesDec2012

Falling house prices causes people to stop driving. Or perhaps falling house prices causes people to lose jobs.


Bad charts can happen to good people

I shouldn't be surprised by this. No sooner did I sing the praise of Significance magazine (link) than a reader sent me to some charts that are not deserving of their standard.

Here is one such chart (link):

Sig_ukuni1
Quite a few problems crop up here. The most hurtful is that the context of the chart is left to the text. If you read the paragraph above, you'll learn that the data represents only a select group of institutions known as the Russell Group; and in particular, Cambridge University was omitted because "it did not provide data in 2005". That omission is a curious decision as the designer weighs one missing year against one missing institution (and a mighty important one at that). This issue is easily fixed by a few choice words.

You will also learn from the text that the author's primary message is that among the elite institutions, little if any improvement has been observed in the enrollment of (disadvantaged) students from "low participation areas". This chart draws our attention to the tangle of up and down segments, giving us the impression that the data is too complicated to extract a clear message.

The decision to use 21 colors for 21 schools is baffling as surely no one can make out which line is which school. A good tip-off that you have the wrong chart type is the fact that you need more than say three or four colors.

The order of institutions listed in the legend is approximately reverse of their appearance in the chart. If software can be "intelligent", I'd hope that it could automatically sort the order of legend entries.

If the whitespace were removed (I'm talking about the space between 0% and 2.25% and between 8% and 10%), the lines could be more spread out, and perhaps labels can be placed next to the vertical axes to simplify the presentation. I'd also delete "Univ." with abandon.

The author concludes that nothing has changed among the Russell Group. Here is the untangled version of the same chart. The schools are ordered by their "inclusiveness" from left to right.

Redo_hesa

This is a case where the "average" obscures a lot of differences between institutions and even within institutions from year to year (witness LSE).

In addition, I see a negative reputation effect, with the proportion of students from low-participation areas decreasing with increasing reputation. I'm basing this on name recognition. Perhaps UK readers can confirm if this is correct. If correct, it's a big miss in terms of interesting features in this dataset.

 

 


A chart that stops the story-telling impetus

We all like to tell stories. One device that has produced a lot of stories, and provoked much imagination is the dual-axis plot showing two time series. Is there a correlation or is there not? Unfortunately, most of these stories are false.

Claremont_homesLooking at the following chart (link) showing the home sales and median home price in Claremont over the last six years, one gets the sense that the two variables move in tandem, kind of. Both time series appear to reach a peak in 2006 and a trough in 2011. In 2010, both series seem to be levelling off.

When the designer places two series on the same chart, he or she is implicitly saying: there is an interesting relationship between these two data sets.

But this is not always the case. Two data sets may have little to do with each other. This is especially true if each data set shows high variability over time as in here.

***

Below is another view of the same data. In order to visualize any year-to-year effect or quarterly effect, I split the data along those dimensions. The year-to-year effect is quite strong although there isn't any interesting pattern. The quarterly effect is not so strong, and as the directions of the paths indicate, this effect is not consistent from year to year.

Redo_claremont

The scales on each axis are "standardized" meaning 0 is the average value, 1 is one standard deviation above the average, etc. Movements of 1 to 2 standard deviations are not unusual so one can see that almost all values on the chart are within 2 SD.

There just doesn't seem to be a compelling story here. This chart taxes our imagination.

PS. In case you're wondering, this chart is made using Graph Builder in JMP. (except for the arrows) I also wish JMP would allow me to use 1,2,3,4 (column data) as my plot objects instead of the standard dots and crosses, etc.

[4/11/2012: Thanks to Ken L. for submitting this chart. Also, Rob Simmon on Twitter points out that the house price data should be inflation-adjusted.]


Necessity is the mother of invention

When there's a need to vow audiences with smart data analysis, there's invention.

Let's start with the U.S. home ownership data. The total occupied homes are subdivided into owner-occupied and renter-occupied. Thus, in any given year, we can compute the proportion of homes that are owner- or renter-occupied. We use blue for owner and red for renter, as follows:

Redo_owner1

Just to confirm, if we superimpose these two charts, we see that the proportions add up to 100%. One chart is the mirror image of the other:

Redo_owner2

Now we have confirmed the data is okay, we pull the charts apart. We change the scale of the renter chart so that the change over time is more clearly displayed. Since the home ownership bubble burst, it's the rental market that has grown.

Redo_owner3

 

It's time for some magic! We superimpose the charts again to obtain this:

Redo_owner4

[Ed: The remainder of the post below is modified from the original version based on reader comments]

The chart designer managed to make the two data series look different even though one series is the mirror image of the other.

***

The inspiration of this post came from reader Leanne C. who submitted this MSNBC chart:

Msnbc_renternation

Initially, I mistakenly assumed what is plotted are proportions. It just so happened that the total occupied units in the U.S. is in the 100M range and the owner v. rental are split 70M / 30M. I looked at the left end of the chart, and saw in 2001, about 33 of rental and about 69 of owner, which happens to add up to 100 (with rounding error). But if I had looked at right-end of the chart, where rental is 39 and owner is 75, then it would have been clear it's not adding up.

In any case, this chart looks different if we make the scales the same. In the following, each unit of both axes represents 2M units. There really is no justifiable reason why the scales should be different given that they both measure the same objects.

Redo_owner5

But using different ranges on each axis also presents a challenge: it is tempting to read meaning into the gaps between the two lines but these gaps merely reflect the choice of axis ranges.

Instead, we should convert all these units into growth indices. Let 100 be the year 2001 units. The following chart then shows what's really going on in housing:

Redo_owner7
Between 2001 and 2008, rental- and owner-occupied units experienced the same total growth (about 4%) although the trajectories were different... owner-occupied units went up steadily during this period while renter-occupied declined till 2004 and then experienced a faster growth rate between 2004-2008. Since 2008, renter-occupied continued about the same growth rate while owner-occupied flattened out and may be slightly declining.