A great visual of complicated schedules

Reader Joe D. tipped me about a nice visualization project by a pair of grad students at WPI (link). They displayed data about the Boston subway system (i.e. the T).

The project has many components, one of which is the visualization of the location of every train in the Boston T system on a given day. This results in a very tall chart, the top of which I clipped:


I recall that Tufte praised this type of chart in one of his books. It is indeed an exquisite design, attributed to Marey. It provides data on both time and space dimensions in a compact manner. The slope of each line is positively correlated with the velocity of the train (I use the word correlated because the distances between stations are not constant as portrayed in this chart). The authors acknowledge the influence of Tufte in their credits, and I recognize a couple of signatures:

  • For once, I like how they hide the names of the intermediate stations along each line while retaining the names of the key stations. Too often, modern charts banish all labels to hover-overs, which is a practice I dislike. When you move the mouse horizontally across the chart, you will see the names of the unnamed stations.
  • The text annotations on the right column are crucial to generating interest in this tall, busy chart. Without those hints, readers may get confused and lost in the tapestry of schedules. If you scroll to the middle, you find an instance of train delay caused by a disabled train. Even with the hints, I find that it takes time to comprehend what the notes are saying. This is definitely a chart that rewards patience.

Clicking on a particular schedule highlights that train, pushing all the other lines into the background. The side panel provides a different visual of the same data, using a schematic subway map.


 Notice that my mouse is hovering over the 6:11 am moment (represented by the horizontal guide on the right side). This generates a snapshot of the entire T system shown on the left. This map shows the momentary location of every train in the system at 6:11 am. The circled dot is the particular Red Line train I have clicked on before.

This is a master class in linking multiple charts and using interactivity wisely.


You may feel that the chart using the subway map is more intuitive and much easier to comprehend. It also becomes very attractive when the dots (i.e., trains) are animated and shown to move through the system. That is the image that project designers have blessed with the top position of their Github page.

However, the image above allows us to  see why the Marey diagram is the far superior representation of the data.

What are some of the questions you might want to answer with this dataset? (The Q of our Trifecta Checkup)

Perhaps figure out which trains were behind schedule on a given day. We can define behind-schedule as slower than the average train on the same route.

It is impossible to figure this out on the subway map. The static version presents a snapshot while the dynamic version has  moving dots, from which readers are challenged to estimate their velocities. The Marey diagram shows all of the other schedules, making it easier to find the late trains.

Another question you might ask is how a delay in one train propagates to other trains. Again, the subway map doesn't show this at all but the Marey diagram does - although here one can nitpick and say even the Marey diagram suffers from overcrowding.


On that last question, the project designers offer up an alternative Marey. Think of this as an indiced view. Each trip is indiced to its starting point. The following setting shows the morning rush hour compared to the rest of the day:


 I think they can utilize this display better if they did not show every single schedule but show the hourly average. Instead of letting readers play with the time scale, they should pre-compute the periods that are the most interesting, which according to the text, are the morning rush, afternoon rush, midday lull and evening lull.

The trouble with showing every line is that the density of lines is affected by the frequency of trains. The rush hours have more trains, causing the lines to be denser. The density gradient competes with the steepness of the lines for our attention, and completely overwhelms it.


There really is a lot to savor in this project. You should definitely spend some time reviewing it. Click here.

Also, there is still time to sign up for my NYU chart-making workshop, starting on Saturday. For more information, see here.

Two good maps, considered part 2

This is a continuation of my previous post on the map of the age of Brooklyn's buildings, in which I suggested that aggregating the data would bring out the geographical patterns better.

For its map illustrating the pattern of insurance coverage in several large cities across America, the New York Times team produced two versions, one using dots to plot the raw data (at the finest level, each dot represents 40 residents) and another showing aggregate data to the level of Census tracts.

We can therefore compare the two views side-by-side.


The structure of this data is similar to that of the Brooklyn map. Where Rhiel has age of buildings as the third dimension, the NYT has the insurance status of people living in Census tracts. (Given that the Census does not disclose individual responses, we know that the data is really tract-level. The "persons" being depicted can be thought of as simulated.) The NYT data poses a greater challange because it is categorical. Each "person" has one of four statuses: "uninsured", "public insurance", "private insurance" and "both public and private insurance". The last category is primarily due to aggregation to the tract level. By contrast, the Brooklyn data is "continuous" (ordinal, to be specific) in the year of construction.

 The aggregated chart at the bottom speaks to me much more loudly. What it gives up in granularity, both at the geographical level and at the metric level, it gains in clarity and readability. The dots on the top chart end up conveying mostly information about population density across Census tracts, which distracts readers from taking in the spatial pattern of the uninsured. The chart in the bottom aggregates the data to the level of a tract. Also, instead of showing all four levels of insuredness, the chart in the bottom concentrates its energy on showing the proportion of uninsured.

In short, the chart that uses fewer elements (areas rather than dots), fewer colors, fewer individual data points ends up answering the question of "mapping uninsured Americans" more effectively. (It is a common misunderstanding that aggregation throws away data -- in fact, aggregation consumes the data.)


When designers choose to plot raw data, they often find a need to compensate for its weakness of losing the signal in the noise. One of the strategies is to produce a hover-over effect that shows aggregated statistics, like this:


Notice the connection between this and my previous comment. What the aggregated map displays are two elements of the hover-over: the boundary of the Census tract, and the first statistic (the proportion of uninsured).

In addition to the hassle of having to hover over different tracts asynchronously, the reader also loses the ability to interpret the statistics. For example, is the proportion of uninsured (21.4%) a good or bad number? The reader can't tell unless he or she has an understanding of the full range of possibilities. In the other chart, this task has been performed by the designer when constructing the legend:


 This trade-off between relative and absolute metrics is one of the key decisions designers have to make all the time. Relative metrics also have problems. For instance, on the bottom chart, the reader loses the understanding of the relative population density between different Census tracts.

A similar design problem faced by Rhiel in the Brooklyn chart is whether to use the year of construction (e.g. 2003) as the metric or the age of buildings (10 years old). Rhiel chose the former while some other designer would have selected the latter.


Again, thanks for reading, and see you next year!


Two good maps, considered

A Relection on the past year:

Thanks to you for continuing to make this blog a success. Writing it has given me much enjoyment over the years, and I have learned much from your comments as well as from the visualization projects of many colleagues. 2013 also saw the publication of my new book Numbersense: How to Use Big Data to Your Advantage (link). I thank those of you who have purchased the book, and supported my writing. For those who haven't, please check it out. I have also been speaking at various events, mostly about interpreting data analyses published in the mass media, and building effective data analytics teams. In addition, I am heavily involved in the new Certificate in Analytics and Data Visualization at New York University (link). While the frequency of posting has suffered a little due to my other projects, I hope you found the contents as engaging, fun, and constructive as before.

Looking forward to 2014, I have as usual a basket of projects. Besides the two blogs, I will be expanding my teaching at NYU, including a visualization workshop that I'll be writing about here soon; taking on consulting projects; evangelizing better communications of data and analytics; and prospecting several book projects. I continue to spend most of the week at Vimeo, where my team analyzes data.

This will be my last post in 2013. It is an extra-long post to tie you over to the New Year. Happy New Year!



A short while ago, I was in correspondence with Thomas Rhiel who created a lovely map depicting the age of buildings in Brooklyn (link). In this case, it's the data that intrigues my interest. I haven't seen this type of data visualized before. The map type is exquisitely aligned to the data: buildings are geographically located and the age is a third, non-geographical dimension which is encoded in the colors. Red-orange is the most recent while green-blue is the oldest.




The data is at the level of individual buildings. If you hover over a building, you find the raw data including the address and the year of construction. The details seem to show that even the shape of each building is depicted. This really impressed me since a lot of manual labor must have been applied (according to Rhiel, there is a source for this type of data). Here is the map at its most magnified:


I came across this starry patch near the Manhattan Bridge, in which the buildings show up as red asterisks. (Rhiel said the shape came from the data. I am not sure I believe the data. Anyone lives near Sands Street?)



The map is useful if you are interested in questions such as "where are the new developments" (look for the deep red buildings) or "what's the average age of the buildings in a specific block" or "what's the age distribution of the buildings in a set of blocks". At the magnified level shown above, the street names are available to help readers orient themselves. The light gray color keeps the roads and the names safely in the background.

Now, zoomed to the other extreme, we get the image of the whole of Brooklyn:



I have a couple of suggestions for Rhiel. As someone who is not familiar with the geography of Brooklyn, this view presumes knowledge that I don't have. Unlike the magnified view, there are no text labels to help us decipher the different sections of Brooklyn. It would be nice if there is a background map to indicate the better-known areas like Williamsburg or Brooklyn Heights or Red Hook, etc.

The other concern is the apparent lack of pattern shown here. At this level, an appropriate question is which sections of Brooklyn are being redeveloped and which sections have older buildings. I see sprinkles of colors everywhere, giving the impression that everything is average. I suggested to Rhiel that aggregating the data would help bring out the pattern.

In data visualization, there is an obsession of plotting the "raw data" at its most granular level. Sometimes, this strategy backfires. It's the classic signal versus noise problem. Aggregation is a noise removal procedure. If for example, Rhiel gives up the data for individual buildings, including those beloved building shapes, and looks at the average age of buildings within each block, or even Census tracts, I suspect that the resulting map would be more informative.

It turns out that the Graphics team at the New York Times just published an interactive map that illustrates exactly what I suggested to Rhiel. Since this post is getting long, please go to the next post to continue reading.


There's nothing wrong with Eli Manning on this chart

The Giants QB Eli Manning is in the news for the wrong reason this season. His hometown paper, the New York Times, looked the other way, focusing on one metric that he still excels at, which is longevity. This is like the Cal Ripken of baseball. The graphic (link) though is fun to look at while managing to put Eli's streak in context. It is a great illustration of recognition of foreground/background issues. (I had to snip the bottom of the chart.)


After playing around with this graphic, please go read Kevin QuigleyQuealy's behind-the-scenes description of the various looks that were discarded (link). He showed 19 sketches of the data. Sketching cannot be stressed enough. If you don't have discarded sketches, you don't have a great chart.

Pay attention to tradeoffs that are being made along the way. For example, one of the sketches showed the proportion of possible games started:


I like this chart quite a bit. The final selection arranges the data by team rather than by player so necessarily, the information about proportion of possible games started fell by the wayside.

(Disclosure: I'm on Team Philip. Good to see that he is right there with Eli even on this metric.)



Highlight the right elements of a chart

The big news in the tech world is Steve Ballmer's retirement accouncement. Andrew Sullivan cites this chart by Derek Thompson as a reason for Ballmer's departure: (original article)


 How about this version?



What makes this version better?

  • Having the Microsoft/Wintel area at the bottom means the boundary of the area traces its rise and fall
  • Choosing a heavy color for Microsoft/Wintel draws attention to the main stage
  • Focus numerical labels on the particular items that convey the story, i.e. the numbers highlighted at the top of the original chart in red
  • Subtle and sparse gridlines tied to the key message
  • Tilt labels to fit inside areas
  • Place data labels inside chart next to the highlighted features
  • Draw attention to the boundary of the Microsoft/Wintel area


Ruining the cake with too much icing

Reader Steve S. tried to spoil my new year with this chart he didn't like:


Or maybe he's just chiding me for recommending Bumps charts. This example is very confusing, a tangled mess.

But not so fast.

The dataset has two characteristics that don't sit well with bumps charts. One is too many things being ranked (twenty). Two is too much rank swapping that happens over time (14 periods).

The latter challenge can be tamed by aggregating the time dimension. For some reason, the period under examination was the first half year after the debut of these computers. Do we really need to know the weekly statistics?

We can keep all 14 periods. If so, we should be judicious in selecting the colors, the lines and dashed lines, and gridlines, and so on. In particular, look for a story and use foreground/background techniques to highlight the story.

Here's a version that focuses on the brands that moved the most number of ranks either up or down during this period:


Here's one that tracks how the top five fared over this period of time. It turns out that despite all the noisy movements, not much happened at the top of the rankings:


Not knowing many of these computer brands, I really have no idea why seven colors were used and why different tints of the six colors were chosen. I also don't have a clue why some lines were dashed and others were solid.

Looking closely, I learn that the Sony PC was given a black color because its label does not show up on either side. It was a product that did not rank among the top 20 at the start nor at the end of this time period. This Sony PC should be consigned to the dustbin of history, and yet in the color scheme selected for the original chart, the black solid line is the most visible!


I'd like to see an interactive layer added to this chart that brings out the "information". Two of the tabs can be "top movers" and "top five brands" as discussed above. If you hover over these tabs, the appropriate lines are highlighted.


A winning graphic of early voting

This pair of WSJ charts I like very much.


The article talks about the effect of early voting during Presidential elections in the States. People are allowed to mail in their votes as early as 2 months before the November 6 election.

The chart on the right identifies all the states that allow early voting, and in particular, it highlights (in orange) the seven battleground states that allow early voting. This shows the designer keenly aware of what's important and what's not important on the chart. The states are ordered by the first date of voting, instead of alphabetically. (I do have a question about why several of the gray lines towards the bottom of the chart do not reach November 6. Probably because mail-in voting is closed prior to Election Day in some states...)

If the data were to be available, a nice addition to this chart is to include the distribution of early votes over time. It's useful to see if North Carolina voters tend to spread their mail-in votes evenly over the 2 month period, or if most of them get sent close to Election Day, or some other pattern. Changing the bar chart to a dot plot and using the density of dots to indicate frequency would work fine here.

Instead of the first date of voting, the chart would be more informative if it plots the average date of voting (among mail-in voters). This is because the first date of voting is an extreme value and there may be few voters who vote on that day. If we have to pick one number to represent all early voters, we should pick the one with the average (or median) voting time. Again, this is constrained by whether such data is publicly released.


The chart on the left is also well executed. The title should include the additional fact that only battleground states are depicted. I'd also extend the vertical axis to 100% since the data are proportions. The beauty of this presentation is that it functions on several levels, whether you are interested in knowing that not much changed in Iowa from 2004 to 2008, or the fact that almost 8 of 10 mail-in votes in Colorado were early votes, or that in both Colorado and North Carolina, the proportion of mail-in votes more than doubled between 2004 and 2008.

Neither of these are fancy charts, but they pack quite a bit of useful information.


As good as Bolt

The accomplished graphics team at NYT outdid themselves with this feature on the 100m dash through Olympic history (link). You should really go and check out the full presentation.


About_100m_winnersThey start with a data table like the one shown on the right. It's a boring list of names and winning times by year and by medal type. What can one do to animate this data? The NYT team found many ways.

The presentation consists of a static dot plot plus a short movie.

They found many ways to convey the meaning of the tenths and hundredths of a second that separate the top performers. In the dot plot, for example, they did not draw the actual winning times. Instead, they converted the differences in winning times into distances. Here is the right section of the chart:


We are drawn into compressing time and place, having Usain Bolt race all of the former winners and assuming everyone ran the same race they did in real life. The dot plot tells us how far ahead of each past winner Bolt is.

Some time ago, I wrote about the "audiolization" of duration data, in another piece about a NYT chart (link). They deployed this strategy beautifully at the end of the short film. The runners were aligned like keys on a piano, and the resulting sound is like playing a scale across the keyboard. Lovely, that is to say.



The authors bring in a number of other data points to create reference points for understanding this data. For example, if you blink, you might miss the national jerseys worn by each winner in the hypothetical competition:


 Later, the dominance of American runners is plainly shown via white lanes:


 The perspective hides the relative impotency of American sprinters in recent Olympics. This view of the surge of Caribbean runners makes up for it:



Next, they compared the times for U.S. age group record holders to Olympic winning times. This is a fun way to look at the data. (Pardon the strutting Play button.)


They play with foreground/background here in an effective way. The 15- and 16-year-old age-group record holder is said to be "good enough for a bronze as recently as 1980".

Fun aside, think twice before you repeat this "insight". It falls into the category of those things that sound impressive but are quite meaningless. For one thing, the gap between the two runners is affected by a multitude of factors: the age of the runner (which is elevated here over and above other factors), the nationality of the runner, and the time of the run. This last point is key: if we compare the 15-to-16-year-old 100m record time from 1980 to the winning times of Olympic medalists from that year, the gap would be much wider.

Also, pay attention to the distribution of runners. It gets very crowded very quickly near the top end of the scale. In other words, while the gap as measured in part-seconds may seem small, the gap as measured in individual athletes would be very wide -- we'd find loads of athletes whose times fit into the gap illustrated here.


According to the dot plot, in some years, like the 1950s, there were no gold medalists. Looking at the data here, I think this is an overplotting effect, where two times were so close that the dots were literally on top of each other. This creates the situation where one of the dots will be on top of the other, and which one is on top is a feature of the software you're using. Jittering is one common strategy to deal with this problem, or we can just place the gold, silver and bronze dots on their own levels. The latter strategy would look exactly like the over-the-top view used in the short film:


(We'll also note that this view has time running left to right, which is perhaps more natural than time running bottom up, as in the dot plot. However, we are used to seeing runners cross the finish line from left to right on a TV screen so this is a case of eight ounces and half a pound.)

In the short film, I find the gigantic play/pause button at the center of the screen an annoyance, ruining my enjoyment. (I'm using Firefox and a Mac.)


Now, go check out the entire feature (link), and applaud the effort.

Spring flowers and striking hours

Reader Joe DiNoto sent me to the following National Post (Canada) chart via Twitter, complaining about the circles. (The full chart is found here.)


This chart is supposed to show that the students in Quebec are wrong to go on strike against a roughly 10% increase in tuition fees because the cost of education in Quebec is dwarfed by those in other provinces. This particular message is visible by virtue of the small amount of space occupied by the Quebec "flower" relative to other provinces.

However, to convey that message would require only a chart of the average tuition of the seven provinces. The dataset here contains a lot more information than just the average: it has the tuition by major. But, does the general pattern of relative tuitions apply to individual majors? This chart type (a disguised bubble chart) does the reader few favors. (At least, the designer managed to keep each "petal" at the same angles; otherwise it would make our lives even harder.)


In order to bring out the tuition by major comparison, the following set of dot plots helps:


The purple dots are Quebec tuitions. The gray dots are the remaining provinces. We find that Quebec is at the bottom of the cost scale for every major. We also learn that the variance of tuition for dentistry, medicine, and law is very high. Surprisingly, the business degree is rather cheap - maybe the demand for it up north is lower?

Look what I found: two amazing charts

While doing some research for my statistics blog, I came across a beauty by Lane Kenworthy from almost a year ago (link) via this post by John Schmitt (link).

How embarrassing is the cost effectiveness of U.S. health care spending?


When a chart is executed well, no further words are necessary.

I'd only add that the other countries depicted are "wealthy nations".


Even more impressive is this next chart, which plots the evolution of cost effectiveness over time. An important point to note is that the U.S. started out in 1970 similar to the other nations.


Let's appreciate this beauty:

  • Let the data speak for itself. Time goes from bottom left to upper right. As more money is spent, life expectancy goes up. However, the slope of the line is much smaller for the US than the other countries. There is no need to add colors, data labels, interactivity, animation, etc.
  • Recognize what's important, what's not. The US line is in a different color, much thicker and properly made the foreground of the chart.
  • Rather than clutter up the chart, the other 19 lines are anonymized. They all have the same color and thickness, and all given one aggregate label. This is an example of overcoming loss aversion (see this post for more): it is ok to suppress some of the data.
  • The axis labeling is superb. Tufte preaches this clean style. There is no need to use regularly-spaced axis labels... use data-informed labels. Unfortunately, software is way behind on this issue. You can do this in R but that's about it.