An example of focusing the chart on a message

Via Jimmy Atkinson on Twitter, I am alerted to this chart from the Wall Street Journal.


The title of the article is "Fiscal Constraints Await the Next President." The key message is that "the next president looks to inherit a particularly dismal set of fiscal circumstances." Josh Zumbrun, who tipped Jimmy about this chart on Twitter, said that it is worth spending time on.

I like the concept of the chart, which juxtaposes the economic condition that faced each president at inauguration, and how his performance measured against expectation, as represented by CBO predictions.

The top portion of the graphic did require significant time to digest:


A glance at the sidebar informs me that there are two scenarios being depicted, the CBO projections and the actual deficit-to-GDP ratios. Then I got confused on several fronts.

One can of course blame the reader (me) for mis-reading the chart but I think dataviz faces a "the reader is always right" situation -- although there can be multiple types of readers for a given graphic so maybe it should say "the readers are always right."

I kept lapsing into thinking that the bold lines (in red and blue) are actual values while the gray line/area represents the predictions. That's because in most financial charts, the actual numbers are in the foreground and the predictions act as background reference materials. But in this rendering, it's the opposite.

For a while, a battle was raging in my head. There are a few clues that the bold red/blue lines cannot represent actual values. For one thing, I don't recall Reagan as a surplus miracle worker. Also, some of the time periods overlap, and one assumes that the CBO issued one projection only at a given time. The Obama line also confused me as the headline led me to expect an ugly deficit but the blue line is rather shallow.

Then, I got even more confused by the units on the vertical axis. According to the sidebar, the metric is deficit-to-GDP ratio. The majority of the line live in the negative territory. Does the negative of the negative imply positive? Could the sharp upward turn of the Reagan line indicate massive deficit spending? Or maybe the axis should be relabelled surplus-to-GDP ratio?


As I proceeded to re-create this graphic, I noticed that some of the tick marks are misaligned. There are various inconsistencies related to the start of each projection, the duration of the projection, the matching between the boxes and the lines, etc. So the data in my version is just roughly accurate.

To me, this data provide a primary reference to how presidents perform on the surplus/deficit compared to expectations as established by the CBO projections.


I decided to only plot the actual surplus/deficit ratios for the duration of each president's tenure. The start of each projection line is the year in which the projection is made (as per the original). We can see the huge gap in every case. Either the CBO analysts are very bad at projections, or the presidents didn't do what they promised during the elections.




The state of the art of interactive graphics

Scott Klein's team at Propublica published a worthy news application, called "Hell and High Water" (link) I took some time taking in the experience. It's a project that needs room to breathe.

The setting is Houston Texas, and the subject is what happens when the next big hurricane hits the region. The reference point was Hurricane Ike and Galveston in 2008.

This image shows the depth of flooding at the height of the disaster in 2008.


The app takes readers through multiple scenarios. This next image depicts what would happen (according to simulations) if something similar to Ike plus 15 percent stronger winds hits Galveston.


One can also speculate about what might happen if the so-called "Mid Bay" solution is implemented:


This solution is estimated to cost about $3 billion.


I am drawn to this project because the designers liberally use some things I praised in my summer talk at the Data Meets Viz conference in Germany.

Here is an example of hover-overs used to annotate text. (My mouse is on the words "Nassau Bay" at the bottom of the paragraph. Much of the Bay would be submerged at the height of this scenario.)


The design has a keen awareness of foreground/background issues. The map uses sparse static labels, indicating the most important landmarks. All other labels are hidden unless the reader hovers over specific words in the text.

I think plotting population density would have been more impactful. With the current set of labels, the perspective is focused on business and institutional impact. I think there is a missed opportunity to highlight the human impact. This can be achieved by coding population density into the map colors. I believe the colors on the map currently represent terrain.


This is a successful interactive project. The technical feats are impressive (read more about them here). A lot of research went into the articles; huge amounts of details are included in the maps. A narrative flow was carefully constructed, and the linkage between the text and the graphics is among the best I've seen.

Happy new year. Did you have a white Christmas?

Happy 2016.

I spent time with the family in California, wiping out any chance of a white Christmas, although I hear that the probability would have been miniscule even had I stayed.

I did come across a graphic that tried to drive the point home, via NOAA.


Unfortunately, this reminded me a little of the controversial Florida gun-deaths chart (see here):


In this graphic, the designer played with the up-is-bigger convention, drawing some loud dissent.

Begin with the question addressed by the NOAA graphic: which parts of the country has the highest likelihood of having a white Christmas? My first instinct is to look at the darkest regions, which ironically match the places with the smallest chance of snow.

Surely, the designer's idea is to play with white Christmas. But I am not liking the result.


Then, I happen upon an older version (2012) of this map, also done by NOAA. (See this Washington Post blog for example.)


There are a number of design choices that make this version more effective.

The use of an unrelated brown color to cordon off the bottom category (0-10%) is a great idea.

Similarly, the play of hue and shade allows readers to see the data at multiple levels, first at the top level of more likely, less likely, and not likely, and then at the more detailed level of 10 categories.

Finally, there is no whiteness inside the US boundary. The top category is the lightest shade of purple, not exactly white. In the 2015 version above, the white of the snowy regions is not differentiated from the white of the Great Lakes.

I am still not convinced about the inversion of the darker-is-larger convention though. How about you?




Putting a final touch on Bloomberg's terrific chart of social movements

My friend Rhonda D. wins a prize for submitting a good chart. This is Bloomberg's take on the current Supreme Court case on gay marriage (link). Their designer places this movement in the context of prior social movements such as women's suffrage and inter-racial marriage.


Previously, I mentioned New York Times' coverage using "tile maps." While the Times places geography front and center, Bloomberg prefers to highlight the time scale. (In the bottom section of Bloomberg's presentation, they use tile maps as well.)

These are the little things I love about the graphic shown above:

  • The very long time horizon really allows us to see our own lifetime as a small section of the history of the nation
  • The gray upper envelope showing the size of the union is essential background data presented subtly
  • The inclusion of "prohibition" representing a movement that failed (I wish they had included more examples of movements that do not succeed)
  • The open circle and arrow indicators to differentiate between ongoing and settled issues

They should have let the movements finish by connecting the open circles to the upper envelope. Like this:


This makes the steepness of the lines jump out even more. In addition, it makes a distinction between the movements that succeeded and the movement that failed. (Prohibition was repealed in 1933. The line between 1920 and 1933 could be more granular if such data are available.)


Designers fuss over little details and so should you

Those who attended my dataviz talks have seen a version of the following chart that showed up yesterday on New York Times (link):


This chart shows the fluctuation in Arctic sea ice volume over time.

The dataset is a simple time series but contains a bit of complexity. There are several ways to display this data that helps readers understand the complex structure. This particular chart should be read at two levels: there is a seasonal pattern that is illustrated by the dotted curve, and then there are annual fluctuations around that average seasonal pattern. Each year's curve is off from the average in one way or another.

The 2015 line (black) is hugging the bottom of the envelope of curves, which means the ice volume is at a historic low.

Meanwhile the lines for 2010-2014 (blue) all trace near the bottom of the historic collection of curves.


There are several nice touches on this graphic, such as the ample annotation describing interesting features of the data, the smart use of foreground/background to make comparisons, and the use of countries and states (note the vertical axis labels) to bring alive the measure of coverage volume.

Check out my previous post about this data set.

Also, this post talks about finding real-life anchors to help readers judge size data.

My collection of posts about New York Times graphics.


PS. As Mike S. pointed out to me on Twitter, the measure is "ice cover", not ice volume so I edited the wording above. The language here is tricky because we don't usually talk about the "cover" of a country or state so I am using "coverage". The term "surface area" also makes more sense for describing ice than a country.

Cloudy and red

Note: I'm traveling during the holidays so updates will be infrequent.


Reader Daniel L. pointed me to a blog post discussing the following weather map:


The author claimed that many readers misinterpreted the red color as meaning high temperatures when he intended to show higher-than-normal temperatures. In other words, the readers did not recognize a relative scale is in play.

That is a minor issue that can be fixed by placing a label on the map.

There are several more irritants, starting with the abundance of what Ed Tufte calls chartjunk. The county boundaries do not serve a purpose, nor is it necessary to place so many place names. State boundaries too are  too imposing. The legend fails to explain what the patch of green in Florida means.

The article itself links to a different view of this data on a newly launched site called Climate Prediction Center, by the National Oceanic and Atmospheric Administration (link). Here is a screenshot of the continental U.S.


This chart is the other extreme, bordering on too simple.

I'd suggest adding a little bit of interactivity to this chart, such as:

  • Hiding the state boundaries and showing them on hover only
  • Selectively print the names of major cities to help readers orient themselves
  • Selectively print the names of larger cities around the color boundaries
  • Using a different background map that focuses on the U.S. rather than the entire North American continent 

This is a Type V chart.

A great visual of complicated schedules

Reader Joe D. tipped me about a nice visualization project by a pair of grad students at WPI (link). They displayed data about the Boston subway system (i.e. the T).

The project has many components, one of which is the visualization of the location of every train in the Boston T system on a given day. This results in a very tall chart, the top of which I clipped:


I recall that Tufte praised this type of chart in one of his books. It is indeed an exquisite design, attributed to Marey. It provides data on both time and space dimensions in a compact manner. The slope of each line is positively correlated with the velocity of the train (I use the word correlated because the distances between stations are not constant as portrayed in this chart). The authors acknowledge the influence of Tufte in their credits, and I recognize a couple of signatures:

  • For once, I like how they hide the names of the intermediate stations along each line while retaining the names of the key stations. Too often, modern charts banish all labels to hover-overs, which is a practice I dislike. When you move the mouse horizontally across the chart, you will see the names of the unnamed stations.
  • The text annotations on the right column are crucial to generating interest in this tall, busy chart. Without those hints, readers may get confused and lost in the tapestry of schedules. If you scroll to the middle, you find an instance of train delay caused by a disabled train. Even with the hints, I find that it takes time to comprehend what the notes are saying. This is definitely a chart that rewards patience.

Clicking on a particular schedule highlights that train, pushing all the other lines into the background. The side panel provides a different visual of the same data, using a schematic subway map.


 Notice that my mouse is hovering over the 6:11 am moment (represented by the horizontal guide on the right side). This generates a snapshot of the entire T system shown on the left. This map shows the momentary location of every train in the system at 6:11 am. The circled dot is the particular Red Line train I have clicked on before.

This is a master class in linking multiple charts and using interactivity wisely.


You may feel that the chart using the subway map is more intuitive and much easier to comprehend. It also becomes very attractive when the dots (i.e., trains) are animated and shown to move through the system. That is the image that project designers have blessed with the top position of their Github page.

However, the image above allows us to  see why the Marey diagram is the far superior representation of the data.

What are some of the questions you might want to answer with this dataset? (The Q of our Trifecta Checkup)

Perhaps figure out which trains were behind schedule on a given day. We can define behind-schedule as slower than the average train on the same route.

It is impossible to figure this out on the subway map. The static version presents a snapshot while the dynamic version has  moving dots, from which readers are challenged to estimate their velocities. The Marey diagram shows all of the other schedules, making it easier to find the late trains.

Another question you might ask is how a delay in one train propagates to other trains. Again, the subway map doesn't show this at all but the Marey diagram does - although here one can nitpick and say even the Marey diagram suffers from overcrowding.


On that last question, the project designers offer up an alternative Marey. Think of this as an indiced view. Each trip is indiced to its starting point. The following setting shows the morning rush hour compared to the rest of the day:


 I think they can utilize this display better if they did not show every single schedule but show the hourly average. Instead of letting readers play with the time scale, they should pre-compute the periods that are the most interesting, which according to the text, are the morning rush, afternoon rush, midday lull and evening lull.

The trouble with showing every line is that the density of lines is affected by the frequency of trains. The rush hours have more trains, causing the lines to be denser. The density gradient competes with the steepness of the lines for our attention, and completely overwhelms it.


There really is a lot to savor in this project. You should definitely spend some time reviewing it. Click here.

Also, there is still time to sign up for my NYU chart-making workshop, starting on Saturday. For more information, see here.

Two good maps, considered part 2

This is a continuation of my previous post on the map of the age of Brooklyn's buildings, in which I suggested that aggregating the data would bring out the geographical patterns better.

For its map illustrating the pattern of insurance coverage in several large cities across America, the New York Times team produced two versions, one using dots to plot the raw data (at the finest level, each dot represents 40 residents) and another showing aggregate data to the level of Census tracts.

We can therefore compare the two views side-by-side.


The structure of this data is similar to that of the Brooklyn map. Where Rhiel has age of buildings as the third dimension, the NYT has the insurance status of people living in Census tracts. (Given that the Census does not disclose individual responses, we know that the data is really tract-level. The "persons" being depicted can be thought of as simulated.) The NYT data poses a greater challange because it is categorical. Each "person" has one of four statuses: "uninsured", "public insurance", "private insurance" and "both public and private insurance". The last category is primarily due to aggregation to the tract level. By contrast, the Brooklyn data is "continuous" (ordinal, to be specific) in the year of construction.

 The aggregated chart at the bottom speaks to me much more loudly. What it gives up in granularity, both at the geographical level and at the metric level, it gains in clarity and readability. The dots on the top chart end up conveying mostly information about population density across Census tracts, which distracts readers from taking in the spatial pattern of the uninsured. The chart in the bottom aggregates the data to the level of a tract. Also, instead of showing all four levels of insuredness, the chart in the bottom concentrates its energy on showing the proportion of uninsured.

In short, the chart that uses fewer elements (areas rather than dots), fewer colors, fewer individual data points ends up answering the question of "mapping uninsured Americans" more effectively. (It is a common misunderstanding that aggregation throws away data -- in fact, aggregation consumes the data.)


When designers choose to plot raw data, they often find a need to compensate for its weakness of losing the signal in the noise. One of the strategies is to produce a hover-over effect that shows aggregated statistics, like this:


Notice the connection between this and my previous comment. What the aggregated map displays are two elements of the hover-over: the boundary of the Census tract, and the first statistic (the proportion of uninsured).

In addition to the hassle of having to hover over different tracts asynchronously, the reader also loses the ability to interpret the statistics. For example, is the proportion of uninsured (21.4%) a good or bad number? The reader can't tell unless he or she has an understanding of the full range of possibilities. In the other chart, this task has been performed by the designer when constructing the legend:


 This trade-off between relative and absolute metrics is one of the key decisions designers have to make all the time. Relative metrics also have problems. For instance, on the bottom chart, the reader loses the understanding of the relative population density between different Census tracts.

A similar design problem faced by Rhiel in the Brooklyn chart is whether to use the year of construction (e.g. 2003) as the metric or the age of buildings (10 years old). Rhiel chose the former while some other designer would have selected the latter.


Again, thanks for reading, and see you next year!


Two good maps, considered

A Relection on the past year:

Thanks to you for continuing to make this blog a success. Writing it has given me much enjoyment over the years, and I have learned much from your comments as well as from the visualization projects of many colleagues. 2013 also saw the publication of my new book Numbersense: How to Use Big Data to Your Advantage (link). I thank those of you who have purchased the book, and supported my writing. For those who haven't, please check it out. I have also been speaking at various events, mostly about interpreting data analyses published in the mass media, and building effective data analytics teams. In addition, I am heavily involved in the new Certificate in Analytics and Data Visualization at New York University (link). While the frequency of posting has suffered a little due to my other projects, I hope you found the contents as engaging, fun, and constructive as before.

Looking forward to 2014, I have as usual a basket of projects. Besides the two blogs, I will be expanding my teaching at NYU, including a visualization workshop that I'll be writing about here soon; taking on consulting projects; evangelizing better communications of data and analytics; and prospecting several book projects. I continue to spend most of the week at Vimeo, where my team analyzes data.

This will be my last post in 2013. It is an extra-long post to tie you over to the New Year. Happy New Year!



A short while ago, I was in correspondence with Thomas Rhiel who created a lovely map depicting the age of buildings in Brooklyn (link). In this case, it's the data that intrigues my interest. I haven't seen this type of data visualized before. The map type is exquisitely aligned to the data: buildings are geographically located and the age is a third, non-geographical dimension which is encoded in the colors. Red-orange is the most recent while green-blue is the oldest.




The data is at the level of individual buildings. If you hover over a building, you find the raw data including the address and the year of construction. The details seem to show that even the shape of each building is depicted. This really impressed me since a lot of manual labor must have been applied (according to Rhiel, there is a source for this type of data). Here is the map at its most magnified:


I came across this starry patch near the Manhattan Bridge, in which the buildings show up as red asterisks. (Rhiel said the shape came from the data. I am not sure I believe the data. Anyone lives near Sands Street?)



The map is useful if you are interested in questions such as "where are the new developments" (look for the deep red buildings) or "what's the average age of the buildings in a specific block" or "what's the age distribution of the buildings in a set of blocks". At the magnified level shown above, the street names are available to help readers orient themselves. The light gray color keeps the roads and the names safely in the background.

Now, zoomed to the other extreme, we get the image of the whole of Brooklyn:



I have a couple of suggestions for Rhiel. As someone who is not familiar with the geography of Brooklyn, this view presumes knowledge that I don't have. Unlike the magnified view, there are no text labels to help us decipher the different sections of Brooklyn. It would be nice if there is a background map to indicate the better-known areas like Williamsburg or Brooklyn Heights or Red Hook, etc.

The other concern is the apparent lack of pattern shown here. At this level, an appropriate question is which sections of Brooklyn are being redeveloped and which sections have older buildings. I see sprinkles of colors everywhere, giving the impression that everything is average. I suggested to Rhiel that aggregating the data would help bring out the pattern.

In data visualization, there is an obsession of plotting the "raw data" at its most granular level. Sometimes, this strategy backfires. It's the classic signal versus noise problem. Aggregation is a noise removal procedure. If for example, Rhiel gives up the data for individual buildings, including those beloved building shapes, and looks at the average age of buildings within each block, or even Census tracts, I suspect that the resulting map would be more informative.

It turns out that the Graphics team at the New York Times just published an interactive map that illustrates exactly what I suggested to Rhiel. Since this post is getting long, please go to the next post to continue reading.


There's nothing wrong with Eli Manning on this chart

The Giants QB Eli Manning is in the news for the wrong reason this season. His hometown paper, the New York Times, looked the other way, focusing on one metric that he still excels at, which is longevity. This is like the Cal Ripken of baseball. The graphic (link) though is fun to look at while managing to put Eli's streak in context. It is a great illustration of recognition of foreground/background issues. (I had to snip the bottom of the chart.)


After playing around with this graphic, please go read Kevin QuigleyQuealy's behind-the-scenes description of the various looks that were discarded (link). He showed 19 sketches of the data. Sketching cannot be stressed enough. If you don't have discarded sketches, you don't have a great chart.

Pay attention to tradeoffs that are being made along the way. For example, one of the sketches showed the proportion of possible games started:


I like this chart quite a bit. The final selection arranges the data by team rather than by player so necessarily, the information about proportion of possible games started fell by the wayside.

(Disclosure: I'm on Team Philip. Good to see that he is right there with Eli even on this metric.)