Graphing highly structured data

The following sankey diagram appeared in my Linkedin feed the other day, and I agree with the poster that this is an excellent example.


It's an unusual use of a flow chart to show the P&L (profit and loss) statement of a business. It makes sense since these are flows of money. The graph explains how Spotify makes money - or how little profit it claims to have earned on over 2.5 billion of revenues.

What makes this chart work so well?

The first thing to notice is how they handled negative flows (costs). They turned the negative numbers into positive numbers, and encoded the signs of the numbers as colors. This doesn't come as naturally as one might think. The raw data are financial tables with revenues shown as positive numbers and costs shown as negative numbers, perhaps in parentheses. Like this:


Now, some readers are sure to have an issue with using the red-green color scheme. I suppose gray-red can be a substitute.

The second smart decision is to pare down the details. There are only four cost categories shown in the entire chart. The cost of revenue represents more than two-thirds of all revenues, and we know nothing about sub-categories of this cost.

The third feature is where the Spotify logo is placed. This directs our attention to the middle of the diagram. This is important because typically on a sankey diagram you read from left to right. Here, the starting point is really the column labeled "total Spotify revenue". The first column just splits the total revenue between subscription revenue and advertising revenue.

Putting the labels of the last column inside the flows improves readability as well.

On the whole, a job well done.


Sankey diagrams have limitations. The charts need to be simple enough to work their magic.

It's difficult to add a time element to the above chart, for example. The next question a business analyst might want to ask is how the revenue/cost/profit structure at Spotify have changed over time.

Another question a business analyst might ask is the revenue/cost/profit structure of premium vs ad-supported users. We have a third of the answer - the revenue split. Depending on relative usage, and content preference, the mix of royalties is likely not to replicate the revenue split.

Yet another business analyst might be interested in comparing Spotify's business model to a competitor. It's also not simple to handle this on a sankey diagram.


I searched for alternative charts, and when you look at what's out there, you appreciate the sankey version more.

Here is a waterfall chart, which is quite popular:


Here is a stacked column chart, rooted at zero:


Of course, someone has to make a pie chart - in this case, two pie charts:






Reading between the gridlines

Reader Jamie H. pointed me to the following chart in the Guardian (link), which originated from Spotify.


This chart is likely inspired by the Arctic ice cover chart discussed here last year (link):


Spotify calls its chart "the Coolness Spiral of Death" while the other one is called "Arctic Death Spiral".

The spiral chart has many problems, some of which I discussed in the post from last year. Just take a look at the headline, and then the black dotted spiral. Does the shape invoke the idea of rapid evolution, followed by maturation? Or try to figure out the amount of evolution between ages 18 and 30.


Instead of the V corner of the Trifecta, I'd like to focus on the D corner today. When I look at charts, I'm always imagining the data behind the chart. Here are some questions to ponder:

  • Given that Spotify was founded in 2006 (not quite 10 years ago), how are they able to discern someone's music taste from 14 through 48?
  • The answer to the above question is they don't have a longitudinal view of anyone's music taste. They are comparing today's 14-year-old kid with today's 48-year-old adult. Under what assumptions would such an analysis yield the same outcome as a proper analysis that tracks the same people over time?
  • If the phenomenon under study follows a predictable trend, there will be little difference between the two ways of looking at the data. For example, teeth in the average baby follow a certain sequence of emergence, first incisors at six months, and first molars at 14 months (according to Wikipedia). Observing John's teething at six months and David's at 14 months won't yield much difference from looking at John at six then 14 months. Does music taste evolve like human growth?
  • Unfortunately, no. Imagine that a new genre of music suddenly erupts and it becomes popular among every generation of listeners. This causes the Spotify curve to shift towards the origin at all ages. However, if you take someone who is currently 30 years ol, the emergence of the new genre should affect his profile at age 30 but not anytime before. In fact, the new music creates a sharp shift at different locations of everyone's taste profile depending on one's age!
  • Let's re-interpret the chart, and accept that each spoke in the wheel concerns a different cohort of people. So we are looking at generational differences. Is the Spotify audience representative of music listeners? Particularly, is each Spotify cohort representative of all listeners of that age?
  • I find it unlikely since Spotify has that "cool" factor. It is probably more representative for younger age groups. Among older customers, there should be some bias. How does this affect the interpretation of the taste profile?
  • If we find that one cohort differs from another cohort, it is important to establish that the gap is a generational difference and not due to the older age group being biased (self-selected) in some way.



Setting the right priority

On the sister blog, I wrote about a new report on the music industry lamenting that the hype over "Long Tail" retail has not really helped small artists (as a group). This was a tip sent by reader Patrick S. He was rightfully unhappy about the chart that was included in this summary of the report.


This classic Excel chart has some basic construction issues:

  • The data labels are excessive
  • The number of ticks on the vertical axis should be halved, given the choice to not show decimal places
  • With only two colors, it is a big ask for readers to shift their sight to the legend on top to understand what the blue and gray signify. Just include the legend text into the existing text annotation!

In terms of the Trifecta checkup, the biggest problem is the misalignment between the intended message and the chart's message. If you read the report, you'd learn that one of their key findings is that the top 1% (superstar) artists continue to earn ~75 percent of total income and this distribution has not changed noticeably despite the Long Tail phenonmenon.

But what is the chart's message? The first and most easily read trend is the fall in total income in the last 12-13 years. And it's a drastic drop of about $1 billion, almost 25 percent. Everything else is hard to compute on this stacked column chart. For example, the decline in the gray parts is even more drastic than the decline in the blue.

It also is challenging to estimate the proportions from these absolute amounts. Recognizing this, the designer added the proportions as text. But only for the most recent year.

So we have identifed two interesting stories, one about the decline in total income and the other about the unending dominance of the 1 percent. This is where the designer has to set priorities. Given that the latter message is the headline of the report, it is better to plot the proportions directly, while hiding the story about total income. The published chart has the priority reversed. Even though you can find both messages on the same chart, it is still not a good idea to highlight your lesser message.


Digital music business needs numbersense

Joran E. sends us to the following chart via Twitter.


Link to the original here.


The top chart fails our self-sufficiency test. There are only eight numbers in the data. All eight numbers are printed onto the chart. If they were removed, the chart is neutered.

The triangle elements are distracting and pointless. The data is encoded in the two ends of those black lines. The two data series have very different scales so that when plotted on the same canvas, the information on digital albums (in red) becomes almost imperceptible.

The tiny font size strains our eyes.


But the bigger problem with the chart is the absence of numbersense.

Start with summing the number of digital albums and the number of tracks. While both units are literally units, they are different units.

A host of statistical adjustments is called for. Revenues would be more telling than unit sales since the average price paid is probably not constant across time. Price is typically inverse to quantity. Singles are cheaper than albums so comparing the units of tracks and the units of albums makes little sense.


The chart on the bottom is a nice idea but again can use some adjusting. As far as we know, the 47 weeks of sales data have not been seasonally adjusted. While the second half of 2013 looked worse than the first half, this insight is remarkable only if this pattern was not likely based on history. Adding lines from previous years would help put things into perspective.

Besides, if sales of other consumption goods fell by 20 percent while sales of digital music dropped by 10%, then by comparison, the digital music industry has fared well. In order to understand the song download data series, we also need to consult the trend for other related goods.

Lastly, by using an area chart, the designer is cornered into starting the vertical axis at zero. If a line chart was used instead, there would be no need to start the axis at zero, and consequently, the drop in weekly sales would appear more pronounced.

When an industry is imploding, lets focus on a metric that remains constant

Augustine F. (@acfou) was not amused by a set of charts made by the Bureau of Labor Statistics, via Business Insider (link). Here's one of them:


The article's message is that the book, periodical and music stores industry has shrunk drastically (over 50%)  in the last 10 years but unless you spend time studying the chart, you're not likely to get this picture.

The bubbles are going right and up, which usually is indicative of an increasing trend. What is tripping us up is the employment level occupying the horizontal axis rather than the expected time dimension. The only real way to see the plunge in employment is to focus on the horizontal axis, and to notice the deepening color of the bubbles.

Redo_books1The chart is actually a scatter plot of number of firms versus number of employees. The slope of the line gives us the number of firms per employee, which is also unexpected since the usual metric is its reciprocal, the number of employees per company. However, since the slope is essentially constant, highlighting this number is pointless. While the industry is collapsing, the average workforce of the surviving firms has remained more or less the same.

I added a cone to the chart to visualize the narrow range in which the employees per firm varied during the past decade.

As if it's not confusing enough, the reciprocal of the slope is coded to the size of the bubbles on the chart. This requires a legend to explain.  All of this means that readers' attention is directed to the average work force metric, instead of the drop in employment.


The following indexed chart shows that the number of employees and the number of firms dropped in step during the ten years. Both dropped about 55% during the decade. This just confirms that the average employee per firm metric is not meaningful.


If you follow the link to the BLS analysis, you'll find some other interesting data, namely the "internet publishing" industry. Does it make sense to talk about the drastic decline in traditional publishing without talking about the rise of the "substitute" industry? The chart below shows that the new jobs created in Internet publishing filled almost all of the hole left in the traditional publishing industry. The decline from 2009 on may not be specific to the industry; it could just be the Great Recession. (As defined, I don't think the two industry sectors are exactly what I'm looking for, but it's close enough.)




Can information be beautiful when information doesn't exist?

Reader Steve S. sent in this article that displays nominations for the "Information is Beautiful" award (link). I see "beauty" in many of these charts but no "information". Several of these charts have appeared on our blog before.

Junkcharts_trifecta_checkupLet's use the Trifecta checkup on these charts. (More about the Trifecta checkup here.)


Info_beaut_plot_linesThe topic of this chart is both tangible and interesting. As someone who loves books, I do want to know what genres of books typically win awards.

However, both the data collection and graphical design make no sense.

The data collection problem presents a huge challenge and it's easy to get wrong. The problem is how narrow should a theme be. If it's too narrow, you can imagine every book has its own set of themes. If it's too wide, each theme maps to lots of books. The challenge is how to select the themes such that they have similar "widths". For example, "death" is a very wide theme and lots of books contain it, as indicated by the black lines. "Nanny trust issues" is a very narrow theme, and only one of those books deals with this theme. When there is such a theme, is its lack of popularity due to its narrow definition or due to writers not being interested in it?


Info_beaut_coversThe caption of this chart said "Cover stars: Charting 50 years up until 2010, this graphic shows The Beatles to be the most covered act in living memory." If that is the message, a much simpler chart would work a lot better.

Since the height of the chart indicates the number of covers sold in that year, the real information being shown is the boom and bust cycles of the worldwide economy. So, a lot more records were sold in 2005, and then the market tanked in 2008, for example.

That's why the data analyst should think twice before plotting raw data. Most data like these should be adjusted. In this case, you could either compare artists against one another in each year (by using proportions) or you have to do a seasonal and trend adjustment. I also don't see the point of highlighting year-to-year fluctuations. Nor do I understand why only in certain years is the top-rated cover identified by name and laurel wreath.



I talked about this stream graph of 311 calls back in 2010. See the post here.



I featured this set of infographics/pie charts back in 2011. See the post here.



This chart is a variant of the one from New York Times that I discussed here. I like the proper orientation on the NYT's version. The color scheme here may be slightly more attractive.



Reading the landscape

Here are some posts I find worth reading on other graphics blogs:

Nick has done wonderful work on the evolution of the rail industry in the U.S., with a flow chart showing how mergers have produced the four giants of today, as well as a small multiples of maps showing how they split up the country.

A lovely feature of the flow chart is the use of red lines to let readers see at a glance that Union Pacific is the only rail company that has lasted the entire 4 decades, while the other 3 giants came into being within the last 20 years.

On the maps, notice a slight inconsistency between the left and right columns: on the right side, both maps have the same set of anchor cities, which act as "axes" to help readers compare the maps; on the left side, the sets of anchor cities are not identical. It would also be interested to see a version with all four route maps superimposed and differentiated by color. That may bring out the competitive structure better.


Georgette has a nice post summarizing issues with picking colors when producing charts. Her blog is called Moved by Metrics.


Meanwhile, Martin finds a shockingly poor pie chart here.


There was a time where you'd find the kind of heatmaps featured here by Nathan as wallpaper in my office. It's a great visualization tool for exploring temporal patterns in large data sets. However, I'd never even think of putting these in a presentation.  It's a starting point, not an end-point, of an analysis project. Some things are wonderful for consumption only in private!






Did days get longer in the last 30 years? Fast Company thinks so.

Craig N. sent us to this infographic from Fast Company about MTV's 30th anniversary, nominating it as the worst infographic ever.


Apply the self-sufficiency test to this chart. Wish away the printed data. Now, does the chart convey any message?  Where is the data embedded? Is it in the white dot, the black dot, the gold ring, the gold disc, the black ring, the eye-white? All of the above?

Now, do the same test on this chart (I removed the sales data, replacing it with years):


How would one compare the white to the orange? If one measures the lengths of the sides, the ratio of white to orange is about 1.32. If one compares areas of the squares, then the ratio is 1.73. Note that this requires the reader to see through the orange area to size up the area of the large white square. Alternatively, we can compute the ratio of the white area as observed to the orange square, and that ratio is 0.73.

The real ratio between 1980 and 2010 sales is given as 3.9/2.7 = 1.44. Given rounding errors, it seems like the designer may have used a ratio of lengths of the sides.

The problem is the same whether sides or areas are used. Can the reader figure out that the 1980 sales is about 40% higher than the 2010 sales?

I suspect that most of us react primarily to the visible areas, which means that we'd have gotten the direction of the change wrong, let alone the magnitude.



Mtv_racetrack Craig really dislikes this one. It's a variant of the racetrack chart. As any athlete knows, inner tracks are shorter than outer tracks. Could it be that days have gotten longer in the last 30 years? Apparently, the editors at Fast Company think so.

The chart that reveals a mysterious death

I agree with Business Insider that the following chart is attractively drawn. It nicely illustrates the rise and fall of various music media over time.


Area charts are more visually appealing than line charts, largely because line charts frequently leave large patches of white space. But one should be aware of some shortcomings of area charts.

Notice that the outer envelope of the area chart represents the growth in music sales across all media, not to be mistaken for the growth of any particular media. However, the primary message of this chart relates to the change in mix among different media, not the growth of the total market. Because of the stacking of different areas on top of each other, it is not an easy task to read the growth of any individual piece, such as CDs.


Unlike Business Insider who found some answers on this chart, I find that this chart raises a mysterious -- and important -- question: what happened around 2001 to damage CD sales? Since according to this chart, digital sales didn't really show up till 2004-ish, there is a gap of two years or so when CD sales dropped drastically, seemingly of its own volition.

Political theater

Jens, a long-time reader, tried to re-make the boring data tables used to report poll data.  Here is an example from USA Election Polls (left) and his enhanced version (right).


Like Jens, I find most of the tabular presentation of poll data underwhelming.  Too much data hiding all the useful information.  For example, the pollster and polling date data provide a context for super-serious poll watchers to interpret the data; however, they do not present themselves in a way that actually help readers.  Read further for versions that bring out this data much better.

Meanwhile, Jens' revision uses color and ordering to bring out the current state of affairs.  The addition of electoral votes allows us to understand the relative weight of each row, countering the weakness of the tabular format, that each row has the same height, implying erroneously that they have the same importance.

There are a number of good web-sites where this type of data is presented in attractive ways.

I have been a fan of Political Arithmetik, which made great use of the pollster and polling date data mentioned above.  Those data have been averaged to show the overall trend while the individual poll results are plotted as dots in the background.  The polling date data is embedded in the horizontal positions of the dots.  Even more impressively, the margins of error are presented.  Remarkably, this race has been a statistical tie for all these months, the 95% lower limit never quite making it above the zero level.


Another great site is  Below, they essentially turned Jen's enhanced table into a map.  The legend on the right perhaps represents what they call "East Coast bias"?  All of Nathan's graphs are very attractively produced; I just wish he'd put more labels on them (such as the differentials corresponding to shades of red and blue.)