« August 2015 | Main | October 2015 »

Don't pick your tool before having your design

My talk at Parsons seemed like a success, based on the conversation it generated, and the fact that people stuck around till the end. One of my talking points is that one should not pick a tool before having a design.

Then, last night on Twitter, I found an example to illustrate this. Jim Fonseca tweeted about this chart from Business Insider: (link)

BI_nflpenaltycount01

The style is clean and crisp, which I credit them for. Jim was not happy about the length of the columns. It seems that no matter how many times we repeat the start-at-zero rule, people continue to ignore it.

So here we go again. The 2015 column is about double the height of the 2013 column but 730 is nowhere near double the value of 617.

The standard remedy for this is to switch to a line chart, or a dot plot. Something like this can be quickly produced in any software:

Redo_binflpenaltywk3_2

Is this the best we can do?

Not if we are willing to free ourselves from the tool. Think about the message: NFL referees have been calling more penalties this year. Compared to what?

I want to leave readers no doubt as to what my message is. So I sketched this version:

Redo_binflpenaltywk3

This version cannot be produced directly from a tool (without contorting your body in various painful locations).

The lesson is: Make your design, then find a way to execute it.


Round-up of up-coming events

I finally got around to updating the event listings. In the coming months, I will be giving  a number of talks on data visualization.

Next week, I will be speaking to the Data Visualization New York meetup, ably organized by Naomi Robbins. The event is heavily over-subscribed, so apologies to those who can't make it in.

In October, I will be offering a short class on data visualization at an executive education event at Columbia University. The event is "Leading Business Change Through Analytics". The fantastic program covers the management and leadership skills necessary to turn data insights into measurable business results. You can still register to attend.

In addition, I will be giving a proseminar at NYU's Applied Quantitative Reasoning program in the Sociology department.

I will also be visiting classes by Andrew Gelman (Columbia) and Ray Vella (NYU) next month.

***

You can follow my events from my sister blog. Click here and look on the right column.

If you come to one of these events, do come up and say hi!

 


Nice title but dubious message

I like to uaeuse declarative titles for charts. This chart below, found in an investment magazine published by Charles Schwab, wants to tell us that emerging markets "perform differently."

Cs_emergingmkt

That is a nice concise message. Now, what does the chart say?

Readers have to jump through some hoops. First, the axes are flipped from their normal posture. Time typically is shown running horizontally. And market returns which range widely from positive to negative values are frequently displayed vertically. But not here.

Second, this chart equally treats all three categories of equity returns (domestic, international developed markets, international emerging markets) when the title draws attention to emerging markets. In fact, emerging markets is placed last in the legend. Try blocking the top section, just staring at the grouped bar chart -- the emerging markets do not jump out.

Third, we are asking ourselves what the designer/analyst means by "performing differently." The most obvious difference is the blue spike corresponding to the 79% return in 2009. But in many other years, the blue bar is not obviously different.

One way to interpret "perform differently" is that the emerging market returns exhibit low correlation with the returns in either domestic or international-developed markets. (Such a finding would be helpful to investors looking for diversification.) The scatter plot can be used to examine correlations.

Redo_csemergingmkt

The pattern is surprising. The chart on the left shows that emerging market returns are highly correlated in a linear way with international devleoped-market returns. The chart on the right shows that domestic returns are less correlated with emerging market returns but the correlation is still pretty strong.

There were two unusual years, one (2009) in which emerging markets did quite a bit better and another (2013) in which emerging marketss did quite a bit worse.

These observations imply that the data do not really support the title of the original chart.

 


Tufte soundbites

Tom B. alerted me to an interview with Ed Tufte, by Ad Age (link). It's a good read. The journalist attended one of Tufte's courses but then the interview was conducted via email. So it reads like a condensed version of Tufte's writing, stuffed with his many colorful coinages.

I like this comment related to Big Data:

First: "overwhelming data" is a bit of a hoax. Many of the time measurements have enormous serial correlation (just because you can measure to the millisecond doesn't mean you've learned anything about a process that moves to a monthly rhythm) and extreme high collinearities in the things measured (as in the endless web metrics, many of which are measuring the same thing over and over). Finally, most website data bizarrely and deliberately overstates the extent and intensity of website activity.


Rethinking the index data, with modesty and clarity in mind

I discussed the rose chart used in the Environmental Performance Index (EPI) report last week. This type of data is always challenging to visualize.

One should start with an objective. If the goal is a data dump, that is to say, all you want is to deliver the raw data in its full glory to the user, then you should just print a set of data tables. This has traditionally been the delivery mechanism of choice.

If, on the other hand, your interest is communicating insights, then you need to ask some interesting questions. One such question is how do different regions and/or countries compare with each other, not just in the overall index but also in the major sub-indices?

Learning to ask such a question requires first understanding the structure of the data. As described in the previous post, the EPI is a weighted average of a bunch of sub-indices. Each sub-index measures "distance to a target," which is then converted into a scale from 0 to 100. This formula guarantees that at the aggregate level, the EPI is not going to be 0 or 100: a country would have to score 100 on all sub-indices to attain EPI perfection!

Here is a design sketch to address the question posed above:

Redo_epi_regional

For a print version, I chose several reference countries listed at the bottom that span the range of common values. In the final product, hovering over a stripe should disclose a country and its EPI. Then the reader can construct comparisons of the type: "Thailand has a value of 53, which places it between Brazil and China."

The chart reveals a number of insights. Each region stakes out its territory within the EPI scale. There are no European countries with EPI lower than 45 while there are no South Asian countries with EPI higher than 50 or so. Within each region, the distribution is very wide, and particularly so in the East Asia and Pacific region. Europe is clearly the leading region, followed by North America.

The same format can be replicated for every sub-index.

This type of graph addresses a subset of the set of all possible questions and it does so in a clear way. Modesty in your goals often helps.

 


A not-so-satisfying rose

At the conference in Bavaria, Jay Emerson asked participants to provide comments on the data visualization of the 2014 Environmental Performance Index (link). We looked at the country profiles in particular. Here is one for Singapore:

Singapore

The main object of interest here is the "rose chart." To understand it, we need to know the methodology behind the index. The index is a weighted average of nine sub-indices, as shown in the table at the bottom. In many cases, the sub-index is itself an average of sub-sub-indices. These lower-level indices measure the distance between a country's performance and some target performance, typically set at the international level. But those distances are converted into a scale between 0 and 100 so the country with a score of zero did the worst in terms of meeting the target while the country with 100 did the best.

In the rose chart, the circle is divided evenly into nine sectors, each representing a sub-index. The data are encoded in the radius of the sectors. Colors map to the sub-index, and the legend is provided in two ways: a hover-over on the Web, and the table below.

Here is the equation that connects the data (EPI) to the area of the sectors:

Rose_formula

There are a number of issues with this representation. First, because of the squaring of the EPI, the area is distorted. If one country is twice the EPI of another, the area is four times as large. Another way to see this is to notice that as the EPI increases, the curved edge of the sector moves outwards, tracing a larger circumference.

Another issue is the one-ninth factor, which implies that each of those nine sub-indices are equally important. The diagram below shows that interpretation to be incorrect. (The nine sub-indices are shown in the second layer from the outside in.)

Epi_index_components

 A third issue is illustrated in the Singapore rose. Notice from the table below that Singapore scored zero on Fisheries. But in the rose, Fisheries has a non-zero area. Think of this practice as coring an apple. The middle circle of radius k should be ignored. If the sector that has the color of Fisheries has zero area, then the entire red circle shown below should have zero area.

Rose-core

With these three adjustments, the encoding formula becomes rather more complicated:

Rose_formula2

where x depends on the weight of the sub-index, and k is the radius of the sector that represents value zero.

***
The rose/radar/spider type charts are more useful when placed side by side to compare countries. But even then, this chart form doesn't work well for this dataset. This is because the spacing of countries within each sub-index is not uniform.

 The site has a visualization of the distribution of sub-index scores by issue:

Epi_by_issue

We can see that in cases of water resources, most countries are not doing very well at all. In terms of air quality, most countries except for those in the right tail have performed quite well. It is hard to interpret the indices unless one has an idea of the full distribution.

***

Finally, one wrinkle that the EPI people did makes me happy. They have created PDF and images of their data visualization so it is quite easy to save and keep some of this work. All too often, browser-based technologies create visualization that can't be saved.