A data graphic that solves a consumer problem

Saw this great little sign at Ippudo, the ramen shop, the other day:

Ippudo_board

It's a great example of highly effective data visualization. The names on the board are sake brands. 

The menu (a version of a data table) is the conventional way of displaying this information.

The Question

Customers are selecting a sake. They don't have a favorite, or don't recognize many of these brands. They know a bit about their preferences: I like full-bodied, or I want the dry one. 

The Data

On a menu, the key data are missing. So the first order of business is to find data on full- and light-bodied, and dry and sweet. The pricing data are omitted, possibly because it clutters up the design, or because the shop doesn't want customers to focus on price - or both.

The Visual

The design uses a scatter plot. The customer finds the right quartet, thus narrowing the choices to three or four brands. Then, the positions on the two axes allow the customer to drill down further. 

This user experience is leaps and bounds above scanning a list of names, and asking someone who may or may not be an expert.

Back to the Data

The success of the design depends crucially on selecting the right data. Baked into the scatter plot is the assumption that the designer knows the two factors most influential to the customer's decision. Technically, this is a "variable selection" problem: of all factors determining the brand choice, which two are the most important? 

Think about the downside of selecting the wrong factors. Then, the scatter plot makes it harder to choose the sake compared to the menu. 

 


Beauty is in the eyes of the fishes

Reader Patrick S. sent in this old gem from Germany.

Swimmingpoolsvisitors_ger

He said:

It displays the change in numbers of visitors to public pools in the German city of Hanover. The invisible y-axis seems to be, um, nonlinear, but at least it's monotonic, in contrast to the invisible x-axis.

There's a nice touch, though: The eyes of the fish are pie charts. Black: outdoor pools, white: indoor pools (as explained in the bottom left corner).

It's taken from a 1960 publication of the city of Hanover called *Hannover: Die Stadt in der wir leben*.

This is the kind of chart that Ed Tufte made (in)famous. The visual elements do not serve the data at all, except for the eyeballs. The design becomes a mere vessel for the data table. The reader who wants to know the growth rate of swimmers has to do a tank of work.

The eyeballs though.

I like the fact that these pie charts do not come with data labels. This part of the chart passes the self-sufficiency test. In fact, the eyeballs contain the most interesting story in this chart. In those four years, the visitors to public pools switched from mostly indoor pools to mostly outdoor pools. These eyeballs show that pie charts can be effective in specific situations.

Now, Hanover fishes are quite lucky to have free admission to the public pools!


Playfulness in data visualization

The Newslab project takes aggregate data from Google's various services and finds imaginative ways to enliven the data. The Beautiful in English project makes a strong case for adding playfulness to your data visualization.

Newslab_language_wordsnakeThe data came from Google Translate. The authors look at 10 languages, and the top 10 words users ask to translate from those languages into English.

The first chart focuses on the most popular word for each language. The crawling snake presents the "worldwide" top words.

The crawling motion and the curvature are not required by the data but it inserts a dimension of playfulness into the data that engages the reader's attention.

The alternative of presenting a data table loses this virtue without gaining much in return.

Readers are asked to click on the top word in each country to reveal further statistics on the word.

For example, the word "good" leads to the following:

Newslab_language_top1_details

 

***

The second chart presents the top 10 words by language in a lollipop style:

Newslab_language_japanese10

The above diagram shows the top 10 Japanese words translated into English. This design sacrifices concise in order to achieve playful.

The standard format is a data table with one column for each country, and 10 words listed below each country header in order of decreasing frequency.

The creative lollipop display generates more extreme emotions - positive, or negative, depending on the reader. The data table is the safer choice, precisely because it does not engage the reader as deeply.

 

 


The art of arranging bars

Twitter friend Janie H. asked how I would visualize a hypothetical third column of this chart that contains the change from 2016 to 2017:

Techpriorities_data_table

This table records the results from a survey question by eMarketer, asking respondents ("marketers") to identify their top 5 technology priorities in the next 12 months.

I suggested the following:

Redo_techpriorities_order1

A hype-chasing phenomemon is clearly at play. Internet of Things and wearable technology are so last year. This year, it's all about A.I. Interestingly, something like "Big data" has been able to sustain the hype for another year.

A design decision I made is to encode the magnitude of the change in the bar lengths while encoding the direction of the change in the colors. One can of course follow the more canonical design of placing the negative bars on the left side of the data labels. My decision is a subtle way of imposing the hierarchy - first I care about magnitude, then I care about direction.

Here is a third way:

Redo_techpriorities_order2

This design imposes a different hierarchy. Your eyes are drawn to the top/bottom of the chart.

Any of these designs beat the data table by a mile. It's just too much work for the reader to figure out the value of the changes from the table.


Enhanced tables, and supercharged spreadsheets with in-cell tech

Old-timer Chris P. sent me to this Bloomberg article about Vanguard ETFs and low-cost funds (link). The article itself is interesting, and I will discuss it on the sister blog some time in the future.

Chris is impressed with this table included with the article:

Bloomberg_vanguard

This table indeed presents the insight clearly. Those fund sectors in which Vanguard does not compete have much higher costs than the fund sectors in which Vanguard is a player. The author calls this the "Vanguard effect."

This is a case where finding a visual design to beat this table is hard.

For a certain type of audience, namely financial, the spreadsheet is like rice or pasta; you simply can't live without it. The Bloomberg spreadsheet does one better: the bands of blue contrast with the white cells, which neatly divides those funds into two groups.

If you use spreadsheets a lot, you should definitely look into in-cell charts. Perhaps Tufte's sparkline is the most famous but use your imagination. I also wish vendors would support in-cell charts more eagerly.

Here is a vision of what in-cell technology can do with the above spreadsheet. (The chart is generated in R.)

  Redo_bloomberg_vanguard2

 

 


Rethinking the index data, with modesty and clarity in mind

I discussed the rose chart used in the Environmental Performance Index (EPI) report last week. This type of data is always challenging to visualize.

One should start with an objective. If the goal is a data dump, that is to say, all you want is to deliver the raw data in its full glory to the user, then you should just print a set of data tables. This has traditionally been the delivery mechanism of choice.

If, on the other hand, your interest is communicating insights, then you need to ask some interesting questions. One such question is how do different regions and/or countries compare with each other, not just in the overall index but also in the major sub-indices?

Learning to ask such a question requires first understanding the structure of the data. As described in the previous post, the EPI is a weighted average of a bunch of sub-indices. Each sub-index measures "distance to a target," which is then converted into a scale from 0 to 100. This formula guarantees that at the aggregate level, the EPI is not going to be 0 or 100: a country would have to score 100 on all sub-indices to attain EPI perfection!

Here is a design sketch to address the question posed above:

Redo_epi_regional

For a print version, I chose several reference countries listed at the bottom that span the range of common values. In the final product, hovering over a stripe should disclose a country and its EPI. Then the reader can construct comparisons of the type: "Thailand has a value of 53, which places it between Brazil and China."

The chart reveals a number of insights. Each region stakes out its territory within the EPI scale. There are no European countries with EPI lower than 45 while there are no South Asian countries with EPI higher than 50 or so. Within each region, the distribution is very wide, and particularly so in the East Asia and Pacific region. Europe is clearly the leading region, followed by North America.

The same format can be replicated for every sub-index.

This type of graph addresses a subset of the set of all possible questions and it does so in a clear way. Modesty in your goals often helps.

 


Reimagining the league table

The reason for the infrequent posting is my travel schedule. I spent the past week in Seattle at JSM. This is an annual meeting of statisticians. I presented some work on fantasy football data that I started while writing Numbersense.

For my talk, I wanted to present the ubiquitous league table in a more useful way. The league table is a table of results and relevant statistics, at the team level, in a given sports league, usually ordered by the current winning percentage. Here is an example of ESPN's presentation of the NFL end-of-season league table from 2014.

Espn_league_table_nfl_2014

If you want to know weekly results, you have to scroll to each team's section, and look at this format:

Espn_cowboys_2014_team

For the graph that I envisioned for the talk,  I wanted to show the correlation between Points Scored and winning/losing. Needless to say, the existing format is not satisfactory. This format is especially poor if I want my readers to be able to compare across teams.

***

The graph that I ended up using is this one:

  All_teams_season_winloss_vs_points

 The teams are sorted by winning percentage. One thing should be pretty clear... the raw Points Scored are only weakly associated with winning percentage. Especially in the middle of the Points distribution, other factors are at play determining if the team wins or loses.

The overlapping dots present a bit of a challenge. I went through a few other drafts before settling on this.

The same chart but with colored dots, and a legend:

Jc_dots_two_layers

Only one line of dots per team instead of two, and also requiring a legend:

Jc_dots_one_line

 Jittering is a popular solution to separating co-located dots but the effect isn't very pleasing to my eye:

Jc_dots_oneline_jittered

Small multiples is another frequently prescribed solution. Here I separated the Wins and Losses in side-by-side panels. The legend can be removed.

Jc_dots_two_panels

 

As usual, sketching is one of the most important skills in data visualization; and you'd want to have a tool that makes sketching painless and quick.


Where are the millionaires? Where's the news?

The financial media, ranging from Wall Street Journal to Zero Hedge, blogged about the geographical distribution of U.S. millionaires. The stories came with a map, and in the case of the latter, two data tables ranked by ascending and descending prevalence of millionaires. The map looks like this:

  Wsj_millionaires

The talking point lifted from the press release of Phoenix Marketing, who is the origin of the data, focuses improbably on North Dakota. For example, the WSJ blog began with:

The state making the fastest climb up the millionaire rankings doesn’t have a single Tiffany or Saks Fifth Avenue store. The closest BMW dealership is a six-hour drive from the capital.

Welcome to North Dakota, which jumped 14 spots in the annual rankings of millionaire households per capita released by Phoenix Marketing International.

The trouble is, you can't pick North Dakota out of the map; it just doesn't stand out. The map uses a different methodology of ordering the states, by groupings of the prevalence of millionaires, that is, the proportion of households in each state who are labeled "millionaires" by Phoenix Marketing.

The text, by contrast, draws attention to the change in the rank of states using the proportion of households who are millionaires as the ranking criterion. This data is two steps removed from the data used for the map (start with the map data, compute the year-to-year change, then convert to ranks).

***

State-level averages pose a challenge: state population varies a lot, and this leads to variability in the estimates of smaller states. You are likely to find smaller states over-represented in the top and bottom of state ranking charts. I talked about a similar situation relating to interpreting high schools test data (see this post, and Prologue of Numbersense link.)

Instead of using proportion of households who are millionaires, I prefer to use the number of millionaires per 1,000 households. Mathematically, these two are equivalent. If we plot that metric versus the size of states (number of households), we see the familiar pattern:

Redo_millionaires

I labeled the North Dakota data point to show how unremarkable it is. While it may have risen in "rank", it is still ranked below median in terms of number of millionaires per 1000 households. Also notice that of states with similar number of households, the millionaires metric ranges wildly from 40 to 70 per 1000 households.

An interpretation of these state average millionaire metrics has to account for state population size.

***

The following map illustrates the ups and downs between 2007 and 2013 by state.  (I found 2007 data but not the 2012 data.)

Redo_millionaires2

Think of an accounting equation. In this view, the positive changes must balance out the negative changes since I am only converned about any shift in mix. What this map shows is that Texas, California, New York, and Washington have the top net gains in the number of millionaires while Florida, and Michigan have the biggest net losses. North Dakota is again in the middle of the bunch.

This view ignores the total net change in millionaires as it focuses on the mix by state.  You'd need to figure out what is the relevant question before you can come up with a good visualization of this (or any) data.

 

 


A straight line going nowhere fast, despite tweets and likes

Ken B., another Australian reader, wasn't too proud of this effort, apparently excerpted from an HSBC report by the Sydney Morning Herald (link):

Smu_ausdollar

Ken: If you plot ranking by ranking it magically turns into a straight line.

***

There are a few other annoyances. Gridlines, data labels, double-edged arrow, bars all based on the same data, which can easily be conveyed with a ranked table. In fact, just turn the chart 90 degrees clockwise, get rid of everything else except the names of countries, and you have a much more readable figure.

The completely unnecessary legend is an Excel special. If only one data series is plotted, it should be automatic to suppress the legend.

The three-letter acronyms for different currencies is a futile educational lesson kind of like plotting geographical data on maps (in many cases). For most readers, the message of the chart does not require knowing the names of the currencies, nor their acronyms. For those who care about acronyms, say currency traders, they most likely already know those letters.

***

Just like I don't understand how we can define "over-rated" or "under-rated" restaurants (see this post and this), I also don't understand how we can define "over-valued" or "under-valued" currencies given the impossiblity of knowing the "true value" of any currency. 

***

I just had to point your attention to the fact that 123 people tweeted this article, and 221 liked this item on Facebook. And these actions form part of the so-called Big Data revolution.