« May 2017 | Main | July 2017 »

The less-is-more story, and its meta

The Schwab magazine has an interesting discussion of a marketing research study purportedly showing "less is more" when it comes to consumer choice. They summarized the experimental setup and results in the following succinct graphic:

Schwab jam displays - Jun 4 2017 - 3-45 PM - p3

The data consist of nested proportions. For example, among those seeing display 1, 60% stopped to look at the jams, and among those who stopped, 3% purchased.

The nesting is presented as overlap in this design. The blue figures on pink are those shoppers who stopped as well as purchased. The blue figures with no background are those who stopped but did not purchase. The blue figures disregarding background color include everyone who stopped. What about the gray? Those are the shoppers who did not stop at the jam display, which is not a key number. To understand what proportion of shoppers stopped, the reader must take in the entire set of figures, in effect giving the blue and blue/pink figures a change of clothes.

***

In this version, we make it easier to estimate the proportions:

Redo_schwab_jams

Each branch starts with 100 figures. The nesting structure is clearly depicted.

***

It turns out that the original design messed up the numbers. They were trying to be precise. The right side (Display 2) had 29 figures on each row, summing to 260, exactly the number of subjects in that treatment cell. The left side had 28 figures per row (one fewer!), summing to 233. However, according to the research paper being cited, they analyzed 242 subjects who saw Display 1. Nine shoppers went missing.

The extra precision, even if correctly rendered, interferes with our comprehension of proportions. Less is more, indeed!

***

P.S. If you know someone interested in upgrading their skills to join the expanding business analytics workforce, send them to my new venture, Principal Analytics Prep, a next-gen bootcamp that helps people transition careers. Contact me for more information.


Some like it packed, some like it piled, and some like it wrapped

In addition to Xan's "packed bars" (which I discussed here), there are some related efforts to improve upon the treemap. To recap, treemap is a design to show parts against the whole, and it works by packing rectangles into the bounding box. Frequently, this leads to odd-shaped rectangles, e.g. really thin and really tall ones, and it asks readers to estimate relative areas of differently-scaled boxes. We often make mistakes in this task.

The packed bar chart approaches this challenge by allowing only the width of the box to vary with the data. The height of every box is identical, so readers only have to compare lengths.

Via Twitter, Adil pointed me to this article by him and his collaborators that describes a few alternatives.

One of the options is the "wrapped bar chart" introduced by Stephen Few. Like Xan, he also restricts the variation to legnths of bars while keeping the heights fixed. But he goes further, and abandons packing completely. Instead of packing, Few wraps the bars. Start with a large bar chart with many categories filling up a tall plotting area. He then divides the bars into different blocks and place them side by side. Here is an example showing 50 states, ranked by total electoral votes:

Umd_few_wrapped_bars

You can see the white space because there is no packing. This version makes it easier to see the relative importance of the different blocks of states but it is tough to tell how much the first block of 13 states accounts for. The wrapped barchart is organized similar to a small multiples, except that the scale in each panel is allowed to vary.

Another option is the "piled bars." This option, presented by Yalçın, Elmqvist, and Bederson, brings packing back. But unlike the packed bars or the treemap, the outside envelope no longer represents the total amount. In the "piled bars" design, the top X categories act as the canvas, and the smaller categories are packed inside these bars rather than around them. Take a look at this example, which plots GDP growth of different countries:

Umd_piledbars

 The inset on the left column is instructive. The green (smallest) and red (medium) bars are packed inside the blue (largest) bars. In this example, it doesn't make sense to add up GDP growth rates, so it doesn't matter that the outer envelope does not equal the total. It would not work as well with the electoral vote data in the previous example.

I wonder whether a piled dot plot works better than a piled bar chart. This piled bar chart shares a problem with the stacked area chart, which is that other than the first piece, all the other pieces represent the differences between the respective data and the next lower category, rather than the value of the data point. Readers are led to compare the green, red and blue pieces but the corresponding values are not truly comparable, or of primary interest.

This problem goes away if the bars are represented by dots.

***

What strikes me as the most key paragraph in the Yalcin, et. al.'s article is the following:

To understand graphical perception performance, we studied three basic tasks:

1) How accurately can we estimate the difference between two data points?
2) How accurately can we estimate the rank of a data point among all the rest?
3) How accurately can we guess the distribution characteristic of the whole dataset?

As a chart designer, we have to prioritize these tasks. There is unlikely to be a single chart form that will prevail on all three tasks. So if the designer starts with the question that he or she wants to address, that leads to the key task that the visualization should enable, which leads to the chart form that facilitates that task the best.

 

 

 


Unintentional deception of area expansion #bigdata #piechart

Someone sent me this chart via Twitter, as an example of yet another terrible pie chart. (I couldn't find that tweet anymore but thank you to the reader for submitting this.)

Uk_itsurvey_left

At first glance, this looks like a pie chart with the radius as a second dimension. But that is the wrong interpretation.

In a pie chart, we typically encode the data in the angles of the pie sectors, or equivalently, the areas of the sectors. In this special case, the angle is invariant across the slices, and the data are encoded in the radius.

Since the data are found in the radii, let's deconstruct this chart by reducing each sector to its left-side edge.

This leads to a different interpretation of the chart: it’s actually a simple bar chart, manipulated.

Redo_ukitsurvey_1

The process of the manipulation runs against what data visualization should be. It takes the bar chart (bottom right) that is easy to read, introduces slants so it becomes harder to digest (top right), and finally absorbs a distortion to go from inefficient to incompetent (left).

What is this distortion I just mentioned? When readers look at the original chart, they are not focusing on the left-side edge of each sector but they are seeing the area of each sector. The ratio of areas is not the same as the ratio of lengths. Adding purple areas to the chart seems harmless but in fact, despite applying the same angles, the designer added disproportionately more area to the larger data points compared to the smaller ones.

  Redo_ukitsurvey_2

In order to remedy this situation, the designer has to take the square root of the lengths of the edges. But of course, the simple bar chart is more effective.

 



 


What do we think of the "packed" bar chart?

Xan Gregg - my partner in the #onelesspie campaign to replace terrible Wikipedia pie charts one at a time - has come up with a new chart form that he calls "packed bars". It's a combination of bar charts and the treemap.

Here is an example of a packed barchart, in which the top 10 companies on the S&P500 index are displayed:

Xangregg_packedbars_tutorial

What he's doing is to add context to help interpret the data. So frequently these days, we encounter data analyses of the "Top X" or "Bottom Y" type. Such analyses are extremely limited in utility as it ignores the bulk of the data. The extreme values have little to nothing to say about the rest of the data. This problem is particularly acute in skewed data.

Compare the two versions:

Xangregg_packedbars_az

The left chart is a Top 10 analysis. The reader knows nothing about the market cap of the other 490 companies. The right chart provides the context. We can see that the Top 10 companies have a combined market cap that is roughly a quarter of the total market cap in the S&P 500. We also learn about the size of the next 10 versus the Top 10, etc.

As with any chart form, a nice dataset can really surface its power. I really like what the packed barchart reveals about the election data by county:

Xangregg_purplepackedbars

(Thanks to Xan for providing me this image.)

Notice the preponderance of red on the right side and the gradual shift from blue/purple to pink/red moving left to right. This is very effective at showing one of the most important patterns in American politics - the small counties are mostly deep red while the Democratic base is to be found primarily in large metropolitan areas. I have previously featured a number of interesting election graphics here. Washington Post's nation of peaks is another way to surface this pattern.

Xan would love to get feedback about this chart type. He has put up a blog post here with more details. I also love this animation he created to show how the packing occurs.

 

 

 


The art of contaminating data

Schwab_indexfundassets_sm

This is one of those innocent-looking charts that could have been a poster child for artistic embellishment. The straightforward time-series chart is deemed too boring. The designer shows admirable constraint in inserting “information-free” content, such as the dense gridlines (graph paper) and the 3D effect (ticker).

Seem harmless but not really.

Here I turn off the color.

Redo_schwab_indexassets_bw_sm

After the 3D effect is applied, the reader no longer knows whether to look at the top or bottom edge of the ticker.

This view makes this point even clearer.

Jc_redo_schwab_indexassets_bw2_sm

The art contaminates the data.


Announcing a new venture

This is a great time for people in the data business. If you go on Linkedin and look for data jobs, there are several thousand open positions, just in the New York area. Every department within any business is accumulating data, and they need people to help them get value out of the data.

There are also lots of people I meet who would like to transition their careers to take advantage of these open positions but too many of them are being turned away. Many of these people have great backgrounds in other fields (economics, chemistry, psychology, engineering, IT, etc.), and have the analytical smarts to excel in these new data jobs. They are not getting hired. That's because as hiring managers, we prefer hiring the experienced person who doesn't need additional training. We also poach experienced people from other employers, instead of training new talent, creating a vicious cycle.

This is the problem that I am trying to solve by launching my new venture - Principal Analytics Prep.

 

We_make_data_unicorn_design

 

The single biggest complaint about the talent pool by hiring managers is that people's skills are too narrow, sometimes too technical, sometimes too "soft". Hiring managers in the business units outside engineering/software development, for example, marketing, operations, finance, customer service, want to hire people who can analyze and interpret data in the business context, communicate findings to non-technical audiences, as well as contribute to inter-departmental working teams to solve business problems.

For Principal Analytics Prep, I have assembled a group of passionate instructors - who are in director or above positions in industry, and hiring managers for their teams - to design a broad-based curriculum that helps people upgrade their skills to meet industry needs. Our courses range from coding to statistical reasoning to business skills. The faculty have worked at places such as American Express, Cisco, Goldman Sachs, HBO, McKinsey, Mount Sinai, SiriusXM Radio, and Vimeo, with an average of 10 years in industry.

We are not a pure coding academy, therefore we want to assemble people from all disciplines.

We will be launching the first class of students this summer in NYC.

***

Blog readers, you can help me in the following ways:

  • If you know anyone who's looking to upgrade their skills and get into the business analytics/data science field, tell them about the program
  • If you are interested in teaching a course, contact me
  • I am also looking for part-time help with administration and operations, so if you believe in my vision, contact me

If you have suggestions, please leave a comment. Thank you.