Webinar Wednesday

Lyon_onlinestreaming


I'm delivering a quick-fire Webinar this Wednesday on how to make impactful data graphics for communication and persuasion. Registration is free, at this link.

***

In the meantime, I'm preparing a guest lecture for the Data Visualization class at Yeshiva University Sims School of Management. The goal of the lecture is to emphasize the importance of incorporating analytics into the data visualization process.

Here is the lesson plan:

  1. Introduce the Trifecta checkup (link) which is the general framework for effective data visualizations
  2. Provide examples of Type D data visualizations, i.e. graphics that have good production values but fail due to issues with the data or the analysis
  3. Hands-on demo of an end-to-end data visualization process
  4. Lessons from the demo including the iterative nature of analytics and visualization; and sketching
  5. Overview of basic statistics concepts useful to visual designers

 


The French takes back cinema but can you see it?

I like independent cinema, and here are three French films that come to mind as I write this post: Delicatessen, The Class (Entre les murs), and 8 Women (8 femmes). 

The French people are taking back cinema. Even though they purchased more tickets to U.S. movies than French movies, the gap has been narrowing in the last two decades. How do I know? It's the subject of this infographic

DataCinema

How do I know? That's not easy to say, given how complicated this infographic is. Here is a zoomed-in view of the top of the chart:

Datacinema_top

 

You've got the slice of orange, which doubles as the imagery of a film roll. The chart uses five legend items to explain the two layers of data. The solid donut chart presents the mix of ticket sales by country of origin, comparing U.S. movies, French movies, and "others". Then, there are two thin arcs showing the mix of movies by country of origin. 

The donut chart has an usual feature. Typically, the data are coded in the angles at the donut's center. Here, the data are coded twice: once at the center, and again in the width of the ring. This is a self-defeating feature because it draws even more attention to the area of the donut slices except that the areas are highly distorted. If the ratios of the areas are accurate when all three pieces have the same width, then varying those widths causes the ratios to shift from the correct ones!

The best thing about this chart is found in the little blue star, which adds context to the statistics. The 61% number is unusually high, which demands an explanation. The designer tells us it's due to the popularity of The Lion King.

***

The one donut is for the year 1994. The infographic actually shows an entire time series from 1994 to 2014.

The design is most unusual. The years 1994, 1999, 2004, 2009, 2014 receive special attention. The in-between years are split into two pairs, shrunk, and placed alternately to the right and left of the highlighted years. So your eyes are asked to zig-zag down the page in order to understand the trend. 

To see the change of U.S. movie ticket sales over time, you have to estimate the sizes of the red-orange donut slices from one pie chart to another. 

Here is an alternative visual design that brings out the two messages in this data: that French movie-goers are increasingly preferring French movies, and that U.S. movies no longer account for the majority of ticket sales.

Redo_junkcharts_frenchmovies

A long-term linear trend exists for both U.S. and French ticket sales. The "outlier" values are highlighted and explained by the blockbuster that drove them.

 

P.S.

1. You can register for the free seminar in Lyon here. To register for live streaming, go here.
2. Thanks Carla Paquet at JMP for help translating from French.


Big Macs in Switzerland are amazing, according to my friend

Bigmac_chNote for those in or near Zurich: I'm giving a Keynote Speech tomorrow morning at the Swiss Statistics Meeting (link). Here is the abstract:

The best and the worst of data visualization share something in common: these graphics provoke emotions. In this talk, I connect the emotional response of readers of data graphics to the design choices made by their creators. Using a plethora of examples, collected over a dozen years of writing online dataviz criticism, I discuss how some design choices generate negative emotions such as confusion and disbelief while other choices elicit positive feelings including pleasure and eureka. Important design choices include how much data to show; which data to highlight, hide or smudge; what research question to address; whether to introduce imagery, or playfulness; and so on. Examples extend from graphics in print, to online interactive graphics, to visual experiences in society.

***

The Big Mac index seems to never want to go away. Here is the latest graphic from the Economist, saying what it says:

Econ_bigmacindex

The index never made much sense to me. I'm in Switzerland, and everything here is expensive. My friend, who is a U.S. transplant, seems to have adopted McDonald's as his main eating-out venue. Online reviews indicate that the quality of the burger served in Switzerland is much better than the same thing in the States. So, part of the price differential can be explained by quality. The index also confounds several other issues, such as local inflation and exchange rate

Now, on to the data visualization, which is primarily an exercise in rolling one's eyeballs. In order to understand the red and blue line segments, our eyes have to hop over the price bubbles to the top of the page. Then, in order to understand the vertical axis labels, unconventionally placed on the right side, our eyes have to zoom over to the left of the page, and search for the line below the header of the graph. Next, if we want to know about a particular country, our eyes must turn sideways and scan from bottom up.

Here is a different take on the same data:

Redo_jc_econbigmac2018

I transformed the data as I don't find it compelling to learn that Russian Big Macs are 60% less than American Big Macs. Instead, on my chart, the reader learns that the price paid for a U.S. Big Mac will buy him/her almost 2 and a half Big Macs in Russia.

The arrows pointing left indicate that in most countries, the values of their currencies are declining relative to the dollar from 2017 to 2018 (at least by the Big Mac Index point of view). The only exception is Turkey, where in 2018, one can buy more Big Macs equivalent to the price paid for one U.S. Big Mac. compared to 2017.

The decimal differences are immaterial so I have grouped the countries by half Big Macs.

This example demonstrates yet again, to make good data visualization, one has to describe an interesting question, make appropriate transformations of the data, and then choose the right visual form. I describe this framework as the Trifecta - a guide to it is here.

(P.S. I noticed that Bitly just decided unilaterally to deactivate my customized Bitly link that was configured years and years ago, when it switched design (?). So I had to re-create the custom link. I have never grasped  why "unreliability" is a feature of the offering by most Tech companies.)


Two good charts can use better titles

NPR has this chart, which I like:

Npr_votersgunpolicy

It's a small multiples of bumps charts. Nice, clear labels. No unnecessary things like axis labels. Intuitive organization by Major Factor, Minor Factor, and Not a Factor.

Above all, the data convey a strong, surprising, message - despite many high-profile gun violence incidents this year, some Democratic voters are actually much less likely to see guns as a "major factor" in deciding their vote!

Of course, the overall importance of gun policy is down but the story of the chart is really about the collapse on the Democratic side, in a matter of two months.

The one missing thing about this chart is a nice, informative title: In two months, gun policy went from a major to a minor issue for some Democratic voters.

***

 I am impressed by this Financial Times effort:

Ft_millennialunemploy

The key here is the analysis. Most lazy analyses compare millennials to other generations but at current ages but this analyst looked at each generation at the same age range of 18 to 33 (i.e. controlling for age).

Again, the data convey a strong message - millennials have significantly higher un(der)employment than previous generations at their age range. Similar to the NPR chart above, the overall story is not nearly as interesting as the specific story - it is the pink area ("not in labour force") that is driving this trend.

Specifically, millennial unemployment rate is high because the proportion of people classified as "not in labour force" has doubled in 2014, compared to all previous generations depicted here. I really like this chart because it lays waste to a prevailing theory spread around by reputable economists - that somehow after the Great Recession, demographics trends are causing the explosion in people classified as "not in labor force". These people are nobodies when it comes to computing the unemployment rate. They literally do not count! There is simply no reason why someone just graduated from college should not be in the labour force by choice. (Dean Baker has a discussion of the theory that people not wanting to work is a long term trend.)

The legend would be better placed to the right of the columns, rather than the top.

Again, this chart benefits from a stronger headline: BLS Finds Millennials are twice as likely as previous generations to have dropped out of the labour force.

 

 

 

 


Two thousand five hundred ways to say the same thing

Wallethub published a credit card debt study, which includes the following map:

Wallethub_creditcardpaydownbyCity

Let's describe what's going on here.

The map plots cities (N = 2,562) in the U.S. Each city is represented by a bubble. The color of the bubble ranges from purple to green, encoding the percentile ranking based on the amount of credit card debt that was paid down by consumers. Purple represents 1st percentile, the lowest amount of paydown while green represents 99th percentile, the highest amount of paydown.

The bubble size is encoding exactly the same data, apparently in a coarser gradation. The more purple the color, the smaller the bubble. The more green the color, the larger the bubble.

***

The design decisions are baffling.

Purple is more noticeable than the green, but signifies the less important cities, with the lesser paydowns.

With over 2,500 bubbles crowding onto the map, over-plotting is inevitable. The purple bubbles are printed last, dominating the attention but those are the least important cities (1st percentile). The green bubbles, despite being larger, lie underneath the smaller, purple bubbles.

What might be the message of this chart? Our best guess is: the map explores the regional variation in the paydown rate of credit card debt.

The analyst provides all the data beneath the map. 

Wallethub_paydownbyCity_data

From this table, we learn that the ranking is not based on total amount of debt paydown, but the amount of paydown per household in each city (last column). That makes sense.

Shouldn't it be ranked by the paydown rate instead of the per-household number? Divide the "Total Credit Card Paydown by City" by "Total Credit Card Debt Q1 2018" should yield the paydown rate. Surprise! This formula yields a column entirely consisting of 4.16%.

What does this mean? They applied the national paydown rate of 4.16% to every one of 2,562 cities in the country. If they had plotted the paydown rate, every city would attain the same color. To create "variability," they plotted the per-household debt paydown amount. Said differently, the color scale encodes not credit card paydown as asserted but amount of credit card debt per household by city.

Here is a scatter plot of the credit card amount against the paydown amount.

Redo_creditcardpaydown_scatter

A perfect alignment!

This credit card debt paydown map is an example of a QDV chart, in which there isn't a clear question, there is almost no data, and the visual contains several flaws. (See our Trifecta checkup guide.) We are presented 2,562 ways of saying the same thing: 4.16%.

 

P.S. [6/22/2018] Added scatter plot, and cleaned up some language.

 

 

 


Digital revolution in China: two visual takes

The following map accompanied an article in the Economist about China's drive to create a "digital silkroad," roughly defined as making a Silicon Valley. 

Economist_digitalsilkroad

The two variables plotted are the wealth of each province (measured by GDP per capita) and the level of Internet penetration. The designer made the following choices:

  • GDP per capita is presented with less precision than Internet penetration. The former is grouped into five large categories while the latter is given as a percentage to one decimal place.
  • The visual design favors GDP per capita which is encoded as the shade of color of each province. The Internet penetration data appeared added on as an afterthought.

If we apply the self-sufficiency test (i.e. by removing the printed data from the chart), it's immediately clear that the visual elements convey zero information about Internet penetration. This is a serious problem for a chart about the "digital silkroad"!

***

If those two variables are chosen, it would seem appropriate to convey to readers the correlation between the two variables. The following sketch is focused on surfacing the correlation.

Redo_jc_china_digitalsilkroad2

(Click on the image to see it in full.) Here is the top of the graphic:

Redo_jc_china_digitalskilkroad_detail

The individual maps are not strictly necessary. Just placing provincial names onto the grid is enough, because regional pattern isn't salient here.

The Internet penetration data were grouped into five categories as well, putting it on equal footing as GDP per capita.

 


Well-structured, interactive graphic about newsrooms

Today, I take a detailed look at one of the pieces that came out of an amazing collaboration between Alberto Cairo, and Google's News Lab. The work on diversity in U.S. newsrooms is published here. Alberto's introduction to this piece is here.

The project addresses two questions: (a) gender diversity (representation of women) in U.S. newsrooms and (b) racial diversity (representation of white vs. non-white) in U.S. newsrooms.

One of the key strengths of the project is how the complex structure of the underlying data is displayed. The design incorporates the layering principle everywhere to clarify that structure.

At the top level, the gender and race data are presented separately through the two tabs on the top left corner. Additionally, newsrooms are classified into three tiers: brand-names (illustrated with logos), "top" newsrooms, and the rest.

Goog_newsrooms_gender_1

The brand-name newsrooms are shown with logos while the reader has to click on individual bubbles to see the other newsrooms. (Presumably, the size of the bubble is the size of each newsroom.)

The horizontal scale is the proportion of males (or females), with equality positioned in the middle. The higher the proportion of male staff, the deeper is the blue. The higher the proportion of female staff, the deeper is the red. The colors are coordinated between the bubbles and the horizontal axis, which is a nice touch.

I am not feeling this color choice. The key reference level on this chart is the 50/50 split (parity), which is given the pale gray. So the attention is drawn to the edges of the chart, to those newsrooms that are the most gender-biased. I'd rather highlight the middle, celebrating those organizations with the best gender balance.

***

The red-blue color scheme unfortunately re-appeared in a subsequent chart, with a different encoding.

Goog_newsrooms_gender_4

Now, blue means a move towards parity while red indicates a move away from parity between 2001 and 2017. Gray now denotes lack of change. The horizontal scale remains the same, which is why this can cause some confusion.

Despite the colors, I like the above chart. The arrows symbolize trends. The chart delivers an insight. On average, these newsrooms are roughly 60% male with negligible improvement over 16 years.

***

Back to layering. The following chart shows that "top" newsrooms include more than just the brand-name ones.

Goog_newsrooms_gender_3

The dot plot is undervalued for showing simple trends like this. This is a good example of this use case.

While I typically recommend showing balanced axis for bipolar scale, this chart may be an exception. Moving to the right side is progress but the target sits in the middle; the goal isn't to get the dots to the far right so much of the right panel is wasted space.

 


Several problems with stacked bar charts, as demonstrated by a Delta chart designer

In the Trifecta Checkup (link), I like to see the Question and the Visual work well together. Sometimes, you have a nice message but you just pick the wrong Visual.

An example is the following stacked column chart, used in an investor presentation by Delta.

Delta_aircraft

From what I can tell, the five types of aircraft are divided into RJ (regional jet) and others (perhaps, larger jets). With each of those types, there are two or three subtypes. The primary message here is the reduction in the RJ fleet and the expansion of Small/Medium/Large.

One problem with a stacked column chart with five types is that it takes too much effort to understand the trends of the middle types.

The two types on the edges are not immune to confusion either. As shown below, both the dark blue (Large) type and the dark red (50-seat RJ) type are associated with downward sloping lines except that the former type is growing rapidly while the latter is vanishing from the mix!

Redo_delta_aircraft

 In this case, the slopegraph (Bumps-type chart) can overcome some of the limitations.

Redo_deltaaircraft_2

***

This example was used in my new dataviz workshop, launched in St. Louis yesterday. Thank you to the participants for making it a lively session!


Steel tariffs, and my new dataviz seminar

I am developing a new seminar aimed at business professionals who want to improve their ability to communicate using charts. I want any guidance to be tool-agnostic, so that attendees can implement them using Excel if that’s their main charting software. Over the 12+ years that I’ve been blogging, certain ideas keep popping up; and I have collected these motifs and organized them for the seminar. This post is about a recent chart that brings up a few of these motifs.

This chart has been making the rounds in articles about the steel tariffs.

2018.03.08steel_1

The chart shows the Top 10 nations that sell steel to the U.S., which together account for 78% of all imports. 

The chart shows a few signs of design. These things caught my eye:

  1. the pie chart on the left delivers the top-line message that 10 countries account for almost 80% of all U.S. steel imports
  2. the callout gives further information about which 10 countries and how much each nation sells to the U.S. This is a nice use of layering
  3. on the right side, progressive tints of blue indicate the respective volumes of imports

On the negative side of the ledger, the chart is marred by three small problems. Each of these problems concerns inconsistency, which creates confusion for readers.

  1. Inconsistent use of color: on the left side, the darker blue indicates lower volume while on the right side, the darker blue indicates higher volume
  2. Inconsistent coding of pie slices: on the right side, the percentages add up to 78% while the total area of the pie is 100%
  3. Inconsistent scales: the left chart carrying the top-line message is notably smaller than the right chart depicting the secondary message. Readers’ first impression is drawn to the right chart.

Easy fixes lead to the following chart:

Redo_steelimports_1

***

The central idea of the new dataviz seminar is that there are many easy fixes that are often missed by the vast majority of people making Excel charts. I will present a stack of these motifs. If you're in the St. Louis area, you get to experience the seminar first. Register for a spot here.

Send this message to your friends and coworkers in the area. Also, contact me if you'd like to bring this seminar to your area.

***

I also tried the following design, which brings out some other interesting tidbits, such as that Canada and Brazil together sell the U.S. about 30% of its imported steel, the top 4 importers account for about 50% of all steel imports, etc. Color is introduced on the chart via a stylized flag coloring.

Redo_steelimports_2

 

 

 

 

 


The tech world in which everyone is below average

Laura pointed me to an infographic about tech worker salaries in major tech hubs (link).

What's wrong with this map?

Entrepreneur_techsalaries_map

The box "Global average" is doubly false. It is not global, and it is not the average!

The only non-American cities included in this survey are Toronto, Paris and London.

The only city with average salary above the "Global average" is San Francisco Bay Area. Since the Bay Area does not outweigh all other cities combined in the number of tech workers, it is impossible to get an average of $135,000.

***

Here is the second chart.

What's wrong with these lines?

Entrepreneur_techsalaries_lines

This chart frustrates the reader's expectations. The reader interprets it as a simple line chart, based on three strong hints:

  • time along the horizontal axis
  • data labels show dollar units
  • lines linking time

Each line seems to show the trend of average tech worker salary, in dollar units.

However, that isn't the designer's intention. Let's zoom in on Chicago and Denver:

Entrepreneur_techsalaries_lines2

The number $112,000 (Denver) sits below the number $107,000 (Chicago). It appears that each chart has its own scale. But that's not the case either.

For a small-multiples setup, we expect all charts should use the same scale. Even though the data labels are absolute dollar amounts, the vertical axis is on a relative scale (percent change). To make things even more complicated, the percent change is computed relative to the minimum of the three annual values, no matter which year it occurs.

Redo_entrepreneurtechsalarieslines2

That's why $106,000 (Chicago) is at the same level as $112,000 (Denver). Those are the minimum values in the respective time series. As shown above, these line charts are easier to understand if the axis is displayed in its true units of percent change.

The choice of using the minimum value as the reference level interferes with comparing one city to the next. For Chicago, the line chart tells us 2015 is about 2 percent above 2016 while 2017 is 6 percent above. For Denver, the line chart tells us that 2016 is about 2 percent above the 2015 and 2017 values. Now what's the message again?

Here I index all lines to the earliest year.

  Redo_junkcharts_entrepreneurtechsalaries_lines

In a Trifecta Checkup analysis (link), I'd be suspicious of the data. Did tech salaries in London really drop by 15-20 percent in the last three years?