« September 2017 | Main | November 2017 »

Fifty-nine intersections supporting forty dots of data

My friend Ray V. asked how this chart can be improved:

Econ_rv_therichgetsricher

Let's try to read this chart. The Economist is always the best at writing headlines, and this one is simple and to the point: the rich get richer. This is about inequality but not just inequality - the growth in inequality over time.

Each country has four dots, divided into two pairs. From the legend, we learn that the line represents the gap between the rich and the poor. But what is rich and what is poor? Looking at the sub-header, we learn that the population is divided by domicile, and the per-capita GDP of the poorest and richest regions are drawn. This is a indirect metric, and may or may not be good, depending on how many regions a country is divided into, the dispersion of incomes within each region, the distribution of population between regions, and so on.

Now, looking at the axis labels, it's pretty clear that the data depicted are not in dollars (or currency), despite the reference to GDP in the sub-header. The numbers represent indices, relative to the national average GDP per head. For many of the countries, the poorest region produces about half of the per-capita GDP as the richest region.

Back to the orginal question. A growing inequality would be represented by a longer line below a shorter line within each country. That is true in some of these countries. The exceptions are Sweden, Japan, South Korea.

***
It doesn't jump out that the key task requires comparing the lengths of the two lines. Another issue is the outdated convention of breaking up a line (Britian) when the line is of extreme length - particularly unwise given that the length of the line encodes the key metric in the chart.

Further, it has low data-ink ratio a la Tufte. The gridlines, reference lines, and data lines weave together in a complex pattern creating 59 intersections in a chart that contains only 40  36 numbers.

***

 I decided to compute a simpler metric - the ratio of rich to poor.  For example, in the UK, the richest area produces about 20 times as much GDP per capita as the poorest one in 2015.  That is easier to understand than an index to the average region.

I had fun making the following chart, although many standard forms like the Bumps chart (i.e. slopegraph) or paired columns and so on also work.

Redo_econ_jc_richgetricher

This chart is influenced by Ed Tufte, who spent a good number of pages in his first book advocating stripping even the standard column chart to its bare essence. The chart also acknowledges the power of design to draw attention.

 

 

PS. Sorry I counted incorrectly. The chart has 36 dots not 40. 


Three pies and a bar: serving visual goodness

If you are not sick of the Washington Post article about friends (not) letting friends join the other party, allow me to write yet another post on, gasp, that pie chart. And sorry to have kept reader Daniel L. waiting, as he pointed out, when submitting this chart to me, that he had tremendous difficulty understanding it:

Wpost_friendsparties4

 

This is not one pie but six pies on a platter. There are two sources of confusion: first, the repeated labels of Republicans and Democrats to refer to different groups of people; and second, the indecision between using two or four categories of "how many".

Let me begin by re-ordering and re-labeling the chart:

Redo_junkcharts_friendsparties4

From this version, one can pull out the key messages of the analysis. (A) Most voters, regardless of party, have mostly friends from the same party. and (B) Republicans are more likely to have more friends from the other party than Democrats. A third, but really not that interesting, point is that regardless of party, people have about the same likelihood to befriend Independents.

In visualization, less is more is frequently appropriate. So, here is a view of the same chart, using two categories instead of four.

Redo_junkcharts_friendsparties4b

The added advantage is only two required colors, and thus even grayscale can work.

The new arrangement of the pie platter makes it clear that there really isn't that much difference between Republican and Democratic voters along this dimension. Thus, visualizing the aggregate gets us to the same place.

Redo_junkcharts_friendsparties4c

After three servings of pies, the reader might be craving some energy bars

Redo_junkcharts_friendsparties4d

One can say that for very simple data like this, pie charts are acceptable. However, the stacked bar is better.

Thanks again Daniel, and it's a pleasure to serve you!


Lop-sided precincts, a visual exploration

In the last post, I discussed one of the charts in the very nice Washington Post feature, delving into polarizing American voters. See the post here. (Thanks again Daniel L.)

Today's post is inspired by the following chart (I am  showing only the top of it - click here to see the entire chart):

Wpost_friendsparties2_top

The chart plots each state as a separate row, so like most such charts, it is tall. The data analysis behind the chart is fascinating and unusual, although I find the chart harder to grasp than expected. The analyst starts with precinct-level data, and determines which precincts were "lop-sided," defined as having a winning margin of over 50 percent for the winner (either Trump or Clinton). The analyst then sums the voters in those lop-sided precincts, and expresses this as a percent of all voters in the state.

For example, in Alabama, the long red bar indicates that about 48% of the state's voters live in lop-sided precincts that went for Trump. It's important to realize that not all such people voted for Trump - they happened to live in precincts that went heavily for Trump. Interestingly, about 12% of the states voters reside in precincts that went heavily for Clinton. Thus, overall, 60% of Alabama's voters live in lop-sided precincts.

This is more sophisticated than the usual analysis that shows up in journalism.

The bar chart may confuse readers for several reasons:

  • The horizontal axis is labeled "50-point plus margin for Trump/Clinton" and has values from 0% to 40-60% range. This description seemingly infers the values being plotted as winning margins. However, the sub-header tells readers that the data values are percentages of total voters in the state.
  • The shades of colors are not explained. I believe the dark shade indicates the winning party in each state, so Trump won Alabama and Clinton, California. The addition of this information allows the analysis to become multi-dimensional. It also reveals that the designer wants to address how lop-sided precincts affect the outcome of the election. However, adding shade in this manner effectively turns a two-color composition into a four-color composition, adding to the processing load.
  • The chart adopts what Howard Wainer calls the "Alabama first"  ordering. This always messes up the designer's message because the alphabetical order typically does not yield a meaningful correlation.

The bars are facing out from the middle, which is the 0% line. This arrangement is most often used in a population pyramid, and used when the designer feels it important to let readers compare the magnitudes of two segments of a population. I do not feel that the Democrat versus Republican comparison within each state is crucial to this chart, given that most states were not competitive.

What is more interesting to me is the total proportion of voters who live in these lop-sided precincts. The designer agrees on this point, and employs bar stacking to make this point. This yields some amazing insights here: several Democratic strongholds such as Massachusetts surprisingly have few lop-sided precincts.

***
Here then is a remake of the chart according to my priorities. Click here for the full chart.

Redo_wpost_friendsparties2_top

The emphasis is on the total proportion of voters in lop-sided precincts. The states are ordered by that metric from most lop-sided to least. This draws out an unexpected insight: most red states have a relatively high proportion of votesr in lop-sided precincts (~ 30 to 40%) while most blue states - except for the quartet of Maryland, New York, California and Illinois - do not exhibit such demographic concentration.

The gray/grey area offers a counterpoint, that most voters do not live in lop-sided districts.

P.S. I should add that this is one of those chart designs that frustrate standard - I mean, point-and-click - charting software because I am placing the longest bar segments on the left, regardless of color.


Let's not mix these polarized voters as the medians run away from one another

Long-time follower Daniel L. sent in a gem, by the Washington Post. This is a multi-part story about the polarization of American voters, nicely laid out, with superior analyses and some interesting graphics. Click here to see the entire article.

Today's post focuses on the first graphic. This one:

Wpost_friendsparties1

The key messages are written out on the 2017 charts: namely, 95% of Republicans are more conservative than the median Democrat, and 97% of Democrats are more libearl than the median Republicans.

This is a nice statistical way of laying out the polarization. There are a number of additional insights one can draw from the population distributions: for example, in the bottom row, the Democrats have been moving left consistently, and decisively in 2017. By contrast, Republicans moved decisively to the right from 2004 to 2017. I recall reading about polarization in past elections but it is really shocking to see the extreme in 2017.

A really astounding but hidden feature is that the median Democrat and the median Republican were not too far apart in 1994 and 2004 but the gap exploded in 2017.

***

I like to solve a few minor problems on this graphic. It's a bit confusing to have each chart display information on both Republican and Democratic distributions. The reader has to understand that in the top row, the red area represents Republican voters but the blue line shows the median Democrat.

Also, I want to surface two key insights: the huge divide that developed in 2017, and the exploding gap between the two medians.

Here is the revised graphic:

  Redo_wpost_friendsparties1

On the left side, each chart focuses on one party, and the trend over the three elections. The reader can cross charts to discover that the median voter in one party is more extreme than essentially all of the voters of the other party. This same conclusion can be drawn from the exploding gap between the median voters in either party, which is explicitly plotted in the lower right chart. The top right chart is a pretty visualization of how polarized the country was in the 2017 election.

 


Excel is the graveyard of charts, no!

It's true that Excel is responsible for large numbers of horrible charts. I just came across a typical example recently:

Ewolff_meanmedianincome

This figure comes from Edward Wolff's 2012 paper, "The Asset Price Meltdown and the Wealth of the Middle Class." It's got all the hallmarks of Excel defaults. It's not a pleasing object to look at.

However, it's also true that Excel can be used to make nice charts. Here is a remake:

Redo_meanmedianincome2

This chart is made almost entirely in Excel - the only edit I made outside Excel is to decompose the legend box.

It takes five minutes to make the first chart; it takes probably 30 minutes to make the second chart. That is the difference between good and bad graphics. Excel users: let that be your inspiration!


Dataviz Seminar and other upcoming events

Please help me spread the word on several upcoming events. If you're coming, please say hi!

 

Data Visualization Seminar - JMP Explorers Series

WHEN: October 4, 2017 , Wed, 9 am - 2:30 pm (ET)
WHERE: New School, 63 5th Avenue, New York
REGISTER HERE: Link

In this seminar, I offer tips on making effective visualizations of data, summarizing over a dozen years of critiquing thousands of data graphics.

PS. New Yorkers: I typically start the seminar with an example of dataviz with a local flavor. If you've seen something interesting recently, send it my way!

 

Principal Analytics Prep Information Session & Webinar on Digital Ad Fraud Analytics

WHEN: October 11, 2017 , Wed, 7 - 8 pm (ET)
WHERE: Online
REGISTER HERE: Link

In this webinar, I will discuss the data analytics revolution, and answer questions on how to start or develop your career in this exciting field. In addition, I invited Dr. Augustine Fou, a leading ad fraud researcher, to comment on the recent scandals of fake data in digital advertising. Augustine and I raised the alarm on this huge problem in a Harvard Business Review article in 2015!

Earlier this year, I launched Principal Analytics Prep, an intensive, 12-week bootcamp, created and staffed by leading industry experts, designed to open doors to new careers in data analytics and data science. In the past 15 years, I established and led data teams at SiriusXM Radio and Vimeo, in addition to teaching and running academic programs at Columbia and NYU.

How to Break into the Hottest Sector of the Job Market: Data Science & Analytics

WHEN: October 12, 2017 , Thur, 6:30 - 8 pm 6 - 7:30 pm (ET)
WHERE: New York Public Library, Small Business & Industry Library (SIBL), 188 Madison Avenue, New York
MORE INFO: Link to NYPL

In this talk, I discuss what data science & analytics is, why this the sector is exploding, what trends are driving such growth, and how you can take advantage of this jobs boom. 

If you can't make it in person, a short version of this talk will be presented at the Principal Analytics Prep online information session mentioned above. You can register here.