Say it thrice: a nice example of layering and story-telling

I enjoyed the New York Times's data viz showing how actively the Democratic candidates were criss-crossing the nation in the month of March (link).

It is a great example of layering the presentation, starting with an eye-catching map at the most aggregate level. The designers looped through the same dataset three times.

Nyt_candidatemap_1

This compact display packs quite a lot. We can easily identify which were the most popular states; and which candidate visited which states the most.

I noticed how they handled the legend. There is no explicit legend. The candidate names are spread around the map. The size legend is also missing, replaced by a short sentence explaining that size encodes the number of cities visited within the state. For a chart like this, having a precise size legend isn't that useful.

The next section presents the same data in a small-multiples layout. The heads are replaced by dots.

Nyt_candidatemap_2

This allows more precise comparison of one candidate to another, and one location to another.

This display has one shortcoming. If you compare the left two maps above, those for Amy Klobuchar and Beto O'Rourke, it looks like they have visited roughly similar number of cities when in fact Beto went to 42 compared to 25. Reducing the size of the dots might work.

Then, in the third visualization of the same data, the time dimension is emphasized. Lines are used to animate the daily movements of the candidates, one by one.

Nyt_candidatemap_3

Click here to see the animation.

When repetition is done right, it doesn't feel like repetition.

 


Visually exploring the relationship between college applicants and enrollment

In a previous post, we learned that top U.S. colleges have become even more selective over the last 15 years, driven by a doubling of the number of applicants while class sizes have nudged up by just 10 to 20 percent. 

Redo_pewcollegeadmissions

The top 25 most selective colleges are included in the first group. Between 2002 and 2017, their average rate of admission dropped from about 20% to about 10%, almost entirely explained by applicants per student doubling from 10 to almost 20. A similar upward movement in selectivity is found in the first four groups of colleges, which on average accept at least half of their applicants.

Most high school graduates however are not enrolling in colleges in the first four groups. Actually, the majority of college enrollment belongs to the bottom two groups of colleges. These groups also attracted twice as many applicants in 2017 relative to 2002 but the selectivity did not change. They accepted 75% to 80% of applicants in 2002, as they did in 2017.

***

In this post, we look at a different view of the same data. The following charts focus on the growth rates, indexed to 2002. 

Collegeadmissions_5

To my surprise, the number of college-age Americans  grew by about 10% initially but by 2017 has dropped back to the level of 2002. Meanwhile, the number of applications to the colleges continues to climb across all eight groups of colleges.

The jump in applications made selectivity surge at the most selective colleges but at the less selective colleges, where the vast majority of students enroll, admission rate stayed put because they gave out many more offers as applications mounted. As the Pew headline asserted, "the rich gets richer."

Enrollment has not kept up. Class sizes expanded about 10 to 30 percent in those 15 years, lagging way behind applications and admissions.

How do we explain the incremental applications?

  • Applicants increasing the number of schools they apply to
  • The untapped market: applicants who in the past would not have applied to college
  • Non-U.S. applicants: this is part of the untapped market, but much larger

An exercise in decluttering

My friend Xan found the following chart by Pew hard to understand. Why is the chart so taxing to look at? 

Pew_collegeadmissions

It's packing too much.

I first notice the shaded areas. Shading usually signifies "look here". On this chart, the shading is highlighting the least important part of the data. Since the top line shows applicants and the bottom line admitted students, the shaded gap displays the rejections.

The numbers printed on the chart are growth rates but they confusingly do not sync with the slopes of the lines because the vertical axis plots absolute numbers, not rates. 

Pew_collegeadmissions_growthThe vertical axis presents the total number of applicants, and the total number of admitted students, in each "bucket" of colleges, grouped by their admission rate in 2017. On the right, I drew in two lines, both growth rates of 100%, from 500K to 1 million, and from 1 to 2 million. The slopes are not the same even though the rates of growth are.

Therefore, the growth rates printed on the chart must be read as extraneous data unrelated to other parts of the chart. Attempts to connect those rates to the slopes of the corresponding lines are frustrated.

Another lurking factor is the unequal sizes of the buckets of colleges. There are fewer than 10 colleges in the most selective bucket, and over 300 colleges in the largest bucket. We are unable to interpret properly the total number of applicants (or admissions). The quantity of applications in a bucket depends not just on the popularity of the colleges but also the number of colleges in each bucket.

The solution isn't to resize the buckets but to select a more appropriate metric: the number of applicants per enrolled student. The most selective colleges are attracting about 20 applicants per enrolled student while the least selective colleges (those that accept almost everyone) are getting 4 applicants per enrolled student, in 2017.

As the following chart shows, the number of applicants has doubled across the board in 15 years. This raises an intriguing question: why would a college that accepts pretty much all applicants need more applicants than enrolled students?

Redo_pewcollegeadmissions

Depending on whether you are a school administrator or a student, a virtuous (or vicious) cycle has been realized. For the top four most selective groups of colleges, they have been able to progressively attract more applicants. Since class size did not expand appreciably, more applicants result in ever-lower admit rate. Lower admit rate reduces the chance of getting admitted, which causes prospective students to apply to even more colleges, which further suppresses admit rate. 

 

 

 


The Bumps come to the NBA, courtesy of 538

The team at 538 did a post-mortem of their in-season forecasts of NBA playoffs, using Bumps charts. These charts have a long history and can be traced back to Cambridge rowing. I featured them in these posts from a long time ago (link 1, link 2). 

Here is the Bumps chart for the NBA West Conference showing all 15 teams, and their ranking by the 538 model throughout the season. 

Fivethirtyeight_nbawest_bumps

The highlighted team is the Kings. It's a story of ascent especially in the second half of the season. It's also a story of close but no cigar. It knocked at the door for the last five weeks but failed to grab the last spot. The beauty of the Bumps chart is how easy it is to see this story.

Now, if you'd focus on the dotted line labeled "Makes playoffs," and note that beyond the half-way point (1/31), there are no further crossings. This means that the 538 model by that point has selected the eight playoff teams accurately.

***

Now what about NBA East?

Fivethirtyeight_nbaeast_bumps

This chart highlights the two top teams. This conference is pretty easy to predict at the top. 

What is interesting is the spaghetti around the playoff line. The playoff race was heart-stopping and it wasn't until the last couple of weeks that the teams were settled. 

Also worthy of attention are the bottom-dwellers. Note that the chart is disconnected in the last four rows (ranks 12 to 15). These four teams did not ever leave the cellar, and the model figured out the final rankings around February.

Using a similar analysis, you can see that the model found the top 5 teams by mid December in this Conference, as there are no further crossings beyond that point. 

***
Go check out the FiveThirtyEight article for their interpretation of these charts. 

While you're there, read the article about when to leave the stadium if you'd like to leave a baseball game early, work that came out of my collaboration with Pravin and Sriram.


Trump resistance chart: cleaning up order, importance, weight, paneling

Morningconsult_gopresistance_trVox featured the following chart when discussing the rise of resistance to President Trump within the GOP.

The chart is composed of mirrored bar charts. On the left side, with thicker pink bars that draw more attention, the design depicts the share of a particular GOP demographic segment that said they'd likely vote for a Trump challenger, according to a Morning Consult poll.

This is the primary metric of interest, and the entire chart is ordered by descending values from African Americans who are most likely (67%) to turn to a challenger to those who strongly support Trump and are the least likely (17%) to turn to someone else.

The right side shows the importance of each demographic, measured by the share of GOP. The relationship between importance and likelihood to defect from Trump is by and large negative but that fact takes a bit of effort to extract from this mirrored bar chart arrangement.

The subgroups are not complete. For example, the only ethnicity featured is African Americans. Age groups are somewhat more complete with under 18 being the only missing category.

The design makes it easy to pick off the most disaffected demographic segments (and the least, from the bottom) but these are disparate segments, possibly overlapping.

***

One challenge of this data is differentiating the two series of proportions. In this design, they use visual cues, like the height and width of the bars, colors, stacked vs not, data labels. Visual variety comes to the rescue.

Also note that the designer compensated for the lack of stacking on the left chart by printing data labels.

***

When reading this chart, I'm well aware that segments like urban residents, income more than $100K, at least college educated are overlapping, and it's hard to interpret the data the way it's been presented.

I wanted to place the different demographics into their natural groups, such as age, income, urbanicity, etc. Such a structure also surfaces demographic patterns, e.g. men are slightly more disaffected than women (not significant), people earning $100K+ are more unhappy than those earning $50K-.

Further, I'd like to make it easier to understand the importance factor - the share of GOP. Because the original form orders the demographics according to the left side, the proportions on the right side are jumbled.

Here is a draft of what I have in mind:

Redo_voxGOPresistance

The widths of the line segments show the importance of each demographic segment. The longest line segments are toward the bottom of the chart (< 40% likely to vote for Trump challenger).

 


A second take on the rural-urban election chart

Yesterday, I looked at the following pictograms used by Business Insider in an article about the rural-urban divide in American politics:

Businessinsider_ruraldistricts

The layout of this diagram suggests that the comparison of 2010 to 2018 is a key purpose.

The following alternate directly plots the change between 2010 and 2018, reducing the number of plots from 4 to 2.

Redo_jc_businessinsider_ruraldistricts_2

The 2018 results are emphasized. Then, for each party, there can be a net add or loss of seats.

The key trends are:

  • a net loss in seats in "Pure rural" districts, split by party;
  • a net gain of 3 seats in "rural-suburban" districts;
  • a loss of 10 Democratic seats balanced by a gain of 13 Republican seats.

 


The merry-go-round of investment bankers

Here is the start of my blog post about the chart I teased the other day:

Businessinsider_ibankers

 

Today's post deals with the following chart, which appeared recently at Business Insider (hat tip: my sister).

It's immediately obvious that this chart requires a heroic effort to decipher. The question shown in the chart title "How many senior investment bankers left their firms?" is the easiest to answer, as the designer places the number of exits in the central circle of each plot relating to a top-tier investment bank (aka "featured bank"). Note that the visual design plays no role in delivering the message, as readers just scan the data from those circles.

Anyone persistent enough to explore the rest of the chart will eventually discover these features...

***

The entire post including an alternative view of the dataset is a guest blog at the JMP Blog here. This is a situation in which plotting everything will make an unreadable chart, and the designer has to think hard about what s/he is really trying to accomplish.


The French takes back cinema but can you see it?

I like independent cinema, and here are three French films that come to mind as I write this post: Delicatessen, The Class (Entre les murs), and 8 Women (8 femmes). 

The French people are taking back cinema. Even though they purchased more tickets to U.S. movies than French movies, the gap has been narrowing in the last two decades. How do I know? It's the subject of this infographic

DataCinema

How do I know? That's not easy to say, given how complicated this infographic is. Here is a zoomed-in view of the top of the chart:

Datacinema_top

 

You've got the slice of orange, which doubles as the imagery of a film roll. The chart uses five legend items to explain the two layers of data. The solid donut chart presents the mix of ticket sales by country of origin, comparing U.S. movies, French movies, and "others". Then, there are two thin arcs showing the mix of movies by country of origin. 

The donut chart has an usual feature. Typically, the data are coded in the angles at the donut's center. Here, the data are coded twice: once at the center, and again in the width of the ring. This is a self-defeating feature because it draws even more attention to the area of the donut slices except that the areas are highly distorted. If the ratios of the areas are accurate when all three pieces have the same width, then varying those widths causes the ratios to shift from the correct ones!

The best thing about this chart is found in the little blue star, which adds context to the statistics. The 61% number is unusually high, which demands an explanation. The designer tells us it's due to the popularity of The Lion King.

***

The one donut is for the year 1994. The infographic actually shows an entire time series from 1994 to 2014.

The design is most unusual. The years 1994, 1999, 2004, 2009, 2014 receive special attention. The in-between years are split into two pairs, shrunk, and placed alternately to the right and left of the highlighted years. So your eyes are asked to zig-zag down the page in order to understand the trend. 

To see the change of U.S. movie ticket sales over time, you have to estimate the sizes of the red-orange donut slices from one pie chart to another. 

Here is an alternative visual design that brings out the two messages in this data: that French movie-goers are increasingly preferring French movies, and that U.S. movies no longer account for the majority of ticket sales.

Redo_junkcharts_frenchmovies

A long-term linear trend exists for both U.S. and French ticket sales. The "outlier" values are highlighted and explained by the blockbuster that drove them.

 

P.S.

1. You can register for the free seminar in Lyon here. To register for live streaming, go here.
2. Thanks Carla Paquet at JMP for help translating from French.


Crazy rich Asians inspire some rich graphics

On the occasion of the hit movie Crazy Rich Asians, the New York Times did a very nice report on Asian immigration in the U.S.

The first two graphics will be of great interest to those who have attended my free dataviz seminar (coming to Lyon, France in October, by the way. Register here.), as it deals with a related issue.

The first chart shows an income gap widening between 1970 and 2016.

Nyt_crazyrichasians_incomegap1

This uses a two-lines design in a small-multiples setting. The distance between the two lines is labeled the "income gap". The clear story here is that the income gap is widening over time across the board, but especially rapidly among Asians, and then followed by whites.

The second graphic is a bumps chart (slopegraph) that compares the endpoints of 1970 and 2016, but using an "income ratio" metric, that is to say, the ratio of the 90th-percentile income to the 10th-percentile income.

Nyt_crazyrichasians_incomeratio2

Asians are still a key story on this chart, as income inequality has ballooned from 6.1 to 10.7. That is where the similarity ends.

Notice how whites now appears at the bottom of the list while blacks shows up as the second "worse" in terms of income inequality. Even though the underlying data are the same, what can be seen in the Bumps chart is hidden in the two-lines design!

In short, the reason is that the scale of the two-lines design is such that the small numbers are squashed. The bottom 10 percent did see an increase in income over time but because those increases pale in comparison to the large incomes, they do not show up.

What else do not show up in the two-lines design? Notice that in 1970, the income ratio for blacks was 9.1, way above other racial groups.

Kudos to the NYT team to realize that the two-lines design provides an incomplete, potentially misleading picture.

***

The third chart in the series is a marvellous scatter plot (with one small snafu, which I'd get t0).

Nyt_crazyrichasians_byethnicity

What are all the things one can learn from this chart?

  • There is, as expected, a strong correlation between having college degrees and earning higher salaries.
  • The Asian immigrant population is diverse, from the perspectives of both education attainment and median household income.
  • The largest source countries are China, India and the Philippines, followed by Korea and Vietnam.
  • The Indian immigrants are on average professionals with college degrees and high salaries, and form an outlier group among the subgroups.

Through careful design decisions, those points are clearly conveyed.

Here's the snafu. The designer forgot to say which year is being depicted. I suspect it is 2016.

Dating the data is very important here because of the following excerpt from the article:

Asian immigrants make up a less monolithic group than they once did. In 1970, Asian immigrants came mostly from East Asia, but South Asian immigrants are fueling the growth that makes Asian-Americans the fastest-expanding group in the country.

This means that a key driver of the rapid increase in income inequality among Asian-Americans is the shift in composition of the ethnicities. More and more South Asian (most of whom are Indians) arrivals push up the education attainment and household income of the average Asian-American. Not only are Indians becoming more numerous, but they are also richer.

An alternative design is to show two bubbles per ethnicity (one for 1970, one for 2016). To reduce clutter, the smaller ethnicites can be aggregated into Other or South Asian Other. This chart may help explain the driver behind the jump in income inequality.

 

 

 

 

 


Graphical advice for conference presenters - demo

Yesterday, I pulled this graphic from a journal paper, and said one should not copy and paste this into an oral presentation.

Example_presentation_graphic

So I went ahead and did some cosmetic surgery on this chart.

Redo_example_conference_graphic

I don't know anything about the underlying science. I'm just interpreting what I see on the chart. It seems like the key message is that the Flowering condition is different from the other three. There are no statistical differences between the three boxplots in the first three panels but there is a big difference between the red-green and the purple in the last panel. Further, this difference can be traced to the red-green boxplots exhibiting negative correlation under the Flowering condition - while the purple boxplot is the same under all four conditions.

I would also have chosen different colors, e.g. make red-green two shades of gray to indicate that these two things can be treated as the same under this chart. Doing this would obviate the need to introduce the orange color.

Further, I think it might be interesting to see the plots split differently: try having the red-green boxplots side by side in one panel, and the purple boxplots in another panel.

If the presentation software has animation, the presenter can show the different text blocks and related materials one at a time. That also aids comprehension.

***

Note that the plot is designed for an oral presentation in which you have a minute or two to get the message across. It's debatable as to whether journal editors should accept this style for publications. I actually think such a style would improve reading comprehension but I surmise some of you will disagree.