Clearing a forest of labels

This chart by the Financial Times has a strong message, and I like a lot about it:

Ft-europe-growth

The countries are by and large aligned along a diagonal, with the poorer countries growing strongly between 2007-2019 while the richer countries suffered negative growth.

A small issue with the chart is the thick forest of text - redundant text. The sub-title, the axis titles, the quadrant labels, and the left-right-half labels all repeat the same things. In the following chart, I simplify the text:

Redo_fteuropegrowth_text

Typically, I don't put axis titles as a sub-header (or, header of the graphic) but as this may be part of the FT style, I respected it.


Seeking simplicity in complex data: Bloomberg's dataviz on UK gender pay gap

Bloomberg featured a thought-provoking dataviz that illustrates the pay gap by gender in the U.K. The dataset underlying this effort is complex, and the designers did a good job simplifying the data for ease of comprehension.

U.K. companies are required to submit data on salaries and bonuses by gender, and by pay quartiles. The dataset is incomplete, since some companies are slow to report, and the analyst decided not to merge companies that changed names.

Companies are classified into industry groups. Readers who read Chapter 3 of Numbers Rule Your World (link) should ask whether these group differences are meaningful by themselves, without controlling for seniority, job titles, etc. The chapter features one method used by the educational testing industry to take a more nuanced analysis of group differences.

***

The Bloomberg visualization has two sections. In the top section, each company is represented by the percent difference between average female pay and average male pay. Then the companies within a given industry is shown in a histogram. The histograms provide a view of the disparity between companies within a given industry. The black line represents the relative proportion of companies in a given industry that have no gender pay gap but it’s the weight of the histogram on either side of the black line that carries the graphic’s message.

This is the histogram for arts, entertainment and recreation.

Bloomberg_genderpaygap_arts

The spread within this industry is very wide, especially on the left side of the black line. A large proportion of these companies pay women less on average than men, and how much less is highly variable. There is one extreme positive value: Chelsea FC Foundation that pays the average female about 40% more than the average male.

This is the histogram for the public sector.

Bloomberg_genderpaygap_public
It is a much tighter distribution, meaning that the pay gaps vary less from organization to organization (this statement ignores the possibility that there are outliers not visible on this graphic). Again, the vast majority of entities in this sector pay women less than men on average.

***

The second part of the visualization look at the quartile data. The employees of each company are divided into four equal-sized groups, based on their wages. Think of these groups as the Top 25% Earners, the Second 25%, etc. Within each group, the analyst looks at the proportion of women. If gender is independent of pay, then we should expect the proportions of women to be about the same for all four quartiles. (This analysis considers gender to be the only explainer for pay gaps. This is a problem I've called xyopia, that frames a complex multivariate issue as a bivariate problem involving one outcome and one explanatory variable. Chapter 3 of Numbers Rule Your World (link) discusses how statisticians approach this issue.)

Bloomberg_genderpaygap_public_pieOn the right is the chart for the public sector. This is a pie chart used as a container. Every pie has four equal-sized slices representing the four quartiles of pay.

The female proportion is encoded in both the size and color of the pie slices. The size encoding is more precise while the color encoding has only 4 levels so it provides a “binned” summary view of the same data.

For the public sector, the lighter-colored slice shows the top 25% earners, and its light color means the proportion of women in the top 25% earners group is between 30 and 50 percent. As we move clockwise around the pie, the slices represent the 2nd, 3rd and bottom 25% earners, and women form 50 to 70 percent of each of those three quartiles.

To read this chart properly, the reader must first do one calculation. Women represent about 60% of the top 25% earners in the public sector. Is that good or bad? This depends on the overall representation of women in the public sector. If the sector employs 75 percent women overall, then the 60 percent does not look good but if it employs 40 percent women, then the same value of 60% tells us that the female employees are disproportionately found in the top 25% earners.

That means the reader must compare each value in the pie chart against the overall proportion of women, which is learned from the average of the four quartiles.

***

In the chart below, I make this relative comparison explicit. The overall proportion of women in each industry is shown using an open dot. Then the graphic displays two bars, one for the Top 25% earners, and one for the Bottom 25% earners. The bars show the gap between those quartiles and the overall female proportion. For the top earners, the size of the red bars shows the degree of under-representation of women while for the bottom earners, the size of the gray bars shows the degree of over-representation of women.

Redo_junkcharts_bloombergukgendergap

The net sum of the bar lengths is a plausible measure of gender inequality.

The industries are sorted from the ones employing fewer women (at the top) to the ones employing the most women (at the bottom). An alternative is to sort by total bar lengths. In the original Bloomberg chart - the small multiples of pie charts, the industries are sorted by the proportion of women in the bottom 25% pay quartile, from smallest to largest.

In making this dataviz, I elected to ignore the middle 50%. This is not a problem since any quartile above the average must be compensated by a different quartile below the average.

***

The challenge of complex datasets is discovering simple ways to convey the underlying message. This usually requires quite a bit of upfront analytics, data transformation, and lots of sketching.

 

 


An exercise in decluttering

My friend Xan found the following chart by Pew hard to understand. Why is the chart so taxing to look at? 

Pew_collegeadmissions

It's packing too much.

I first notice the shaded areas. Shading usually signifies "look here". On this chart, the shading is highlighting the least important part of the data. Since the top line shows applicants and the bottom line admitted students, the shaded gap displays the rejections.

The numbers printed on the chart are growth rates but they confusingly do not sync with the slopes of the lines because the vertical axis plots absolute numbers, not rates. 

Pew_collegeadmissions_growthThe vertical axis presents the total number of applicants, and the total number of admitted students, in each "bucket" of colleges, grouped by their admission rate in 2017. On the right, I drew in two lines, both growth rates of 100%, from 500K to 1 million, and from 1 to 2 million. The slopes are not the same even though the rates of growth are.

Therefore, the growth rates printed on the chart must be read as extraneous data unrelated to other parts of the chart. Attempts to connect those rates to the slopes of the corresponding lines are frustrated.

Another lurking factor is the unequal sizes of the buckets of colleges. There are fewer than 10 colleges in the most selective bucket, and over 300 colleges in the largest bucket. We are unable to interpret properly the total number of applicants (or admissions). The quantity of applications in a bucket depends not just on the popularity of the colleges but also the number of colleges in each bucket.

The solution isn't to resize the buckets but to select a more appropriate metric: the number of applicants per enrolled student. The most selective colleges are attracting about 20 applicants per enrolled student while the least selective colleges (those that accept almost everyone) are getting 4 applicants per enrolled student, in 2017.

As the following chart shows, the number of applicants has doubled across the board in 15 years. This raises an intriguing question: why would a college that accepts pretty much all applicants need more applicants than enrolled students?

Redo_pewcollegeadmissions

Depending on whether you are a school administrator or a student, a virtuous (or vicious) cycle has been realized. For the top four most selective groups of colleges, they have been able to progressively attract more applicants. Since class size did not expand appreciably, more applicants result in ever-lower admit rate. Lower admit rate reduces the chance of getting admitted, which causes prospective students to apply to even more colleges, which further suppresses admit rate. 

 

 

 


NYT hits the trifecta with this market correction chart

Yesterday, in the front page of the Business section, the New York Times published a pair of charts that perfectly captures the story of the ongoing turbulence in the stock market.

Here is the first chart:

Nyt_marketcorrection_1

Most market observers are very concerned about the S&P entering "correction" territory, which the industry arbitrarily defines as a drop of 10% or more from a peak. This corresponds to the shortest line on the above chart.

The chart promotes a longer-term reflection on the recent turbulence, using two reference points: the index has returned to the level even with that at the start of 2018, and about 16 percent higher since the beginning of 2017.

This is all done tastefully in a clear, understandable graphic.

Then, in a bit of a rhetorical flourish, the bottom of the page makes another point:

Myt_marketcorrection2

When viewed back to a 10-year period, this chart shows that the S&P has exploded by 300% since 2009.

A connection is made between the two charts via the color of the lines, plus the simple, effective annotation "Chart above".

The second chart adds even more context, through vertical bands indicating previous corrections (drops of at least 10%). These moments are connected to the first graphic via the beige color. The extra material conveys the message that the market has survived multiple corrections during this long bull period.

Together, the pair of charts addresses a pressing current issue, and presents a direct, insightful answer in a simple, effective visual design, so it hits the Trifecta!

***

There are a couple of interesting challenges related to connecting plots within a multiple-plot framework.

While the beige color connects the concept of "market correction" in the top and bottom charts, it can also be a source of confusion. The orientation and the visual interpretation of those bands differ. The first chart uses one horizontal band while the chart below shows multiple vertical bands. In the first chart, the horizontal band refers to a definition of correction while in the second chart, the vertical bands indicate experienced corrections.

Is there a solution in which the bands have the same orientation and same meaning?

***

These graphs solve a visual problem concerning the visualization of growth over time. Growth rates are anchored to some starting time. A ten-percent reduction means nothing unless you are told ten-percent of what.

Using different starting times as reference points, one gets different values of growth rates. With highly variable series of data like stock prices, picking starting times even a day apart can lead to vastly different growth rates.

The designer here picked several obvious reference times, and superimposes multiple lines on the same plotting canvass. Instead of having four lines on one chart, we have three lines on one, and four lines on the other. This limits the number of messages per chart, which speeds up cognition.

The first chart depicts this visual challenge well. Look at the start of 2018. This second line appears as if you can just reset the start point to 0, and drag the remaining portion of the line down. The part of the top line (to the right of Jan 2018) looks just like the second line that starts at Jan 2018.

Jc_marketcorrection1

However, a closer look reveals that the shape may be the same but the magnitude isn't. There is a subtle re-scaling in addition to the re-set to zero.

The same thing happens at the starting moment of the third line. You can't just drag the portion of the first or second line down - there is also a needed re-scaling.


A second take on the rural-urban election chart

Yesterday, I looked at the following pictograms used by Business Insider in an article about the rural-urban divide in American politics:

Businessinsider_ruraldistricts

The layout of this diagram suggests that the comparison of 2010 to 2018 is a key purpose.

The following alternate directly plots the change between 2010 and 2018, reducing the number of plots from 4 to 2.

Redo_jc_businessinsider_ruraldistricts_2

The 2018 results are emphasized. Then, for each party, there can be a net add or loss of seats.

The key trends are:

  • a net loss in seats in "Pure rural" districts, split by party;
  • a net gain of 3 seats in "rural-suburban" districts;
  • a loss of 10 Democratic seats balanced by a gain of 13 Republican seats.

 


The French takes back cinema but can you see it?

I like independent cinema, and here are three French films that come to mind as I write this post: Delicatessen, The Class (Entre les murs), and 8 Women (8 femmes). 

The French people are taking back cinema. Even though they purchased more tickets to U.S. movies than French movies, the gap has been narrowing in the last two decades. How do I know? It's the subject of this infographic

DataCinema

How do I know? That's not easy to say, given how complicated this infographic is. Here is a zoomed-in view of the top of the chart:

Datacinema_top

 

You've got the slice of orange, which doubles as the imagery of a film roll. The chart uses five legend items to explain the two layers of data. The solid donut chart presents the mix of ticket sales by country of origin, comparing U.S. movies, French movies, and "others". Then, there are two thin arcs showing the mix of movies by country of origin. 

The donut chart has an usual feature. Typically, the data are coded in the angles at the donut's center. Here, the data are coded twice: once at the center, and again in the width of the ring. This is a self-defeating feature because it draws even more attention to the area of the donut slices except that the areas are highly distorted. If the ratios of the areas are accurate when all three pieces have the same width, then varying those widths causes the ratios to shift from the correct ones!

The best thing about this chart is found in the little blue star, which adds context to the statistics. The 61% number is unusually high, which demands an explanation. The designer tells us it's due to the popularity of The Lion King.

***

The one donut is for the year 1994. The infographic actually shows an entire time series from 1994 to 2014.

The design is most unusual. The years 1994, 1999, 2004, 2009, 2014 receive special attention. The in-between years are split into two pairs, shrunk, and placed alternately to the right and left of the highlighted years. So your eyes are asked to zig-zag down the page in order to understand the trend. 

To see the change of U.S. movie ticket sales over time, you have to estimate the sizes of the red-orange donut slices from one pie chart to another. 

Here is an alternative visual design that brings out the two messages in this data: that French movie-goers are increasingly preferring French movies, and that U.S. movies no longer account for the majority of ticket sales.

Redo_junkcharts_frenchmovies

A long-term linear trend exists for both U.S. and French ticket sales. The "outlier" values are highlighted and explained by the blockbuster that drove them.

 

P.S.

1. You can register for the free seminar in Lyon here. To register for live streaming, go here.
2. Thanks Carla Paquet at JMP for help translating from French.


No Latin honors for graphic design

Paw_honors_2018This chart appeared on a recent issue of Princeton Alumni Weekly.

If you read the sister blog, you'll be aware that at most universities in the United States, every student is above average! At Princeton,  47% of the graduating class earned "Latin" honors. The median student just missed graduating with honors so the honors graduate is just above average! The 47% number is actually lower than at some other peer schools - at one point, Harvard was giving 90% of its graduates Latin honors.

Side note: In researching this post, I also learned that in the Senior Survey for Harvard's Class of 2018, two-thirds of the respondents (response rate was about 50%) reported GPA to be 3.71 or above, and half reported 3.80 or above, which means their grade average is higher than A-.  Since Harvard does not give out A+, half of the graduates received As in almost every course they took, assuming no non-response bias.

***

Back to the chart. It's a simple chart but it's not getting a Latin honor.

Most readers of the magazine will not care about the decimal point. Just write 18.9% as 19%. Or even 20%.

The sequencing of the honor levels is backwards. Summa should be on top.

***

Warning: the remainder of this post is written for graphics die-hards. I go through a bunch of different charts, exploring some fine points.

People often complain that bar charts are boring. A trendy alternative when it comes to count or percentage data is the "pictogram."

Here are two versions of the pictogram. On the left, each percent point is shown as a dot. Then imagine each dot turned into a square, then remove all padding and lines, and you get the chart on the right, which is basically an area chart.

Redo_paw_honors_2018

The area chart is actually worse than the original column chart. It's now much harder to judge the areas of irregularly-shaped pieces. You'd have to add data labels to assist the reader.

The 100 dots is appealing because the reader can count out the number of each type of honors. But I don't like visual designs that turn readers into bean-counters.

So I experimented with ways to simplify the counting. If counting is easier, then making comparisons is also easier.

Start with this observation: When asked to count a large number of objects, we group by 10s and 5s.

So, on the left chart below, I made connectors to form groups of 5 or 10 dots. I wonder if I should use different line widths to differentiate groups of five and groups of ten. But the human brain is very powerful: even when I use the same connector style, it's easy to see which is a 5 and which is a 10.

Redo_paw_honors_2

On the left chart, the organizing principles are to keep each connector to its own row, and within each category, to start with 10-group, then 5-group, then singletons. The anti-principle is to allow same-color dots to be separated. The reader should be able to figure out Summa = 10+3, Magna = 10+5+1, Cum Laude = 10+5+4.

The right chart is even more experimental. The anti-principle is to allow bending of the connectors. I also give up on using both 5- and 10-groups. By only using 5-groups, readers can rely on their instinct that anything connected (whether straight or bent) is a 5-group. This is powerful. It relieves the effort of counting while permitting the dots to be packed more tightly by respective color.

Further, I exploited symmetry to further reduce the counting effort. Symmetry is powerful as it removes duplicate effort. In the above chart, once the reader figured out how to read Magna, reading Cum Laude is simplified because the two categories share two straight connectors, and two bent connectors that are mirror images, so it's clear that Cum Laude is more than Magna by exactly three dots (percentage points).

***

Of course, if the message you want to convey is that roughly half the graduates earn honors, and those honors are split almost even by thirds, then the column chart is sufficient. If you do want to use a pictogram, spend some time thinking about how you can reduce the effort of the counting!

 

 

 

 

 


Made in France stereotypes

France is on my mind lately, as I prepare to bring my dataviz seminar to Lyon in a couple of weeks.  (You can still register for the free seminar here.)

The following Made in France poster brings out all the stereotypes of the French.

Made_in_france_small

(You can download the original PDF here.)

It's a sankey diagram with so many flows that it screams "it's complicated!" This is an example of a graphic for want of a story. In a Trifecta Checkup, it's failing in the Q(uestion) corner.

It's also failing in the D(ata) corner. Take a look at the top of the chart.

Madeinfrance_totalexports

France exported $572 billion worth of goods. The diagram then plots eight categories of exports, ranging from wines to cheeses:

Madeinfrance_exportcategories

Wine exports totaled $9 billion which is about 1.6% of total exports. That's the largest category of the eight shown on the page. Clearly the vast majority of exports are excluded from the sankey diagram.

Are the 8 the largest categories of exports for France? According to this site, those are (1) machinery (2) aircraft (3) vehicles (4) electrical machinery (5) pharmaceuticals (6) plastics (7) beverages, spirits, vinegar (8) perfumes, cosmetics.

Compare: (1) wines (2) jewellery (3) perfume (4) clothing (5) cheese (6) baked goods (7) chocolate (8) paintings.

It's stereotype central. Name 8 things associated with the French brand and cherry-pick those.

Within each category, the diagram does not show all of the exports either. It discloses that the bars for wines show only $7 of the $9 billion worth of wines exported. This is because the data only capture the "Top 10 Importers." (See below for why the designer did this... France exports wine to more than 180 countries.)

Finally, look at the parade of key importers of French products, as shown at the bottom of the sankey:

Madeinfrance_topimporters

The problem with interpreting this list of countries is best felt by attempting to describe which countries ended up on this list! It's the list of countries that belong to the top 10 importers of one or more of the eight chosen products, ordered by the total value of imports in those 8 categories only but only including the value in any category if it rises to the top 10 of the respective category.

In short, with all those qualifications, the size or rank of the black bars does not convey any useful information.

***

One feature of the chart that surprised me was no flows in the Wine category from France to Italy or Spain. (Based on the above discussion, you should realize that no flows does not mean no exports.) So I went to the Comtrade database that is referenced in the poster, and pulled out all the wine export data.

How does one visualize where French wines are going? After fiddling around the numbers, I came up with the following diagram:

Redo_jc_frenchwineexports

I like this type of block diagram which brings out the structure of the dataset. The key features are:

  • The total wine exports to the rest of the world was $1.4 billion in 2016
  • Half of it went to five European neighbors, the other half to the rest of the world
  • On the left half, Germany took a third of those exports; the UK and Switzerland together is another third; and the final third went to Belgium and the Netherlands
  • On the right half, the countries in the blue zone accounted for three-fifths with the unspecified countries taking two-fifths.
  • As indicated, the two-fifths (in gray) represent 20% of total wine exports, and were spread out among over 180 countries.
  • The three-fifths of the blue zone were split in half, with the first half going to North America (about 2/3 to USA and 1/3 to Canada) and the second half going to Asia (2/3 to China and 1/3 to Japan)
  • As the title indicates, the top 9 importers of French wine covered 80% of the total volume (in litres) while the other 180+ countries took 20% of the volume

 The most time-consuming part of this exercise was finding the appropriate structure which can be easily explained in a visual manner.

 

 


Finding simple ways to explain complicated data and concepts, using some Pew data

A reader submitted the following chart from Pew Research for discussion.

Pew_ST-2014-09-24-never-married-08

The reader complained that this chart was difficult to comprehend. What are some of the reasons?

The use of color is superfluous. Each line is a "cohort" of people being tracked over time. Each cohort is given its own color or hue. But the color or hue does not signify much.

The dotted lines. This design element requires a footnote to explain. The reader learns that some of the numbers on the chart are projections because those numbers pertain to time well into the future. The chart was published in 2014, using historical data so any numbers dated 2014 or after (and even some data before 2014) will be projections. The data are in fact encoded in the dots, not the slopes. Look at the cohort that has one solid line segment and one dotted line segment - it's unclear which of those three data points are projections, and which are experienced.

The focus on within-cohort trends. The line segments indicate the desire of the designer to emphasize trends within each cohort. However, it's not clear what the underlying message is. It may be that more and more people are not getting married (i.e. fewer people are getting married). That trend affects each of the three age groups - and it's easier to paint that message by focusing on between-cohort trends.

***
Here is a chart that emphasizes the between-cohort trends.

Redo_jc_pewmarriagebyage

A key decision is to not mix oil and water. The within-cohort analysis is presented in its own chart, next to the between-cohort analysis. It turns out that some of the gap between cohorts can be explained by people deferring marriage to later in life. The steep line on the right indicates that a bigger proportion of people now gets married between 35 and 44 than in previous cohorts.

I experimented a bit with the axes here. Several pie charts are used in lieu of axis labels. I also plotted a dual axis with the proportion of unmarried on the one side, and the corresponding proportion of married on the other side.