Seeking simplicity in complex data: Bloomberg's dataviz on UK gender pay gap

Bloomberg featured a thought-provoking dataviz that illustrates the pay gap by gender in the U.K. The dataset underlying this effort is complex, and the designers did a good job simplifying the data for ease of comprehension.

U.K. companies are required to submit data on salaries and bonuses by gender, and by pay quartiles. The dataset is incomplete, since some companies are slow to report, and the analyst decided not to merge companies that changed names.

Companies are classified into industry groups. Readers who read Chapter 3 of Numbers Rule Your World (link) should ask whether these group differences are meaningful by themselves, without controlling for seniority, job titles, etc. The chapter features one method used by the educational testing industry to take a more nuanced analysis of group differences.

***

The Bloomberg visualization has two sections. In the top section, each company is represented by the percent difference between average female pay and average male pay. Then the companies within a given industry is shown in a histogram. The histograms provide a view of the disparity between companies within a given industry. The black line represents the relative proportion of companies in a given industry that have no gender pay gap but it’s the weight of the histogram on either side of the black line that carries the graphic’s message.

This is the histogram for arts, entertainment and recreation.

Bloomberg_genderpaygap_arts

The spread within this industry is very wide, especially on the left side of the black line. A large proportion of these companies pay women less on average than men, and how much less is highly variable. There is one extreme positive value: Chelsea FC Foundation that pays the average female about 40% more than the average male.

This is the histogram for the public sector.

Bloomberg_genderpaygap_public
It is a much tighter distribution, meaning that the pay gaps vary less from organization to organization (this statement ignores the possibility that there are outliers not visible on this graphic). Again, the vast majority of entities in this sector pay women less than men on average.

***

The second part of the visualization look at the quartile data. The employees of each company are divided into four equal-sized groups, based on their wages. Think of these groups as the Top 25% Earners, the Second 25%, etc. Within each group, the analyst looks at the proportion of women. If gender is independent of pay, then we should expect the proportions of women to be about the same for all four quartiles. (This analysis considers gender to be the only explainer for pay gaps. This is a problem I've called xyopia, that frames a complex multivariate issue as a bivariate problem involving one outcome and one explanatory variable. Chapter 3 of Numbers Rule Your World (link) discusses how statisticians approach this issue.)

Bloomberg_genderpaygap_public_pieOn the right is the chart for the public sector. This is a pie chart used as a container. Every pie has four equal-sized slices representing the four quartiles of pay.

The female proportion is encoded in both the size and color of the pie slices. The size encoding is more precise while the color encoding has only 4 levels so it provides a “binned” summary view of the same data.

For the public sector, the lighter-colored slice shows the top 25% earners, and its light color means the proportion of women in the top 25% earners group is between 30 and 50 percent. As we move clockwise around the pie, the slices represent the 2nd, 3rd and bottom 25% earners, and women form 50 to 70 percent of each of those three quartiles.

To read this chart properly, the reader must first do one calculation. Women represent about 60% of the top 25% earners in the public sector. Is that good or bad? This depends on the overall representation of women in the public sector. If the sector employs 75 percent women overall, then the 60 percent does not look good but if it employs 40 percent women, then the same value of 60% tells us that the female employees are disproportionately found in the top 25% earners.

That means the reader must compare each value in the pie chart against the overall proportion of women, which is learned from the average of the four quartiles.

***

In the chart below, I make this relative comparison explicit. The overall proportion of women in each industry is shown using an open dot. Then the graphic displays two bars, one for the Top 25% earners, and one for the Bottom 25% earners. The bars show the gap between those quartiles and the overall female proportion. For the top earners, the size of the red bars shows the degree of under-representation of women while for the bottom earners, the size of the gray bars shows the degree of over-representation of women.

Redo_junkcharts_bloombergukgendergap

The net sum of the bar lengths is a plausible measure of gender inequality.

The industries are sorted from the ones employing fewer women (at the top) to the ones employing the most women (at the bottom). An alternative is to sort by total bar lengths. In the original Bloomberg chart - the small multiples of pie charts, the industries are sorted by the proportion of women in the bottom 25% pay quartile, from smallest to largest.

In making this dataviz, I elected to ignore the middle 50%. This is not a problem since any quartile above the average must be compensated by a different quartile below the average.

***

The challenge of complex datasets is discovering simple ways to convey the underlying message. This usually requires quite a bit of upfront analytics, data transformation, and lots of sketching.

 

 


An exercise in decluttering

My friend Xan found the following chart by Pew hard to understand. Why is the chart so taxing to look at? 

Pew_collegeadmissions

It's packing too much.

I first notice the shaded areas. Shading usually signifies "look here". On this chart, the shading is highlighting the least important part of the data. Since the top line shows applicants and the bottom line admitted students, the shaded gap displays the rejections.

The numbers printed on the chart are growth rates but they confusingly do not sync with the slopes of the lines because the vertical axis plots absolute numbers, not rates. 

Pew_collegeadmissions_growthThe vertical axis presents the total number of applicants, and the total number of admitted students, in each "bucket" of colleges, grouped by their admission rate in 2017. On the right, I drew in two lines, both growth rates of 100%, from 500K to 1 million, and from 1 to 2 million. The slopes are not the same even though the rates of growth are.

Therefore, the growth rates printed on the chart must be read as extraneous data unrelated to other parts of the chart. Attempts to connect those rates to the slopes of the corresponding lines are frustrated.

Another lurking factor is the unequal sizes of the buckets of colleges. There are fewer than 10 colleges in the most selective bucket, and over 300 colleges in the largest bucket. We are unable to interpret properly the total number of applicants (or admissions). The quantity of applications in a bucket depends not just on the popularity of the colleges but also the number of colleges in each bucket.

The solution isn't to resize the buckets but to select a more appropriate metric: the number of applicants per enrolled student. The most selective colleges are attracting about 20 applicants per enrolled student while the least selective colleges (those that accept almost everyone) are getting 4 applicants per enrolled student, in 2017.

As the following chart shows, the number of applicants has doubled across the board in 15 years. This raises an intriguing question: why would a college that accepts pretty much all applicants need more applicants than enrolled students?

Redo_pewcollegeadmissions

Depending on whether you are a school administrator or a student, a virtuous (or vicious) cycle has been realized. For the top four most selective groups of colleges, they have been able to progressively attract more applicants. Since class size did not expand appreciably, more applicants result in ever-lower admit rate. Lower admit rate reduces the chance of getting admitted, which causes prospective students to apply to even more colleges, which further suppresses admit rate. 

 

 

 


Visualizing the 80/20 rule, with the bar-density plot

Through Twitter, Danny H. submitted the following chart that shows a tiny 0.3 percent of Youtube creators generate almost 40 percent of all viewing on the platform. He asks for ideas about how to present lop-sided data that follow the "80/20" rule.

Dannyheiman_twitter_youtubedata

In the classic 80/20 rule, 20 percent of the units account for 80 percent of the data. The percentages vary, so long as the first number is small relative to the second. In the Youtube example, 0.3 percent is compared to 40 percent. The underlying reason for such lop-sidedness is the differential importance of the units. The top units are much more important than the bottom units, as measured by their contribution to the data.

I sense a bit of "loss aversion" on this chart (explained here). The designer color-coded the views data into blue, brown and gray but didn't have it in him/her to throw out the sub-categories, which slows down cognition and adds hardly to our understanding.

I like the chart title that explains what it is about.

Turning to the D corner of the Trifecta Checkup for a moment, I suspect that this chart only counts videos that have at least one play. (Zero-play videos do not show up in a play log.) For a site like Youtube, a large proportion of uploaded videos have no views and thus, many creators also have no views.

***

My initial reaction on Twitter is to use a mirrored bar chart, like this:

Jc_redo_youtube_mirrorbar_lobsided

I ended up spending quite a bit of time exploring other concepts. In particular, I like to find an integrated way to present this information. Most charts, such as the mirrored bar chart, a Bumps chart (slopegraph), and Lorenz chart, keep the two series of percentages separate.

Also, the biggest bar (the gray bar showing 97% of all creators) highlights the least important Youtubers while the top creators ("super-creators") are cramped inside a slither of a bar, which is invisible in the original chart.

What I came up with is a bar-density plot, where I use density to encode the importance of creators, and bar lengths to encode the distribution of views.

Jc_redo_youtube_bar_h_2col

Each bar is divided into pieces, with the number of pieces proportional to the number of creators in each segment. This has the happy result that the super-creators are represented by large (red) pieces while the least important creators by little (gray) pieces.

The embedded tessellation shows the structure of the data: the bottom third of the views are generated by a huge number of creators, producing a few views each - resulting in a high density. The top 38% of the views correspond to a small number of super-creators - appropriately shown by a bar of low density.

For those interested in technicalities, I embed a Voronoi diagram inside each bar, with randomly placed points. (There will be a companion post later this week with some more details, and R code.)

Here is what the bar-density plot looks like when the distribution is essentially uniform:

Jc_redo_youtube_bar_even
The density inside each bar is roughly the same, indicating that the creators are roughly equally important.

 

P.S.

1) The next post on the bar-density plot, with some experimental R code, will be available here.

2) Check out my first Linkedin "article" on this topic.

 

 

 

 

 


Labels, scales, controls, aggregation all in play

JB @barclaysdevries sent me the following BBC production over Twitter.

Johnbennett_barclaysdevries_bbc_chinagrowth

He was not amused.

This chart pushes a number of my hot buttons.

First, I like to assume that readers don't need to be taught that 2007 and 2018 are examples of "Year".

Second, starting an area chart away from zero is equally as bad as starting a bar chart not at zero! The area is distorted and does not reflect the relative values of the data.

Third, I suspect the 2007 high point is a local peak, which they chose in order to forward a sky-is-falling narrative related to China's growth.

So I went to a search engine and looked up China's growth rate, and it helpfully automatically generated the following chart:

Google_chinagrowth

Just wow! This chart does a number of things right.

First, it confirms my hunch above. 2007 is a clear local peak and it is concerning that the designer chose that as a starting point.

Second, this chart understands that the zero-growth line has special meaning.

Third, there are more year labels.

Fourth, and very importantly, the chart offers two "controls". We can look at China's growth relative to India's and relative to the U.S.'s. Those two other lines bring context.

JB's biggest complaint is that the downward-sloping line confuses the issue, which is that slowing growth is still growth. The following chart conveys a completely different message but the underlying raw data are the same:

Redo_chinagdpgrowth

 


The French takes back cinema but can you see it?

I like independent cinema, and here are three French films that come to mind as I write this post: Delicatessen, The Class (Entre les murs), and 8 Women (8 femmes). 

The French people are taking back cinema. Even though they purchased more tickets to U.S. movies than French movies, the gap has been narrowing in the last two decades. How do I know? It's the subject of this infographic

DataCinema

How do I know? That's not easy to say, given how complicated this infographic is. Here is a zoomed-in view of the top of the chart:

Datacinema_top

 

You've got the slice of orange, which doubles as the imagery of a film roll. The chart uses five legend items to explain the two layers of data. The solid donut chart presents the mix of ticket sales by country of origin, comparing U.S. movies, French movies, and "others". Then, there are two thin arcs showing the mix of movies by country of origin. 

The donut chart has an usual feature. Typically, the data are coded in the angles at the donut's center. Here, the data are coded twice: once at the center, and again in the width of the ring. This is a self-defeating feature because it draws even more attention to the area of the donut slices except that the areas are highly distorted. If the ratios of the areas are accurate when all three pieces have the same width, then varying those widths causes the ratios to shift from the correct ones!

The best thing about this chart is found in the little blue star, which adds context to the statistics. The 61% number is unusually high, which demands an explanation. The designer tells us it's due to the popularity of The Lion King.

***

The one donut is for the year 1994. The infographic actually shows an entire time series from 1994 to 2014.

The design is most unusual. The years 1994, 1999, 2004, 2009, 2014 receive special attention. The in-between years are split into two pairs, shrunk, and placed alternately to the right and left of the highlighted years. So your eyes are asked to zig-zag down the page in order to understand the trend. 

To see the change of U.S. movie ticket sales over time, you have to estimate the sizes of the red-orange donut slices from one pie chart to another. 

Here is an alternative visual design that brings out the two messages in this data: that French movie-goers are increasingly preferring French movies, and that U.S. movies no longer account for the majority of ticket sales.

Redo_junkcharts_frenchmovies

A long-term linear trend exists for both U.S. and French ticket sales. The "outlier" values are highlighted and explained by the blockbuster that drove them.

 

P.S.

1. You can register for the free seminar in Lyon here. To register for live streaming, go here.
2. Thanks Carla Paquet at JMP for help translating from French.


No Latin honors for graphic design

Paw_honors_2018This chart appeared on a recent issue of Princeton Alumni Weekly.

If you read the sister blog, you'll be aware that at most universities in the United States, every student is above average! At Princeton,  47% of the graduating class earned "Latin" honors. The median student just missed graduating with honors so the honors graduate is just above average! The 47% number is actually lower than at some other peer schools - at one point, Harvard was giving 90% of its graduates Latin honors.

Side note: In researching this post, I also learned that in the Senior Survey for Harvard's Class of 2018, two-thirds of the respondents (response rate was about 50%) reported GPA to be 3.71 or above, and half reported 3.80 or above, which means their grade average is higher than A-.  Since Harvard does not give out A+, half of the graduates received As in almost every course they took, assuming no non-response bias.

***

Back to the chart. It's a simple chart but it's not getting a Latin honor.

Most readers of the magazine will not care about the decimal point. Just write 18.9% as 19%. Or even 20%.

The sequencing of the honor levels is backwards. Summa should be on top.

***

Warning: the remainder of this post is written for graphics die-hards. I go through a bunch of different charts, exploring some fine points.

People often complain that bar charts are boring. A trendy alternative when it comes to count or percentage data is the "pictogram."

Here are two versions of the pictogram. On the left, each percent point is shown as a dot. Then imagine each dot turned into a square, then remove all padding and lines, and you get the chart on the right, which is basically an area chart.

Redo_paw_honors_2018

The area chart is actually worse than the original column chart. It's now much harder to judge the areas of irregularly-shaped pieces. You'd have to add data labels to assist the reader.

The 100 dots is appealing because the reader can count out the number of each type of honors. But I don't like visual designs that turn readers into bean-counters.

So I experimented with ways to simplify the counting. If counting is easier, then making comparisons is also easier.

Start with this observation: When asked to count a large number of objects, we group by 10s and 5s.

So, on the left chart below, I made connectors to form groups of 5 or 10 dots. I wonder if I should use different line widths to differentiate groups of five and groups of ten. But the human brain is very powerful: even when I use the same connector style, it's easy to see which is a 5 and which is a 10.

Redo_paw_honors_2

On the left chart, the organizing principles are to keep each connector to its own row, and within each category, to start with 10-group, then 5-group, then singletons. The anti-principle is to allow same-color dots to be separated. The reader should be able to figure out Summa = 10+3, Magna = 10+5+1, Cum Laude = 10+5+4.

The right chart is even more experimental. The anti-principle is to allow bending of the connectors. I also give up on using both 5- and 10-groups. By only using 5-groups, readers can rely on their instinct that anything connected (whether straight or bent) is a 5-group. This is powerful. It relieves the effort of counting while permitting the dots to be packed more tightly by respective color.

Further, I exploited symmetry to further reduce the counting effort. Symmetry is powerful as it removes duplicate effort. In the above chart, once the reader figured out how to read Magna, reading Cum Laude is simplified because the two categories share two straight connectors, and two bent connectors that are mirror images, so it's clear that Cum Laude is more than Magna by exactly three dots (percentage points).

***

Of course, if the message you want to convey is that roughly half the graduates earn honors, and those honors are split almost even by thirds, then the column chart is sufficient. If you do want to use a pictogram, spend some time thinking about how you can reduce the effort of the counting!

 

 

 

 

 


Finding simple ways to explain complicated data and concepts, using some Pew data

A reader submitted the following chart from Pew Research for discussion.

Pew_ST-2014-09-24-never-married-08

The reader complained that this chart was difficult to comprehend. What are some of the reasons?

The use of color is superfluous. Each line is a "cohort" of people being tracked over time. Each cohort is given its own color or hue. But the color or hue does not signify much.

The dotted lines. This design element requires a footnote to explain. The reader learns that some of the numbers on the chart are projections because those numbers pertain to time well into the future. The chart was published in 2014, using historical data so any numbers dated 2014 or after (and even some data before 2014) will be projections. The data are in fact encoded in the dots, not the slopes. Look at the cohort that has one solid line segment and one dotted line segment - it's unclear which of those three data points are projections, and which are experienced.

The focus on within-cohort trends. The line segments indicate the desire of the designer to emphasize trends within each cohort. However, it's not clear what the underlying message is. It may be that more and more people are not getting married (i.e. fewer people are getting married). That trend affects each of the three age groups - and it's easier to paint that message by focusing on between-cohort trends.

***
Here is a chart that emphasizes the between-cohort trends.

Redo_jc_pewmarriagebyage

A key decision is to not mix oil and water. The within-cohort analysis is presented in its own chart, next to the between-cohort analysis. It turns out that some of the gap between cohorts can be explained by people deferring marriage to later in life. The steep line on the right indicates that a bigger proportion of people now gets married between 35 and 44 than in previous cohorts.

I experimented a bit with the axes here. Several pie charts are used in lieu of axis labels. I also plotted a dual axis with the proportion of unmarried on the one side, and the corresponding proportion of married on the other side.


Education deserts: places without schools still serve pies and story time

I very much enjoyed reading The Chronicle's article on "education deserts" in the U.S., defined as places where there are no public colleges within reach of potential students.

In particular, the data visualization deployed to illustrate the story is superb. For example, this map shows 1,500 colleges and their "catchment areas" defined as places within 60 minutes' drive.

Screenshot-2018-8-22 Who Lives in Education Deserts More People Than You Might Think 2

It does a great job walking through the logic of the analysis (even if the logic may not totally convince - more below). The areas not within reach of these 1,500 colleges are labeled "deserts". They then take Census data and look at the adult population in those deserts:

Screenshot-2018-8-22 Who Lives in Education Deserts More People Than You Might Think 4

This leads to an analysis of the racial composition of the people living in these "deserts". We now arrive at the only chart in the sequence that disappoints. It is a pair of pie charts:

Chronicle_edudesserts_pie

 The color scheme makes it hard to pair up the pie slices. The focus of the chart should be on the over or under representation of races in education deserts relative to the U.S. average. The challenge of this dataset is the coexistence of one large number, and many small numbers.

Here is one solution:

Redo_jc_chronedudesserts

***

The Chronicle made a commendable effort to describe this social issue. But the analysis has a lot of built-in assumptions. Readers should look at the following list and see if you agree with the assumptions:

  • Only public colleges are considered. This restriction requires the assumption that the private colleges pretty much serve the same areas as public colleges.
  • Only non-competitive colleges are included. Precisely, the acceptance rate must be higher than 30 percent. The underlying assumption is that the "local students" won't be interested in selective colleges. It's not clear how the 30 percent threshold was decided.
  • Colleges that are more than 60 minutes' driving distance away are considered unreachable. So the assumption is that "local students" are unwilling to drive more than 60 minutes to attend college. This raises a couple other questions: are we only looking at commuter colleges with no dormitories? Is the 60 minutes driving distance based on actual roads and traffic speeds, or some kind of simple model with stylized geometries and fixed speeds?
  • The demographic analysis is based on all adults living in the Census "blocks" that are not within 60 minutes' drive of one of those colleges. But if we are calling them "education deserts" focusing on the availability of colleges, why consider all adults, and not just adults in the college age group? One further hidden assumption here is that the lack of colleges in those regions has not caused young generations to move to areas closer to colleges. I think a map of the age distribution in the "education deserts" will be quite telling.
  • Not surprisingly, the areas classified as "education deserts" lag the rest of the nation on several key socio-economic metrics, like median income, and proportion living under the poverty line. This means those same areas could be labeled income deserts, or job deserts.

At the end of the piece, the author creates a "story time" moment. Story time is when you are served a bunch of data or analyses, and then when you are about to doze off, the analyst calls story time, and starts making conclusions that stray from the data just served!

Story time starts with the following sentence: "What would it take to make sure that distance doesn’t prevent students from obtaining a college degree? "

The analysis provided has nowhere shown that distance has prevented students from obtaining a college degree. We haven't seen anything that says that people living in the "education deserts" have fewer college degrees. We don't know that distance is the reason why people in those areas don't go to college (if true) - what about poverty? We don't know if 60 minutes is the hurdle that causes people not to go to college (if true).We know the number of adults living in those neighborhoods but not the number of potential students.

The data only showed two things: 1) which areas of the country are not within 60 minutes' driving of the subset of public colleges under consideration, 2) the number of adults living in those Census blocks.

***

So we have a case where the analysis is incomplete but the visualization of the analysis is superb. So in our Trifecta analysis, this chart poses a nice question and has nice graphics but the use of data can be improved. (Type QV)

 

 

 


Environmental science can use better graphics

Mike A. pointed me to two animated maps made by Caltech researchers published in LiveScience (here).

The first map animation shows the rise and fall of water levels in a part of California over time. It's an impressive feat of stitching together satellite images. Click here to play the video.

Caltech_groundwater_map1

The animation grabs your attention. I'm not convinced by the right side of the color scale in which the white comes after the red. I'd want the white in the middle then the yellow and finally the red.

In order to understand this map and the other map in the article, the reader has to bring a lot of domain knowledge. This visualization isn't easy to decipher for a layperson.

Here I put the two animations side by side:

Caltech_groundwater_side

The area being depicted is the same. One map shows "ground deformation" while the other shows "subsidence". Are they the same? What's the connection between the two concepts (if any)?  On a further look, one notices that the time window for the two charts differ: the right map is clearly labeled 1995 to 2003 but there is no corresponding label on the left map. To find the time window of the left map, the reader must inspect the little graph on the top right (1996 to 2000).

This means the time window of the left map is a subset of the time window of the right map. The left map shows a sinusoidal curve that moves up and down rhythmically as the ground shifts. How should I interpret the right map? The periodicity is no longer there despite this map illustrating a longer time window. The scale on the right map is twice the magnitude of the left map. Maybe on average the ground level is collapsing? If that were true, shouldn't the sinusoidal curve drift downward over time?

Caltech_groundwater_sineThe chart on the top right of the left map is a bit ugly. The year labels are given in decimals e.g. 1997.5. In R, this can be fixed by customizing the axis labels.

I also wonder how this curve is related to the map it accompanies. The curve looks like a model - perfect oscillations of a fixed period and amplitude. But one suppose the amount of fluctuation should vary by location, based on geographical features and human activities.

The author of the article points to both natural and human impacts on the ground level. Humans affect this by water usage and also by management policies dictated by law. It would be very helpful to have a map that sheds light on the causes of the movements.


Graphical advice for conference presenters - demo

Yesterday, I pulled this graphic from a journal paper, and said one should not copy and paste this into an oral presentation.

Example_presentation_graphic

So I went ahead and did some cosmetic surgery on this chart.

Redo_example_conference_graphic

I don't know anything about the underlying science. I'm just interpreting what I see on the chart. It seems like the key message is that the Flowering condition is different from the other three. There are no statistical differences between the three boxplots in the first three panels but there is a big difference between the red-green and the purple in the last panel. Further, this difference can be traced to the red-green boxplots exhibiting negative correlation under the Flowering condition - while the purple boxplot is the same under all four conditions.

I would also have chosen different colors, e.g. make red-green two shades of gray to indicate that these two things can be treated as the same under this chart. Doing this would obviate the need to introduce the orange color.

Further, I think it might be interesting to see the plots split differently: try having the red-green boxplots side by side in one panel, and the purple boxplots in another panel.

If the presentation software has animation, the presenter can show the different text blocks and related materials one at a time. That also aids comprehension.

***

Note that the plot is designed for an oral presentation in which you have a minute or two to get the message across. It's debatable as to whether journal editors should accept this style for publications. I actually think such a style would improve reading comprehension but I surmise some of you will disagree.