More visuals of the economic crisis

As we move into the next phase of the dataviz bonanza arising from the coronavirus pandemic, we will see a shift from simple descriptive graphics of infections and deaths to bivariate explanatory graphics claiming (usually spurious) correlations.

The FT is leading the way with this effort, and I hope all those who follow will make a note of several wise decisions they made.

  • They source their data. Most of the data about business activities come from private entities, many of which are data vendors who make money selling the data. In this article, FT got restaurant data from OpenTable, retail foot traffic data from Springboard, box office data from Box Office Mojo, flight data from Flightradar24, road traffic data from TomTom, and energy use data from European Network of Transmission System Operators for Electricity.
  • They generally let the data and charts speak without "story time". The text primarily describes the trends of the various data series.
  • They selected sectors that are obviously impacted by the shutdowns so any link between the observed trends and the crisis is plausible.

The FT charts are examples of clarity. Here is the one about road traffic patterns in major cities:

Ft_roadusage_corona_wrongsource

The cities are organized into regions: Europe, US, China, other Asia.

The key comparison is the last seven days versus the historical averages. The stories practically jump out of the page. Traffic in Paris collapsed on Tuesday. Wuhan is still locked down despite the falloff in infections. Drivers of Tokyo are probably wondering why teams are not going to the Olympics this year. Londoners? My guess is they're determined to not let another Brexit deadline slip.

***

I'd hope we go even further than FT when publishing this type of visual analytics involving "Big Data." These business data obtained from private sources typically have OCCAM properties: they are observational, seemingly complete, uncontrolled, adapted and merged. All these properties make the data very challenging to interpret.

The coronavirus case and death counts are simple by comparison. People are now aware of all the problems from differential rates of testing to which groups are selectively tested (i.e. triage) to how an infection or death is defined. The problems involving Big Data are much more complex.

I have three additional proposals:

Disclosure of Biases and Limitations

The private data have many more potential pitfalls. Take OpenTable data for example. The data measure restaurant bookings, not revenues. It measures gross bookings, not net bookings (i.e. removing no-shows). Only a proportion of restaurants use OpenTable (which cost owners money). OpenTable does not strike me as a quasi-monopoly so there are competitors with significant market share. The restaurants that use OpenTable do not form a random subsample of all restaurants. One of the most popular restaurants in the U.S. are pizza joints, with little of no seating, which do not feature in the bookings data. OpenTable also has differential popularity by country, region, or probably cuisine. 

I believe data journalists ought to provide such context in a footnote. Readers should have the information to judge whether they believe the data are sufficiently representative. Private data vendors who want data journalists to feature their datasets should be required to supply a footnote that describes the biases and limitations of their data.

Data journalists should think seriously about how they headline this type of chart. The standard practice is what FT adopted. The headline said "Restaurant bookings have collapsed" with a small footnote saying "Source: OpenTable". Should the headline have said "OpenTable bookings have collapsed" instead?

Disclosure of Definitions and Proxies

In the road traffic chart shown above, the metric is called "TomTom traffic congestion index". In order to earn this free advertising (euphemistically called "earned media" by industry), TomTom should be obliged to explain how this index is constructed. What does index = 100 mean?

[For example, it is curious that the Madrid index values are much lower across the board than those in Paris and Roma.]

For the electric usage chart, FT discloses the name of the data provider as a group of "43 electricity transmission system operators in 36 countries across Europe." Now, that is important context but can be better. The group may consist of 43 operators but how many of them are in the dataset? What proportion of the total electric usage do they account for in each country? If they have low penetration in a particular country, do they just report the low statistics or adjust the numbres?

If the journalist decides to use a proxy, for example, OpenTable restaurant bookings to reflect restaurant revenues, that should be explained, perhaps even justified.

Data as a Public Good

If private businesses choose to supply data to media outlets as a public service, they should allow the underlying data to be published.

Speaking from experience, I've seen a lot of bad data. It's one thing to hold your nose when the data are analyzed to make online advertising more profitable, or to find signals to profit from the stock market. It's another thing for the data analysis to drive public policy, in this case, policies that will have life-or-death implications.


Bubble charts, ratios and proportionality

A recent article in the Wall Street Journal about a challenger to the dominant weedkiller, Roundup, contains a nice selection of graphics. (Dicamba is the up-and-comer.)

Wsj_roundup_img1


The change in usage of three brands of weedkillers is rendered as a small-multiples of choropleth maps. This graphic displays geographical and time changes simultaneously.

The staircase chart shows weeds have become resistant to Roundup over time. This is considered a weakness in the Roundup business.

***

In this post, my focus is on the chart at the bottom, which shows complaints about Dicamba by state in 2019. This is a bubble chart, with the bubbles sorted along the horizontal axis by the acreage of farmland by state.

Wsj_roundup_img2

Below left is a more standard version of such a chart, in which the bubbles are allowed to overlap. (I only included the bubbles that were labeled in the original chart).

Redo_roundupwsj0

The WSJ’s twist is to use the vertical spacing to avoid overlapping bubbles. The vertical axis serves a design perogative and does not encode data.  

I’m going to stick with the more traditional overlapping bubbles here – I’m getting to a different matter.

***

The question being addressed by this chart is: which states have the most serious Dicamba problem, as revealed by the frequency of complaints? The designer recognizes that the amount of farmland matters. One should expect the more acres, the more complaints.

Let's consider computing directly the number of complaints per million acres.

The resulting chart (shown below right) – while retaining the design – gives a wholly different feeling. Arkansas now owns the largest bubble even though it has the least acreage among the included states. The huge Illinois bubble is still large but is no longer a loner.

Redo_dicambacomplaints1

Now return to the original design for a moment (the chart on the left). In theory, this should work in the following manner: if complaints grow purely as a function of acreage, then the bubbles should grow proportionally from left to right. The trouble is that proportional areas are not as easily detected as proportional lengths.

The pair of charts below depict made-up data in which all states have 30 complaints for each million acres of farmland. It’s not intuitive that the bubbles on the left chart are growing proportionally.

Redo_dicambacomplaints2

Now if you look at the right chart, which shows the relative metric of complaints per million acres, it’s impossible not to notice that all bubbles are the same size.


This chart tells you how rich is rich - if you can read it

Via twitter, John B. sent me the following YouGov chart (link) that he finds difficult to read:

Yougov_whoisrich

The title is clear enough: the higher your income, the higher you set the bar.

When one then moves from the title to the chart, one gets misdirected. The horizontal axis shows pound values, so the axis naturally maps to "the higher your income". But it doesn't. Those pound values are the "cutoff" values - the line between "rich" and "not rich". Even after one realizes this detail, the axis  presents further challenges: the cutoff values are arbitrary numbers such as "45,001" sterling; and these continuous numbers are treated as discrete categories, with irregular intervals between each category.

There is some very interesting and hard to obtain data sitting behind this chart but the visual form suppresses them. The best way to understand this dataset is to first think about each income group. Say, people who make between 20 to 30 thousand sterling a year. Roughly 10% of these people think "rich" starts at 25,000. Forty percent of this income group think "rich" start at 40,000.

For each income group, we have data on Z percent think "rich" starts at X. I put all of these data points into a heatmap, like this:

Redo_junkcharts_yougovuk_whoisrich

Technical note: in order to restore the horizontal axis to a continuous scale, you can take the discrete data from the original chart, then fit a smoothed curve through those points, and finally compute the interpolated values for any income level using the smoothing model.

***

There are some concerns about the survey design. It's hard to get enough samples for higher-income people. This is probably why the highest income segment starts at 50,000. But notice that 50,ooo is around the level at which lower-income people consider "rich". So, this survey is primarily about how low-income people perceive "rich" people.

The curve for the highest income group is much straighter and smoother than the other lines - that's because it's really the average of a number of curves (for each 10,000 sterling segment).

 

P.S. The YouGov tweet that publicized the small-multiples chart shown above links to a page that no longer contains the chart. They may have replaced it due to feedback.

 

 


Marketers want millennials to know they're millennials

When I posted about the lack of a standard definition of "millennials", Dean Eckles tweeted about the arbitrary division of age into generational categories. His view is further reinforced by the following chart, courtesy of PewResearch by way of MarketingCharts.com.

PewResearch-Generational-Identification-Sept2015

Pew asked people what generation they belong to. The amount of people who fail to place themselves in the right category is remarkable. One way to interpret this finding is that these are marketing categories created by the marketing profession. We learned in my other post that even people who use the term millennial do not have a consensus definition of it. Perhaps the 8 percent of "millennials" who identify as "boomers" are handing in a protest vote!

The chart is best read row by row - the use of stacked bar charts provides a clue. Forty percent of millennials identified as millennials, which leaves sixty percent identifying as some other generation (with about 5 percent indicating "other" responses). 

While this chart is not pretty, and may confuse some readers, it actually shows a healthy degree of analytical thinking. Arranging for the row-first interpretation is a good start. The designer also realizes the importance of the diagonal entries - what proportion of each generation self-identify as a member of that generation. Dotted borders are deployed to draw eyes to the diagonal.

***

The design doesn't do full justice for the analytical intelligence. Despite the use of the bar chart form, readers may be tempted to read column by column due to the color scheme. The chart doesn't have an easy column-by-column interpretation.

It's not obvious which axis has the true category and which, the self-identified category. The designer adds a hint in the sub-title to counteract this problem.

Finally, the dotted borders are no match for the differential colors. So a key message of the chart is buried.

Here is a revised chart, using a grouped bar chart format:

Redo_junkcharts_millennial_id

***

In a Trifecta checkup (link), the original chart is a Type V chart. It addresses a popular, pertinent question, and it shows mature analytical thinking but the visual design does not do full justice to the data story.

 

 


Where are the Democratic donors?

I like Alberto's discussion of the attractive maps about donors to Democratic presidential candidates, produced by the New York Times (direct link).

Here is the headline map:

Nyt_demdonormaps

The message is clear: Bernie Sanders is the only candidate with nation-wide appeal. The breadth of his coverage is breath-taking. (I agree with Alberto's critique about the lack of a color scale. It's impossible to know if the counts are trivial or not.)

Bernie's coverage is so broad that his numbers overwhelm those of all other candidates except in their home bases (e.g. O'Rourke in Texas).

A remedy to this is to look at the data after removing Bernie's numbers.

Nyt_demdonormap_2

 

This pair of maps reminds me of the Sri Lanka religions map that I revisualized in this post.

Redo_srilankareligiondistricts_v2

The first two maps divide the districts into those in which one religion dominates and those in which multiple religions share the limelight. The third map then shows the second-rank religion in the mixed-religions districts.

The second map in the NYT's donor map series plots the second-rank candidate in all the precincts that Bernie Sanders lead. It's like the designer pulled off the top layer (blue: Bernie) to reveal what's underneath.

Because all of Bernie's data are removed, O'Rourke is still dominating Texas, Buttigieg in Indiana, etc. An alternative is to pull off the top layer in those pockets as well. Then, it's likely to see Bernie showing up in those areas.

The other startling observation is how small Joe Biden's presence is on these maps. This is likely because Biden relies primarily on big donors.

See here for the entire series of donor maps. See here for past discussion of New York Times's graphics.


Tightening the bond between the message and the visual: hello stats-cats

The editors of ASA's Amstat News certainly got my attention, in a recent article on school counselling. A research team asked two questions. The first was HOW ARE YOU FELINE?

Stats and cats. The pun got my attention and presumably also made others stop and wonder. The second question was HOW DO YOU REMEMBER FEELING while you were taking a college statistics course? Well, it's hard to imagine the average response to that question would be positive.

What also drew me to the article was this pair of charts:

Counselors_Figure1small

Surely, ASA can do better. (I'm happy to volunteer my time!)

Rotate the chart, clean up the colors, remove the decimals, put the chart titles up top, etc.

***

The above remedies fall into the V corner of my Trifecta checkup.

Trifectacheckup_junkcharts_imageThe key to fixing this chart is to tighten the bond between the message and the visual. This means working that green link between the Q and V corners.

This much became clear after reading the article. The following paragraphs are central to the research (bolding is mine):

Responses indicated the majority of school counselors recalled experiences of studying statistics in college that they described with words associated with more unpleasant affect (i.e., alarm, anger, distress, fear, misery, gloom, depression, sadness, and tiredness; n = 93; 66%). By contrast, a majority of counselors reported same-day (i.e., current) emotions that appeared to be associated with more pleasant affect (i.e., pleasure, happiness, excitement, astonishment, sleepiness, satisfaction, and calm; n = 123; 88%).

Both recalled emotive experiences and current emotional states appeared approximately balanced on dimensions of arousal: recalled experiences associated with lower arousal (i.e., pleasure, misery, gloom, depression, sadness, tiredness, sleepiness, satisfaction, and calm, n = 65, 46%); recalled experiences associated with higher arousal (i.e., happiness, excitement, astonishment, alarm, anger, distress, fear, n = 70, 50%); current emotions associated with lower arousal (n = 60, 43%); current experiences associated with higher arousal (i.e., n = 79, 56%).

These paragraphs convey two crucial pieces of information: the structure of the analysis, and its insights.

The two survey questions measure two states of experiences, described as current versus recalled. Then the individual affects (of which there were 16 plus an option of "other") are scored on two dimensions, pleasure and arousal. Each affect maps to high or low pleasure, and separately to high or low arousal.

The research insight is that current experience was noticably higher than recalled experience on the pleasure dimension but both experiences were similar on the arousal dimension.

Any visualization of this research must bring out this insight.

***

Here is an attempt to illustrate those paragraphs:

Redo_junkcharts_amstat_feline

The primary conclusion can be read from the four simple pie charts in the middle of the page. The color scheme shines light on which affects are coded as high or low for each dimension. For example, "distressed" is scored as showing low pleasure and high arousal.

A successful data visualization for this situation has to bring out the conclusion drawn at the aggregated level, while explaining the connection between individual affects and their aggregates.


Book review: Visualizing Baseball

I requested a copy of Jim Albert’s Visualizing Baseball book, which is part of the ASA-CRC series on Statistical Reasoning in Science and Society that has the explicit goal of reaching a mass audience.

Visualizingbaseball_coverThe best feature of Albert’s new volume is its brevity. For someone with a decent background in statistics (and grasp of basic baseball jargon), it’s a book that can be consumed within one week, after which one receives a good overview of baseball analytics, otherwise known as sabermetrics.

Within fewer than 200 pages, Albert outlines approaches to a variety of problems, including:

  • Comparing baseball players by key hitting (or pitching) metrics
  • Tracking a player’s career
  • Estimating the value of different plays, such as a single, a triple or a walk
  • Predicting expected runs in an inning from the current state of play
  • Analyzing pitches and swings using PitchFX data
  • Describing the effect of ballparks on home runs
  • Estimating the effect of particular plays on the outcome of a game
  • Simulating “fake” games and seasons in order to produce probabilistic forecasts such as X% chance that team Y will win the World Series
  • Examining whether a hitter is “streaky” or not

Most of the analyses are descriptive in nature, e.g. describing the number and types of pitches thrown by a particular pitcher, or the change in on-base percentage over the career of a particular hitter. A lesser number of pages are devoted to predictive analytics. This structure is acceptable in a short introductory book. In practice, decision-makers require more sophisticated work on top of these descriptive analyses. For example, what’s the value of telling a coach that the home run was the pivotal moment in a 1-0 game that has played out?

To appreciate the practical implications of the analyses included in this volume, I’d recommend reading Moneyball by Michael Lewis, or the more recent Astroball by Ben Reiter.

For the more serious student of sabermetrics, key omitted details will need to be gleaned from other sources, including other books by the same author – for years, I have recommended Curve Ball by Albert and Bennett to my students.

***

In the final chapters, Albert introduced the simulation of “fake” seasons that underlies predictions. An inquiring reader should investigate how the process is tied back to the reality of what actually happened; otherwise, the simulation will have a life of its own. Further, if one simulates 1,000 seasons of 2018 baseball, a large number of these fake seasons would crown some team other than the Red Sox as the 2018 World Series winner. Think about it: that’s how it is possible to make the prediction that the Red Sox has a say 60 percent chance of winning the World Series in 2018! A key to understanding the statistical way of thinking is to accept the logic of this fake simulated world. It is not the stated goal of Albert to convince readers of the statistical way of thinking – but you’re not going to be convinced unless you think about why we do it this way.

***

While there are plenty of charts included in the book, a more appropriate title for “Visualizing Baseball” would have been “Fast Intro to Baseball Analytics”. With several exceptions, the charts are not essential to understanding the analyses. The dominant form of exposition is first describe the analytical conclusion, then introduce a chart to illustrate that conclusion. The inverse would be: Start with the chart, and use the chart to explain the analysis.

The visualizations are generally of good quality, emphasizing clarity over prettiness. The choice of sticking to one software, ggplot2 in R, without post-production, constrains the visual designer to the preferences of the software designer. Such limitations are evident in chart elements like legends and titles. Here is one example (Chapter 5, Figure 5.8):

Albert_visualizingbaseball_chart

By default, the software prints the names of data columns in the titles. Imagine if the plot titles were Changeup, Fastball and Slider instead of CU, FF and SL. Or that the axis labels were “horizontal location” and “vertical location” (check) instead of px and pz. [Note: The chart above was taken from the book's github site; in the  Figure 5.8 in the printed book, the chart titles were edited as suggested.]

The chart analyzes the location relative to the strike zone of pitches that were missed versus pitches that were hit (not missed). By default, the software takes the name of the binary variable (“Miss”) as the legend title, and lists the values of the variable (“True” and “False”) as the labels of the two colors. Imagine if True appeared as “Miss” and False as “Hit” .

Finally, the chart exhibits over-plotting, making it tough to know how many blue or gray dots are present. Smaller dot size might help, or else some form of aggregation.

***

Visualizing Baseball is not the book for readers who learn by running code as no code is included in the book. A github page by the author hosts the code, but only the R/ggplot2 code for generating the data visualization. Each script begins after the analysis or modeling has been completed. If you already know R and ggplot2, the github is worth a visit. In any case, I don’t recommend learning coding from copying and pasting clean code.

All in all, I can recommend this short book to any baseball enthusiast who’s beginning to look at baseball data. It may expand your appreciation of what can be done. For details, and practical implications, look elsewhere.


Pay levels in the U.S.

The Wall Street Journal published a graphic showing the median pay levels at "most" public companies in the U.S. here.

Wsj_mediancompanypay

People who attended my dataviz seminar might recognize the similarity with the graphic showing internet download speeds by different broadband technologies. It's a clean, clear way of showing multiple comparisons on the same chart.

You can see the distribution of pay levels of companies within each industry grouping, and the vertical lines showing the sector medians allow comparison across sectors. The median pay levels are quite similar with the energy sector leaning higher, and consumer sector leaning lower.

The consumer sector is extremely heavy on the low side of the pay range. Companies like Universal, Abercrombie, Skechers, Mattel, Gap, etc. all pay at least half their employees less than $6,000. The data is sourced to MyLogIQ. I have no knowledge of how reliable or valid the data are. It's curious to me that Dunkin Brands showed a median of $110K while Starbucks showed $13K.

Wsj_medianpay_dunkinstarbucks

***

I like the interactive features.

The window control lets the user zoom in to different parts of the pay range. This is necessary because of the extremely high salaries. The control doubles as a presentation of the overall distribution of median salaries.

The text box can be used to add data labels to specific companies.

***

See previous discussion of WSJ Graphics.

 


Elegant way to present a pair of charts

The Bloomberg team has come up with a few goodies lately. I was captivated by the following graphic about the ebb and flow of U.S. presidential candidates across recent campaigns. Link to the full presentation here.

The highlight is at the bottom of the page. This is an excerpt of the chart:

Bloomberg_presidentialcandidates_1

From top to bottom are the sequential presidential races. The far right vertical axis is the finish line. Going right to left is the time before the finish line. In 2008, for example, there are candidates who entered the race much earlier than typical.

This chart presents an aggregate view of the data. We get a sense of when most of the candidates enter the race, when most of them are knocked out, and also a glimpse of outliers. The general pattern across multiple elections is also clear. The design is a stacked area chart with the baseline in the middle, rather than the bottom, of the chart.

Sure, the chart can disappoint those readers who want details and precise numbers. It's not immediately apparent how many candidates were in the race at the height of 2008, nor who the candidates were.

The designer added a nice touch. By clicking on any of the stacks, it transforms into a bar chart, showing the extent of each candidate's participation in the race.

Bloomberg_presidentialcandidates_2

I wish this was a way to collapse the bar chart back to the stack. You can reload the page to start afresh.

***

This elegant design touch makes the user experience playful. It's also an elegant way to present what is essentially a panel of plots. Imagine the more traditional presentation of placing the stack and the bar chart side by side.

This design does not escape the trade-off between entertainment value and data integrity. Looking at the 2004 campaign, one should expect to see the blue stack halve in size around day 100 when Kerry became the last man standing. That moment is not marked in the stack. The stack can be interpreted as a smoothed version of the count of active candidates.

Redo_bloombergpresidentialcandidates_3

I suppose some may complain the stack misrepresents the data somewhat. I find it an attractive way of presenting the big-picture message to an audience that mostly spend less than a minute looking at the graphic.


Seeking simplicity in complex data: Bloomberg's dataviz on UK gender pay gap

Bloomberg featured a thought-provoking dataviz that illustrates the pay gap by gender in the U.K. The dataset underlying this effort is complex, and the designers did a good job simplifying the data for ease of comprehension.

U.K. companies are required to submit data on salaries and bonuses by gender, and by pay quartiles. The dataset is incomplete, since some companies are slow to report, and the analyst decided not to merge companies that changed names.

Companies are classified into industry groups. Readers who read Chapter 3 of Numbers Rule Your World (link) should ask whether these group differences are meaningful by themselves, without controlling for seniority, job titles, etc. The chapter features one method used by the educational testing industry to take a more nuanced analysis of group differences.

***

The Bloomberg visualization has two sections. In the top section, each company is represented by the percent difference between average female pay and average male pay. Then the companies within a given industry is shown in a histogram. The histograms provide a view of the disparity between companies within a given industry. The black line represents the relative proportion of companies in a given industry that have no gender pay gap but it’s the weight of the histogram on either side of the black line that carries the graphic’s message.

This is the histogram for arts, entertainment and recreation.

Bloomberg_genderpaygap_arts

The spread within this industry is very wide, especially on the left side of the black line. A large proportion of these companies pay women less on average than men, and how much less is highly variable. There is one extreme positive value: Chelsea FC Foundation that pays the average female about 40% more than the average male.

This is the histogram for the public sector.

Bloomberg_genderpaygap_public
It is a much tighter distribution, meaning that the pay gaps vary less from organization to organization (this statement ignores the possibility that there are outliers not visible on this graphic). Again, the vast majority of entities in this sector pay women less than men on average.

***

The second part of the visualization look at the quartile data. The employees of each company are divided into four equal-sized groups, based on their wages. Think of these groups as the Top 25% Earners, the Second 25%, etc. Within each group, the analyst looks at the proportion of women. If gender is independent of pay, then we should expect the proportions of women to be about the same for all four quartiles. (This analysis considers gender to be the only explainer for pay gaps. This is a problem I've called xyopia, that frames a complex multivariate issue as a bivariate problem involving one outcome and one explanatory variable. Chapter 3 of Numbers Rule Your World (link) discusses how statisticians approach this issue.)

Bloomberg_genderpaygap_public_pieOn the right is the chart for the public sector. This is a pie chart used as a container. Every pie has four equal-sized slices representing the four quartiles of pay.

The female proportion is encoded in both the size and color of the pie slices. The size encoding is more precise while the color encoding has only 4 levels so it provides a “binned” summary view of the same data.

For the public sector, the lighter-colored slice shows the top 25% earners, and its light color means the proportion of women in the top 25% earners group is between 30 and 50 percent. As we move clockwise around the pie, the slices represent the 2nd, 3rd and bottom 25% earners, and women form 50 to 70 percent of each of those three quartiles.

To read this chart properly, the reader must first do one calculation. Women represent about 60% of the top 25% earners in the public sector. Is that good or bad? This depends on the overall representation of women in the public sector. If the sector employs 75 percent women overall, then the 60 percent does not look good but if it employs 40 percent women, then the same value of 60% tells us that the female employees are disproportionately found in the top 25% earners.

That means the reader must compare each value in the pie chart against the overall proportion of women, which is learned from the average of the four quartiles.

***

In the chart below, I make this relative comparison explicit. The overall proportion of women in each industry is shown using an open dot. Then the graphic displays two bars, one for the Top 25% earners, and one for the Bottom 25% earners. The bars show the gap between those quartiles and the overall female proportion. For the top earners, the size of the red bars shows the degree of under-representation of women while for the bottom earners, the size of the gray bars shows the degree of over-representation of women.

Redo_junkcharts_bloombergukgendergap

The net sum of the bar lengths is a plausible measure of gender inequality.

The industries are sorted from the ones employing fewer women (at the top) to the ones employing the most women (at the bottom). An alternative is to sort by total bar lengths. In the original Bloomberg chart - the small multiples of pie charts, the industries are sorted by the proportion of women in the bottom 25% pay quartile, from smallest to largest.

In making this dataviz, I elected to ignore the middle 50%. This is not a problem since any quartile above the average must be compensated by a different quartile below the average.

***

The challenge of complex datasets is discovering simple ways to convey the underlying message. This usually requires quite a bit of upfront analytics, data transformation, and lots of sketching.