Women workers taken for a loop or four

I was drawn to the following chart in Business Insider because of the calendar metaphor. (The accompanying article is here.)


Sometimes, the calendar helps readers grasp concepts faster but I'm afraid the usage here slows us down.

The underlying data consist of just four numbers: the wage gaps between race and gender in the U.S., considered simply from an aggregate median personal income perspective. The analyst adopts the median annual salary of a white male worker as a baseline. Then, s/he imputes the number of extra days that others must work to attain the same level of income. For example, the median Asian female worker must work 64 extra days (at her daily salary level) to match the white guy's annual pay. Meanwhile, Hispanic female workers must work 324 days extra.

There are a host of reasons why the calendar metaphor backfired.

Firstly, it draws attention to an uncomfortable detail of the analysis - which papers over the fact that weekends or public holidays are counted as workdays. The coloring of the boxes compounds this issue. (And the designer also got confused and slipped up when applying the purple color for Hispanic women.)

Secondly, the calendar focuses on Year 2 while Year 1 lurks in the background - white men have to work to get that income (roughly $46,000 in 2017 according to the Census Bureau).

Thirdly, the calendar view exposes another sore point around the underlying analysis. In reality, the white male workers are continuing to earn wages during Year 2.

The realism of the calendar clashes with the hypothetical nature of the analysis.


One can just use a bar chart, comparing the number of extra days needed. The calendar design can be considered a set of overlapping bars, wrapped around the shape of a calendar.

The staid bars do not bring to life the extra toil - the message is that these women have to work harder to get the same amount of pay. This led me to a different metaphor - the white men got to the destination in a straight line but the women must go around loops (extra days) before reaching the same endpoint.


While the above is a rough sketch, I made sure that the total length of the lines including the loops roughly matches the total number of days the women needed to work to earn $46,000.


The above discussion focuses solely on the V(isual) corner of the Trifecta Checkup, but this data visualization is also interesting from the D(ata) perspective. Statisticians won't like such a simple analysis that ignores, among other things, the different mix of jobs and industries underlying these aggregate pay figures.

Now go to my other post on the sister (book) blog for a discussion of the underlying analysis.



Where are the Democratic donors?

I like Alberto's discussion of the attractive maps about donors to Democratic presidential candidates, produced by the New York Times (direct link).

Here is the headline map:


The message is clear: Bernie Sanders is the only candidate with nation-wide appeal. The breadth of his coverage is breath-taking. (I agree with Alberto's critique about the lack of a color scale. It's impossible to know if the counts are trivial or not.)

Bernie's coverage is so broad that his numbers overwhelm those of all other candidates except in their home bases (e.g. O'Rourke in Texas).

A remedy to this is to look at the data after removing Bernie's numbers.



This pair of maps reminds me of the Sri Lanka religions map that I revisualized in this post.


The first two maps divide the districts into those in which one religion dominates and those in which multiple religions share the limelight. The third map then shows the second-rank religion in the mixed-religions districts.

The second map in the NYT's donor map series plots the second-rank candidate in all the precincts that Bernie Sanders lead. It's like the designer pulled off the top layer (blue: Bernie) to reveal what's underneath.

Because all of Bernie's data are removed, O'Rourke is still dominating Texas, Buttigieg in Indiana, etc. An alternative is to pull off the top layer in those pockets as well. Then, it's likely to see Bernie showing up in those areas.

The other startling observation is how small Joe Biden's presence is on these maps. This is likely because Biden relies primarily on big donors.

See here for the entire series of donor maps. See here for past discussion of New York Times's graphics.

SCMP's fantastic infographic on Hong Kong protests

In the past month, there have been several large-scale protests in Hong Kong. The largest one featured up to two million residents taking to the streets on June 16 to oppose an extradition act that was working its way through the legislature. If the count was accurate, about 25 percent of the city’s population joined in the protest. Another large demonstration occurred on July 1, the anniversary of Hong Kong’s return to Chinese rule.

South China Morning Post, which can be considered the New York Times of Hong Kong, is well known for its award-winning infographics, and they rose to the occasion with this effort.

This is one of the rare infographics that you’d not regret spending time reading. After reading it, you have learned a few new things about protesting in Hong Kong.

In particular, you’ll learn that the recent demonstrations are part of a larger pattern in which Hong Kong residents express their dissatisfaction with the city’s governing class, frequently accused of acting as puppets of the Chinese state. Under the “one country, two systems” arrangement, the city’s officials occupy an unenviable position of mediating the various contradictions of the two systems.

This bar chart shows the growth in the protest movement. The recent massive protests didn't come out of nowhere. 


This line chart offers a possible explanation for burgeoning protests. Residents’ perceived their freedoms eroding in the last decade.


If you have seen videos of the protests, you’ll have noticed the peculiar protest costumes. Umbrellas are used to block pepper sprays, for example. The following lovely graphic shows how the costumes have evolved:


The scale of these protests captures the imagination. The last part in the infographic places the number of protestors in context, by expressing it in terms of football pitches (as soccer fields are known outside the U.S.) This is a sort of universal measure due to the popularity of football almost everywhere. (Nevertheless, according to Wikipedia, the fields do not have one fixed dimension even though fields used for international matches are standardized to 105 m by 68 m.)


This chart could be presented as a bar chart. It’s just that the data have been re-scaled – from counting individuals to counting football pitches-ful of individuals. 

Here is the entire infographics.

An exercise in decluttering

My friend Xan found the following chart by Pew hard to understand. Why is the chart so taxing to look at? 


It's packing too much.

I first notice the shaded areas. Shading usually signifies "look here". On this chart, the shading is highlighting the least important part of the data. Since the top line shows applicants and the bottom line admitted students, the shaded gap displays the rejections.

The numbers printed on the chart are growth rates but they confusingly do not sync with the slopes of the lines because the vertical axis plots absolute numbers, not rates. 

Pew_collegeadmissions_growthThe vertical axis presents the total number of applicants, and the total number of admitted students, in each "bucket" of colleges, grouped by their admission rate in 2017. On the right, I drew in two lines, both growth rates of 100%, from 500K to 1 million, and from 1 to 2 million. The slopes are not the same even though the rates of growth are.

Therefore, the growth rates printed on the chart must be read as extraneous data unrelated to other parts of the chart. Attempts to connect those rates to the slopes of the corresponding lines are frustrated.

Another lurking factor is the unequal sizes of the buckets of colleges. There are fewer than 10 colleges in the most selective bucket, and over 300 colleges in the largest bucket. We are unable to interpret properly the total number of applicants (or admissions). The quantity of applications in a bucket depends not just on the popularity of the colleges but also the number of colleges in each bucket.

The solution isn't to resize the buckets but to select a more appropriate metric: the number of applicants per enrolled student. The most selective colleges are attracting about 20 applicants per enrolled student while the least selective colleges (those that accept almost everyone) are getting 4 applicants per enrolled student, in 2017.

As the following chart shows, the number of applicants has doubled across the board in 15 years. This raises an intriguing question: why would a college that accepts pretty much all applicants need more applicants than enrolled students?


Depending on whether you are a school administrator or a student, a virtuous (or vicious) cycle has been realized. For the top four most selective groups of colleges, they have been able to progressively attract more applicants. Since class size did not expand appreciably, more applicants result in ever-lower admit rate. Lower admit rate reduces the chance of getting admitted, which causes prospective students to apply to even more colleges, which further suppresses admit rate. 




Pretty circular things

National Geographic features this graphic illustrating migration into the U.S. from the 1850s to the present.



What to Like

It's definitely eye-catching, and some readers will be enticed to spend time figuring out how to read this chart.

The inset reveals that the chart is made up of little colored strips that mix together. This produces a pleasing effect of gradual color gradation.

The white rings that separate decades are crucial. Without those rings, the chart becomes one long run-on sentence.

Once the reader invests time in learning how to read the chart, the reader will grasp the big picture. One learns, for example, that migrants from the most recent decades have come primarily from Latin America (orange) or Asia (pink). Migrants from Europe (green) and Canada (blue) came in waves but have been muted in the last few decades.


What's baffling

Initially, the chart is disorienting. It's not obvious whether the compass directions mean anything. We can immediately understand that the further out we go, the larger numbers of migrants. But what about which direction?

The key appears in the legend - which should be moved from bottom right to top left as it's so important. Apparently, continent/country of origin is coded in the directions.

This region-to-color coding seems to be rough-edged by design. The color mixing discussed above provides a nice artistic effect. Here, the reader finds out that mixing is primarily between two neighboring colors, thus two regions placed side by side on the chart. Thus, because Europe (green) and Asia (pink) are on opposite sides of the rings, those two colors do not mix.

Another notable feature of the chart is the lack of any data other than the decade labels. We won't learn how many migrants arrived in any decade, or the extent of migration as it impacts population size.

A couple of other comments on the circular design.

The circles expand in size for sure as time moves from inside out. Thus, this design only works well for "monotonic" data, that is to say, migration always increases as time passes.

The appearance of the chart is only mildly affected by the underlying data. Swapping the regions of origin changes the appearance of this design drastically.






Check out the Lifespan of News project

Alberto Cairo introduces another one of his collaborations with Google, visualizing Google search data. We previously looked at other projects here.

The latest project, designed by Schema, Axios, and Google News Initiative, tracks the trending of popular news stories over time and space, and it's a great example of making sense of a huge pile of data.

The design team produced a sequence of graphics to illustrate the data. The top news stories are grouped by category, such as Politics & Elections, Violence & War, and Environment & Science, each given a distinct color maintained throughout the project.

The first chart is an area chart that looks at individual stories, and tracks the volume over time.


To read this chart, you have to notice that the vertical axis measuring volume is a log scale, meaning that each tick mark up represents a 10-fold increase. Log scale is frequently used to draw far-away data closer to the middle, making it possible to see both ends of a wide distribution on the same chart. The log transformation introduces distortion deliberately. The smaller data look disproportionately large because of it.

The time scrolls automatically so that you feel a rise and fall of various news stories. It's a great way to experience the news cycle in the past year. The overlapping areas show competing news stories that shared the limelight at that point in time.

Just bear in mind that you have to mentally reverse the distortion introduced by the log scale.


In the second part of the project, they tackle regional patterns. Now you see a map with proportional symbols. The top story in each locality is highlighted with the color of the topic. As time flows by, the sizes of the bubbles expand and contract.


Sometimes, the entire nation was consumed by the same story, e.g. certain obituaries. At other times, people in different regions focused on different topics.


In the last part of the project, they describe general shapes of the popularity curves. Most stories have one peak although certain stories like U.S. government shutdown will have multiple peaks. There is also variation in terms of how fast a story rises to the peak and how quickly it fades away.

The most interesting aspect of the project can be learned from the footnote. The data are not direct hits to the Google News stories but searches on Google. For each story, one (or more) unique search terms are matched, and only those stories are counted. A "control" is established, which is an excellent idea. The control gives meaning to those counts. The control used here is the number of searches for the generic term "Google News." Presumably this is a relatively stable number that is a proxy for general search activity. Thus, the "volume" metric is really a relative measure against this control.





NYT hits the trifecta with this market correction chart

Yesterday, in the front page of the Business section, the New York Times published a pair of charts that perfectly captures the story of the ongoing turbulence in the stock market.

Here is the first chart:


Most market observers are very concerned about the S&P entering "correction" territory, which the industry arbitrarily defines as a drop of 10% or more from a peak. This corresponds to the shortest line on the above chart.

The chart promotes a longer-term reflection on the recent turbulence, using two reference points: the index has returned to the level even with that at the start of 2018, and about 16 percent higher since the beginning of 2017.

This is all done tastefully in a clear, understandable graphic.

Then, in a bit of a rhetorical flourish, the bottom of the page makes another point:


When viewed back to a 10-year period, this chart shows that the S&P has exploded by 300% since 2009.

A connection is made between the two charts via the color of the lines, plus the simple, effective annotation "Chart above".

The second chart adds even more context, through vertical bands indicating previous corrections (drops of at least 10%). These moments are connected to the first graphic via the beige color. The extra material conveys the message that the market has survived multiple corrections during this long bull period.

Together, the pair of charts addresses a pressing current issue, and presents a direct, insightful answer in a simple, effective visual design, so it hits the Trifecta!


There are a couple of interesting challenges related to connecting plots within a multiple-plot framework.

While the beige color connects the concept of "market correction" in the top and bottom charts, it can also be a source of confusion. The orientation and the visual interpretation of those bands differ. The first chart uses one horizontal band while the chart below shows multiple vertical bands. In the first chart, the horizontal band refers to a definition of correction while in the second chart, the vertical bands indicate experienced corrections.

Is there a solution in which the bands have the same orientation and same meaning?


These graphs solve a visual problem concerning the visualization of growth over time. Growth rates are anchored to some starting time. A ten-percent reduction means nothing unless you are told ten-percent of what.

Using different starting times as reference points, one gets different values of growth rates. With highly variable series of data like stock prices, picking starting times even a day apart can lead to vastly different growth rates.

The designer here picked several obvious reference times, and superimposes multiple lines on the same plotting canvass. Instead of having four lines on one chart, we have three lines on one, and four lines on the other. This limits the number of messages per chart, which speeds up cognition.

The first chart depicts this visual challenge well. Look at the start of 2018. This second line appears as if you can just reset the start point to 0, and drag the remaining portion of the line down. The part of the top line (to the right of Jan 2018) looks just like the second line that starts at Jan 2018.


However, a closer look reveals that the shape may be the same but the magnitude isn't. There is a subtle re-scaling in addition to the re-set to zero.

The same thing happens at the starting moment of the third line. You can't just drag the portion of the first or second line down - there is also a needed re-scaling.

Crazy rich Asians inspire some rich graphics

On the occasion of the hit movie Crazy Rich Asians, the New York Times did a very nice report on Asian immigration in the U.S.

The first two graphics will be of great interest to those who have attended my free dataviz seminar (coming to Lyon, France in October, by the way. Register here.), as it deals with a related issue.

The first chart shows an income gap widening between 1970 and 2016.


This uses a two-lines design in a small-multiples setting. The distance between the two lines is labeled the "income gap". The clear story here is that the income gap is widening over time across the board, but especially rapidly among Asians, and then followed by whites.

The second graphic is a bumps chart (slopegraph) that compares the endpoints of 1970 and 2016, but using an "income ratio" metric, that is to say, the ratio of the 90th-percentile income to the 10th-percentile income.


Asians are still a key story on this chart, as income inequality has ballooned from 6.1 to 10.7. That is where the similarity ends.

Notice how whites now appears at the bottom of the list while blacks shows up as the second "worse" in terms of income inequality. Even though the underlying data are the same, what can be seen in the Bumps chart is hidden in the two-lines design!

In short, the reason is that the scale of the two-lines design is such that the small numbers are squashed. The bottom 10 percent did see an increase in income over time but because those increases pale in comparison to the large incomes, they do not show up.

What else do not show up in the two-lines design? Notice that in 1970, the income ratio for blacks was 9.1, way above other racial groups.

Kudos to the NYT team to realize that the two-lines design provides an incomplete, potentially misleading picture.


The third chart in the series is a marvellous scatter plot (with one small snafu, which I'd get t0).


What are all the things one can learn from this chart?

  • There is, as expected, a strong correlation between having college degrees and earning higher salaries.
  • The Asian immigrant population is diverse, from the perspectives of both education attainment and median household income.
  • The largest source countries are China, India and the Philippines, followed by Korea and Vietnam.
  • The Indian immigrants are on average professionals with college degrees and high salaries, and form an outlier group among the subgroups.

Through careful design decisions, those points are clearly conveyed.

Here's the snafu. The designer forgot to say which year is being depicted. I suspect it is 2016.

Dating the data is very important here because of the following excerpt from the article:

Asian immigrants make up a less monolithic group than they once did. In 1970, Asian immigrants came mostly from East Asia, but South Asian immigrants are fueling the growth that makes Asian-Americans the fastest-expanding group in the country.

This means that a key driver of the rapid increase in income inequality among Asian-Americans is the shift in composition of the ethnicities. More and more South Asian (most of whom are Indians) arrivals push up the education attainment and household income of the average Asian-American. Not only are Indians becoming more numerous, but they are also richer.

An alternative design is to show two bubbles per ethnicity (one for 1970, one for 2016). To reduce clutter, the smaller ethnicites can be aggregated into Other or South Asian Other. This chart may help explain the driver behind the jump in income inequality.






Visualizing the Thai cave rescue operation

The Thai cave rescue was a great story with a happy ending. It's also one that lends itself to visualization. A good visualization can explain the rescue operation more efficiently than mere words.

A good visual should bring out the most salient features of the story, such as:

  • Why the operation was so daunting?
  • What were the tactics used to overcome those challenges?
  • How long did it take?
  • What were the specific local challenges that must be overcome?
  • Were there any surprises?

In terms of what made the rescue challenging, some of the following are pertinent:

  • How far in they were?
  • How deep were they trapped?
  • How much of the caves were flooded? Why couldn't they come out by themselves?
  • How much headroom was there in different sections of the cave "tunnel"?

There were many attempts at visualizing the Thai cave rescue operation. The best ones I saw were: BBC (here, here), The New York Times (here), South China Morning Post (here) and Straits Times (here). It turns out each of these efforts focuses on some of the aspects above, and you have to look at all of them to get the full picture.


BBC's coverage began with a top-down view of the route of the rescue, which seems to be the most popular view adopted by news organizations. This is easily understood because of the standard map aesthetic.


The BBC map is missing a smaller map of Thailand to place this in a geographical context.

While this map provides basic information, it doesn't address many of the elements that make the Thai cave rescue story compelling. In particular, human beings are missing from this visualization. The focus is on the actions ("diving", "standing"). This perspective also does not address the water level, the key underlying environmental factor.


Another popular perspective is the sideway cross-section. The Straits Times has one:

Straittimes_thai rescue_part

The excerpt of the infographic presents a nice collection of data that show the effort of the rescue. The sideway cross-sectional section shows the distance and the up-and-down nature of the journey, the level of flooding along the route, plus a bit about the headroom available at different points. Most of these diagrams bring out the "horizontal" distance but somehow ignore the "vertical" distance. One possibility is that the real trajectory is curvy - but if we can straighten out the horizontal, we should be able to straighten out the vertical too.

The NYT article gives a more detailed view of the same perspective, with annotations that describe key moments along the rescue route.


If, like me, you like to place humans into this picture, then you have to go back to the Straits Times, where they have an expanded version of the sideway cross-section.


This is probably my most favorite single visualization of the rescue operation.

There are better cartoons of the specific diving actions, though. For example, the BBC has this visual that shows the particularly narrow part of the route, corresponding to the circular inset in the Straits Times version above.


The drama!

NYT also has a set of cartoons. Here's one:



There is one perspective that curiously has been underserved in all of the visualizations - this is the first-person perspective. Imagine the rescuer (or the kids) navigating the rescue route. It's a cross-section from the front, not from the side.

Various publications try to address this by augmenting the top-down route view with sporadic cross-sectional diagrams. Recall the first map we showed from the BBC. On the right column are little annotations of this type (here):


I picked out this part of the map because it shows that the little human figure serves two potentially conflicting purposes. In the bottom diagram, the figurine shows that there is limited headroom in this part of the cave, plus the actual position of the figurine on the ledge conveys information about where the kids were. However, on the top cross-section, the location of the figure conveys no information; the only purpose of the human figure is to show how tall the cave is at that site.

The South China Morning Post (here - site appears to be down when I wrote this) has this wonderful animation of how the shape of the headroom changed as they navigated the route. Please visit their page to see the full animation. Here are two screenshots:



This little clip adds a lot to the story! It'd be even better if the horizontal timeline at the bottom is replaced by the top-down route map.

Thank you all the various dataviz teams for these great efforts.




Some Tufte basics brought to you by your favorite birds

Someone sent me this via Twitter, found on the Data is Beautiful reddit:


The chart does not deliver on its promise: It's tough to know which birds like which seeds.

The original chart was also provided in the reddit:


I can see why someone would want to remake this visualization.

Let's just apply some Tufte fixes to it, and see what happens.

Our starting point is this:


First, consider the colors. Think for a second: order the colors of the cells by which ones stand out most. For me, the order is white > yellow > red > green.

That is a problem because for this data, you'd like green > yellow > red > white. (By the way, it's not explained what white means. I'm assuming it means the least preferred, so not preferred that one wouldn't consider that seed type relevant.)

Compare the above with this version that uses a one-dimensional sequential color scale:


The white color still stands out more than necessary. Fix this using a gray color.


What else is grabbing your attention when it shouldn't? It's those gridlines. Push them into the background using white-out.


The gridlines are also too thick. Here's a slimmed-down look:


The visual is much improved.

But one more thing. Let's re-order the columns (seeds). The most popular seeds are shown on the left, and the least on the right in this final revision.


Look for your favorite bird. Then find out which are its most preferred seeds.

Here is an animated gif to see the transformation. (Depending on your browser, you may have to click on it to view it.)



PS. [7/23/18] Fixed the 5th and 6th images and also in the animated gif. The row labels were scrambled in the original version.