Pretty circular things

National Geographic features this graphic illustrating migration into the U.S. from the 1850s to the present.



What to Like

It's definitely eye-catching, and some readers will be enticed to spend time figuring out how to read this chart.

The inset reveals that the chart is made up of little colored strips that mix together. This produces a pleasing effect of gradual color gradation.

The white rings that separate decades are crucial. Without those rings, the chart becomes one long run-on sentence.

Once the reader invests time in learning how to read the chart, the reader will grasp the big picture. One learns, for example, that migrants from the most recent decades have come primarily from Latin America (orange) or Asia (pink). Migrants from Europe (green) and Canada (blue) came in waves but have been muted in the last few decades.


What's baffling

Initially, the chart is disorienting. It's not obvious whether the compass directions mean anything. We can immediately understand that the further out we go, the larger numbers of migrants. But what about which direction?

The key appears in the legend - which should be moved from bottom right to top left as it's so important. Apparently, continent/country of origin is coded in the directions.

This region-to-color coding seems to be rough-edged by design. The color mixing discussed above provides a nice artistic effect. Here, the reader finds out that mixing is primarily between two neighboring colors, thus two regions placed side by side on the chart. Thus, because Europe (green) and Asia (pink) are on opposite sides of the rings, those two colors do not mix.

Another notable feature of the chart is the lack of any data other than the decade labels. We won't learn how many migrants arrived in any decade, or the extent of migration as it impacts population size.

A couple of other comments on the circular design.

The circles expand in size for sure as time moves from inside out. Thus, this design only works well for "monotonic" data, that is to say, migration always increases as time passes.

The appearance of the chart is only mildly affected by the underlying data. Swapping the regions of origin changes the appearance of this design drastically.






Trump resistance chart: cleaning up order, importance, weight, paneling

Morningconsult_gopresistance_trVox featured the following chart when discussing the rise of resistance to President Trump within the GOP.

The chart is composed of mirrored bar charts. On the left side, with thicker pink bars that draw more attention, the design depicts the share of a particular GOP demographic segment that said they'd likely vote for a Trump challenger, according to a Morning Consult poll.

This is the primary metric of interest, and the entire chart is ordered by descending values from African Americans who are most likely (67%) to turn to a challenger to those who strongly support Trump and are the least likely (17%) to turn to someone else.

The right side shows the importance of each demographic, measured by the share of GOP. The relationship between importance and likelihood to defect from Trump is by and large negative but that fact takes a bit of effort to extract from this mirrored bar chart arrangement.

The subgroups are not complete. For example, the only ethnicity featured is African Americans. Age groups are somewhat more complete with under 18 being the only missing category.

The design makes it easy to pick off the most disaffected demographic segments (and the least, from the bottom) but these are disparate segments, possibly overlapping.


One challenge of this data is differentiating the two series of proportions. In this design, they use visual cues, like the height and width of the bars, colors, stacked vs not, data labels. Visual variety comes to the rescue.

Also note that the designer compensated for the lack of stacking on the left chart by printing data labels.


When reading this chart, I'm well aware that segments like urban residents, income more than $100K, at least college educated are overlapping, and it's hard to interpret the data the way it's been presented.

I wanted to place the different demographics into their natural groups, such as age, income, urbanicity, etc. Such a structure also surfaces demographic patterns, e.g. men are slightly more disaffected than women (not significant), people earning $100K+ are more unhappy than those earning $50K-.

Further, I'd like to make it easier to understand the importance factor - the share of GOP. Because the original form orders the demographics according to the left side, the proportions on the right side are jumbled.

Here is a draft of what I have in mind:


The widths of the line segments show the importance of each demographic segment. The longest line segments are toward the bottom of the chart (< 40% likely to vote for Trump challenger).


McKinsey thinks the data world needs more dataviz talent

Note about last week: While not blogging, I delivered four lectures on three topics over five days: one on the use of data analytics in marketing for a marketing class at Temple; two on the interplay of analytics and data visualization, at Yeshiva and a JMP Webinar; and one on how to live during the Data Revolution at NYU.

This week, I'm back at blogging.

McKinsey publishes a report confirming what most of us already know or experience - the explosion of data jobs that just isn't stopping.

On page 5, it says something that is of interest to readers of this blog: "As data grows more complex, distilling it and bringing it to life through visualization is becoming critical to help make the results of data analyses digestible for decision makers. We estimate that demand for visualization grew roughly 50 percent annually from 2010 to 2015." (my bolding)

The report contains a number of unfortunate graphics. Here's one:


I applied my self-sufficiency test by removing the bottom row of data from the chart. Here is what happened to the second circle, representing the fraction of value realized by the U.S. health care industry.


What does the visual say? This is one of the questions in the Trifecta Checkup. We see three categories of things that should add up to 100 percent. With a little more effort, we find the two colored categories are each 10% while the white area is 80%. 

But that's not what the data say, because there is only one thing being measured: how much of the potential has already been realized. The two colors is an attempt to visualize the uncertainty of the estimated proportion, which in this case is described as 10 to 20 percent underneath the chart.

If we have to describe what the two colored sections represent: the dark green section is the lower bound of the estimate while the medium green section is the range of uncertainty. The edge between the two sections is the actual estimated proportion (assuming the uncertainty bound is symmetric around the estimate)!

A first attempt to fix this might be to use line segments instead of colored arcs. 


The middle diagram emphasizes the mid-point estimate while the right diagram, the range of estimates. Observe how differently these two diagrams appear from the original one shown on the left.

This design only works if the reader perceives the chart as a "racetrack" chart. You have to see the invisible vertical line at the top, which is the starting line, and measure how far around the track has the symbol gone. I have previously discussed why I don't like racetracks (for example, here and here).


Here is a sketch of another design:


The center figure will have to be moved and changed to a different shape. This design conveys the sense of a goal (at 100%) and how far one is along the path. The uncertainty is represented by wave-like elements that make the exact location of the pointer arrow appear as wavering.





No Latin honors for graphic design

Paw_honors_2018This chart appeared on a recent issue of Princeton Alumni Weekly.

If you read the sister blog, you'll be aware that at most universities in the United States, every student is above average! At Princeton,  47% of the graduating class earned "Latin" honors. The median student just missed graduating with honors so the honors graduate is just above average! The 47% number is actually lower than at some other peer schools - at one point, Harvard was giving 90% of its graduates Latin honors.

Side note: In researching this post, I also learned that in the Senior Survey for Harvard's Class of 2018, two-thirds of the respondents (response rate was about 50%) reported GPA to be 3.71 or above, and half reported 3.80 or above, which means their grade average is higher than A-.  Since Harvard does not give out A+, half of the graduates received As in almost every course they took, assuming no non-response bias.


Back to the chart. It's a simple chart but it's not getting a Latin honor.

Most readers of the magazine will not care about the decimal point. Just write 18.9% as 19%. Or even 20%.

The sequencing of the honor levels is backwards. Summa should be on top.


Warning: the remainder of this post is written for graphics die-hards. I go through a bunch of different charts, exploring some fine points.

People often complain that bar charts are boring. A trendy alternative when it comes to count or percentage data is the "pictogram."

Here are two versions of the pictogram. On the left, each percent point is shown as a dot. Then imagine each dot turned into a square, then remove all padding and lines, and you get the chart on the right, which is basically an area chart.


The area chart is actually worse than the original column chart. It's now much harder to judge the areas of irregularly-shaped pieces. You'd have to add data labels to assist the reader.

The 100 dots is appealing because the reader can count out the number of each type of honors. But I don't like visual designs that turn readers into bean-counters.

So I experimented with ways to simplify the counting. If counting is easier, then making comparisons is also easier.

Start with this observation: When asked to count a large number of objects, we group by 10s and 5s.

So, on the left chart below, I made connectors to form groups of 5 or 10 dots. I wonder if I should use different line widths to differentiate groups of five and groups of ten. But the human brain is very powerful: even when I use the same connector style, it's easy to see which is a 5 and which is a 10.


On the left chart, the organizing principles are to keep each connector to its own row, and within each category, to start with 10-group, then 5-group, then singletons. The anti-principle is to allow same-color dots to be separated. The reader should be able to figure out Summa = 10+3, Magna = 10+5+1, Cum Laude = 10+5+4.

The right chart is even more experimental. The anti-principle is to allow bending of the connectors. I also give up on using both 5- and 10-groups. By only using 5-groups, readers can rely on their instinct that anything connected (whether straight or bent) is a 5-group. This is powerful. It relieves the effort of counting while permitting the dots to be packed more tightly by respective color.

Further, I exploited symmetry to further reduce the counting effort. Symmetry is powerful as it removes duplicate effort. In the above chart, once the reader figured out how to read Magna, reading Cum Laude is simplified because the two categories share two straight connectors, and two bent connectors that are mirror images, so it's clear that Cum Laude is more than Magna by exactly three dots (percentage points).


Of course, if the message you want to convey is that roughly half the graduates earn honors, and those honors are split almost even by thirds, then the column chart is sufficient. If you do want to use a pictogram, spend some time thinking about how you can reduce the effort of the counting!






Crazy rich Asians inspire some rich graphics

On the occasion of the hit movie Crazy Rich Asians, the New York Times did a very nice report on Asian immigration in the U.S.

The first two graphics will be of great interest to those who have attended my free dataviz seminar (coming to Lyon, France in October, by the way. Register here.), as it deals with a related issue.

The first chart shows an income gap widening between 1970 and 2016.


This uses a two-lines design in a small-multiples setting. The distance between the two lines is labeled the "income gap". The clear story here is that the income gap is widening over time across the board, but especially rapidly among Asians, and then followed by whites.

The second graphic is a bumps chart (slopegraph) that compares the endpoints of 1970 and 2016, but using an "income ratio" metric, that is to say, the ratio of the 90th-percentile income to the 10th-percentile income.


Asians are still a key story on this chart, as income inequality has ballooned from 6.1 to 10.7. That is where the similarity ends.

Notice how whites now appears at the bottom of the list while blacks shows up as the second "worse" in terms of income inequality. Even though the underlying data are the same, what can be seen in the Bumps chart is hidden in the two-lines design!

In short, the reason is that the scale of the two-lines design is such that the small numbers are squashed. The bottom 10 percent did see an increase in income over time but because those increases pale in comparison to the large incomes, they do not show up.

What else do not show up in the two-lines design? Notice that in 1970, the income ratio for blacks was 9.1, way above other racial groups.

Kudos to the NYT team to realize that the two-lines design provides an incomplete, potentially misleading picture.


The third chart in the series is a marvellous scatter plot (with one small snafu, which I'd get t0).


What are all the things one can learn from this chart?

  • There is, as expected, a strong correlation between having college degrees and earning higher salaries.
  • The Asian immigrant population is diverse, from the perspectives of both education attainment and median household income.
  • The largest source countries are China, India and the Philippines, followed by Korea and Vietnam.
  • The Indian immigrants are on average professionals with college degrees and high salaries, and form an outlier group among the subgroups.

Through careful design decisions, those points are clearly conveyed.

Here's the snafu. The designer forgot to say which year is being depicted. I suspect it is 2016.

Dating the data is very important here because of the following excerpt from the article:

Asian immigrants make up a less monolithic group than they once did. In 1970, Asian immigrants came mostly from East Asia, but South Asian immigrants are fueling the growth that makes Asian-Americans the fastest-expanding group in the country.

This means that a key driver of the rapid increase in income inequality among Asian-Americans is the shift in composition of the ethnicities. More and more South Asian (most of whom are Indians) arrivals push up the education attainment and household income of the average Asian-American. Not only are Indians becoming more numerous, but they are also richer.

An alternative design is to show two bubbles per ethnicity (one for 1970, one for 2016). To reduce clutter, the smaller ethnicites can be aggregated into Other or South Asian Other. This chart may help explain the driver behind the jump in income inequality.






Finding simple ways to explain complicated data and concepts, using some Pew data

A reader submitted the following chart from Pew Research for discussion.


The reader complained that this chart was difficult to comprehend. What are some of the reasons?

The use of color is superfluous. Each line is a "cohort" of people being tracked over time. Each cohort is given its own color or hue. But the color or hue does not signify much.

The dotted lines. This design element requires a footnote to explain. The reader learns that some of the numbers on the chart are projections because those numbers pertain to time well into the future. The chart was published in 2014, using historical data so any numbers dated 2014 or after (and even some data before 2014) will be projections. The data are in fact encoded in the dots, not the slopes. Look at the cohort that has one solid line segment and one dotted line segment - it's unclear which of those three data points are projections, and which are experienced.

The focus on within-cohort trends. The line segments indicate the desire of the designer to emphasize trends within each cohort. However, it's not clear what the underlying message is. It may be that more and more people are not getting married (i.e. fewer people are getting married). That trend affects each of the three age groups - and it's easier to paint that message by focusing on between-cohort trends.

Here is a chart that emphasizes the between-cohort trends.


A key decision is to not mix oil and water. The within-cohort analysis is presented in its own chart, next to the between-cohort analysis. It turns out that some of the gap between cohorts can be explained by people deferring marriage to later in life. The steep line on the right indicates that a bigger proportion of people now gets married between 35 and 44 than in previous cohorts.

I experimented a bit with the axes here. Several pie charts are used in lieu of axis labels. I also plotted a dual axis with the proportion of unmarried on the one side, and the corresponding proportion of married on the other side.

Some Tufte basics brought to you by your favorite birds

Someone sent me this via Twitter, found on the Data is Beautiful reddit:


The chart does not deliver on its promise: It's tough to know which birds like which seeds.

The original chart was also provided in the reddit:


I can see why someone would want to remake this visualization.

Let's just apply some Tufte fixes to it, and see what happens.

Our starting point is this:


First, consider the colors. Think for a second: order the colors of the cells by which ones stand out most. For me, the order is white > yellow > red > green.

That is a problem because for this data, you'd like green > yellow > red > white. (By the way, it's not explained what white means. I'm assuming it means the least preferred, so not preferred that one wouldn't consider that seed type relevant.)

Compare the above with this version that uses a one-dimensional sequential color scale:


The white color still stands out more than necessary. Fix this using a gray color.


What else is grabbing your attention when it shouldn't? It's those gridlines. Push them into the background using white-out.


The gridlines are also too thick. Here's a slimmed-down look:


The visual is much improved.

But one more thing. Let's re-order the columns (seeds). The most popular seeds are shown on the left, and the least on the right in this final revision.


Look for your favorite bird. Then find out which are its most preferred seeds.

Here is an animated gif to see the transformation. (Depending on your browser, you may have to click on it to view it.)



PS. [7/23/18] Fixed the 5th and 6th images and also in the animated gif. The row labels were scrambled in the original version.


Two thousand five hundred ways to say the same thing

Wallethub published a credit card debt study, which includes the following map:


Let's describe what's going on here.

The map plots cities (N = 2,562) in the U.S. Each city is represented by a bubble. The color of the bubble ranges from purple to green, encoding the percentile ranking based on the amount of credit card debt that was paid down by consumers. Purple represents 1st percentile, the lowest amount of paydown while green represents 99th percentile, the highest amount of paydown.

The bubble size is encoding exactly the same data, apparently in a coarser gradation. The more purple the color, the smaller the bubble. The more green the color, the larger the bubble.


The design decisions are baffling.

Purple is more noticeable than the green, but signifies the less important cities, with the lesser paydowns.

With over 2,500 bubbles crowding onto the map, over-plotting is inevitable. The purple bubbles are printed last, dominating the attention but those are the least important cities (1st percentile). The green bubbles, despite being larger, lie underneath the smaller, purple bubbles.

What might be the message of this chart? Our best guess is: the map explores the regional variation in the paydown rate of credit card debt.

The analyst provides all the data beneath the map. 


From this table, we learn that the ranking is not based on total amount of debt paydown, but the amount of paydown per household in each city (last column). That makes sense.

Shouldn't it be ranked by the paydown rate instead of the per-household number? Divide the "Total Credit Card Paydown by City" by "Total Credit Card Debt Q1 2018" should yield the paydown rate. Surprise! This formula yields a column entirely consisting of 4.16%.

What does this mean? They applied the national paydown rate of 4.16% to every one of 2,562 cities in the country. If they had plotted the paydown rate, every city would attain the same color. To create "variability," they plotted the per-household debt paydown amount. Said differently, the color scale encodes not credit card paydown as asserted but amount of credit card debt per household by city.

Here is a scatter plot of the credit card amount against the paydown amount.


A perfect alignment!

This credit card debt paydown map is an example of a QDV chart, in which there isn't a clear question, there is almost no data, and the visual contains several flaws. (See our Trifecta checkup guide.) We are presented 2,562 ways of saying the same thing: 4.16%.


P.S. [6/22/2018] Added scatter plot, and cleaned up some language.




Fantastic visual, but the Google data need some pre-processing

Another entry in the Google Newslab data visualization project that caught my eye is the "How to Fix It" project, illustrating search queries across the world that asks "how." The project web page is here.

The centerpiece of the project is an interactive graphic showing queries related to how to fix home appliances. Here is what it looks like in France (It's always instructive to think about how they would count "France" queries. Is it queries from queries written in French? queries from an IP address in France? A combination of the above?)


I particularly appreciate the lack of labels. When we see the pictures, we don't need to be told this is a window and that is a door. The search data concern the relative sizes of the appliances. The red dotted lines show the relative popularity of searches for the respective appliances in aggregate.

By comparison, the Russian picture looks very different:


Are the Russians more sensible? Their searches are far and away about the washing machine, which is the most complicated piece of equipment on the graphic.

At the bottom of the page, the project looks at other queries, such as those related to cooking. I find it fascinating to learn what people need help making:


I have to confess that I searched for "how to make soft boiled eggs". That led me to a lot of different webpages, mostly created for people who search for how to make a soft boiled egg. All of them contain lots of advertising, and the answer boils down to cook it for 6 minutes.


The Russia versus France comparison brings out a perplexing problem with the "Data" in this visualization. For competitive reasons, Google does not provide data on search volume. The so-called Search Index is what is being depicted. The Search Index uses the top-ranked item as the reference point (100). In the Russian diagram, the washing machine has Search Index of 100 and everything else pales in comparison.

In the France example, the window is the search item with the greatest number of searches, so it has Search Index of 100; the door has Index 96, which means it has 96% of the search volume of the window; the washing machine with Index 49 has about half the searches of the window.

The numbers cannot be interpreted as proportions. The Index of 49 does not mean that washing machines account for 49% of all France queries about fixing home appliances. That is really the meaning of popularity we want to have but we don't have. We can obtain true popularity measures by "normalizing" the Search Index: just sum up the Index Values of all the appliances and divide the Search Index by the sum of the Indices. After normalizing, the numbers can be interpreted as proportions and they add up to 100% for each country. When not normalized, the indices do not add to 100%.

Take the case in which we have five appliances, and let's say all five appliances are equally popular, comprising 20% of searches each. The five Search Indices will all be 100 because the top-ranked item is given the value of 100. Those indices add to 500!

By contrast, in the case of Russia (or a more extreme case), the top-ranked query is almost 100% of all the searches, so the sum of the indices will be only slightly larger than 100.

If you realize this, then you'd understand that it is risky to compare Search Indices across countries. The interpretation is clouded by how much of the total queries accounted for by the top query.

In our Trifecta Checkup, this is a chart that does well in the Question and Visual corners, but there is a problem with the Data.



Well-structured, interactive graphic about newsrooms

Today, I take a detailed look at one of the pieces that came out of an amazing collaboration between Alberto Cairo, and Google's News Lab. The work on diversity in U.S. newsrooms is published here. Alberto's introduction to this piece is here.

The project addresses two questions: (a) gender diversity (representation of women) in U.S. newsrooms and (b) racial diversity (representation of white vs. non-white) in U.S. newsrooms.

One of the key strengths of the project is how the complex structure of the underlying data is displayed. The design incorporates the layering principle everywhere to clarify that structure.

At the top level, the gender and race data are presented separately through the two tabs on the top left corner. Additionally, newsrooms are classified into three tiers: brand-names (illustrated with logos), "top" newsrooms, and the rest.


The brand-name newsrooms are shown with logos while the reader has to click on individual bubbles to see the other newsrooms. (Presumably, the size of the bubble is the size of each newsroom.)

The horizontal scale is the proportion of males (or females), with equality positioned in the middle. The higher the proportion of male staff, the deeper is the blue. The higher the proportion of female staff, the deeper is the red. The colors are coordinated between the bubbles and the horizontal axis, which is a nice touch.

I am not feeling this color choice. The key reference level on this chart is the 50/50 split (parity), which is given the pale gray. So the attention is drawn to the edges of the chart, to those newsrooms that are the most gender-biased. I'd rather highlight the middle, celebrating those organizations with the best gender balance.


The red-blue color scheme unfortunately re-appeared in a subsequent chart, with a different encoding.


Now, blue means a move towards parity while red indicates a move away from parity between 2001 and 2017. Gray now denotes lack of change. The horizontal scale remains the same, which is why this can cause some confusion.

Despite the colors, I like the above chart. The arrows symbolize trends. The chart delivers an insight. On average, these newsrooms are roughly 60% male with negligible improvement over 16 years.


Back to layering. The following chart shows that "top" newsrooms include more than just the brand-name ones.


The dot plot is undervalued for showing simple trends like this. This is a good example of this use case.

While I typically recommend showing balanced axis for bipolar scale, this chart may be an exception. Moving to the right side is progress but the target sits in the middle; the goal isn't to get the dots to the far right so much of the right panel is wasted space.