The Newslab project takes aggregate data from Google's various services and finds imaginative ways to enliven the data. The Beautiful in English project makes a strong case for adding playfulness to your data visualization.
The data came from Google Translate. The authors look at 10 languages, and the top 10 words users ask to translate from those languages into English.
The first chart focuses on the most popular word for each language. The crawling snake presents the "worldwide" top words.
The crawling motion and the curvature are not required by the data but it inserts a dimension of playfulness into the data that engages the reader's attention.
The alternative of presenting a data table loses this virtue without gaining much in return.
Readers are asked to click on the top word in each country to reveal further statistics on the word.
For example, the word "good" leads to the following:
The second chart presents the top 10 words by language in a lollipop style:
The above diagram shows the top 10 Japanese words translated into English. This design sacrifices concise in order to achieve playful.
The standard format is a data table with one column for each country, and 10 words listed below each country header in order of decreasing frequency.
The creative lollipop display generates more extreme emotions - positive, or negative, depending on the reader. The data table is the safer choice, precisely because it does not engage the reader as deeply.
Today's chart comes from Pew Research Center, and the big question is why the colors?
The data show the age distributions of people who believe different religions. It's a stacked bar chart, in which the ages have been grouped into the young (under 15), the old (60 plus) and everyone else. Five religions are afforded their own bars while "folk" religions are grouped as one, and so have "other" religions. There is even a bar for the unaffiliated. "World" presumably is the aggregate of all the other bars, weighted by the popularity of each religion group.
So far so good. But what is it that demands 9 colors, and 27 total shades? In other words, one shade for every data point on this chart.
Here is a more restrained view:
Let's follow the designer's various decisions. The choice of those age groups indicates that the story is really happening at the "margins": Muslims and Hindus have higher proportions of younger followers while Jews and Buddhists have higher concentrations of older followers.
Therein lies the problem. Because of the lengths, their central locations, and the tints, the middle section of each bar is the most eye-catching: the reader is glancing at the wrong part of the chart.
So, let me fix this by re-ordering the three panels:
Is there really a need to draw those gray bars? The middle age group (grab-all) only exists to assure readers that everyone who's supposed to be included has been included. Why plot it?
The above chart says "trust me, what isn't drawn here constitutes the remaining population, and the whole adds to 100%."
Another issue of these charts, exacerbated by inflexible software defaults, is the forced choice of imbuing one variable with a super status above the others. In the Pew chart, the rows are ordered by decreasing proportion of the young age group, except for the "everyone" group pinned as the bottom row. Therefore, the green bars (old age group) are not in a particular order, its pattern much harder to comprehend.
In the final version, I break the need to keep bars of the same religion on the same row:
Five colors are used. Three of them are used to cluster similar religions: Muslims and Hindus (in blue) have higher proportions of the young compared to the world average (gray) while the religions painted in green have higher proportions of the old. Christians (in orange) are unusual in that the proportions are higher than average in both young and old age groups. Everyone and unaffiliated are given separate colors.
The colors here serve two purposes: connecting the two panels, and revealing the cluster structure.
The box "Global average" is doubly false. It is not global, and it is not the average!
The only non-American cities included in this survey are Toronto, Paris and London.
The only city with average salary above the "Global average" is San Francisco Bay Area. Since the Bay Area does not outweigh all other cities combined in the number of tech workers, it is impossible to get an average of $135,000.
Here is the second chart.
What's wrong with these lines?
This chart frustrates the reader's expectations. The reader interprets it as a simple line chart, based on three strong hints:
time along the horizontal axis
data labels show dollar units
lines linking time
Each line seems to show the trend of average tech worker salary, in dollar units.
However, that isn't the designer's intention. Let's zoom in on Chicago and Denver:
The number $112,000 (Denver) sits below the number $107,000 (Chicago). It appears that each chart has its own scale. But that's not the case either.
For a small-multiples setup, we expect all charts should use the same scale. Even though the data labels are absolute dollar amounts, the vertical axis is on a relative scale (percent change). To make things even more complicated, the percent change is computed relative to the minimum of the three annual values, no matter which year it occurs.
That's why $106,000 (Chicago) is at the same level as $112,000 (Denver). Those are the minimum values in the respective time series. As shown above, these line charts are easier to understand if the axis is displayed in its true units of percent change.
The choice of using the minimum value as the reference level interferes with comparing one city to the next. For Chicago, the line chart tells us 2015 is about 2 percent above 2016 while 2017 is 6 percent above. For Denver, the line chart tells us that 2016 is about 2 percent above the 2015 and 2017 values. Now what's the message again?
Here I index all lines to the earliest year.
In a Trifecta Checkup analysis (link), I'd be suspicious of the data. Did tech salaries in London really drop by 15-20 percent in the last three years?
I came across this chart from the OurWorldinData website, and this one would make the late Hans Rosling very happy.
If you went to Professor Rosling's talk, he was bitter that the amazing gains in public health, worldwide (but particularly in less developed nations) during the last few decades have been little noticed. This chart makes it clear: note especially the dramatic plunge in extreme poverty, rise in vaccinations, drop in child mortality, and improvement in education and literacy, mostly achived in the last few decades.
This set of charts has a simple but powerful message. It's the simplicity of execution that really helps readers get that powerful message.
The text labels on the left and right side of the charts are just perfect.
Little things that irk me:
I am not convinced by the liberal use of colors - I would make the "other" category of each chart consistently gray so 6 colors total. Having different colors does make the chart more interesting to look at.
Even though the gridlines are muted, I still find them excessive.
There is a coding bug in the Vaccination chart right around 1960.
When I look at this chart (from Business Insider), I try to understand the decisions made by its designer - which things are important to her/him, and which things are less important.
The chart shows average salaries in the top 2 percent of income earners. The data are split by gender and by state.
First, I notice that the designer chooses to use the map form. This decision suggests that the spatial pattern of top incomes is of top interest to the designer because she/he is willing to accept the map's constraints - namely, the designer loses control of the x and y dimensions, as well as the area and shape of the data containers. For the U.S. state map, there is no elegant solution to the large number of small states problem in the Northeast.
Second, I notice the color choice. The designer provides actual values on the visualization but also groups all state-average incomes into five categories. It's not clear how she/he determines the boundaries of these income brackets. There are many more dark blue states than there are light blue states in the map for men. Because women incomes are everywhere lower than men, the map at the bottom fits all states into two large buckets, plus Connecticut. Women incomes are lower than men but there is no need to break the data down by gender to convey this message.
Third, the use of two maps indicates that the designer does not care much about gender comparisons within each state. These comparisons are difficult to accomplish on the chart - one must involuntarily bob one's head up and down to make the comparisons. The head bobbing isn't even enough: then you must pull out your calculator and compute the ratio of women to men average. If the designer wants to highlight state-level comparisons, she/he could have plotted the gender ratio on a single map, like this:
So far, I infer that the key questions are (a) the gender gap in aggregate (b) the variability of incomes within each gender, or the spatial clustering (c) the gender gap within each state.
(a) is better conveyed in more aggregate form. Goal (b) is defeated by the lack of clear clustering. (c) is not helped by the top-bottom split.
In making the above chart, I discover a pattern - that women fare better in the smaller states like Montana, Iowa, North & South Dakota. Meanwhile, the disparity in New York is of the same degree as Oklahoma and Wyoming.
This chart tells readers a bit more about the underlying data, without having to print the entire dataset on the page.
This chart by Axios is well made. The full version is here.
It's easy to identify all the Cat 5 hurricanes. Only important ones are labeled. The other labels are hidden behind the hover. The chart provides a good answer to the question: what time of the year does the worst hurricanes strike. It's harder to compare the maximum speeds of the hurricanes.
I wish there is a way to incorporate geography. I'd be willing to trade off the trajectory of wind speeds as the max speed is of most use.
Recently, I noted how we have to learn to hate defaults in data visualization software. I was reminded again of this point when reviewing this submission from long-time reader & contributor Chris P.
The chart is included in this Medium article, which credits Mott Capital Management as the source.
Look at the axis labels on the right side. They have the hallmarks of software defaults. The software designer decided that the axis labels will be formatted in exactly the same way as the data in that column: this means $XXX.XXB, with two decimal places. The same formatting rule is in place for the data labels, shown in boxes.
Why put tick marks at the odd intervals, 37.50, 62.50, 87.50, ... ? What's wrong with 40, 60, 80, 100, ...? It comes down to machine thinking versus human thinking.
This software places the most recent values into data labels, formatted as boxes that point to the positions of those values on the axis. Evidently, it doesn't have a plan for overcrowding. At the bottom of the axis, we see four labels for six lines. The blue, pink and orange labels point to the wrong places on the axis.
Worse, it's unclear what those "most recent" values represent. I have added gridlines for each year on the excerpt shown right. The lines extend to 2017, which isn't even half over.
Now, consider the legend. Which version do you prefer?
Most likely, the original dataset has columns named "Amazon.com Revenue (TTM)", "Dillard's Revenue (TTM)", etc. so the software just picks those up and prints them in the legend text.
The chart is an output from YCharts, which I learned is a Bloomberg terminal competitor. It probably uses one of the available Web graphing packages out there. These packages typically emphasize ease of use through automating the process of data visualization. Ease of use is defined as rigid defaults that someone determines are the optimal settings. Users then discover that there is no getting around those settings; in some cases, a coding interface is available, which usurps the goal of user-friendliness.
The problem lies in defining what ease of use means. Ease of use should require expanding, not restricting, choices. Setting rigid defaults restricts choices. In addition to providing good defaults, the software designer should make it simple for users to make their own choices. Ideally, each of the elements (data labels, gridlines, tick marks, etc.) can be independently removed, shifted, expanded, reduced, re-colored, edited, etc. from their original settings.
The following chart showing wage gaps by gender among U.S. physicians was sent to me via Twitter:
The original chart was published by the Stat News website (link).
I am most curious about the source of the data. It apparently came from a website called Doximity, which collects data from physicians. Here is a link to the PR release related to this compensation dataset. However, the data is not freely available. There is a claim that this data come from self reports by 36,000 physicians.
I am not sure whether I trust this data. For example:
Do I believe that physicians in North Dakota earn the highest salaries on average in the nation? And not only that, they earn almost 30% more than the average physician in New York. Does the average physician in ND really earn over $400K a year? If you are wondering, the second highest salary number comes from South Dakota. And then Idaho. Also, these high-salary states are correlated with the lowest gender wage gaps.
I suspect that sample size is an issue. They do not report sample size at the level of their analyses. They apparently published statistics at the level of MSAs. There are roughly 400 MSAs in the U.S. so at that level, on average, they have only 90 samples per MSA. When split by gender, the average sample size is less than 50. Then, they are comparing differences, so we should see the standard errors. And finally, they are making hundreds of such comparisons, for which some kind of multiple-comparisons correction is needed.
I am pretty sure some of you are doctors, or work in health care. Do those salary numbers make sense? Are you moving to North/South Dakota?
Turning to the Visual corner of the Trifecta Checkup (link), I have a mixed verdict. The hover-over effect showing the precise values at either axes is a nice idea, well executed.
I don't see the point of drawing the circle inside a circle. The wage gap is already on the vertical axis, and the redundant representation in dual circles adds nothing to it. Because of this construct, the size of the bubbles is now encoding the male average salary, taking attention away from the gender gap which is the point of the chart.
I also don't think the regional analysis (conveyed by the colors of the bubbles) is producing a story line.
This is another instance of a dubious analysis in this "big data" era. The analyst makes no attempt to correct for self-reporting bias, and works as if the dataset is complete. There is no indication of any concern about sample sizes, after the analyst drills down to finer areas of the dataset. While there are other variables available, such as specialty, and other variables that can be merged in, such as income levels, all of which may explain at least a portion of the gender wage gap, no attempt has been made to incorporate other factors. We are stuck with a bivariate analysis that does not control for any other factors.
Last but not least, the analyst draws a bold conclusion from the overly simplistic analysis. Here, we are told: "If you want that big money, you can't be a woman." (link)
P.S. The Stat News article reports that the researchers at Doximity claimed that they controlled for "hours worked and other factors that might explain the wage gap." However, in Doximity's own report, there is no language confirming how they included the controls.