« March 2018 | Main | May 2018 »

Hog wild about dot maps

Reader Chris P. sent me this chart.

This was meant to be "light entertainment." See the Twitter discussion below.



Let's think a bit about the dot map as a data graphic.

Dot maps are one dimensional. The dot's location is used to indicate the latitude and longitude and therefore the x,y coordinates cannot encode any other data. If we have basically a black/white chart, as in this hog map, the dot can only encode binary data (yes/no).

The legend says "each dot represents 5,000 hogs." Think about how that statement applies to these scenarios:

  • Do you expect to see something different between the dot representing 4,200 and the one showing 4,900?
  • Do you expect to see something different between the dot representing 400 and 4,000?
  • Do you expect to see something different between the location with 4,800 hogs and 9,600 hogs?

Based on the legend, the designer would need two dots to represent 10,000 hogs. But those two dots pertain to the same location. Sometimes, "jitter" is added, and the two dots are placed side by side. However, with the scale of the map of the U.S., and the dots representing seemingly small neighborhoods, jitter creates more confusion than anything. Also, what about 3, 4, 5, .. dots in the same location?


Looking at the details above, are the dots jittered or do they represent neighboring locations?

Sometimes, colors are used to encode data on a dot map. But each dot can only contain one color, so it only typically shows the top category in each location.

Dot maps are very limited. Think before you use them.


Beauty is in the eyes of the fishes

Reader Patrick S. sent in this old gem from Germany.


He said:

It displays the change in numbers of visitors to public pools in the German city of Hanover. The invisible y-axis seems to be, um, nonlinear, but at least it's monotonic, in contrast to the invisible x-axis.

There's a nice touch, though: The eyes of the fish are pie charts. Black: outdoor pools, white: indoor pools (as explained in the bottom left corner).

It's taken from a 1960 publication of the city of Hanover called *Hannover: Die Stadt in der wir leben*.

This is the kind of chart that Ed Tufte made (in)famous. The visual elements do not serve the data at all, except for the eyeballs. The design becomes a mere vessel for the data table. The reader who wants to know the growth rate of swimmers has to do a tank of work.

The eyeballs though.

I like the fact that these pie charts do not come with data labels. This part of the chart passes the self-sufficiency test. In fact, the eyeballs contain the most interesting story in this chart. In those four years, the visitors to public pools switched from mostly indoor pools to mostly outdoor pools. These eyeballs show that pie charts can be effective in specific situations.

Now, Hanover fishes are quite lucky to have free admission to the public pools!

Playfulness in data visualization

The Newslab project takes aggregate data from Google's various services and finds imaginative ways to enliven the data. The Beautiful in English project makes a strong case for adding playfulness to your data visualization.

Newslab_language_wordsnakeThe data came from Google Translate. The authors look at 10 languages, and the top 10 words users ask to translate from those languages into English.

The first chart focuses on the most popular word for each language. The crawling snake presents the "worldwide" top words.

The crawling motion and the curvature are not required by the data but it inserts a dimension of playfulness into the data that engages the reader's attention.

The alternative of presenting a data table loses this virtue without gaining much in return.

Readers are asked to click on the top word in each country to reveal further statistics on the word.

For example, the word "good" leads to the following:




The second chart presents the top 10 words by language in a lollipop style:


The above diagram shows the top 10 Japanese words translated into English. This design sacrifices concise in order to achieve playful.

The standard format is a data table with one column for each country, and 10 words listed below each country header in order of decreasing frequency.

The creative lollipop display generates more extreme emotions - positive, or negative, depending on the reader. The data table is the safer choice, precisely because it does not engage the reader as deeply.



Lines, gridlines, reference lines, regression lines, the works

This post is part 2 of an appreciation of the chart project by Google Newslab, advised by Alberto Cairo, on the gender and racial diversity of the newsroom. Part 1 can be read here.

In the previous discussion, I left out the following scatter bubble plot.


This plot is available in two versions, one for gender and one for race. The key question being asked is whether the leadership in the newsroom is more or less diverse than the rest of the staff.

The story appears to be a happy one: in many newsrooms, the leadership roughly reflects the staff in terms of gender distribution (even though both parts of the whole compare disfavorably to the gender ratio in the neighborhoods, as we saw in the previous post.)


Unfortunately, there are a few execution problems with this scatter plot.

First, take a look at the vertical axis labels on the right side. The labels inform the leadership axis. The mid-point showing 50-50 (parity) is emphasized with the gray band. Around the mid-point, the labels seem out of place. Typically, when the chart contains gridlines, we expect the labels to sit right around each gridline, either on top or just below the line. Here the labels occupy the middle of the space between successive gridlines. On closer inspection, the labels are correctly affixed, and the gridlines  drawn where they are supposed to be. The designer chose to show irregularly spaced labels: from the midpoint, it's a 15% jump on either side, then a 10% jump.

I find this decision confounding. It also seems as if two people have worked on these labels, as there exists two patterns: the first is "X% Leaders are Women", and second is "Y% Female." (Actually, the top and bottom labels are also inconsistent, one using "women" and the other "female".)

The horizontal axis? They left out the labels. Without labels, it is not possible to interpret the chart. Inspecting several conveniently placed data points, I figured that the labels on the six vertical gridlines should be 25%, 35%, ..., 65%, 75%, in essence the same scale as the vertical axis.

Here is the same chart with improved axis labels:


Re-labeling serves up a new issue. The key reference line on this chart isn't the horizontal parity line: it is the 45-degree line, showing that the leadership has the same proprotion of females as the rest of the staff. In the following plot (right side), I added in the 45-degree line. Note that it is positioned awkwardly on top of the grid system. The culprit is the incompatible gridlines.


The solution, as shown below, is to shift the vertical gridlines by 5% so that the 45-degree line bisects every grid cell it touches.



Now that we dealt with the purely visual issues, let me get to a statistical issue that's been troubling me. It's about that yellow line. It's supposed to be a regression line that runs through the points.

Does it appear biased downwards to you? It just seems that there are too many dots above and not enough below. The distance of the furthest points above also appears to be larger than that of the distant points below.

How do we know the line is not correct? Notice that the green 45-degree line goes through the point labeled "AVERAGE." That is the "average" newsroom with the average proportion of female staff and the average proportion of leadership staff. Interestingly, the average falls right on the 45-degree line.

In general, the average does not need to hit the 45-degree line. The average, however, does need to hit the regression line! (For a mathematical explanation, see here.)

Note the corresponding chart for racial diversity has it right. The yellow line does pass through the average point here:



In practice, how do problems seep into dataviz projects? It's the fact that you don't get to the last chart via a clean, streamlined process but that you pass through a cycle of explore-retrench-synthesize, frequently bouncing ideas between several people, and it's challenging to keep consistency!

And let me repeat my original comment about this project - the key learning here is how they took a complex dataset with many variables, broke it down into multiple parts addressing specific problems, and applied the layering principle to make each part of the project digestible.



Well-structured, interactive graphic about newsrooms

Today, I take a detailed look at one of the pieces that came out of an amazing collaboration between Alberto Cairo, and Google's News Lab. The work on diversity in U.S. newsrooms is published here. Alberto's introduction to this piece is here.

The project addresses two questions: (a) gender diversity (representation of women) in U.S. newsrooms and (b) racial diversity (representation of white vs. non-white) in U.S. newsrooms.

One of the key strengths of the project is how the complex structure of the underlying data is displayed. The design incorporates the layering principle everywhere to clarify that structure.

At the top level, the gender and race data are presented separately through the two tabs on the top left corner. Additionally, newsrooms are classified into three tiers: brand-names (illustrated with logos), "top" newsrooms, and the rest.


The brand-name newsrooms are shown with logos while the reader has to click on individual bubbles to see the other newsrooms. (Presumably, the size of the bubble is the size of each newsroom.)

The horizontal scale is the proportion of males (or females), with equality positioned in the middle. The higher the proportion of male staff, the deeper is the blue. The higher the proportion of female staff, the deeper is the red. The colors are coordinated between the bubbles and the horizontal axis, which is a nice touch.

I am not feeling this color choice. The key reference level on this chart is the 50/50 split (parity), which is given the pale gray. So the attention is drawn to the edges of the chart, to those newsrooms that are the most gender-biased. I'd rather highlight the middle, celebrating those organizations with the best gender balance.


The red-blue color scheme unfortunately re-appeared in a subsequent chart, with a different encoding.


Now, blue means a move towards parity while red indicates a move away from parity between 2001 and 2017. Gray now denotes lack of change. The horizontal scale remains the same, which is why this can cause some confusion.

Despite the colors, I like the above chart. The arrows symbolize trends. The chart delivers an insight. On average, these newsrooms are roughly 60% male with negligible improvement over 16 years.


Back to layering. The following chart shows that "top" newsrooms include more than just the brand-name ones.


The dot plot is undervalued for showing simple trends like this. This is a good example of this use case.

While I typically recommend showing balanced axis for bipolar scale, this chart may be an exception. Moving to the right side is progress but the target sits in the middle; the goal isn't to get the dots to the far right so much of the right panel is wasted space.


Discoloring the chart to re-discover its plot

Today's chart comes from Pew Research Center, and the big question is why the colors?


The data show the age distributions of people who believe different religions. It's a stacked bar chart, in which the ages have been grouped into the young (under 15), the old (60 plus) and everyone else. Five religions are afforded their own bars while "folk" religions are grouped as one, and so have "other" religions. There is even a bar for the unaffiliated. "World" presumably is the aggregate of all the other bars, weighted by the popularity of each religion group.

So far so good. But what is it that demands 9 colors, and 27 total shades? In other words, one shade for every data point on this chart.

Here is a more restrained view:



Let's follow the designer's various decisions. The choice of those age groups indicates that the story is really happening at the "margins": Muslims and Hindus have higher proportions of younger followers while Jews and Buddhists have higher concentrations of older followers.

Therein lies the problem. Because of the lengths, their central locations, and the tints, the middle section of each bar is the most eye-catching: the reader is glancing at the wrong part of the chart.

So, let me fix this by re-ordering the three panels:

Is there really a need to draw those gray bars? The middle age group (grab-all) only exists to assure readers that everyone who's supposed to be included has been included. Why plot it?


The above chart says "trust me, what isn't drawn here constitutes the remaining population, and the whole adds to 100%."


Another issue of these charts, exacerbated by inflexible software defaults, is the forced choice of imbuing one variable with a super status above the others. In the Pew chart, the rows are ordered by decreasing proportion of the young age group, except for the "everyone" group pinned as the bottom row. Therefore, the green bars (old age group) are not in a particular order, its pattern much harder to comprehend.

In the final version, I break the need to keep bars of the same religion on the same row:


Five colors are used. Three of them are used to cluster similar religions: Muslims and Hindus (in blue) have higher proportions of the young compared to the world average (gray) while the religions painted in green have higher proportions of the old. Christians (in orange) are unusual in that the proportions are higher than average in both young and old age groups. Everyone and unaffiliated are given separate colors.

The colors here serve two purposes: connecting the two panels, and revealing the cluster structure.