Two views of earthquake occurrence in the Bay Area

This article has a nice description of earthquake occurrence in the San Francisco Bay Area. A few quantities are of interest: when the next quake occurs, the size of the quake, the epicenter of the quake, etc. The data graphic included in the article fails the self-sufficiency test: the only way to read this chart is to read out the entire data set - in other words, the graphical details have no utility.


The article points out the clustering of earthquakes. In particular, there is a 68-year "quiet period" between 1911 and 1979, during which no quakes over 6.0 in size occurred. The author appears to have classified quakes into three groups: "Largest" which are those at 6.5 or over; "Smaller but damaging" which are those between 6.0 and 6.5; and those below 6.0 (not shown).

For a more standard and more effective visualization of this dataset, see this post on a related chart (about avian flu outbreaks). The post discusses a bubble chart versus a column chart. I prefer the column chart.

image from

This chart focuses on the timing of rare events. The time between events is not as easy to see. 

What if we want to focus on the "quiet years" between earthquakes? Here is a visualization that addresses the question: when will the next one hit us?




Two good charts can use better titles

NPR has this chart, which I like:


It's a small multiples of bumps charts. Nice, clear labels. No unnecessary things like axis labels. Intuitive organization by Major Factor, Minor Factor, and Not a Factor.

Above all, the data convey a strong, surprising, message - despite many high-profile gun violence incidents this year, some Democratic voters are actually much less likely to see guns as a "major factor" in deciding their vote!

Of course, the overall importance of gun policy is down but the story of the chart is really about the collapse on the Democratic side, in a matter of two months.

The one missing thing about this chart is a nice, informative title: In two months, gun policy went from a major to a minor issue for some Democratic voters.


 I am impressed by this Financial Times effort:


The key here is the analysis. Most lazy analyses compare millennials to other generations but at current ages but this analyst looked at each generation at the same age range of 18 to 33 (i.e. controlling for age).

Again, the data convey a strong message - millennials have significantly higher un(der)employment than previous generations at their age range. Similar to the NPR chart above, the overall story is not nearly as interesting as the specific story - it is the pink area ("not in labour force") that is driving this trend.

Specifically, millennial unemployment rate is high because the proportion of people classified as "not in labour force" has doubled in 2014, compared to all previous generations depicted here. I really like this chart because it lays waste to a prevailing theory spread around by reputable economists - that somehow after the Great Recession, demographics trends are causing the explosion in people classified as "not in labor force". These people are nobodies when it comes to computing the unemployment rate. They literally do not count! There is simply no reason why someone just graduated from college should not be in the labour force by choice. (Dean Baker has a discussion of the theory that people not wanting to work is a long term trend.)

The legend would be better placed to the right of the columns, rather than the top.

Again, this chart benefits from a stronger headline: BLS Finds Millennials are twice as likely as previous generations to have dropped out of the labour force.





Discoloring the chart to re-discover its plot

Today's chart comes from Pew Research Center, and the big question is why the colors?


The data show the age distributions of people who believe different religions. It's a stacked bar chart, in which the ages have been grouped into the young (under 15), the old (60 plus) and everyone else. Five religions are afforded their own bars while "folk" religions are grouped as one, and so have "other" religions. There is even a bar for the unaffiliated. "World" presumably is the aggregate of all the other bars, weighted by the popularity of each religion group.

So far so good. But what is it that demands 9 colors, and 27 total shades? In other words, one shade for every data point on this chart.

Here is a more restrained view:



Let's follow the designer's various decisions. The choice of those age groups indicates that the story is really happening at the "margins": Muslims and Hindus have higher proportions of younger followers while Jews and Buddhists have higher concentrations of older followers.

Therein lies the problem. Because of the lengths, their central locations, and the tints, the middle section of each bar is the most eye-catching: the reader is glancing at the wrong part of the chart.

So, let me fix this by re-ordering the three panels:

Is there really a need to draw those gray bars? The middle age group (grab-all) only exists to assure readers that everyone who's supposed to be included has been included. Why plot it?


The above chart says "trust me, what isn't drawn here constitutes the remaining population, and the whole adds to 100%."


Another issue of these charts, exacerbated by inflexible software defaults, is the forced choice of imbuing one variable with a super status above the others. In the Pew chart, the rows are ordered by decreasing proportion of the young age group, except for the "everyone" group pinned as the bottom row. Therefore, the green bars (old age group) are not in a particular order, its pattern much harder to comprehend.

In the final version, I break the need to keep bars of the same religion on the same row:


Five colors are used. Three of them are used to cluster similar religions: Muslims and Hindus (in blue) have higher proportions of the young compared to the world average (gray) while the religions painted in green have higher proportions of the old. Christians (in orange) are unusual in that the proportions are higher than average in both young and old age groups. Everyone and unaffiliated are given separate colors.

The colors here serve two purposes: connecting the two panels, and revealing the cluster structure.





Several problems with stacked bar charts, as demonstrated by a Delta chart designer

In the Trifecta Checkup (link), I like to see the Question and the Visual work well together. Sometimes, you have a nice message but you just pick the wrong Visual.

An example is the following stacked column chart, used in an investor presentation by Delta.


From what I can tell, the five types of aircraft are divided into RJ (regional jet) and others (perhaps, larger jets). With each of those types, there are two or three subtypes. The primary message here is the reduction in the RJ fleet and the expansion of Small/Medium/Large.

One problem with a stacked column chart with five types is that it takes too much effort to understand the trends of the middle types.

The two types on the edges are not immune to confusion either. As shown below, both the dark blue (Large) type and the dark red (50-seat RJ) type are associated with downward sloping lines except that the former type is growing rapidly while the latter is vanishing from the mix!


 In this case, the slopegraph (Bumps-type chart) can overcome some of the limitations.



This example was used in my new dataviz workshop, launched in St. Louis yesterday. Thank you to the participants for making it a lively session!

Doing my duty on Pi Day #onelesspie

Xan Gregg and I started a #onelesspie campaign a few years ago. On Pi Day each year, we find a pie chart, and remake it. On Wikipedia, you can find all manners of pie chart. Try this search, and see for yourself.

Here's one found on the Wiki page about the city of Ogema, in Canada:


This chart has 20 age groups, each given a different color. That's way too much!

I was able to find data on 10-year age groups, not five. But the "shape" of the distribution is much easily seen on a column chart (a histogram).


Only a single color is needed.

The reason why I gravitated to this chart was the highly unusual age distribution... this town has almost uniform distribution of age groups, with each of the 10-year ranges accounting for about 11% of the population. Given that there are 9 groups, a perfectly even distribution would be 11% for each column. (Well, the last group of 80+ is cheating a bit as it has more than 10 years.)

I don't know about Ogema. Maybe a reader can explain this unusual age distribution!




Looking above the waist, dataviz style

I came across this chart on NYU's twitter feed. 


Growth has indeed been impressive; the dataviz less so. Here's the problem with not starting the vertical scale of a column chart at zero:


In a column chart, the heights of the columns should be proportional to the data. Here they are misaligned because an equal amount has been chopped off below 30,000 from all columns. The light purple that I layered on top of the chart presents the correct heights of the columns, assuming that the first column for 2007 indeed properly encoded the data.

The dark purple top of each column represents the "lie factor." It is the amount of exaggeration created by chopping off those legs. The lie factor is of Ed Tufte coinage.


The designer probably wanted to show the year-to-year trend more starkly. Doubling the number of applications in 10 years is pretty impressive. The solution is not to chop off the legs but to look above the waist. You can't fix the column chart but you can switch to a line chart, as follows:


In a line chart, we are mostly concerned with the changing slope of the line segments going from year to year. The slopes encode the year-on-year growth rates. 


When design goes awry

One can't accuse the following chart of lacking design. Strong is the evidence of departing from convention but the design decisions appear wayward. (The original link on Money here)



The donut chart (right) has nine sections. Eight of the sections (excepting A) have clearly all been bent out of shape. It turns out that section A does not have the right size either. The middle gray circle is not really in the middle, as seen below.


The bar charts (left) suffer from two ills. Firstly, the full width of the chart is at the 50 percent mark, so readers are forced to read the data labels to understand the data. Secondly, only the top two categories are shown, thus the size of the whole is lost. A stacked bar chart would serve better here.

Here is a bardot chart; the "dot" part of it makes it easier to see a Top 2 box analysis.


I explain the bardot chart here.


 PS. Here is Jamie's version (from the comment below):




When your main attraction is noise

Peter K. asked me about this 538 chart, which is a stacked column chart in which the percentages appear to not add up to 100%. Link to the article here.

538-cox-evangelicals-1Here's my reply:

They made the columns so tall that the "rounding errors" (noise) disclosed in the footnotes became the main attraction.


The gap between the highest and lowest peaks looks large but mostly due to the aspect ratio. The  gap is only ~2% at the widest (101% versus 99%) so it is the rounding error disclosed below the chart.

The lesson here is to make sure you suppress the noise and accentuate your data!