Two views of earthquake occurrence in the Bay Area

This article has a nice description of earthquake occurrence in the San Francisco Bay Area. A few quantities are of interest: when the next quake occurs, the size of the quake, the epicenter of the quake, etc. The data graphic included in the article fails the self-sufficiency test: the only way to read this chart is to read out the entire data set - in other words, the graphical details have no utility.

Earthquake-probability-chart

The article points out the clustering of earthquakes. In particular, there is a 68-year "quiet period" between 1911 and 1979, during which no quakes over 6.0 in size occurred. The author appears to have classified quakes into three groups: "Largest" which are those at 6.5 or over; "Smaller but damaging" which are those between 6.0 and 6.5; and those below 6.0 (not shown).

For a more standard and more effective visualization of this dataset, see this post on a related chart (about avian flu outbreaks). The post discusses a bubble chart versus a column chart. I prefer the column chart.

image from junkcharts.typepad.com

This chart focuses on the timing of rare events. The time between events is not as easy to see. 

What if we want to focus on the "quiet years" between earthquakes? Here is a visualization that addresses the question: when will the next one hit us?

Redo_jc_earthquakeprobability

 

 


The downside of discouraging pie charts

It's no secret most dataviz experts do not like pie charts.

Our disdain for pie charts causes people to look for alternatives.

Sometimes, the alternative is worse. Witness:

Schwab_bloombergaggregatebondindex

This chart comes from the Spring 2018 issue of On Investing, the magazine for Charles Schwab customers.

It's not a pie chart.

Redo_jc_bondindex

I'm forced to say the pie chart is preferred.

The original chart fails the self-sufficiency test. Here is the 2007 chart with the data removed.

Bloombergbondindex_sufficiency

It's very hard to figure out how large are those pieces, so any reader trying to understand this chart will resort to reading the data, which means the visual representation does no work!

Or, you can use a dot plot.

Redo_jc_bondindex2

This version emphasizes the change over time.

 


Beauty is in the eyes of the fishes

Reader Patrick S. sent in this old gem from Germany.

Swimmingpoolsvisitors_ger

He said:

It displays the change in numbers of visitors to public pools in the German city of Hanover. The invisible y-axis seems to be, um, nonlinear, but at least it's monotonic, in contrast to the invisible x-axis.

There's a nice touch, though: The eyes of the fish are pie charts. Black: outdoor pools, white: indoor pools (as explained in the bottom left corner).

It's taken from a 1960 publication of the city of Hanover called *Hannover: Die Stadt in der wir leben*.

This is the kind of chart that Ed Tufte made (in)famous. The visual elements do not serve the data at all, except for the eyeballs. The design becomes a mere vessel for the data table. The reader who wants to know the growth rate of swimmers has to do a tank of work.

The eyeballs though.

I like the fact that these pie charts do not come with data labels. This part of the chart passes the self-sufficiency test. In fact, the eyeballs contain the most interesting story in this chart. In those four years, the visitors to public pools switched from mostly indoor pools to mostly outdoor pools. These eyeballs show that pie charts can be effective in specific situations.

Now, Hanover fishes are quite lucky to have free admission to the public pools!


When design goes awry

One can't accuse the following chart of lacking design. Strong is the evidence of departing from convention but the design decisions appear wayward. (The original link on Money here)

Mc_cellphones_money17

 

The donut chart (right) has nine sections. Eight of the sections (excepting A) have clearly all been bent out of shape. It turns out that section A does not have the right size either. The middle gray circle is not really in the middle, as seen below.

Redo_mc_cellphone

The bar charts (left) suffer from two ills. Firstly, the full width of the chart is at the 50 percent mark, so readers are forced to read the data labels to understand the data. Secondly, only the top two categories are shown, thus the size of the whole is lost. A stacked bar chart would serve better here.

Here is a bardot chart; the "dot" part of it makes it easier to see a Top 2 box analysis.

Redo_jc_mc_cellphone_2

I explain the bardot chart here.

 

 PS. Here is Jamie's version (from the comment below):

Jamie_mc_cellphone

 

 


The visual should be easier to read than your data

A reader sent this tip in some time ago and I lost track of who he/she is. This graphic looks deceptively complex.

MW-FW350_1milli_20171016112101_NS

What's complex is not the underlying analysis. The design is complex and so the decoding is complex.

The question of the graphic is a central concern of anyone who's retired: how long will one's savings last? There are two related metrics to describe the durability of the stash, and they are both present on this chart. The designer first presumes that one has saved $1 million for retirement. Then he/she computes how many years the savings will last. That, of course, depends on the cost of living, which naively can be expressed as a projected annual expenditure. The designer allows the cost of living to vary by state, which is the main source of variability in the computations. The time-based and dollar-based metrics are directly linked to one another via a formula.

The design encodes the time metric in a grid of dots, and the dollar-metric in the color of the dots. The expenditures are divided into eight segments, given eight colors from deep blue to deep pink.

Thirteen of those dots are invariable, appearing in every state. Readers are drawn into a ranking of the states, which is nothing but a ranking of costs of living. (We don't know, but presume, that the cost of living computation is appropriate for retirees, and not averaged.) This order obscures any spatial correlation. There are a few production errors in the first row in which the year and month numbers are misstated slightly; the numbers should be monotonically decreasing. In terms of years and months, the difference between many states is immaterial. The pictogram format is more popular than it deserves: only highly motivated readers will count individual dots. If readers are merely reading the printed text, which contains all the data encoded in the dots, then the graphic has failed the self-sufficiency principle - the visual elements are not doing any work.

***

In my version, I surface the spatial correlation using maps. The states are classified into sensible groups that allow a story to be told around the analysis. Three groups of states are identified and separately portrayed. The finer variations between states within each state group appear as shades.

Redo_howlonglive

Data visualization should make the underlying data easier to comprehend. It's a problem when the graphic is harder to decipher than the underlying dataset.

 

 

 


A pretty good chart ruined by some naive analysis

The following chart showing wage gaps by gender among U.S. physicians was sent to me via Twitter:

Statnews_physicianwages

The original chart was published by the Stat News website (link).

I am most curious about the source of the data. It apparently came from a website called Doximity, which collects data from physicians. Here is a link to the PR release related to this compensation dataset. However, the data is not freely available. There is a claim that this data come from self reports by 36,000 physicians.

I am not sure whether I trust this data. For example:

Stat_wagegapdoctor_1

Do I believe that physicians in North Dakota earn the highest salaries on average in the nation? And not only that, they earn almost 30% more than the average physician in New York. Does the average physician in ND really earn over $400K a year? If you are wondering, the second highest salary number comes from South Dakota. And then Idaho.  Also, these high-salary states are correlated with the lowest gender wage gaps.

I suspect that sample size is an issue. They do not report sample size at the level of their analyses. They apparently published statistics at the level of MSAs. There are roughly 400 MSAs in the U.S. so at that level, on average, they have only 90 samples per MSA. When split by gender, the average sample size is less than 50. Then, they are comparing differences, so we should see the standard errors. And finally, they are making hundreds of such comparisons, for which some kind of multiple-comparisons correction is needed.

I am pretty sure some of you are doctors, or work in health care. Do those salary numbers make sense? Are you moving to North/South Dakota?

***

Turning to the Visual corner of the Trifecta Checkup (link), I have a mixed verdict. The hover-over effect showing the precise values at either axes is a nice idea, well executed.

I don't see the point of drawing the circle inside a circle.  The wage gap is already on the vertical axis, and the redundant representation in dual circles adds nothing to it. Because of this construct, the size of the bubbles is now encoding the male average salary, taking attention away from the gender gap which is the point of the chart.

I also don't think the regional analysis (conveyed by the colors of the bubbles) is producing a story line.

***

This is another instance of a dubious analysis in this "big data" era. The analyst makes no attempt to correct for self-reporting bias, and works as if the dataset is complete. There is no indication of any concern about sample sizes, after the analyst drills down to finer areas of the dataset. While there are other variables available, such as specialty, and other variables that can be merged in, such as income levels, all of which may explain at least a portion of the gender wage gap, no attempt has been made to incorporate other factors. We are stuck with a bivariate analysis that does not control for any other factors.

Last but not least, the analyst draws a bold conclusion from the overly simplistic analysis. Here, we are told: "If you want that big money, you can't be a woman." (link)

 

P.S. The Stat News article reports that the researchers at Doximity claimed that they controlled for "hours worked and other factors that might explain the wage gap." However, in Doximity's own report, there is no language confirming how they included the controls.

 


Round things, square things

The following chart traces the flow of funds into AI (artificial intelligence) startups.

Financial-times-graphic-recent-funding-for-ai-machine-learning-2014-machine-learning-post

I found it on this webpage and it is attributed to Financial Times.

Here, I apply the self-sufficiency test to show that the semicircles are playing no role in the visualization. When the numbers are removed, readers cannot understand the data at all. So the visual elements are toothless.

Ft_ai_funding2

Actually, it's worse. The data got encoded in the diameters of the semicircles, but not the areas. Thus, anyone courageously computing the ratios of the areas finds their effort frustrated.

Here is a different view that preserves the layout:

Redo_ft_ai_funding

The two data series in the original chart show the current round of funding and the total funds raised. In the junkcharts version, I decided to compare the new funds versus the previously-raised funds so that the total area represents the total funds raised.


Counting the Olympic medals

Reader Conor H. sent in this daily medals table at the NBC website:

Nbc_olympicmedals

He commented that the bars are not quite the right lengths. So even though China and Russia both won five total medals that day, the bar for China is slightly shorter.

One issue with the stacked bar chart is that the reader's attention is drawn to the components rather that the whole. However, as is this case, the most important statistic is the total number of medals.

Here is a different view of the data:

Redo_olympicmedalsdaily

 

 


A Tufte fix for austerity

Trish, who attended one of my recent data visualization workshops, submitted a link to the Illinois Atlas of Austerity.

Atlas_il_austerity_clientsShown on the right is one of the charts included in the presentation.

This is an example of a chart that fails my self-sufficiency test.

There is no chance that readers are getting any information out of the graphical elements (the figurines of 100 people each).

Everyone who tries to learn something from this chart will be reading the data labels directly.

The entire dataset is printed on the chart itself. If you cover up the data, the chart becomes unreadable!

***

Here is a simple fix that resettles the figurines onto a bar chart:

Redo_atlas_il_clients_1

Tufte would not be amused by this composition. The figurines are purely decorative.

This version is more likely to delight Tufte:

Redo_atlas_il_clients_2

It is the edges of the bars in the bar chart that make all the difference.

***

Aside from the visual problems, there is also a data issue. They should have controlled by the size of different programs.


 

 


Scorched by the heat in Arizona

Reader Jeffrey S. saw this graphic inside a Dec 2 tweet from the National Weather Service (NWS) in Phoenix, Arizona.

Nwsphoenix_bars

In a Trifecta checkup (link), I'd classify this as Type QV.

The problems with the visual design are numerous and legendary. The column chart where the heights of the columns are not proportional to the data. The unnecessary 3D effect. The lack of self-sufficiency (link). The distracting gridlines. The confusion of year labels that do not increment from left to right.

The more hidden but more serious issue with this chart is the framing of the question. The main message of the original chart is that the last two years have been the hottest two years in a long time. But it is difficult for readers to know if the differences of less than one degree from the first to the last column are meaningful since we are not shown the variability of the time series.

The green line makes an assertion that 1981 to 2010 represents the "normal". It is unclear why that period is normal and the years from 2011-5 are abnormal. Maybe they are using the word normal in a purely technical way to mean "average." If true, it is better to just say average.

***
For this data, I prefer to see the entire time series from 1981 to 2015, which allows readers to judge the variability as well as the trending of the average temperatures. In the following chart, I also label the five years with the highest average temperatures.

Redo_nws_phoenix_avgtemp_2