Aligning the visual and the data

The Washington Post reported a surge in donations to the Democrats after the death of Justice Ruth Ginsberg (link). A secondary effect, perhaps unexpected, was that donors decided to spread the money around; the proportion of donors who gave to six or more candidates jumped to 65%, where normally it is at 5%.


The text tells us what to look for, and the axis labels are commendably restrained. The color scheme is also intuitive.

There is something frustrating about this chart, though. It's that the spike is shown upside down. The level that the arrow points at is 45%, which is the total of the blue columns. The visual suggests the proportion of multiple beneficiaries (2 or more) should be 55%. There is a divergence between what the visual is saying and what the data are saying. Whichever number is correct, the required proportion is the inverse of the level shown on the percentage axis!


This is the same chart flipped over.


Now, the number we need can be read off the vertical axis.

I also moved the color legend to the right side so that the entries can be printed vertically, in the same direction as the data. This is one of the unspoken rules of data visualization I featured in my feature for


In the Trifecta Checkup (link), the issue is with the green arrow between the D corner and the V corner. The data and the visual are not in sync. 


Bubble charts, ratios and proportionality

A recent article in the Wall Street Journal about a challenger to the dominant weedkiller, Roundup, contains a nice selection of graphics. (Dicamba is the up-and-comer.)


The change in usage of three brands of weedkillers is rendered as a small-multiples of choropleth maps. This graphic displays geographical and time changes simultaneously.

The staircase chart shows weeds have become resistant to Roundup over time. This is considered a weakness in the Roundup business.


In this post, my focus is on the chart at the bottom, which shows complaints about Dicamba by state in 2019. This is a bubble chart, with the bubbles sorted along the horizontal axis by the acreage of farmland by state.


Below left is a more standard version of such a chart, in which the bubbles are allowed to overlap. (I only included the bubbles that were labeled in the original chart).


The WSJ’s twist is to use the vertical spacing to avoid overlapping bubbles. The vertical axis serves a design perogative and does not encode data.  

I’m going to stick with the more traditional overlapping bubbles here – I’m getting to a different matter.


The question being addressed by this chart is: which states have the most serious Dicamba problem, as revealed by the frequency of complaints? The designer recognizes that the amount of farmland matters. One should expect the more acres, the more complaints.

Let's consider computing directly the number of complaints per million acres.

The resulting chart (shown below right) – while retaining the design – gives a wholly different feeling. Arkansas now owns the largest bubble even though it has the least acreage among the included states. The huge Illinois bubble is still large but is no longer a loner.


Now return to the original design for a moment (the chart on the left). In theory, this should work in the following manner: if complaints grow purely as a function of acreage, then the bubbles should grow proportionally from left to right. The trouble is that proportional areas are not as easily detected as proportional lengths.

The pair of charts below depict made-up data in which all states have 30 complaints for each million acres of farmland. It’s not intuitive that the bubbles on the left chart are growing proportionally.


Now if you look at the right chart, which shows the relative metric of complaints per million acres, it’s impossible not to notice that all bubbles are the same size.

Pay levels in the U.S.

The Wall Street Journal published a graphic showing the median pay levels at "most" public companies in the U.S. here.


People who attended my dataviz seminar might recognize the similarity with the graphic showing internet download speeds by different broadband technologies. It's a clean, clear way of showing multiple comparisons on the same chart.

You can see the distribution of pay levels of companies within each industry grouping, and the vertical lines showing the sector medians allow comparison across sectors. The median pay levels are quite similar with the energy sector leaning higher, and consumer sector leaning lower.

The consumer sector is extremely heavy on the low side of the pay range. Companies like Universal, Abercrombie, Skechers, Mattel, Gap, etc. all pay at least half their employees less than $6,000. The data is sourced to MyLogIQ. I have no knowledge of how reliable or valid the data are. It's curious to me that Dunkin Brands showed a median of $110K while Starbucks showed $13K.



I like the interactive features.

The window control lets the user zoom in to different parts of the pay range. This is necessary because of the extremely high salaries. The control doubles as a presentation of the overall distribution of median salaries.

The text box can be used to add data labels to specific companies.


See previous discussion of WSJ Graphics.


The ebb and flow of an effective dataviz showing the rise and fall of GE

Wsj_ebbflowGE_800A WSJ chart caught my eye the other day – I spotted someone looking at it in a coffee shop, and immediately got a hold of a copy. The chart plots the ebb and flow of GE’s revenues from the 1980s to the present.

What grabbed my attention? The less-used chart form, and the appealing but not too gaudy color scheme.

The chart presents a highly digestible view of the structure of GE’s revenues. We learn about GE’s major divisions, as well as how certain segments split from or merged with others over time. Major acquisitions and divestitures are also depicted; if these events are the main focus, the designer should find ways to make these moments stand out more.

An interesting design decision concerns the sequence of the divisions. One possible order is by increasing or decreasing importance, typically indicated by proportional revenues. This is complicated by the changing nature of the business over the decades. So financial services went from nothing to the largest division by far to almost disappearing.

The sequencing need not be data-driven; it can be design-constrained. The merging and splitting of business units are conveyed via linking arrows. Longer arrows are unsightly, and meshes of arrows are confusing.

On this chart, the long arrow pointing from the orange to the gray around 2004 feels out of place. What if the financial services block is moved to the right of the consumer block? That will significantly shorten the long arrow. It won’t create other entanglements as the media block is completely disjoint and there are no other arrows tying financial services to another division.



To improve readability, the bars are spaced out horizontally. The addition of whitespace distorts the proportionality. So, in 2001, the annotation states that financial services (orange) accounted for “about half of the revenues,” which is directly contradicted by the visual perception – readers find the orange bar to be clearly shorter than the total length of the other bars. This is a serious deficiency of the chart form but this chart conveys the "ebb and flow" very well.

Making people jump over hoops

Take a look at the following chart, and guess what message the designer wants to convey:


This chart accompanied an article in the Wall Street Journal about Wells Fargo losing brokers due to the fake account scandal, and using bonuses to lure them back. Like you, my first response to the chart was that little has changed from 2015 to 2017.

It is a bit mysterious the intention of the whitespace inserted to split the four columns into two pairs. It's not obvious that UBS and Merrill are different from Wells Fargo and Morgan Stanley. This device might have been used to overcome the difficulty of reading four columns side by side.

The additional challenge of this dataset is the outlier values for UBS, which elongates the range of the vertical axis, squeezing together the values of the other three banks.

In this first alternative version, I play around with irregular gridlines.


Grouped column charts are not great at conveying changes over time, as they cause our eyes to literally jump over hoops. In the second version, I use a bumps chart to compactly highlight the trends. I also zoom in on the quarterly growth rates.


The rounded interpolation removes the sharp angles from the typical bumps chart (aka slopegraph) but it does add patterns that might not be there. This type of interpolation however respects the values at the "knots" (here, the quarterly values) while a smoother may move those points. On balance, I like this treatment.


PS. [6/2/2017] Given the commentary below, I am including the straight version of the chart, so you can compare. The straight-line version is more precise. One aspect of this chart form I dislike is the sharp angles. When there are more lines, it gets very entangled.


Lines that delight, lines that blight

This WSJ graphic caught my eye. The accompanying article is here.


The article (judging from the sub-header) makes two separate points, one about the total amount of money raised in IPOs in a year, and the change in market value of those newly-public companies one year from the IPO date.

The first metric is shown by the size of the bubbles while the second metric is displayed as distances from the horizontal axis. (The second metric is further embedded, in a simplified, binary manner, in the colors of the bubbles.)

The designer has decided that the second metric - performance after IPO - to be more important. Therefore, it is much easier for readers to know how each annual cohort of IPOs has performed. The use of color to map to the second metric (and not the first) also helps to emphasize the second metric.

There are details on this chart that I admire. The general tidiness of it. The restraint on the gridlines, especially along the horizontal ones. The spatial balance. The annotation.

And ah, turning those bubbles into lollipops. Yummy! Those dotted lines allow readers to find the center of each bubble, which is where the values of the second metrics lie. Frequently, these bubble charts are presented without those guiding lines, and it is often hard to find the circles' anchors.

That leaves one inexplicable decision - why did they place two vertical gridlines in the middle of two arbitrary years?

Story within story, bar within bar

This Wall Street Journal offering caught my eye.


It's the unusual way of displaying proportions.

Your first impression is to interpret the graphic as a bar chart. But it really is a bar within a bar: the crux of the matter - gender balance - is embedded in individual bars.

Instead of pie charts or stacked bar charts, we see  stacked columns within each bar.

I see what the designer is attempting to accomplish. The first message is the sharp decline in gender equality at higher job titles. The next message is the sharp drop in the frequency of higher job titles.

This chart is a variant of the "Marimekko" chart (beloved by management consultants), also called the mosaic chart. The only difference being how the distribution of jobs in the work force is coded.

The Marimekko is easier to understand:


A key advantage of this version is to be found in the thin columns.

Here is another way to visualize this data, drawing attention to the gender gap.


In the other versions, the reader must do subtractions to figure out the size of the gaps.

Here are the cool graphics from the election

There were some very nice graphics work published during the last few days of the U.S. presidential election. Let me tell you why I like the following four charts.

FiveThirtyEight's snake chart


This chart definitely hits the Trifecta. It is narrowly focused on the pivotal questions of election night: which candidate is leading? if current projections hold, which candidate would win? how is the margin of victory?

The chart is symmetric so that the two sides have equal length. One can therefore immediately tell which side is in the lead by looking at the middle. With a little more effort, one can also read from the chart which side has more electoral votes based only on the called states: this would be by comparing the white parts of each snake. (This is made difficult by the top-bottom mirroring. That is an unfortunate design decision - I'd would have preferred to not have the top-bottom reversal.)

The length of each segment maps to the number of electoral votes for the particular state, and the shade of colors reflect the size of the advantage.

In a great illustration of less is more, by aggregating all called states into a single white segment, and not presenting the individual results, the 538 team has delivered a phenomenal chart that is refreshing, informative, and functional.

 Compare with a more typical map:


 New York Times's snake chart

Snakes must be the season's gourmet meat because the New York Times also got inspired by those reptiles by delivering a set of snake charts (link). Here's one illustrating how different demographic segments picked winners in the last four elections.



They also made a judicious decision by highlighting the key facts and hiding the secondary ones. Each line connects four points of data but only the beginning and end of each line are labeled, inviting readers to first and foremost compare what happened in 2004 with what happened in 2016. The middle two elections were Obama wins.

This particular chart may prove significant for decades to come. It illustrates that the two parties may be arriving at a cross-over point. The Democrats are driving the lower income classes out of their party while the upper income classes are jumping over to blue.

While the chart's main purpose is to display the changes within each income segment, it does allow readers to address a secondary question. By focusing only on the 2004 endpoints, one can see the almost linear relationship between support and income level. Then focusing on the 2016 endpoints, one can also see an almost linear relationship but this is much steeper, meaning the spread is much narrower compared to the situation in 2004. I don't think this means income matters a lot less - I just think this may be the first step in an ongoing demographic shift.

This chart is both fun and easy to read, packing quite a bit of information into a small space.


Washington Post's Nation of Peaks

The Post prints a map that shows, by county, where the votes were and how the two Parties built their support. (Link to original)


The height represents the number of voters and the width represents the margin of victory. Landslide victories are shown with bolded triangles. In the online version, they chose to turn the map sideways.

I particularly like the narratives about specific places.

This is an entertaining visual that draws you in to explore.


Andrew Gelman's Insight

If you want quantitative insights, it's a good idea to check out Andrew Gelman's blog.

This example is a plain statistical graphic but it says something important:


There is a lot of noise about how the polls were all wrong, the entire polling industry will die, etc.

This chart shows that the polls were reasonably accurate about Trump's vote share in most Democratic states. In the Republican states, these polls consistently under-estimated Trump's advantage. You see the line of red states starting to bend away from the diagonal.

If the total error is about 2%, as stated in the caption of the chart, then the average error in the red states must have been about 4%.

This basic chart advances our understanding of what happened on election night, and why the result was considered a "shock."



An example of focusing the chart on a message

Via Jimmy Atkinson on Twitter, I am alerted to this chart from the Wall Street Journal.


The title of the article is "Fiscal Constraints Await the Next President." The key message is that "the next president looks to inherit a particularly dismal set of fiscal circumstances." Josh Zumbrun, who tipped Jimmy about this chart on Twitter, said that it is worth spending time on.

I like the concept of the chart, which juxtaposes the economic condition that faced each president at inauguration, and how his performance measured against expectation, as represented by CBO predictions.

The top portion of the graphic did require significant time to digest:


A glance at the sidebar informs me that there are two scenarios being depicted, the CBO projections and the actual deficit-to-GDP ratios. Then I got confused on several fronts.

One can of course blame the reader (me) for mis-reading the chart but I think dataviz faces a "the reader is always right" situation -- although there can be multiple types of readers for a given graphic so maybe it should say "the readers are always right."

I kept lapsing into thinking that the bold lines (in red and blue) are actual values while the gray line/area represents the predictions. That's because in most financial charts, the actual numbers are in the foreground and the predictions act as background reference materials. But in this rendering, it's the opposite.

For a while, a battle was raging in my head. There are a few clues that the bold red/blue lines cannot represent actual values. For one thing, I don't recall Reagan as a surplus miracle worker. Also, some of the time periods overlap, and one assumes that the CBO issued one projection only at a given time. The Obama line also confused me as the headline led me to expect an ugly deficit but the blue line is rather shallow.

Then, I got even more confused by the units on the vertical axis. According to the sidebar, the metric is deficit-to-GDP ratio. The majority of the line live in the negative territory. Does the negative of the negative imply positive? Could the sharp upward turn of the Reagan line indicate massive deficit spending? Or maybe the axis should be relabelled surplus-to-GDP ratio?


As I proceeded to re-create this graphic, I noticed that some of the tick marks are misaligned. There are various inconsistencies related to the start of each projection, the duration of the projection, the matching between the boxes and the lines, etc. So the data in my version is just roughly accurate.

To me, this data provide a primary reference to how presidents perform on the surplus/deficit compared to expectations as established by the CBO projections.


I decided to only plot the actual surplus/deficit ratios for the duration of each president's tenure. The start of each projection line is the year in which the projection is made (as per the original). We can see the huge gap in every case. Either the CBO analysts are very bad at projections, or the presidents didn't do what they promised during the elections.




Denver outspends everyone on this

Someone at the Wall Street Journal noticed that Denver's transit agency has outspent other top transit agencies, after accounting for number of rides -- and by a huge margin.

But the accompanying graphic conspires against the journalist.


For one thing, Denver is at the bottom of the page. Denver's two bars do not stand out in any way. New York's transit system dwarfs everyone else in both number of rides and total capital expenses and funding. And the division into local, state, and federal sources of funds is on the page, absorbing readers' mindspace for unknown reasons.

But Denver is an outlier, as can be seen here: