This chart advises webpages to add more words

A reader sent me the following chart. In addition to the graphical glitch, I was asked about the study's methodology.


I was able to trace the study back to this page. The study uses a line chart instead of the bar chart with axis not starting at zero. The line shows that web pages ranked higher by Google on the first page tend to have more words, i.e. longer content may help with Google ranking.


On the bar chart, Position 1 is more than 6 times as big as Position 10, if one compares the bar areas. But it's really only 20% larger in the data.

In this case, even the line chart is misleading. If we extend the Google Position to 20, the line would quickly dip below the horizontal axis if the same trend applies.

The line chart includes too much grid, one of Tufte's favorite complaints. The Google position is an integer and yet the chart's gridlines imply that 0.5 rank is possible.

Any chart of this data should supply information about the variance around these average word counts. Would like to see a side-by-side box plot, for example.

Another piece of context is the word counts for results on the second or third pages of Google results. Where are the short pages?


Turning to methodology, we learn that the research team analyzed 1 million pages of Google search results, and they also "removed outliers from our data (pages that contained fewer than 51 words and more than 9999 words)."

When you read a line like this, you have to ask some questions:

How do they define "outlier"? Why do they choose 51 and 9,999 as the cut-offs?

What proportion of the data was removed at either end of the distribution?

If these proportions are small, then the outliers are not going to affect that average word count by much, and thus there is no point to their removal. If they are large, we'd like to see what impact removing them might have.

In any case, the median is a better number to use here, or just show us the distribution, not just the average number.

It could well be true that Google's algorithm favors longer content, but we need to see more of the data to judge.



Big Macs in Switzerland are amazing, according to my friend

Bigmac_chNote for those in or near Zurich: I'm giving a Keynote Speech tomorrow morning at the Swiss Statistics Meeting (link). Here is the abstract:

The best and the worst of data visualization share something in common: these graphics provoke emotions. In this talk, I connect the emotional response of readers of data graphics to the design choices made by their creators. Using a plethora of examples, collected over a dozen years of writing online dataviz criticism, I discuss how some design choices generate negative emotions such as confusion and disbelief while other choices elicit positive feelings including pleasure and eureka. Important design choices include how much data to show; which data to highlight, hide or smudge; what research question to address; whether to introduce imagery, or playfulness; and so on. Examples extend from graphics in print, to online interactive graphics, to visual experiences in society.


The Big Mac index seems to never want to go away. Here is the latest graphic from the Economist, saying what it says:


The index never made much sense to me. I'm in Switzerland, and everything here is expensive. My friend, who is a U.S. transplant, seems to have adopted McDonald's as his main eating-out venue. Online reviews indicate that the quality of the burger served in Switzerland is much better than the same thing in the States. So, part of the price differential can be explained by quality. The index also confounds several other issues, such as local inflation and exchange rate

Now, on to the data visualization, which is primarily an exercise in rolling one's eyeballs. In order to understand the red and blue line segments, our eyes have to hop over the price bubbles to the top of the page. Then, in order to understand the vertical axis labels, unconventionally placed on the right side, our eyes have to zoom over to the left of the page, and search for the line below the header of the graph. Next, if we want to know about a particular country, our eyes must turn sideways and scan from bottom up.

Here is a different take on the same data:


I transformed the data as I don't find it compelling to learn that Russian Big Macs are 60% less than American Big Macs. Instead, on my chart, the reader learns that the price paid for a U.S. Big Mac will buy him/her almost 2 and a half Big Macs in Russia.

The arrows pointing left indicate that in most countries, the values of their currencies are declining relative to the dollar from 2017 to 2018 (at least by the Big Mac Index point of view). The only exception is Turkey, where in 2018, one can buy more Big Macs equivalent to the price paid for one U.S. Big Mac. compared to 2017.

The decimal differences are immaterial so I have grouped the countries by half Big Macs.

This example demonstrates yet again, to make good data visualization, one has to describe an interesting question, make appropriate transformations of the data, and then choose the right visual form. I describe this framework as the Trifecta - a guide to it is here.

(P.S. I noticed that Bitly just decided unilaterally to deactivate my customized Bitly link that was configured years and years ago, when it switched design (?). So I had to re-create the custom link. I have never grasped  why "unreliability" is a feature of the offering by most Tech companies.)

Some Tufte basics brought to you by your favorite birds

Someone sent me this via Twitter, found on the Data is Beautiful reddit:


The chart does not deliver on its promise: It's tough to know which birds like which seeds.

The original chart was also provided in the reddit:


I can see why someone would want to remake this visualization.

Let's just apply some Tufte fixes to it, and see what happens.

Our starting point is this:


First, consider the colors. Think for a second: order the colors of the cells by which ones stand out most. For me, the order is white > yellow > red > green.

That is a problem because for this data, you'd like green > yellow > red > white. (By the way, it's not explained what white means. I'm assuming it means the least preferred, so not preferred that one wouldn't consider that seed type relevant.)

Compare the above with this version that uses a one-dimensional sequential color scale:


The white color still stands out more than necessary. Fix this using a gray color.


What else is grabbing your attention when it shouldn't? It's those gridlines. Push them into the background using white-out.


The gridlines are also too thick. Here's a slimmed-down look:


The visual is much improved.

But one more thing. Let's re-order the columns (seeds). The most popular seeds are shown on the left, and the least on the right in this final revision.


Look for your favorite bird. Then find out which are its most preferred seeds.

Here is an animated gif to see the transformation. (Depending on your browser, you may have to click on it to view it.)



PS. [7/23/18] Fixed the 5th and 6th images and also in the animated gif. The row labels were scrambled in the original version.


Lines, gridlines, reference lines, regression lines, the works

This post is part 2 of an appreciation of the chart project by Google Newslab, advised by Alberto Cairo, on the gender and racial diversity of the newsroom. Part 1 can be read here.

In the previous discussion, I left out the following scatter bubble plot.


This plot is available in two versions, one for gender and one for race. The key question being asked is whether the leadership in the newsroom is more or less diverse than the rest of the staff.

The story appears to be a happy one: in many newsrooms, the leadership roughly reflects the staff in terms of gender distribution (even though both parts of the whole compare disfavorably to the gender ratio in the neighborhoods, as we saw in the previous post.)


Unfortunately, there are a few execution problems with this scatter plot.

First, take a look at the vertical axis labels on the right side. The labels inform the leadership axis. The mid-point showing 50-50 (parity) is emphasized with the gray band. Around the mid-point, the labels seem out of place. Typically, when the chart contains gridlines, we expect the labels to sit right around each gridline, either on top or just below the line. Here the labels occupy the middle of the space between successive gridlines. On closer inspection, the labels are correctly affixed, and the gridlines  drawn where they are supposed to be. The designer chose to show irregularly spaced labels: from the midpoint, it's a 15% jump on either side, then a 10% jump.

I find this decision confounding. It also seems as if two people have worked on these labels, as there exists two patterns: the first is "X% Leaders are Women", and second is "Y% Female." (Actually, the top and bottom labels are also inconsistent, one using "women" and the other "female".)

The horizontal axis? They left out the labels. Without labels, it is not possible to interpret the chart. Inspecting several conveniently placed data points, I figured that the labels on the six vertical gridlines should be 25%, 35%, ..., 65%, 75%, in essence the same scale as the vertical axis.

Here is the same chart with improved axis labels:


Re-labeling serves up a new issue. The key reference line on this chart isn't the horizontal parity line: it is the 45-degree line, showing that the leadership has the same proprotion of females as the rest of the staff. In the following plot (right side), I added in the 45-degree line. Note that it is positioned awkwardly on top of the grid system. The culprit is the incompatible gridlines.


The solution, as shown below, is to shift the vertical gridlines by 5% so that the 45-degree line bisects every grid cell it touches.



Now that we dealt with the purely visual issues, let me get to a statistical issue that's been troubling me. It's about that yellow line. It's supposed to be a regression line that runs through the points.

Does it appear biased downwards to you? It just seems that there are too many dots above and not enough below. The distance of the furthest points above also appears to be larger than that of the distant points below.

How do we know the line is not correct? Notice that the green 45-degree line goes through the point labeled "AVERAGE." That is the "average" newsroom with the average proportion of female staff and the average proportion of leadership staff. Interestingly, the average falls right on the 45-degree line.

In general, the average does not need to hit the 45-degree line. The average, however, does need to hit the regression line! (For a mathematical explanation, see here.)

Note the corresponding chart for racial diversity has it right. The yellow line does pass through the average point here:



In practice, how do problems seep into dataviz projects? It's the fact that you don't get to the last chart via a clean, streamlined process but that you pass through a cycle of explore-retrench-synthesize, frequently bouncing ideas between several people, and it's challenging to keep consistency!

And let me repeat my original comment about this project - the key learning here is how they took a complex dataset with many variables, broke it down into multiple parts addressing specific problems, and applied the layering principle to make each part of the project digestible.



A chart Hans Rosling would have loved

I came across this chart from the OurWorldinData website, and this one would make the late Hans Rosling very happy.


If you went to Professor Rosling's talk, he was bitter that the amazing gains in public health, worldwide (but particularly in less developed nations) during the last few decades have been little noticed. This chart makes it clear: note especially the dramatic plunge in extreme poverty, rise in vaccinations, drop in child mortality, and improvement in education and literacy, mostly achived in the last few decades.

This set of charts has a simple but powerful message. It's the simplicity of execution that really helps readers get that powerful message.

The text labels on the left and right side of the charts are just perfect.


Little things that irk me:

I am not convinced by the liberal use of colors - I would make the "other" category of each chart consistently gray so 6 colors total. Having different colors does make the chart more interesting to look at.

Even though the gridlines are muted, I still find them excessive.

There is a coding bug in the Vaccination chart right around 1960.


Fifty-nine intersections supporting forty dots of data

My friend Ray V. asked how this chart can be improved:


Let's try to read this chart. The Economist is always the best at writing headlines, and this one is simple and to the point: the rich get richer. This is about inequality but not just inequality - the growth in inequality over time.

Each country has four dots, divided into two pairs. From the legend, we learn that the line represents the gap between the rich and the poor. But what is rich and what is poor? Looking at the sub-header, we learn that the population is divided by domicile, and the per-capita GDP of the poorest and richest regions are drawn. This is a indirect metric, and may or may not be good, depending on how many regions a country is divided into, the dispersion of incomes within each region, the distribution of population between regions, and so on.

Now, looking at the axis labels, it's pretty clear that the data depicted are not in dollars (or currency), despite the reference to GDP in the sub-header. The numbers represent indices, relative to the national average GDP per head. For many of the countries, the poorest region produces about half of the per-capita GDP as the richest region.

Back to the orginal question. A growing inequality would be represented by a longer line below a shorter line within each country. That is true in some of these countries. The exceptions are Sweden, Japan, South Korea.

It doesn't jump out that the key task requires comparing the lengths of the two lines. Another issue is the outdated convention of breaking up a line (Britian) when the line is of extreme length - particularly unwise given that the length of the line encodes the key metric in the chart.

Further, it has low data-ink ratio a la Tufte. The gridlines, reference lines, and data lines weave together in a complex pattern creating 59 intersections in a chart that contains only 40  36 numbers.


 I decided to compute a simpler metric - the ratio of rich to poor.  For example, in the UK, the richest area produces about 20 times as much GDP per capita as the poorest one in 2015.  That is easier to understand than an index to the average region.

I had fun making the following chart, although many standard forms like the Bumps chart (i.e. slopegraph) or paired columns and so on also work.


This chart is influenced by Ed Tufte, who spent a good number of pages in his first book advocating stripping even the standard column chart to its bare essence. The chart also acknowledges the power of design to draw attention.



PS. Sorry I counted incorrectly. The chart has 36 dots not 40. 

Shocker: ease of use requires expanding, not restricting, choices

Recently, I noted how we have to learn to hate defaults in data visualization software. I was reminded again of this point when reviewing this submission from long-time reader & contributor Chris P.


The chart is included in this Medium article, which credits Mott Capital Management as the source.

Jc_medium_retailersLook at the axis labels on the right side. They have the hallmarks of software defaults. The software designer decided that the axis labels will be formatted in exactly the same way as the data in that column: this means $XXX.XXB, with two decimal places. The same formatting rule is in place for the data labels, shown in boxes.

Why put tick marks at the odd intervals, 37.50, 62.50, 87.50, ... ? What's wrong with 40, 60, 80, 100, ...? It comes down to machine thinking versus human thinking.

This software places the most recent values into data labels, formatted as boxes that point to the positions of those values on the axis. Evidently, it doesn't have a plan for overcrowding. At the bottom of the axis, we see four labels for six lines. The blue, pink and orange labels point to the wrong places on the axis.

Worse, it's unclear what those "most recent" values represent. I have added gridlines for each year on the excerpt shown right. The lines extend to 2017, which isn't even half over.

Now, consider the legend. Which version do you prefer?


Most likely, the original dataset has columns named " Revenue (TTM)", "Dillard's Revenue (TTM)", etc. so the software just picks those up and prints them in the legend text.


The chart is an output from YCharts, which I learned is a Bloomberg terminal competitor. It probably uses one of the available Web graphing packages out there. These packages typically emphasize ease of use through automating the process of data visualization. Ease of use is defined as rigid defaults that someone determines are the optimal settings. Users then discover that there is no getting around those settings; in some cases, a coding interface is available, which usurps the goal of user-friendliness.

The problem lies in defining what ease of use means. Ease of use should require expanding, not restricting, choices. Setting rigid defaults restricts choices. In addition to providing good defaults, the software designer should make it simple for users to make their own choices. Ideally, each of the elements (data labels, gridlines, tick marks, etc.) can be independently removed, shifted, expanded, reduced, re-colored, edited, etc. from their original settings.

It's your fault when you use defaults

The following chart showed up on my Twitter feed last week. It's a cautionary tale for using software defaults.


 At first glance, the stacking of years in a bar chart makes little sense. This is particularly so when there appears not to be any interesting annual trend: the four segments seem to have roughly equal length almost everywhere.

This designer might be suffering from what I have called "loss aversion" (link). Loss aversion in data visualization is the fear of losing your data, which causes people to cling on to every little bit of data they have.

Several challenges of the chart come from the software defaults. The bars are ordered alphabetically, making it difficult to discern a trend. The horizontal axis labels are given in single dollars and units, and yet the intention of the designer is to use millions, as indicated in the chart titles.

The one horrifying feature of this chart is the 3D effect. The third dimension contains no information at all. In fact, it destroys information, as readers who use the vertical gridlines to estimate the lengths of the bars will be sadly misled. As shown below, readers must draw imaginary lines to figure out the horizontal values.


The Question of this chart is the distribution of book sales (revenues and units) across different genres. When the designer chose to stack the bars (i.e. sum the yearly data), he or she has decided that the details of specific years are not as important as the total - this is the right conclusion since the bar segments have similar measurement within each genre.

So let's pursue the revolution of averaging the data, plotting average yearly sales.


This chart shows that there are two major types of genres. In the education world, the unit prices of (text)books are very high while sales are relatively small by units but in aggregate, the dollar revenues are high. In the "adult" world, whether it's fiction or non-fiction, the unit price is low while the number of units is high, which results in similar total dollar revenues as the education genres.


Simple lesson here: learn to hate software defaults

Much more to do after selecting a chart form


I sketched out this blog post right before the Superbowl - and was really worked up as I happened to be flying into Atlanta right after they won (well, according to any of our favorite "prediction engines," the Falcons had 95%+ chance of winning it all a minute from the end of the 4th quarter!) What I'd give to be in the SuperBowl-winning city the day after the victory!

Maybe next year. I didn't feel like publishing about SuperBowl graphics when the wound was so very raw. But now is the moment.

The following chart came from Orange County Register on the run-up to the Superbowl. (The bobble-head quarterbacks also came from OCR). The original article is here.


The choice of a set of dot plots is inspired. The dot plot is one of those under-utilized chart types - for comparing two or three objects along a series of metrics, it has to be one of the most effective charts.

To understand this type of design, readers have to collect three pieces of information: first is to recognize the dot symbols, which color or shape represents which object being compared; second is to understand the direction of the axis; third is to recognize that the distance between the paired dots encodes the amount of difference between the two objects.

The first task is easy enough here as red stands for Atlanta and blue for New England - those being the team colors.

The second task is deceptively simple. It appears that a ranking scale is used for all metrics with the top ("1st") shown on the left side and the bottom ("32nd") shown on the right. Thus, all 32 teams in the NFL are lined up left to right (i.e. best to worst).

Now, focus your attention on the "Interceptions Caught" metric, third row from the bottom. The designer indicated "Fewest" on the left and "Most" on the right. For those who don't know American football, an "interception caught" is a good defensive play; it means your defensive player grabs a ball thrown by the opposing team (usually their quarterback), causing a turnover. Therefore, the more interceptions caught, the better your defence is playing.

Glancing back at the chart, you learn that on the "Interceptions Caught" metric, the worst team is shown on the left while the best team is shown on the right. The same reversal happened with "Fumbles Lost" (fewest is best), "Penalties" (fewest is best), and "Points Allowed per Game" (fewest is best). For four of nine metrics, right is best while for the other five, left is best.

The third task is the most complicated. A ranking scale always has the weakness that a gap of one rank does not yield information on how important the gap is. It's a complicated decision to select what type of scale to use in a chart like this, and in this post, I shall ignore this issue, and focus on a visual makeover.


I find the nine arrays of 32 squares, essentially the grid system, much too insistent, elevating information that belongs to the background. So one of the first fixes is to soften the grid system, and the labeling of the axes.

In addition, given the meaningless nature of the rank number (as mentioned above), I removed those numbers and used team logos instead. The locations on the axes are sufficient to convey the relative ranks of the two teams against the field of 32.


Most importantly, the directions of all metrics are now oriented in such a way that moving left is always getting better.


While using logos for sports teams is natural, I ended up replacing those, as the size of the dots is such that the logos are illegible anyway.

The above makeover retains the original order of metrics. But to help readers address the key question of this chart - which team is better, the designer should arrange the metrics in a more helpful way. For example, in the following version, the metrics are subdivided into three sections: the ones for which New England is significantly better, the ones for which Atlanta is much better, and the rest for which both teams are competitive with each other.


In the Trifecta checkup (link), I speak of the need to align your visual choices with the question you are trying to address with the chart. This is a nice case study of strengthening that Q-V alignment.







Lines that delight, lines that blight

This WSJ graphic caught my eye. The accompanying article is here.


The article (judging from the sub-header) makes two separate points, one about the total amount of money raised in IPOs in a year, and the change in market value of those newly-public companies one year from the IPO date.

The first metric is shown by the size of the bubbles while the second metric is displayed as distances from the horizontal axis. (The second metric is further embedded, in a simplified, binary manner, in the colors of the bubbles.)

The designer has decided that the second metric - performance after IPO - to be more important. Therefore, it is much easier for readers to know how each annual cohort of IPOs has performed. The use of color to map to the second metric (and not the first) also helps to emphasize the second metric.

There are details on this chart that I admire. The general tidiness of it. The restraint on the gridlines, especially along the horizontal ones. The spatial balance. The annotation.

And ah, turning those bubbles into lollipops. Yummy! Those dotted lines allow readers to find the center of each bubble, which is where the values of the second metrics lie. Frequently, these bubble charts are presented without those guiding lines, and it is often hard to find the circles' anchors.

That leaves one inexplicable decision - why did they place two vertical gridlines in the middle of two arbitrary years?