Sorting out the data, and creating the head-shake manual

Yesterday's post attracted a few good comments.

Several readers don't like the data used in the NAEP score chart. The authors labeled the metric "gain in NAEP scale scores" which I interpreted to be "gain scores," a popular way of evaluating educational outcomes. A gain score is the change in test score between (typically consecutive) years. I also interpreted the label "2000-2009" as the average of eight gain scores, in other words, the average year-on-year change in test scores during those 10 years.

After thinking about what reader mankoff wrote, which prompted me to download the raw data, I realized that the designer did not compute gain scores. "2000-2009" really means the difference between the 2009 score and the 2000 score, ignoring all values between those end points. So mankoff is correct in saying that the 2009 number was used in both "2000-2009" and "2009-2015" computations.

This treatment immediately raises concerns. Why is a 10-year period compared to a 7-year period?

Andrew prefers to see the raw scores ("scale scores") instead of relative values. Here is the corresponding chart:

Redo_naep2015d

I placed a line at 2009, just to see if there is a reason for that year to be a special year. (I don't think so.) The advantage of plotting raw scores is that it is easier to interpret. As Andrew said, less abstraction. It also soothes the nerves of those who are startled that the lines for white students appear at the bottom of the chart of gain scores.

I suppose the reason why the original designer chose to use score differentials is to highlight their message concerning change in scores. One can nitpick that their message isn't particularly cogent because if you look at 8th grade math or reading scores, comparing 2009 and 2015, there appeared to be negligible change, and yet between those end-points, the scores did spike and then drop back to the 2009 level.

One way to mitigate the confusion that mankoff encountered in interpreting my gain-score graphic is to use "informative" labels, rather than "uninformative" labels.

Redo_naep2015e

Instead of saying the vertical axis plots "gain scores" or "change in scores," directly label one end as "no progress" and the other end as "more progress."

Everything on this chart is progress over time, and the stalling of progress is their message. This chart requires more upfront learning, after which the message jumps out. The chart of raw scores shown above has almost no perceptive overhead but the message has to be teased out. I prefer the chart of raw scores in this case.

***

Let me now address another objection, which pops up every time I convert a bar chart to a line chart (a type of Bumps chart, which has been called slope graphs by Tufte followers). The objection is that the line chart causes readers to see a trend when there isn't one.

So let me make the case one more time.

Start with the original column chart. If you want to know that Hispanic students have seen progress in their 4th grade math scores grind to a halt, you have to shake your head involuntarily in the following manner:

Redo_naep15f

(Notice how the legend interferes with your line of sight.)

By the time you finish interpreting this graphic, you would have shaken your head in all of the following directions:

Redo_naep15g

Now, I am a scavenger. I collect all these lines and rearrange them into four panels of charts. That becomes the chart I showed in yesterday's post. All I have done is to bring to the surface the involuntary motions readers were undertaking. I didn't invent any trends.


Involuntary head-shaking is probably not an intended consequence of data visualization

This chart is in the Sept/Oct edition of Harvard Magazine:

Naep scores - Nov 29 2016 - 4-21 PM

Pretty standard fare. It even is Tufte-sque in the sparing use of axes, labels, and other non-data-ink.

Does it bug you how much work you need to do to understand this chart?

Here is the junkchart version:

Redo_2016naep_v2

In the accompanying article, the journalist declared that student progress on NAEP tests came to a virtual standstill, and this version highlights the drop in performance between the two periods, as measured by these "gain scores."

The clarity is achieved through proximity as well as slopes.

The column chart form has a number of deficiencies when used to illustrate this data. It requires too many colors. It induces involuntary head-shaking.

Most unforgivingly, it leaves us with a puzzle: does the absence of a column means no progress or unknown?

Inset_2016naep

PS. The inclusion of 2009 on both time periods is probably an editorial oversight.

 

 


Political winds and hair styling

Washington Post (link) and New York Times (link) published dueling charts last week, showing the swing-swang of the political winds in the U.S. Of course, you know that the pendulum has shifted riotously rightward towards Republican red in this election.

The Post focused its graphic on the urban / not urban division within the country:

Wp_trollhair

Over Twitter, Lazaro Gamio told me they are calling these troll-hair charts. You certainly can see the imagery of hair blowing with the wind. In small counties (right), the wind is strongly to the right. In urban counties (left), the straight hair style has been in vogue since 2008. The numbers at the bottom of the chart drive home the story.

Previously, I discussed the Two Americas map by the NY Times, which covers a similar subject. The Times version emphasizes the geography, and is a snapshot while the Post graphic reveals longer trends.

Meanwhile, the Times published its version of a hair chart.

Nyt_hair_election

This particular graphic highlights the movement among the swing states. (Time moves bottom to top in this chart.) These states shifted left for Obama and marched right for Trump.

The two sets of charts have many similarities. They both use curvy lines (hair) as the main aesthetic feature. The left-right dimension is the anchor of both charts, and sways to the left or right are important tropes. In both presentations, the charts provide visual aid, and are nicely embedded within the story. Neither is intended as exploratory graphics.

But the designers diverged on many decisions, mostly in the D(ata) or V(isual) corner of the Trifecta framework.

***

The Times chart is at the state level while the Post uses county-level data.

The Times plots absolute values while the Post focuses on relative values (cumulative swing from the 2004 position). In the Times version, the reader can see the popular vote margin for any state in any election. The middle vertical line is keyed to the electoral vote (plurality of the popular vote in most states). It is easy to find the crossover states and times.

The Post's designer did some data transformations. Everything is indiced to 2004. Each number in the chart is the county's current leaning relative to 2004. Thus, left of vertical means said county has shifted more blue compared to 2004. The numbers are cumulative moving top to bottom. If a county is 10% left of center in the 2016 election, this effect may have come about this year, or 4 years ago, or 8 years ago, or some combination of the above. Again, left of center does not mean the county voted Democratic in that election. So, the chart must be read with some care.

One complaint about anchoring the data is the arbitrary choice of the starting year. Indeed, the Times chart goes back to 2000, another arbitrary choice. But clearly, the two teams were aiming to address slightly different variations of the key question.

There is a design advantage to anchoring the data. The Times chart is noticeably more entangled than the Post chart. There are tons more criss-crossing. This is particularly glaring given that the Times chart contains many fewer lines than the Post chart, due to state versus county.

Anchoring the data to a starting year has the effect of combing one's unruly hair. Mathematically, they are just shifting the lines so that they start at the same location, without altering the curvature. Of course, this is double-edged: the re-centering means the left-blue / right-red interpretation is co-opted.

On the Times chart, they used a different coping strategy. Each version of their charts has a filter: they highlight the set of lines to demonstrate different vignettes: the swing states moved slightly to the right, the Republican states marched right, and the Democratic states also moved right. Without these filters, the readers would be winking at the Times's bad-hair day.

***

Another decision worth noting: the direction of time. The Post's choice of top to bottom seems more natural to me than the Times's reverse order but I am guessing some of you may have different inclinations.

Finally, what about the thickness of the lines? The Post encoded population (voter) size while the Times used electoral votes. This decision is partly driven by the choice of state versus county level data.

One can consider electoral votes as a kind of log transformation. The effect of electorizing the popular vote is to pull the extreme values to the center. This significantly simplifies the designer's life. To wit, in the Post chart (shown nbelow), they have to apply a filter to highlight key counties, and you notice that those lines are so thick that all the other countries become barely visible.

  Wp_trollhair_texas

 


Bumps chart goes mainstream

It’s a happy day when one of my favorite chart types, the Bumps chart, makes it to the Wall Street Journal, and the front page no less! (Link to article)

This chart shows the ground shifting in global auto production in the next five years, with Mexico and India gaining in rank over Germany and South Korea.

Wsj_auto_bumps

The criss-crossing of lines is key to reading these charts. A crossing ("bump") necessarily means one entity has surpassed the other entity in absolute terms, even though we are looking at the relative rank.

Of course, there is no Swiss Army Knife of charts. This graphic provides no clue as to the share of world production. It's quite possible that the first few countries account for the majority of the world's producction, so that the rank shifts toward the bottom of the chart are relatively inconsequential. Wikipedia says that the top player (China) produces a quarter of the world's vehicles, and twice as many as the next biggest producer. Any country ranked below 4 accounts for less than 5 percent of global volume.

***

I made a few minor edits in this version below. Fro example, it's unclear why both 2014 and 2015 are depicted since there were no rank shifts and also the 2015 data is a projection. (I don't have any problem with the two red lines even though I didn't carry over the color scheme.)

  Redo_wsj_auto_trend


A startling chart about income inequality, with interpretative difficulties

Reader Robbi B. submitted the following chart posted to Twitter by Branko Milanovic:

Brano_img

The chart took a little time to figure out. This isn't a bad chart. Robbi wondered if there are alternative ways to plot this information.

The U.S. population is divided into percentiles across the horizontal axis, presumably based on the income distribution in some year (I'm guessing 2007, the start of the recession). For each percentile of people, the real per capita growth (decline) in disposable income is computed for two periods: the blue line shows the decline during the recession (2007-2010) and the orange shows the growth (in some cases further decline) during the recovery (2010-2013).

This chart draws attention to the two tails of the distibution, namely, the bottom 10 percent, and the top 5 percent. At one level, these two groups (excepting the bottom 2%) experienced the best of the recovery. But then, they also suffered the worst declines during the recession.

***

Here is one possible view of the same data, in a format with which I have been experimenting recently. You might call this a Bumps panel or a slopegraph panel.

Redo_branko

The slopes draw attention to the relative magnitude of the declines and the subsequent recoveries. (I thinned the middle 80% substantially because there isn't much going on in that part of the dataset.) If I have more time, I'd have chosen a different color instead of grayscale for those lines.

I ignored any questions I have about the underlying data. How is disposable income defined and measured? Does it carry the same meaning across the entire spectrum of income distribution? etc. (Milanovic points to the Survey of Consumer Fiannces as the source.)

***

One reason for the reading difficulty is the absence of a reference point. It's unclear how to judge the orange line. Two answers are suggestive (but problematic). One is the zero line: which segments of the population experienced a recovery and which didn't? Another is the mirror image of the blue line: how much of what one lost during the recession did one recover by 2013 (roughly speaking)?

Both of these easy interpretations worry me because they carry an assumption of equal guilt (blue line) and/or equal spoils (orange line). It is very possible that the unwarranted risk-taking or fraud was not evenly spread out amongst the percentiles, and if so, it is impossible to judge whether the distribution exhibited in the blue line was "fair". It is then also impossible to know if the distribution contained in the orange line was "fair". Indeed, if the orange line mirrored the blue line, then all segments recovered similarly what they lost--this would only make sense if all segments are equally culpable in the recession.


Where a scatter plot fails

Found this chart in the magazine that Charles Schwab sends to customers:

Schwab_riskreturn

When there are two variables, and their correlation is of interest, a scatter plot is usually recommended. But not here!

The text labels completely dominate this chart and the designer tried very hard to place them but a careful look reveals that some boxes are placed above the dots while others are placed to their right and the dot for "Short Treasuries" holds refuge quite a while away from the dot. This means the locations of the text boxes do not substitute for the dots.

***

Here is a different view of this data:

Redo_schwabriskreturn1

I am using a bumps-style chart, which allows the labels to be written horizontally outside the canvass. Instead of all categories plotted on the same chart, I use a small multiples setup to differentiate three types of risk-return relationships.


Circular but insufficient

One of my students analyzed the following Economist chart for her homework.

Economist_book_sales_printversion

I was looking for it online, and found an interactive version that is a bit different (link). Here are three screen shots from the online version for years 2009, 2013 and 2018. The first and last snapshots correspond to the years depicted in the print version.

  Economist_booksales_all

The online version is the self-sufficiency test for the print version. In testing self-sufficiency, we want to see if the visual elements (i.e. the circular sectors on the print version) pull their own weights. The quick answer is no. The reader can't tell how much sales are represented in each sector, nor can they reliably estimate the relative scales of print versus ebook (pink/red vs yellow/orange) or year-to-year growth rates.

As usual, when we see the entire data set printed on the chart itself, it is giveaway that the visual elements are mere ornaments.

The online version does not have labels unless you hover over the hemispheres. But again it is a challenge to learn anything from the picture.

In the Trifecta checkup, this is a Type V chart.

***

This particular dataset is made for the bumps-style chart:

Redo_economistbooksales

 

 

 


Respect the reader's time

A graphic illustrating how Americans spend their time is a perfect foil to make the important case that the reader's time is a scarce resource. I wrote about this at the ASA forum in 2011 (link).

In the same WSJ that carried the DSL speed chart (link), they boldly placed the following graphic in the center of the front page of the printed edition:

Wsj_atus_printed_sm

The visual form is of a treemap displaying the results of the recently released Time Use Survey results (link to pdf).

What does the designer want us to learn from this chart?

***

What jumps out first is the importance of various activities, starting with sleep, then work, TV, leisure/sports, etc.

If you read the legend, you'll notice that the colors mean something. The blue activities take up more time in 2013 compared to 2003. Herein, we encounter the first design hiccup.

The size of the blocks (which codes the absolute amount) and the color of the blocks (which codes the relative change in the amount) compete for our attention. According to Bill Cleveland's research, size is perceived more strongly than color. Thus, the wrong element wins.

Next, if we have time on our hands, we might read the data labels. Each block has two labels, the absolute values for 2003 and for 2013. In this, the designer is giving an arithmetic test. The reader is asked to compute the change in time spent in his or her head.

It appears that the designer's key message is "Aging Americans sleep more, work less", with the subtitle "TV remains No.1 hobby".

***

Wsj_atus2013Now compare the treemap to this set of "boring" bar charts.

This visualization of the same data appears in WSJ online in lieu of the treemap. Here, the point of the article is made clear; the reader needs not struggle with mental gymnastics.

(One can grumble about the red-green color-blindness blindness but otherwise, the graphic is pretty good.)

 

***

When I see this sort of data, I like to make a Bumps chart. So here it is:

Redo_wsjatus

The labeling of the smaller categories poses a challenge because the lines are so close together. However, those numbers are so small that none of the changes would be considered statistically significant.

***

From a statistical/data perspective, a very important question must be raised. What is the error bar around these estimates? Is there anything meaningful about an observed difference of fewer than 10 minutes?

Amusingly, the ATUS press release (link to pdf) has a technical note that warns us about reliability of estimates but nowhere in the press release can one actually find the value of the standard error, or a confidence interval, etc. After emailing them, I did get the information promptly. The standard error of one estimate is roughly 0.025-0.05 hours, which means that standard error of a difference is roughly 0.05- 0.1 hours, which means that a confidence interval around any estimated difference is roughly 0.1-0.2 hours, or 6-12 minutes.

Except for the top three categories, it's hard to know if the reported differences are due to sampling.

***

A further problem with the data is its detachment from reality. There are two layers of averaging going on, once at the population level and once at the time level. In reality, not everyone does these things every day. This dataset is really only interesting to statisticians.

So, in a Trifecta Checkup, the treemap is a Type DV and the bar chart is a Type D.


Losing the big picture

One of the dangers of "Big Data" is the temptation to get lost in the details. You become so absorbed in the peeling of the onion that you don't realize your tear glands have dried up.

Hans Rosling linked to a visualization of tobacco use around the world from Twitter (link to original). The setup is quite nice for exploration. I'd call this a "tool" rather than a visual.

***

Tobaccouse_1Let's take a look at the concentric circles on the right.

I appreciate the designer's concept -- the typical visualization of this type of data is looking at relative rates, which obscures the fact that China and India have far and away the most smokers even if their rates are middling (24% and 13% respectively).

This circular chart is supposed to show the absolute distribution of smokers across so-called "super-regions" of the world.

Unfortunately, the designer decided to pile on additional details. The concentric circles present a geography lesson, in effect. For example, high-income super-region is composed of high-income North America, Western Europe, high-income Asia Pacific, etc. and then high-income North America is composed of USA, Canada, etc.

Notice something odd? The further out you go, the larger the circular segments but the smaller the amount of people they represent! There are more people in the super-region of high-income worldwide than in high-income North America and in turn, there are more people in the high-income North American region than in USA. But the size of the graphical elements is reversed.

***

Tobaccouse_2In principle, the "bumps"-like chart used to show the evolution of tobacco prevalence in individual countries make for a nice visual. In fact, Rosling marvelled that the global rate of consumption has fallen in recent years.

However, I'm often irritated when the designer pays no attention to what not to show. There are probably well above 200 lines densely packed into this chart. It is almost for sure that over-plotting will cause some of these lines to literally never see the light of day. Try hovering over these lines and see for yourself.

The same chart with say 10 judiciously chosen lines (countries or regions) provides the reader with a lot more profit.

***

The discerning reader figures out that the best visual actually does not even show up on the dashboard. Go ahead, and click on the tab called "Data" on top of the page. You now see a presentation of each country's "data" by age group and by gender. This is where you can really come up with stories for what is going on in different countries.

For example, the British have really done extremly well in reducing tobacco use. Look at how steep the declines are across the board for British men (in most parts of the world, the prevalence of smoking is much higher among men than women.)

Tobaccouse_ukmen

Bulgaria on the other hand shows a rather odd pattern. It is one of the few countries in the bumps chart that showed a climb in smoking rates, at least in the early 2000s. Here the data for men is broken down into age groups.

Tobaccouse_bulgariamen

 
This chart exposes a weakness of the underlying data. The error bars indicate to us that what is being plotted is not actual data but modeled data. The error bars here are enormous. With the average at about 40% to 50% for many age groups, the confidence interval is also 40% wide. Further, note that there were only three or four observations (purple dots) and curves are being fitted to these three or four dots, plus extrapolation outside the window of observation. The end result is that the apparent uplift in smoking in the early 2000s is probably a figment of the modeler's imagination. You'd want to understand if there are changes in methodologies around that time.

As a responsible designer of data graphics, you should focus less on comprehensiveness and focus more on highlighting the good data. I'm a firm believer of "no data is better than bad data".