Graphical advice for conference presenters - demo

Yesterday, I pulled this graphic from a journal paper, and said one should not copy and paste this into an oral presentation.

Example_presentation_graphic

So I went ahead and did some cosmetic surgery on this chart.

Redo_example_conference_graphic

I don't know anything about the underlying science. I'm just interpreting what I see on the chart. It seems like the key message is that the Flowering condition is different from the other three. There are no statistical differences between the three boxplots in the first three panels but there is a big difference between the red-green and the purple in the last panel. Further, this difference can be traced to the red-green boxplots exhibiting negative correlation under the Flowering condition - while the purple boxplot is the same under all four conditions.

I would also have chosen different colors, e.g. make red-green two shades of gray to indicate that these two things can be treated as the same under this chart. Doing this would obviate the need to introduce the orange color.

Further, I think it might be interesting to see the plots split differently: try having the red-green boxplots side by side in one panel, and the purple boxplots in another panel.

If the presentation software has animation, the presenter can show the different text blocks and related materials one at a time. That also aids comprehension.

***

Note that the plot is designed for an oral presentation in which you have a minute or two to get the message across. It's debatable as to whether journal editors should accept this style for publications. I actually think such a style would improve reading comprehension but I surmise some of you will disagree.


Is the chart answering your question? Excavating the excremental growth map

Economist_excrement_growthSan Franciscans are fed up with excremental growth. Understandably.

Here is how the Economist sees it - geographically speaking.

***

In the Trifecta Checkup analysis, one of the questions to ask is "What does the visual say?" and with respect to the question being asked.

The question is how much has the problem of human waste in SF grew from 2011 to 2017.

What does the visual say?

The number of complaints about human waste has increased from 2011 to 2014 to 2017.

The areas where there are complaints about human waste expanded.

The worst areas are around downtown, and that has not changed during this period of time.

***

Now, what does the visual not say?

Let's make a list:

  • How many complaints are there in total in any year?
  • How many complaints are there in each neighborhood in any year?
  • What's the growth rate in number of complaints, absolute or relative?
  • What proportion of complaints are found in the worst neighborhoods?
  • What proportion of the area is covered by the green dots on each map?
  • What's the growth in terms of proportion of areas covered by the green dots?
  • Does the density of green dots reflect density of human waste or density of human beings?
  • Does no green dot indicate no complaints or below the threshold of the color scale?

There's more:

  • Is the growth in complaints a result of more reporting or more human waste?
  • Is each complainant unique? Or do some people complain multiple times?
  • Does each piece of human waste lead to one and only one complaint? In other words, what is the relationship between the count of complaints and the count of human waste?
  • Is it easy to distinguish between human waste and animal waste?

And more:

  • Are all complaints about human waste valid? Does anyone verify complaints?
  • Are the plotted locations describing where the human waste is or where the complaint was made?
  • Can all complaints be treated identically as a count of one?
  • What is the per-capita rate of complaints?

In other words, the set of maps provides almost all no information about the excrement problem in San Francisco.

After you finish working, go back and ask what the visual is saying about the question you're trying to address!

 

As a reference, I found this map of the population density in San Francisco (link):

SFO_Population_Density

 


Common charting issues related to connecting lines, labels, sequencing

The following chart about "ranges and trends for digital marketing salaries" has some problems that appear in a great number of charts.

Marketingsherpa-chartofweek-062915-salaries

The head tilt required to read the job titles.

The order of the job titles is baffling. It's neither alphabetical nor by salary.

The visual form suggests that we could see trends in salaries reading left-right, but the only information about trends is the year on year salary change, printed on top of the chart.

Some readers will violently object to the connecting lines between job titles, which are discrete categories. In this case, I also agree. I am a fan of so-called profile charts in which we do connect discrete categories with connecting lines - but those charts work because we are comparing the "profiles" of one group versus another group. Here, there is only one group.

The N=3,567 is weird. It doesn't say anything about the reliability of the estimate for say Chief Marketing Officer.

***

A dot plot can be used for this dataset. Like this:

Redo_jc_digitalsalaries

The range of salaries is not a great metric as the endpoints could be outliers.

Also, the variability of salaries is affected by two factors: the variability between companies, and sampling variability (which depends on the sample size for each job title). A wide range here could mean that different companies pay different salaries for the same job title, or that very few survey responders held that job title.

 

 


Fantastic visual, but the Google data need some pre-processing

Another entry in the Google Newslab data visualization project that caught my eye is the "How to Fix It" project, illustrating search queries across the world that asks "how." The project web page is here.

The centerpiece of the project is an interactive graphic showing queries related to how to fix home appliances. Here is what it looks like in France (It's always instructive to think about how they would count "France" queries. Is it queries from google.fr? queries written in French? queries from an IP address in France? A combination of the above?)

Howtofixit_france_appliances

I particularly appreciate the lack of labels. When we see the pictures, we don't need to be told this is a window and that is a door. The search data concern the relative sizes of the appliances. The red dotted lines show the relative popularity of searches for the respective appliances in aggregate.

By comparison, the Russian picture looks very different:

Howtofixit_russia_appliances

Are the Russians more sensible? Their searches are far and away about the washing machine, which is the most complicated piece of equipment on the graphic.

At the bottom of the page, the project looks at other queries, such as those related to cooking. I find it fascinating to learn what people need help making:

Howtofixit_world_cooking

I have to confess that I searched for "how to make soft boiled eggs". That led me to a lot of different webpages, mostly created for people who search for how to make a soft boiled egg. All of them contain lots of advertising, and the answer boils down to cook it for 6 minutes.

***

The Russia versus France comparison brings out a perplexing problem with the "Data" in this visualization. For competitive reasons, Google does not provide data on search volume. The so-called Search Index is what is being depicted. The Search Index uses the top-ranked item as the reference point (100). In the Russian diagram, the washing machine has Search Index of 100 and everything else pales in comparison.

In the France example, the window is the search item with the greatest number of searches, so it has Search Index of 100; the door has Index 96, which means it has 96% of the search volume of the window; the washing machine with Index 49 has about half the searches of the window.

The numbers cannot be interpreted as proportions. The Index of 49 does not mean that washing machines account for 49% of all France queries about fixing home appliances. That is really the meaning of popularity we want to have but we don't have. We can obtain true popularity measures by "normalizing" the Search Index: just sum up the Index Values of all the appliances and divide the Search Index by the sum of the Indices. After normalizing, the numbers can be interpreted as proportions and they add up to 100% for each country. When not normalized, the indices do not add to 100%.

Take the case in which we have five appliances, and let's say all five appliances are equally popular, comprising 20% of searches each. The five Search Indices will all be 100 because the top-ranked item is given the value of 100. Those indices add to 500!

By contrast, in the case of Russia (or a more extreme case), the top-ranked query is almost 100% of all the searches, so the sum of the indices will be only slightly larger than 100.

If you realize this, then you'd understand that it is risky to compare Search Indices across countries. The interpretation is clouded by how much of the total queries accounted for by the top query.

In our Trifecta Checkup, this is a chart that does well in the Question and Visual corners, but there is a problem with the Data.

 

 


Lines, gridlines, reference lines, regression lines, the works

This post is part 2 of an appreciation of the chart project by Google Newslab, advised by Alberto Cairo, on the gender and racial diversity of the newsroom. Part 1 can be read here.

In the previous discussion, I left out the following scatter bubble plot.

Goog_newsrooms_gender_2

This plot is available in two versions, one for gender and one for race. The key question being asked is whether the leadership in the newsroom is more or less diverse than the rest of the staff.

The story appears to be a happy one: in many newsrooms, the leadership roughly reflects the staff in terms of gender distribution (even though both parts of the whole compare disfavorably to the gender ratio in the neighborhoods, as we saw in the previous post.)

***

Unfortunately, there are a few execution problems with this scatter plot.

First, take a look at the vertical axis labels on the right side. The labels inform the leadership axis. The mid-point showing 50-50 (parity) is emphasized with the gray band. Around the mid-point, the labels seem out of place. Typically, when the chart contains gridlines, we expect the labels to sit right around each gridline, either on top or just below the line. Here the labels occupy the middle of the space between successive gridlines. On closer inspection, the labels are correctly affixed, and the gridlines  drawn where they are supposed to be. The designer chose to show irregularly spaced labels: from the midpoint, it's a 15% jump on either side, then a 10% jump.

I find this decision confounding. It also seems as if two people have worked on these labels, as there exists two patterns: the first is "X% Leaders are Women", and second is "Y% Female." (Actually, the top and bottom labels are also inconsistent, one using "women" and the other "female".)

The horizontal axis? They left out the labels. Without labels, it is not possible to interpret the chart. Inspecting several conveniently placed data points, I figured that the labels on the six vertical gridlines should be 25%, 35%, ..., 65%, 75%, in essence the same scale as the vertical axis.

Here is the same chart with improved axis labels:

Jc_newsroomgender_1

Re-labeling serves up a new issue. The key reference line on this chart isn't the horizontal parity line: it is the 45-degree line, showing that the leadership has the same proprotion of females as the rest of the staff. In the following plot (right side), I added in the 45-degree line. Note that it is positioned awkwardly on top of the grid system. The culprit is the incompatible gridlines.

  Jc_newsroomgender_1

The solution, as shown below, is to shift the vertical gridlines by 5% so that the 45-degree line bisects every grid cell it touches.

Jc_newsroomgender_3

***

Now that we dealt with the purely visual issues, let me get to a statistical issue that's been troubling me. It's about that yellow line. It's supposed to be a regression line that runs through the points.

Does it appear biased downwards to you? It just seems that there are too many dots above and not enough below. The distance of the furthest points above also appears to be larger than that of the distant points below.

How do we know the line is not correct? Notice that the green 45-degree line goes through the point labeled "AVERAGE." That is the "average" newsroom with the average proportion of female staff and the average proportion of leadership staff. Interestingly, the average falls right on the 45-degree line.

In general, the average does not need to hit the 45-degree line. The average, however, does need to hit the regression line! (For a mathematical explanation, see here.)

Note the corresponding chart for racial diversity has it right. The yellow line does pass through the average point here:

Goog_newsrooms_race_2

 ***

In practice, how do problems seep into dataviz projects? It's the fact that you don't get to the last chart via a clean, streamlined process but that you pass through a cycle of explore-retrench-synthesize, frequently bouncing ideas between several people, and it's challenging to keep consistency!

And let me repeat my original comment about this project - the key learning here is how they took a complex dataset with many variables, broke it down into multiple parts addressing specific problems, and applied the layering principle to make each part of the project digestible.

 

 


Saying no thanks to a box of donuts

As I reported last week, the Department of Education for Delaware is running a survey on dashboard design. The survey link is here.

One of the charts being evaluated is a box of donuts, as shown below:

Delaware_doe

I have written before about the problem with donut charts (see here). A box of donuts is worse than one donut. Here, each donut references a school year. The composition by race/ethnicity of the student body is depicted. In aggregate, the composition has not changed drastically although there are small changes from year to year.

In the following alternative, I use a side-by-side line charts, sometimes called slopegraphs, to illustrate the change by race/ethnicity.

Redo_delaware_doe

The key decisions are:

  • using slopes to encode the year-to-year changes, as opposed to having readers compute those changes by measuring and dividing
  • using color to show insights (whether the race/ethnicity has expanded, contracted or remained stable across the three years) as opposed to definitions of the data
  • not showing that the percentages within each year summing to 100% as opposed to explicitly presenting this fact in a circular arrangement
  • placing annual data side by side on the same plot region as opposed to separating them in three charts

***

There is still a further question of how big a change from year to year is considered material.

This is a good example of why there is never "complete data." In theory, the numbers on this chart are "complete," and come from administrative records. Even when ignoring the possibility that some of the records are missing or incorrect, you still have the issue that the students in the system from year to year varies, so a 1 percent increase in the proportion of Hispanic students can indicate a real demographic trend, or it does not.

 

 


The visual should be easier to read than your data

A reader sent this tip in some time ago and I lost track of who he/she is. This graphic looks deceptively complex.

MW-FW350_1milli_20171016112101_NS

What's complex is not the underlying analysis. The design is complex and so the decoding is complex.

The question of the graphic is a central concern of anyone who's retired: how long will one's savings last? There are two related metrics to describe the durability of the stash, and they are both present on this chart. The designer first presumes that one has saved $1 million for retirement. Then he/she computes how many years the savings will last. That, of course, depends on the cost of living, which naively can be expressed as a projected annual expenditure. The designer allows the cost of living to vary by state, which is the main source of variability in the computations. The time-based and dollar-based metrics are directly linked to one another via a formula.

The design encodes the time metric in a grid of dots, and the dollar-metric in the color of the dots. The expenditures are divided into eight segments, given eight colors from deep blue to deep pink.

Thirteen of those dots are invariable, appearing in every state. Readers are drawn into a ranking of the states, which is nothing but a ranking of costs of living. (We don't know, but presume, that the cost of living computation is appropriate for retirees, and not averaged.) This order obscures any spatial correlation. There are a few production errors in the first row in which the year and month numbers are misstated slightly; the numbers should be monotonically decreasing. In terms of years and months, the difference between many states is immaterial. The pictogram format is more popular than it deserves: only highly motivated readers will count individual dots. If readers are merely reading the printed text, which contains all the data encoded in the dots, then the graphic has failed the self-sufficiency principle - the visual elements are not doing any work.

***

In my version, I surface the spatial correlation using maps. The states are classified into sensible groups that allow a story to be told around the analysis. Three groups of states are identified and separately portrayed. The finer variations between states within each state group appear as shades.

Redo_howlonglive

Data visualization should make the underlying data easier to comprehend. It's a problem when the graphic is harder to decipher than the underlying dataset.

 

 

 


Choosing the right metric reveals the story behind the subway mess in NYC

I forgot who sent this chart to me - it may have been a Twitter follower. The person complained that the following chart exaggerated how much trouble the New York mass transit system (MTA) has been facing in 2017, because of the choice of the vertical axis limits.

Streetsblog_mtatraffic

This chart is vintage Excel, using Excel defaults. I find this style ugly and uninviting. But the chart does contain some good analysis. The analyst made two smart moves: the chart controls for month-to-month seasonality by plotting the data for the same month over successive years; and the designation "12 month averages" really means moving averages with a window size of 12 months - this has the effect of smoothing out the short-term fluctuations to reveal the longer-term trend.

The red line is very alarming as it depicts a sustained negative trend over the entire year of 2017, even though the actual decline is a small percentage.

If this chart showed up on a business dashboard, the CEO would have been extremely unhappy. Slow but steady declines are the most difficult trends to deal with because it cannot be explained by one-time impacts. Until the analytics department figures out what the underlying cause is, it's very difficult to curtail, and with each monthly report, the sense of despair grows.

Because the base number of passengers in the New York transit system is so high, using percentages to think about the shift in volume underplays the message. It's better to use actual millions of passengers lost. That's what I did in my version of this chart:

Redo_jc_mtarevdecline

The quantity depicted is the unexpected loss of revenue passengers, measured against a forecast. The forecast I used is the average of the past two years' passenger counts. Above the zero line means out-performing the forecast but of course, in this case, since October 2016, the performance has dipped ever farther below the forecast. By April, 2017, the gap has widened to over 5 million passengers. That's a lot of lost customers and lost revenues, regardless of percent!

The biggest headache is to investigate what is the cause of this decline. Most likely, it is a combination of factors.


Upcoming talks here and there

I'm giving a dataviz talk in San Ramon, CA on Thursday Nov 9. Go here to register.

***

Then next Monday (Nov 13, 11 am), I will be in Boston at Harvard Business Review, giving a "live whiteboard session" on A/B Testing. This talk will be streamed live on Facebook Live.

***

Finally, my letter to the editor of New York Times Magazine was published this past Sunday. This letter is a response to Susan Dominus's article about the "power pose" research, and the replication crisis in social science. Fundamentally, it is a debate over how data is used and analyzed in experiments, and therefore relevant to my readers. I added a list of resources in this blog post about the letter.

***

Those are some of my favorite topics: dataviz, A/B testing, and data-driven decision-making.


Lop-sided precincts, a visual exploration

In the last post, I discussed one of the charts in the very nice Washington Post feature, delving into polarizing American voters. See the post here. (Thanks again Daniel L.)

Today's post is inspired by the following chart (I am  showing only the top of it - click here to see the entire chart):

Wpost_friendsparties2_top

The chart plots each state as a separate row, so like most such charts, it is tall. The data analysis behind the chart is fascinating and unusual, although I find the chart harder to grasp than expected. The analyst starts with precinct-level data, and determines which precincts were "lop-sided," defined as having a winning margin of over 50 percent for the winner (either Trump or Clinton). The analyst then sums the voters in those lop-sided precincts, and expresses this as a percent of all voters in the state.

For example, in Alabama, the long red bar indicates that about 48% of the state's voters live in lop-sided precincts that went for Trump. It's important to realize that not all such people voted for Trump - they happened to live in precincts that went heavily for Trump. Interestingly, about 12% of the states voters reside in precincts that went heavily for Clinton. Thus, overall, 60% of Alabama's voters live in lop-sided precincts.

This is more sophisticated than the usual analysis that shows up in journalism.

The bar chart may confuse readers for several reasons:

  • The horizontal axis is labeled "50-point plus margin for Trump/Clinton" and has values from 0% to 40-60% range. This description seemingly infers the values being plotted as winning margins. However, the sub-header tells readers that the data values are percentages of total voters in the state.
  • The shades of colors are not explained. I believe the dark shade indicates the winning party in each state, so Trump won Alabama and Clinton, California. The addition of this information allows the analysis to become multi-dimensional. It also reveals that the designer wants to address how lop-sided precincts affect the outcome of the election. However, adding shade in this manner effectively turns a two-color composition into a four-color composition, adding to the processing load.
  • The chart adopts what Howard Wainer calls the "Alabama first"  ordering. This always messes up the designer's message because the alphabetical order typically does not yield a meaningful correlation.

The bars are facing out from the middle, which is the 0% line. This arrangement is most often used in a population pyramid, and used when the designer feels it important to let readers compare the magnitudes of two segments of a population. I do not feel that the Democrat versus Republican comparison within each state is crucial to this chart, given that most states were not competitive.

What is more interesting to me is the total proportion of voters who live in these lop-sided precincts. The designer agrees on this point, and employs bar stacking to make this point. This yields some amazing insights here: several Democratic strongholds such as Massachusetts surprisingly have few lop-sided precincts.

***
Here then is a remake of the chart according to my priorities. Click here for the full chart.

Redo_wpost_friendsparties2_top

The emphasis is on the total proportion of voters in lop-sided precincts. The states are ordered by that metric from most lop-sided to least. This draws out an unexpected insight: most red states have a relatively high proportion of votesr in lop-sided precincts (~ 30 to 40%) while most blue states - except for the quartet of Maryland, New York, California and Illinois - do not exhibit such demographic concentration.

The gray/grey area offers a counterpoint, that most voters do not live in lop-sided districts.

P.S. I should add that this is one of those chart designs that frustrate standard - I mean, point-and-click - charting software because I am placing the longest bar segments on the left, regardless of color.