Conceptualizing a chart using Trifecta: a practical example

In response to the reader who left a comment asking for ideas for improving the "marginal abatements chart" that was discussed here, I thought it might be helpful to lay out the process I go through when conceptualizing a chart. (Just a reminder, here is the chart we're dealing with.)

Ar_submit_Fig-3-2-The-policy-cost-curve-525

First, I'm very concerned about the long program names. I see their proper placement in a horizontal orientation as a hard constraint on the design. I'd reject every design that displays the text vertically, at an angle, or hides it behind some hover effect, or abbreviates or abridges the text.

Second, I strongly suggest re-thinking the "cost-effectiveness" metric on the vertical axis. Flipping the sign of this metric makes a return-on-investment-type metric, which is much more intuitive. Just to reiterate a prior point, it feels odd to be selecting more negative projects before more positive projects.

Third, I'd like to decide what metrics to place on the two axes. There are three main possibilities: a) benefits (that is, the average annual emissions abatement shown on the horizontal axis currently), b) costs, and c) some function that ties together costs and benefits (currently, this design uses cost per unit benefit, and calls it cost effectivness but there are a variety of similar metrics that can be defined).

For each of these metrics, there is a secondary choice. I can use the by-project value or the cumulative value. The cumulative value is dependent on a selection order, in this case, determined by the criterion of selecting from the most cost-effective program to the least (regardless of project size or any other criteria).

This is where I'd bring in the Trifecta Checkup framework (see here for a guide).

Trifectacheckup_junkcharts_image
The decision of which metrics to use on the axes means I'm operating in the "D" corner. But this decision must be made with respect to the "Q" corner, thus the green arrow between the two. Which two metrics are the most relevant depends on what we want the chart to accomplish. That in turn depends on the audience and what specific question we are addressing for them.

Fourth, if the purpose of the chart is exploratory - that is to say, we use it to guide decision-makers in choosing a subset of programs, then I would want to introduce an element of interactivity. Imagine an interface that allows the user to move programs in and out of the chart, while the chart updates itself to compute the total costs and total benefits.

This last point ties together the entire Trifacta Checkup framework (link). The Question being exploratory in nature suggests a certain way of organizing and analyzing the Data as well as a Visual form that facilitates interacting with the information.

 

 


SCMP's fantastic infographic on Hong Kong protests

In the past month, there have been several large-scale protests in Hong Kong. The largest one featured up to two million residents taking to the streets on June 16 to oppose an extradition act that was working its way through the legislature. If the count was accurate, about 25 percent of the city’s population joined in the protest. Another large demonstration occurred on July 1, the anniversary of Hong Kong’s return to Chinese rule.

South China Morning Post, which can be considered the New York Times of Hong Kong, is well known for its award-winning infographics, and they rose to the occasion with this effort.

This is one of the rare infographics that you’d not regret spending time reading. After reading it, you have learned a few new things about protesting in Hong Kong.

In particular, you’ll learn that the recent demonstrations are part of a larger pattern in which Hong Kong residents express their dissatisfaction with the city’s governing class, frequently accused of acting as puppets of the Chinese state. Under the “one country, two systems” arrangement, the city’s officials occupy an unenviable position of mediating the various contradictions of the two systems.

This bar chart shows the growth in the protest movement. The recent massive protests didn't come out of nowhere. 

Scmp_protestsovertime

This line chart offers a possible explanation for burgeoning protests. Residents’ perceived their freedoms eroding in the last decade.

Scmp_freedomsurvey

If you have seen videos of the protests, you’ll have noticed the peculiar protest costumes. Umbrellas are used to block pepper sprays, for example. The following lovely graphic shows how the costumes have evolved:

Scmp_protestcostume

The scale of these protests captures the imagination. The last part in the infographic places the number of protestors in context, by expressing it in terms of football pitches (as soccer fields are known outside the U.S.) This is a sort of universal measure due to the popularity of football almost everywhere. (Nevertheless, according to Wikipedia, the fields do not have one fixed dimension even though fields used for international matches are standardized to 105 m by 68 m.)

Scmp_protestscale_pitches

This chart could be presented as a bar chart. It’s just that the data have been re-scaled – from counting individuals to counting football pitches-ful of individuals. 

***
Here is the entire infographics.


Pay levels in the U.S.

The Wall Street Journal published a graphic showing the median pay levels at "most" public companies in the U.S. here.

Wsj_mediancompanypay

People who attended my dataviz seminar might recognize the similarity with the graphic showing internet download speeds by different broadband technologies. It's a clean, clear way of showing multiple comparisons on the same chart.

You can see the distribution of pay levels of companies within each industry grouping, and the vertical lines showing the sector medians allow comparison across sectors. The median pay levels are quite similar with the energy sector leaning higher, and consumer sector leaning lower.

The consumer sector is extremely heavy on the low side of the pay range. Companies like Universal, Abercrombie, Skechers, Mattel, Gap, etc. all pay at least half their employees less than $6,000. The data is sourced to MyLogIQ. I have no knowledge of how reliable or valid the data are. It's curious to me that Dunkin Brands showed a median of $110K while Starbucks showed $13K.

Wsj_medianpay_dunkinstarbucks

***

I like the interactive features.

The window control lets the user zoom in to different parts of the pay range. This is necessary because of the extremely high salaries. The control doubles as a presentation of the overall distribution of median salaries.

The text box can be used to add data labels to specific companies.

***

See previous discussion of WSJ Graphics.

 


Fantastic visual, but the Google data need some pre-processing

Another entry in the Google Newslab data visualization project that caught my eye is the "How to Fix It" project, illustrating search queries across the world that asks "how." The project web page is here.

The centerpiece of the project is an interactive graphic showing queries related to how to fix home appliances. Here is what it looks like in France (It's always instructive to think about how they would count "France" queries. Is it queries from google.fr? queries written in French? queries from an IP address in France? A combination of the above?)

Howtofixit_france_appliances

I particularly appreciate the lack of labels. When we see the pictures, we don't need to be told this is a window and that is a door. The search data concern the relative sizes of the appliances. The red dotted lines show the relative popularity of searches for the respective appliances in aggregate.

By comparison, the Russian picture looks very different:

Howtofixit_russia_appliances

Are the Russians more sensible? Their searches are far and away about the washing machine, which is the most complicated piece of equipment on the graphic.

At the bottom of the page, the project looks at other queries, such as those related to cooking. I find it fascinating to learn what people need help making:

Howtofixit_world_cooking

I have to confess that I searched for "how to make soft boiled eggs". That led me to a lot of different webpages, mostly created for people who search for how to make a soft boiled egg. All of them contain lots of advertising, and the answer boils down to cook it for 6 minutes.

***

The Russia versus France comparison brings out a perplexing problem with the "Data" in this visualization. For competitive reasons, Google does not provide data on search volume. The so-called Search Index is what is being depicted. The Search Index uses the top-ranked item as the reference point (100). In the Russian diagram, the washing machine has Search Index of 100 and everything else pales in comparison.

In the France example, the window is the search item with the greatest number of searches, so it has Search Index of 100; the door has Index 96, which means it has 96% of the search volume of the window; the washing machine with Index 49 has about half the searches of the window.

The numbers cannot be interpreted as proportions. The Index of 49 does not mean that washing machines account for 49% of all France queries about fixing home appliances. That is really the meaning of popularity we want to have but we don't have. We can obtain true popularity measures by "normalizing" the Search Index: just sum up the Index Values of all the appliances and divide the Search Index by the sum of the Indices. After normalizing, the numbers can be interpreted as proportions and they add up to 100% for each country. When not normalized, the indices do not add to 100%.

Take the case in which we have five appliances, and let's say all five appliances are equally popular, comprising 20% of searches each. The five Search Indices will all be 100 because the top-ranked item is given the value of 100. Those indices add to 500!

By contrast, in the case of Russia (or a more extreme case), the top-ranked query is almost 100% of all the searches, so the sum of the indices will be only slightly larger than 100.

If you realize this, then you'd understand that it is risky to compare Search Indices across countries. The interpretation is clouded by how much of the total queries accounted for by the top query.

In our Trifecta Checkup, this is a chart that does well in the Question and Visual corners, but there is a problem with the Data.

 

 


Playfulness in data visualization

The Newslab project takes aggregate data from Google's various services and finds imaginative ways to enliven the data. The Beautiful in English project makes a strong case for adding playfulness to your data visualization.

Newslab_language_wordsnakeThe data came from Google Translate. The authors look at 10 languages, and the top 10 words users ask to translate from those languages into English.

The first chart focuses on the most popular word for each language. The crawling snake presents the "worldwide" top words.

The crawling motion and the curvature are not required by the data but it inserts a dimension of playfulness into the data that engages the reader's attention.

The alternative of presenting a data table loses this virtue without gaining much in return.

Readers are asked to click on the top word in each country to reveal further statistics on the word.

For example, the word "good" leads to the following:

Newslab_language_top1_details

 

***

The second chart presents the top 10 words by language in a lollipop style:

Newslab_language_japanese10

The above diagram shows the top 10 Japanese words translated into English. This design sacrifices concise in order to achieve playful.

The standard format is a data table with one column for each country, and 10 words listed below each country header in order of decreasing frequency.

The creative lollipop display generates more extreme emotions - positive, or negative, depending on the reader. The data table is the safer choice, precisely because it does not engage the reader as deeply.

 

 


Lines, gridlines, reference lines, regression lines, the works

This post is part 2 of an appreciation of the chart project by Google Newslab, advised by Alberto Cairo, on the gender and racial diversity of the newsroom. Part 1 can be read here.

In the previous discussion, I left out the following scatter bubble plot.

Goog_newsrooms_gender_2

This plot is available in two versions, one for gender and one for race. The key question being asked is whether the leadership in the newsroom is more or less diverse than the rest of the staff.

The story appears to be a happy one: in many newsrooms, the leadership roughly reflects the staff in terms of gender distribution (even though both parts of the whole compare disfavorably to the gender ratio in the neighborhoods, as we saw in the previous post.)

***

Unfortunately, there are a few execution problems with this scatter plot.

First, take a look at the vertical axis labels on the right side. The labels inform the leadership axis. The mid-point showing 50-50 (parity) is emphasized with the gray band. Around the mid-point, the labels seem out of place. Typically, when the chart contains gridlines, we expect the labels to sit right around each gridline, either on top or just below the line. Here the labels occupy the middle of the space between successive gridlines. On closer inspection, the labels are correctly affixed, and the gridlines  drawn where they are supposed to be. The designer chose to show irregularly spaced labels: from the midpoint, it's a 15% jump on either side, then a 10% jump.

I find this decision confounding. It also seems as if two people have worked on these labels, as there exists two patterns: the first is "X% Leaders are Women", and second is "Y% Female." (Actually, the top and bottom labels are also inconsistent, one using "women" and the other "female".)

The horizontal axis? They left out the labels. Without labels, it is not possible to interpret the chart. Inspecting several conveniently placed data points, I figured that the labels on the six vertical gridlines should be 25%, 35%, ..., 65%, 75%, in essence the same scale as the vertical axis.

Here is the same chart with improved axis labels:

Jc_newsroomgender_1

Re-labeling serves up a new issue. The key reference line on this chart isn't the horizontal parity line: it is the 45-degree line, showing that the leadership has the same proprotion of females as the rest of the staff. In the following plot (right side), I added in the 45-degree line. Note that it is positioned awkwardly on top of the grid system. The culprit is the incompatible gridlines.

  Jc_newsroomgender_1

The solution, as shown below, is to shift the vertical gridlines by 5% so that the 45-degree line bisects every grid cell it touches.

Jc_newsroomgender_3

***

Now that we dealt with the purely visual issues, let me get to a statistical issue that's been troubling me. It's about that yellow line. It's supposed to be a regression line that runs through the points.

Does it appear biased downwards to you? It just seems that there are too many dots above and not enough below. The distance of the furthest points above also appears to be larger than that of the distant points below.

How do we know the line is not correct? Notice that the green 45-degree line goes through the point labeled "AVERAGE." That is the "average" newsroom with the average proportion of female staff and the average proportion of leadership staff. Interestingly, the average falls right on the 45-degree line.

In general, the average does not need to hit the 45-degree line. The average, however, does need to hit the regression line! (For a mathematical explanation, see here.)

Note the corresponding chart for racial diversity has it right. The yellow line does pass through the average point here:

Goog_newsrooms_race_2

 ***

In practice, how do problems seep into dataviz projects? It's the fact that you don't get to the last chart via a clean, streamlined process but that you pass through a cycle of explore-retrench-synthesize, frequently bouncing ideas between several people, and it's challenging to keep consistency!

And let me repeat my original comment about this project - the key learning here is how they took a complex dataset with many variables, broke it down into multiple parts addressing specific problems, and applied the layering principle to make each part of the project digestible.

 

 


Well-structured, interactive graphic about newsrooms

Today, I take a detailed look at one of the pieces that came out of an amazing collaboration between Alberto Cairo, and Google's News Lab. The work on diversity in U.S. newsrooms is published here. Alberto's introduction to this piece is here.

The project addresses two questions: (a) gender diversity (representation of women) in U.S. newsrooms and (b) racial diversity (representation of white vs. non-white) in U.S. newsrooms.

One of the key strengths of the project is how the complex structure of the underlying data is displayed. The design incorporates the layering principle everywhere to clarify that structure.

At the top level, the gender and race data are presented separately through the two tabs on the top left corner. Additionally, newsrooms are classified into three tiers: brand-names (illustrated with logos), "top" newsrooms, and the rest.

Goog_newsrooms_gender_1

The brand-name newsrooms are shown with logos while the reader has to click on individual bubbles to see the other newsrooms. (Presumably, the size of the bubble is the size of each newsroom.)

The horizontal scale is the proportion of males (or females), with equality positioned in the middle. The higher the proportion of male staff, the deeper is the blue. The higher the proportion of female staff, the deeper is the red. The colors are coordinated between the bubbles and the horizontal axis, which is a nice touch.

I am not feeling this color choice. The key reference level on this chart is the 50/50 split (parity), which is given the pale gray. So the attention is drawn to the edges of the chart, to those newsrooms that are the most gender-biased. I'd rather highlight the middle, celebrating those organizations with the best gender balance.

***

The red-blue color scheme unfortunately re-appeared in a subsequent chart, with a different encoding.

Goog_newsrooms_gender_4

Now, blue means a move towards parity while red indicates a move away from parity between 2001 and 2017. Gray now denotes lack of change. The horizontal scale remains the same, which is why this can cause some confusion.

Despite the colors, I like the above chart. The arrows symbolize trends. The chart delivers an insight. On average, these newsrooms are roughly 60% male with negligible improvement over 16 years.

***

Back to layering. The following chart shows that "top" newsrooms include more than just the brand-name ones.

Goog_newsrooms_gender_3

The dot plot is undervalued for showing simple trends like this. This is a good example of this use case.

While I typically recommend showing balanced axis for bipolar scale, this chart may be an exception. Moving to the right side is progress but the target sits in the middle; the goal isn't to get the dots to the far right so much of the right panel is wasted space.

 


Two nice examples of interactivity

Janie on Twitter pointed me to this South China Morning Post graphic showing off the mighty train line just launched between north China and London (!)

Scmp_chinalondonrail

Scrolling down the page simulates the train ride from origin to destination. Pictures of key regions are shown on the left column, as well as some statistics and other related information.

The interactivity has a clear purpose: facilitating cross-reference between two chart forms.

The graphic contains a little oversight ... The label for the key city of Xian, referenced on the map, is missing from the elevation chart on the left here:

Scmp_chinalondonrail_xian

 ***

I also like the way New York Times handled interactivity to this chart showing the rise in global surface temperature since the 1900s. The accompanying article is here.

Nyt_surfacetemp

When the graph is loaded, the dots get printed from left to right. That's an attention grabber.

Further, when the dots settle, some years sink into the background, leaving the orange dots that show the years without the El Nino effect. The reader can use the toggle under the chart title to view all of the years.

This configuration is unusual. It's more common to show all the data, and allow readers to toggle between subsets of the data. By inverting this convention, it's likely few readers need to hit that toggle. The key message of the story concerns the years without El Nino, and that's where the graphic stands.

This is interactivity that succeeds by not getting in the way. 

 

 

 


Excellent visualization of gun violence in American cities

I like the Guardian's feature (undated) on gun violence in American cities a lot.

The following graphic illustrates the situation in Baltimore.

Guardian_gunviolence_baltimore

The designer starts by placing where the gun homicides occured in 2015. Then, it leads readers through an exploration of the key factors that might be associated with the spatial distribution of those homicides.

The blue color measures poverty levels. There is a moderate correlation between high numbers of dots (homicides) and deeper blue (poorer). The magenta color measures education attainment and the orange color measures proportion of blacks. In Baltimore, it appears that race is substantially better at explaining the prevalence of homicides.

This work is exemplary because it transcends description (first map) and explores explanations for the spatial pattern. Because three factors are explored together in a small-multiples layout, readers learn that no single factor can explain everything. In addition, we learn that different factors have different degrees of explanatory power.

Attentive readers will also find that the three factors of poverty, education attainment and proportion black are mutually correlated.  Areas with large black populations also tend to be poorer and less educated.

***

I also like the introductory section in which a little dose of interactivity is used to sequentially present the four maps, now superimposed. It then becomes possible to comprehend the rest quickly.

Guardian_guncrimemaps_stlouis_2

 ***

The top section is less successful as proportions are not easily conveyed via dot density maps.

Guardian_guncrime_map_prop

Dropping the map form helps. Here is a draft of what I have in mind. I just pulled some data from online sources at the metropolitan area (MSA) level, and it doesn't have as striking a comparison as the city-level data, it seems.

Redo_guardiangundeathsprop

 

 PS. On Twitter, Aliza tells me the article was dated January 9, 2017.


Storm story, a masterpiece

The visual story published by the New York Times on hurricane Irma is a masterpiece. See the presentation here.

The story starts with the standard presentation of the trajectories of past hurricane on a map:

Nyt_irma_map

Maps are great at conveying location and direction but much is lost in this rendering - wind speeds, time, strength, energy, to name but a few.

The Times then switches to other chart forms to convey some of the other data. A line chart is used to convey the strength of wind speeds as the storms shake through the Atlantic. Some kind of approximation is used to straighten the trajectories along an east-west orientation.

Nyt_irma_notime

The key insight here is how strong Irma was pretty far out in the Atlantic. The lines in the background can be brought to live by clicking on them. This view omits some details - the passage of time is ignored, and location has been reduced to one dimension.

The display then switches again, and this time it shows time and wind speed.

Nyt_irma_nolocation

This shows Irma's strength, sustaining Category 5 level windss for three days. This line chart ignores location completely.

Finally, a composite metric called cyclone energy is introduced.

Nyt_irma_energy

This chart also ignores location. It does show Irma as a special storm. The storm that has reached the maximum energy by far is Ivan. Will Irma beat that standard? I am not so sure.

Each chart form has limitations. The use of multiple charts helps convey a story from multiple perspectives. A very nice example indeed.