The windy path to the Rugby World Cup

When I first saw the following chart, I wondered whether it is really that challenging for these eight teams to get into the Rugby World Cup, currently playing in Japan:

1920px-2019_Rugby_World_Cup_Qualifying_Process_Diagram.svg

Another visualization of the process conveys a similar message. Both of these are uploaded to Wikipedia.

Rugby_World_Cup_2019_Qualification_illustrated_v2

(This one hasn't been updated and still contains blank entries.)

***

What are some of the key messages one would want the dataviz to deliver?

  • For the eight countries that got in (not automatically), track their paths to the World Cup. How many competitions did they have to play?
  • For those countries that failed to qualify, track their paths to the point that they were stopped. How many competitions did they play?
  • What is the structure of the qualification rounds? (These are organized regionally, in addition to certain playoffs across regions.)
  • How many countries had a chance to win one of the eight spots?
  • Within each competition, how many teams participated? Did the winner immediately qualify, or face yet another hurdle? Did the losers immediately disqualify, or were they offered another chance?

Here's my take on this chart:

Rugby_path_to_world_cup_sm

 


The time of bird seeds and chart tuneups

The recent post about multi-national companies reminded me of an older post, in which I stepped through data table enhancements.

Here is a video of the process. You can use any tool to implement the steps; even Excel is good enough.

 

 

The video is part of a series called "Data science: the Missing Pieces". In these episodes, I cover the parts of data science that are between the cracks, the little things that textbooks and courses do not typically cover - the things that often block students from learning efficiently.

If you have encountered such things, please comment below to suggest future topics. What is something about visualizing data you wish you learned formally?

***

P.S. Placed here to please the twitter-bot

DSTMP2_goodchart_thumb

 

 


Pulling the multi-national story out, step by step

Reader Aleksander B. found this Economist chart difficult to understand.

Redo_multinat_1

Given the chart title, the reader is looking for a story about multinationals producing lower return on equity than local firms. The first item displayed indicates that multinationals out-performed local firms in the technology sector.

The pie charts on the right column provide additional information about the share of each sector by the type of firms. Is there a correlation between the share of multinationals, and their performance differential relative to local firms?

***

We can clean up the presentation. The first changes include using dots in place of pipes, removing the vertical gridlines, and pushing the zero line to the background:

Redo_multinat_2

The horizontal gridlines attached to the zero line can also be removed:

Redo_multinat_3

Now, we re-order the rows. Start with the aggregate "All sectors". Then, order sectors from the largest under-performance by multinationals to the smallest.

Redo_multinat_4

The pie charts focus only on the share of multinationals. Taking away the remainders speeds up our perception:

Redo_multinat_5

Help the reader understand the data by dividing the sectors into groups, organized by the performance differential:

Redo_multinat_6

For what it's worth, re-sort the sectors from largest to smallest share of multinationals:

Redo_multinat_7

Having created groups of sectors by share of multinationals, I simplify further by showing the average pie chart within each group:

Redo_multinat_8

***

To recap all the edits, here is an animated gif: (if it doesn't play automatically, click on it)

Redo_junkcharts_econmultinat

***

Judging from the last graphic, I am not sure there is much correlation between share of multinationals and the performance differentials. It's interesting that in aggregate, local firms and multinationals performed the same. The average hides the variability by sector: in some sectors, local firms out-performed multinationals, as the original chart title asserted.


Tennis greats at the top of their game

The following chart of world No. 1 tennis players looks pretty but the payoff of spending time to understand it isn't high enough. The light colors against the tennis net backdrop don't work as intended. The annotation is well done, and it's always neat to tug a legend inside the text.

Tableautennisnumberones

The original is found at Tableau Public (link).

The topic of the analysis appears to be the ages at which tennis players attained world #1 ranking. Here are the male players visualized differently:

Redo_junkcharts_no1tennisplayers

Some players like Jimmy Connors and Federer have second springs after dominating the game in their late twenties. It's relatively rare for players to get to #1 after 30.


Announcement: Advancing your data skills, Fall 2019

Interrupting the flow of dataviz with the following announcement.

If you're looking to shore up your data skills, modernize your skill set, or know someone looking for hands-on, high-touch instruction in Machine Learning, R, Cloud Computing, Data Quality, Digital Analytics,  A/B Testing and Financial Analysis, Principal Analytics Prep is offering evening classes this Fall. Click here to learn about our courses. 

Our instructors are industry veterans with 10+ years of practical industry experience. And class size is capped to 10, ensuring a high-touch learning environment.

Facebook_pap_parttimeimmersive_tree

 


Choosing between individuals and aggregates

Friend/reader Thomas B. alerted me to this paper that describes some of the key chart forms used by cancer researchers.

It strikes me that many of the "new" charts plot granular data at the individual level. This heatmap showing gene expressions show one column per patient:

Jnci_genemap

This so-called swimmer plot shows one bar per patient:

Jnci_swimlanes

This spider plot shows the progression of individual patients over time. Key events are marked with symbols.

Jnci_spaghetti

These chart forms are distinguished from other ones that plot aggregated statistics: statistical averages, medians, subgroup averages, and so on.

One obvious limitation of such charts is their lack of scalability. The number of patients, the variability of the metric, and the timing of trends all drive up the amount of messiness.

I am left wondering what Question is being addressed by these plots. If we are concerned about treatment of an individual patient, then showing each line by itself would be clearer. If we are interested in the average trends of patients, then a chart that plots the overall average, or subgroup averages would be more accurate. If the interpretation of the individual's trend requires comparing with similar patients, then showing that individual's line against the subgroup average would be preferred.

When shown these charts of individual lines, readers are tempted to play the statistician - without using appropriate tools! Readers draw aggregate conclusions, performing the aggregation in their heads.

The authors of the paper note: "Spider plots only provide good visual qualitative assessment but do not allow for formal statistical inference." I agree with the second part. The first part is a fallacy - if the visual qualitative assessment is good enough, then no formal inference is necessary! The same argument is often made when people say they don't need advanced analysis because their simple analysis is "directionally accurate". When is something "directionally inaccurate"? How would one know?

Reference: Chia, Gedye, et. al., "Current and Evolving Methods to Visualize Biological Data in Cancer Research", JNCI, 2016, 108(8). (link)

***

Meteoreologists, whom I featured in the previous post, also have their own spider-like chart for hurricanes. They call it a spaghetti map:

Dorian_spaghetti

Compare this to the "cone of uncertainty" map that was featured in the prior post:

AL052019_5day_cone_with_line_and_wind

These two charts build upon the same dataset. The cone map, as we discussed, shows the range of probable paths of the storm center, based on all simulations of all acceptable models for projection. The spaghetti map shows selected individual simulations. Each line is the most likely trajectory of the storm center as predicted by a single simulation from a single model.

The problem is that each predictive model type has its own historical accuracy (known as "skill"), and so the lines embody different levels of importance. Further, it's not immediately clear if all possible lines are drawn so any reader making conclusions of, say, the envelope containing x percent of these lines is likely to be fooled. Eyeballing the "cone" that contains x percent of the lines is not trivial either. We tend to naturally drift toward aggregate statistical conclusions without the benefit of appropriate tools.

Plots of individuals should be used to address the specific problem of assessing individuals.


Blog receives a facelift

Junkcharts_newdesign_2019


After a number of years, I finally took time this long weekend to refresh the blog design. I hope you like it.

The key changes are:

  • This design is responsive, so mobile users should have a better experience.
  • The Welcome message that was pinned to the top has been moved to the top navigation menu. Someone complained about that a long time ago, and I can finally say it's now fixed.
  • The Search box is shown at the top (for non-mobile users), which is another request from some time ago.
  • Many of the links on the side have been updated or made secure.

Comment below if you encounter any problems, especially if you're using mobile.


As Dorian confounds meteorologists, we keep our minds clear on hurricane graphics, and discover correlation as our friend

As Hurricane Dorian threatens the southeastern coast of the U.S., forecasters are fretting about the lack of consensus among various predictive models used to predict the storm’s trajectory. The uncertainty of these models, as reflected in graphical displays, has been a controversial issue in the visualization community for some time.

Let’s start by reviewing a visual design that has captured meteorologists in recent years, something known as the cone map.

Charley_oldconemap

If asked to explain this map, most of us trace a line through the middle of the cone understood to be the center of the storm, the “cone” as the areas near the storm center that are affected, and the warmer colors (red, orange) as indicating higher levels of impact. [Note: We will  design for this type of map circa 2000s.]

The above interpretation is complete, and feasible. Nevertheless, the data used to make the map are forward-looking, not historical. It is still possible to stick to the same interpretation by substituting historical measurement of impact with its projection. As such, the “warmer” regions are projected to suffer worse damage from the storm than the “cooler” regions (yellow).

After I replace the text that was removed from the map (see below), you may notice the color legend, which discloses that the colors on the map encode probabilities, not storm intensity. The text further explains that the chart shows the most probable path of the center of the storm – while the coloring shows the probability that the storm center will reach specific areas.

Charley_oldconemap

***

When reading a data graphic, we rarely first look for text about how to read the chart. In the case of the cone map, those who didn’t seek out the instructions may form one of these misunderstandings:

  1. For someone living in the yellow-shaded areas, the map does not say that the impact of the storm is projected to be lighter; it’s that the center of the storm has a lower chance of passing right through. If, however, the storm does pay a visit, the intensity of the winds will reach hurricane grade.
  2. For someone living outside the cone, the map does not say that the storm will definitely bypass you; it’s that the chance of a direct hit is below the threshold needed to show up on the cone map. Thee threshold is set to attain 66% accurate. The actual paths of storms are expected to stay inside the cone two out of three times.

Adding to the confusion, other designers have produced cone maps in which color is encoding projections of wind speeds. Here is the one for Dorian.

AL052019_wind_probs_64_F120

This map displays essentially what we thought the first cone map was showing.

One way to differentiate the two maps is to roll time forward, and imagine what the maps should look like after the storm has passed through. In the wind-speed map (shown below right), we will see a cone of damage, with warmer colors indicating regions that experienced stronger winds.

Projectedactualwinds_irma

In the storm-center map (below right), we should see a single curve, showing the exact trajectory of the center of the storm. In other words, the cone of uncertainty dissipates over time, just like the storm itself.

Projectedactualstormcenter_irma

 

After scientists learned that readers were misinterpreting the cone maps, they started to issue warnings, and also re-designed the cone map. The cone map now comes with a black-box health warning right up top. Also, in the storm-center cone map, color is no longer used. The National Hurricane Center even made a youtube pointing out the dos and donts of using the cone map.

AL052019_5day_cone_with_line_and_wind

***

The conclusion drawn from misreading the cone map isn’t as devastating as it’s made out to be. This is because the two issues are correlated. Since wind speeds are likely to be stronger nearer to the center of the storm, if one lives in a region that has a low chance of being a direct hit, then that region is also likely to experience lower average wind speeds than those nearer to the projected center of the storm’s path.

Alberto Cairo has written often about these maps, and in his upcoming book, How Charts Lie, there is a nice section addressing his work with colleagues at the University of Miami on improving public understanding of these hurricane graphics. I highly recommended Cairo’s book here.

P.S. [9/5/2019] Alberto also put out a post about the hurricane cone map.

 

 

 


Water stress served two ways

Via Alberto Cairo (whose new book How Charts Lie can be pre-ordered!), I found the Water Stress data visualization by the Washington Post. (link)

The main interest here is how they visualized the different levels of water stress across the U.S. Water stress is some metric defined by the Water Resources Institute that, to my mind, measures the demand versus supply of water. The higher the water stress, the higher the risk of experiencing droughts.

There are two ways in which the water stress data are shown: the first is a map, and the second is a bubble plot.

Wp_waterstress

This project provides a great setting to compare and contrast these chart forms.

How Data are Coded

In a map, the data are usually coded as colors. Sometimes, additional details can be coded as shades, or moire patterns within the colors. But the map form locks down a number of useful dimensions - including x and y location, size and shape. The outline map reserves all these dimensions, rendering them unavailable to encode data.

By contrast, the bubble plot admits a good number of dimensions. The key ones are the x- and y- location. Then, you can also encode data in the size of the dots, the shape, and the color of the dots.

In our map example, the colors encode the water stress level, and a moire pattern encodes "arid areas". For the scatter plot, x = daily water use, y = water stress level, grouped by magnitude, color = water stress level, size = population. (Shape is constant.)

Spatial Correlation

The map is far superior in displaying spatial correlation. It's visually obvious that the southwestern states experience higher stress levels.

This spatial knowledge is relinquished when using a bubble plot. The designer relies on the knowledge of the U.S. map in the head of the readers. It is possible to code this into one of the available dimensions, e.g. one could make x = U.S. regions, but another variable is sacrificed.

Non-contiguous Spatial Patterns

When spatial patterns are contiguous, the map functions well. Sometimes, spatial patterns are disjoint. In that case, the bubble plot, which de-emphasizes the physcial locations, can be superior. In our example, the vertical axis divides the states into five groups based on their water stress levels. Try figuring out which states are "medium to high" water stress from the map, and you'll see the difference.

Finer Geographies

The map handles finer geographical units like counties and precincts better. It's completely natural.

In the bubble plot, shifting to finer units causes the number of dots to explode. This clutters up the chart. Besides, while most (we hope) Americans know the 50 states, most of us can't recite counties or precincts. Thus, the designer can't rely on knowledge in our heads. It would be impossible to learn spatial patterns from such a chart.

***

The key, as always, is to nail down your message, then select the right chart form.

 

 


Women workers taken for a loop or four

I was drawn to the following chart in Business Insider because of the calendar metaphor. (The accompanying article is here.)

Businessinsider_payday

Sometimes, the calendar helps readers grasp concepts faster but I'm afraid the usage here slows us down.

The underlying data consist of just four numbers: the wage gaps between race and gender in the U.S., considered simply from an aggregate median personal income perspective. The analyst adopts the median annual salary of a white male worker as a baseline. Then, s/he imputes the number of extra days that others must work to attain the same level of income. For example, the median Asian female worker must work 64 extra days (at her daily salary level) to match the white guy's annual pay. Meanwhile, Hispanic female workers must work 324 days extra.

There are a host of reasons why the calendar metaphor backfired.

Firstly, it draws attention to an uncomfortable detail of the analysis - which papers over the fact that weekends or public holidays are counted as workdays. The coloring of the boxes compounds this issue. (And the designer also got confused and slipped up when applying the purple color for Hispanic women.)

Secondly, the calendar focuses on Year 2 while Year 1 lurks in the background - white men have to work to get that income (roughly $46,000 in 2017 according to the Census Bureau).

Thirdly, the calendar view exposes another sore point around the underlying analysis. In reality, the white male workers are continuing to earn wages during Year 2.

The realism of the calendar clashes with the hypothetical nature of the analysis.

***

One can just use a bar chart, comparing the number of extra days needed. The calendar design can be considered a set of overlapping bars, wrapped around the shape of a calendar.

The staid bars do not bring to life the extra toil - the message is that these women have to work harder to get the same amount of pay. This led me to a different metaphor - the white men got to the destination in a straight line but the women must go around loops (extra days) before reaching the same endpoint.

Redo_businessinsider_racegenderpaygap

While the above is a rough sketch, I made sure that the total length of the lines including the loops roughly matches the total number of days the women needed to work to earn $46,000.

***

The above discussion focuses solely on the V(isual) corner of the Trifecta Checkup, but this data visualization is also interesting from the D(ata) perspective. Statisticians won't like such a simple analysis that ignores, among other things, the different mix of jobs and industries underlying these aggregate pay figures.

Now go to my other post on the sister (book) blog for a discussion of the underlying analysis.