## The windy path to the Rugby World Cup

##### Sep 26, 2019

When I first saw the following chart, I wondered whether it is really that challenging for these eight teams to get into the Rugby World Cup, currently playing in Japan:

Another visualization of the process conveys a similar message. Both of these are uploaded to Wikipedia.

(This one hasn't been updated and still contains blank entries.)

***

What are some of the key messages one would want the dataviz to deliver?

• For the eight countries that got in (not automatically), track their paths to the World Cup. How many competitions did they have to play?
• For those countries that failed to qualify, track their paths to the point that they were stopped. How many competitions did they play?
• What is the structure of the qualification rounds? (These are organized regionally, in addition to certain playoffs across regions.)
• How many countries had a chance to win one of the eight spots?
• Within each competition, how many teams participated? Did the winner immediately qualify, or face yet another hurdle? Did the losers immediately disqualify, or were they offered another chance?

Here's my take on this chart:

## The time of bird seeds and chart tuneups

##### Sep 25, 2019

The recent post about multi-national companies reminded me of an older post, in which I stepped through data table enhancements.

Here is a video of the process. You can use any tool to implement the steps; even Excel is good enough.

The video is part of a series called "Data science: the Missing Pieces". In these episodes, I cover the parts of data science that are between the cracks, the little things that textbooks and courses do not typically cover - the things that often block students from learning efficiently.

If you have encountered such things, please comment below to suggest future topics. What is something about visualizing data you wish you learned formally?

***

P.S. Placed here to please the twitter-bot

## Pulling the multi-national story out, step by step

##### Sep 18, 2019

Reader Aleksander B. found this Economist chart difficult to understand.

Given the chart title, the reader is looking for a story about multinationals producing lower return on equity than local firms. The first item displayed indicates that multinationals out-performed local firms in the technology sector.

The pie charts on the right column provide additional information about the share of each sector by the type of firms. Is there a correlation between the share of multinationals, and their performance differential relative to local firms?

***

We can clean up the presentation. The first changes include using dots in place of pipes, removing the vertical gridlines, and pushing the zero line to the background:

The horizontal gridlines attached to the zero line can also be removed:

Now, we re-order the rows. Start with the aggregate "All sectors". Then, order sectors from the largest under-performance by multinationals to the smallest.

The pie charts focus only on the share of multinationals. Taking away the remainders speeds up our perception:

Help the reader understand the data by dividing the sectors into groups, organized by the performance differential:

For what it's worth, re-sort the sectors from largest to smallest share of multinationals:

Having created groups of sectors by share of multinationals, I simplify further by showing the average pie chart within each group:

***

To recap all the edits, here is an animated gif: (if it doesn't play automatically, click on it)

***

Judging from the last graphic, I am not sure there is much correlation between share of multinationals and the performance differentials. It's interesting that in aggregate, local firms and multinationals performed the same. The average hides the variability by sector: in some sectors, local firms out-performed multinationals, as the original chart title asserted.

## Tennis greats at the top of their game

##### Sep 12, 2019

The following chart of world No. 1 tennis players looks pretty but the payoff of spending time to understand it isn't high enough. The light colors against the tennis net backdrop don't work as intended. The annotation is well done, and it's always neat to tug a legend inside the text.

The original is found at Tableau Public (link).

The topic of the analysis appears to be the ages at which tennis players attained world #1 ranking. Here are the male players visualized differently:

Some players like Jimmy Connors and Federer have second springs after dominating the game in their late twenties. It's relatively rare for players to get to #1 after 30.

## Announcement: Advancing your data skills, Fall 2019

##### Sep 10, 2019

Interrupting the flow of dataviz with the following announcement.

If you're looking to shore up your data skills, modernize your skill set, or know someone looking for hands-on, high-touch instruction in Machine Learning, R, Cloud Computing, Data Quality, Digital Analytics,  A/B Testing and Financial Analysis, Principal Analytics Prep is offering evening classes this Fall. Click here to learn about our courses.

Our instructors are industry veterans with 10+ years of practical industry experience. And class size is capped to 10, ensuring a high-touch learning environment.

## Choosing between individuals and aggregates

##### Sep 05, 2019

Friend/reader Thomas B. alerted me to this paper that describes some of the key chart forms used by cancer researchers.

It strikes me that many of the "new" charts plot granular data at the individual level. This heatmap showing gene expressions show one column per patient:

This so-called swimmer plot shows one bar per patient:

This spider plot shows the progression of individual patients over time. Key events are marked with symbols.

These chart forms are distinguished from other ones that plot aggregated statistics: statistical averages, medians, subgroup averages, and so on.

One obvious limitation of such charts is their lack of scalability. The number of patients, the variability of the metric, and the timing of trends all drive up the amount of messiness.

I am left wondering what Question is being addressed by these plots. If we are concerned about treatment of an individual patient, then showing each line by itself would be clearer. If we are interested in the average trends of patients, then a chart that plots the overall average, or subgroup averages would be more accurate. If the interpretation of the individual's trend requires comparing with similar patients, then showing that individual's line against the subgroup average would be preferred.

When shown these charts of individual lines, readers are tempted to play the statistician - without using appropriate tools! Readers draw aggregate conclusions, performing the aggregation in their heads.

The authors of the paper note: "Spider plots only provide good visual qualitative assessment but do not allow for formal statistical inference." I agree with the second part. The first part is a fallacy - if the visual qualitative assessment is good enough, then no formal inference is necessary! The same argument is often made when people say they don't need advanced analysis because their simple analysis is "directionally accurate". When is something "directionally inaccurate"? How would one know?

Reference: Chia, Gedye, et. al., "Current and Evolving Methods to Visualize Biological Data in Cancer Research", JNCI, 2016, 108(8). (link)

***

Meteoreologists, whom I featured in the previous post, also have their own spider-like chart for hurricanes. They call it a spaghetti map:

Compare this to the "cone of uncertainty" map that was featured in the prior post:

These two charts build upon the same dataset. The cone map, as we discussed, shows the range of probable paths of the storm center, based on all simulations of all acceptable models for projection. The spaghetti map shows selected individual simulations. Each line is the most likely trajectory of the storm center as predicted by a single simulation from a single model.

The problem is that each predictive model type has its own historical accuracy (known as "skill"), and so the lines embody different levels of importance. Further, it's not immediately clear if all possible lines are drawn so any reader making conclusions of, say, the envelope containing x percent of these lines is likely to be fooled. Eyeballing the "cone" that contains x percent of the lines is not trivial either. We tend to naturally drift toward aggregate statistical conclusions without the benefit of appropriate tools.

Plots of individuals should be used to address the specific problem of assessing individuals.

## Blog receives a facelift

##### Sep 04, 2019

After a number of years, I finally took time this long weekend to refresh the blog design. I hope you like it.

The key changes are:

• This design is responsive, so mobile users should have a better experience.
• The Welcome message that was pinned to the top has been moved to the top navigation menu. Someone complained about that a long time ago, and I can finally say it's now fixed.
• The Search box is shown at the top (for non-mobile users), which is another request from some time ago.
• Many of the links on the side have been updated or made secure.

Comment below if you encounter any problems, especially if you're using mobile.

## As Dorian confounds meteorologists, we keep our minds clear on hurricane graphics, and discover correlation as our friend

##### Sep 03, 2019

As Hurricane Dorian threatens the southeastern coast of the U.S., forecasters are fretting about the lack of consensus among various predictive models used to predict the storm’s trajectory. The uncertainty of these models, as reflected in graphical displays, has been a controversial issue in the visualization community for some time.

Let’s start by reviewing a visual design that has captured meteorologists in recent years, something known as the cone map.

If asked to explain this map, most of us trace a line through the middle of the cone understood to be the center of the storm, the “cone” as the areas near the storm center that are affected, and the warmer colors (red, orange) as indicating higher levels of impact. [Note: We will  design for this type of map circa 2000s.]

The above interpretation is complete, and feasible. Nevertheless, the data used to make the map are forward-looking, not historical. It is still possible to stick to the same interpretation by substituting historical measurement of impact with its projection. As such, the “warmer” regions are projected to suffer worse damage from the storm than the “cooler” regions (yellow).

After I replace the text that was removed from the map (see below), you may notice the color legend, which discloses that the colors on the map encode probabilities, not storm intensity. The text further explains that the chart shows the most probable path of the center of the storm – while the coloring shows the probability that the storm center will reach specific areas.

***

When reading a data graphic, we rarely first look for text about how to read the chart. In the case of the cone map, those who didn’t seek out the instructions may form one of these misunderstandings:

1. For someone living in the yellow-shaded areas, the map does not say that the impact of the storm is projected to be lighter; it’s that the center of the storm has a lower chance of passing right through. If, however, the storm does pay a visit, the intensity of the winds will reach hurricane grade.
2. For someone living outside the cone, the map does not say that the storm will definitely bypass you; it’s that the chance of a direct hit is below the threshold needed to show up on the cone map. Thee threshold is set to attain 66% accurate. The actual paths of storms are expected to stay inside the cone two out of three times.

Adding to the confusion, other designers have produced cone maps in which color is encoding projections of wind speeds. Here is the one for Dorian.

This map displays essentially what we thought the first cone map was showing.

One way to differentiate the two maps is to roll time forward, and imagine what the maps should look like after the storm has passed through. In the wind-speed map (shown below right), we will see a cone of damage, with warmer colors indicating regions that experienced stronger winds.

In the storm-center map (below right), we should see a single curve, showing the exact trajectory of the center of the storm. In other words, the cone of uncertainty dissipates over time, just like the storm itself.

After scientists learned that readers were misinterpreting the cone maps, they started to issue warnings, and also re-designed the cone map. The cone map now comes with a black-box health warning right up top. Also, in the storm-center cone map, color is no longer used. The National Hurricane Center even made a youtube pointing out the dos and donts of using the cone map.

***

The conclusion drawn from misreading the cone map isn’t as devastating as it’s made out to be. This is because the two issues are correlated. Since wind speeds are likely to be stronger nearer to the center of the storm, if one lives in a region that has a low chance of being a direct hit, then that region is also likely to experience lower average wind speeds than those nearer to the projected center of the storm’s path.

Alberto Cairo has written often about these maps, and in his upcoming book, How Charts Lie, there is a nice section addressing his work with colleagues at the University of Miami on improving public understanding of these hurricane graphics. I highly recommended Cairo’s book here.

P.S. [9/5/2019] Alberto also put out a post about the hurricane cone map.