Did the pandemic drive mass migration?

The Wall Street Journal ran this nice compact piece about migration patterns during the pandemic in the U.S. (link to article)

Wsj_migration

I'd look at the chart on the right first. It shows the greatest net flow of people out of the Northeast to the South. This sankey diagram is nicely done. The designer shows restraint in not printing the entire dataset on the chart. If a reader really cares about the net migration from one region to a specific other region, it's easy to estimate the number even though it's not printed.

The maps succinctly provide readers the definition of the regions.

To keep things in perspective, we are talking around 100,000 when the death toll of Covid-19 is nearing 600,000. Some people have moved but almost everyone else haven't.

***

The chart on the left breaks down the data in a different way - by urbanicity. This is a variant of the stacked column chart. It is a chart form that fits the particular instance of the dataset. It works only because in every month of the last three years, there was a net outflow from "large metro cores". Thus, the entire series for large metro cores can be pointed downwards.

The fact that this design is sensitive to the dataset is revealed in the footnote, which said that the May 2018 data for "small/medium metro" was omitted from the chart. Why didn't they plot that number?

It's the one datum that sticks out like a sore thumb. It's the only negative number in the entire dataset that is not associated with "large metro cores". I suppose they could have inserted a tiny medium green slither in the bottom half of that chart for May 2018. I don't think it hurts the interpretation of the chart. Maybe the designer thinks it might draw unnecessary attention to one data point that really doesn't warrant it.

***

See my collection of posts about Wall Street Journal graphics.


Illustrating differential growth rates

Reader Mirko was concerned about a video published in Germany that shows why the new coronavirus variant is dangerous. He helpfully provided a summary of the transcript:

The South African and the British mutations of the SARS-COV-2 virus are spreading faster than the original virus. On average, one infected person infects more people than before. Researchers believe the new variant is 50 to 70 % more transmissible.

Here are two key moments in the video:

Germanvid_newvariant1

This seems to be saying the original virus (left side) replicates 3 times inside the infected person while the new variant (right side) replicates 19 times. So we have a roughly 6-fold jump in viral replication.

Germanvid_newvariant2

Later in the video, it appears that every replicate of the old virus finds a new victim while the 19 replicates of the new variant land on 13 new people, meaning 6 replicates didn't find a host.

As Mirko pointed out, the visual appears to have run away from the data. (In our Trifecta Checkup, we have a problem with the arrow between the D and the V corners. What the visual is saying is not aligned with what the data are saying.)

***

It turns out that the scientists have been very confusing when talking about the infectiousness of this new variant. The most quoted line is that the British variant is "50 to 70 percent more transmissible". At first, I thought this is a comment on the famous "R number". Since the R number around December was roughly 1 in the U.K, the new variant might bring the R number up to 1.7.

However, that is not the case. From this article, it appears that being 5o to 70 percent more transmissible means R goes up from 1 to 1.4. R is interpreted as the average number of people infected by one infected person.

Mirko wonders if there is a better way to illustrate this. I'm sure there are many better ways. Here's one I whipped up:

Junkcharts_redo_germanvideo_newvariant

The left side is for the 40% higher R number. Both sides start at the center with 10 infected people. At each time step, if R=1 (right side), each of the 10 people infects 10 others, so the total infections increase by 10 per time step. It's immediately obvious that a 40% higher R is very serious indeed. Starting with 10 infected people, in 10 steps, the total number of infections is almost 1,000, almost 10 times higher than when R is 1.

The lines of the graphs simulate the transmission chains. These are "average" transmission chains since R is an average number.

 

P.S. [1/29/2021: Added the missing link to the article in which it is reported that 50-70 percent more transmissible implies R increasing by 40%.]

 

 


Unlocking the secrets of a marvellous data visualization

Scmp_coronavirushk_paperThe graphics team in my hometown paper SCMP has developed a formidable reputation in data visualization, and I lapped every drop of goodness on this beautiful graphic showing how the coronavirus spread around Hong Kong (in the first wave in April). Marcelo uploaded an image of the printed version to his Twitter. This graphic occupied the entire back page of that day's paper.

An online version of the chart is found here.

The data graphic is a masterclass in organizing data. While it looks complicated, I had no problem unpacking the different layers.

Cases were divided into imported cases (people returning to Hong Kong) and local cases. A small number of cases are considered in-betweens.

Scmp_coronavirushk_middle

The two major classes then occupy one half page each. I first looked at the top half, where my attention is drawn to the thickest flows. The majority of imported cases arrived from the U.K., and most of those were returning students. The U.S. is the next largest source of imported cases. The flows are carefully ordered by continent, with the Americas on the left, followed by Europe, Middle East, Africa, and Asia.

Junkcharts_scmpcoronavirushk_americas1

Where there are interesting back stories, the flow blossoms into a flower. An annotation explains the cluster of cases. Each anther represents a case. Eight people caught the virus while touring Bolivia together.

Junkcharts_scmpcoronavirushk_bolivia

One reads the local cases in the same way. Instead of flowers, think of roots. The biggest cluster by far was a band that played at clubs in three different parts of the city, infecting a total of 72 people.

Junkcharts_scmpcoronavirushk_localband

Everything is understood immediately, without a need to read text or refer to legends. The visual elements carry that kind of power.

***

This data graphic presents a perfect amalgam of art and science. For a flow chart, the data are encoded in the relative thickness of the lines. This leaves two unused dimensions of these lines: the curvature and lengths. The order of the countries and regions take up the horizontal axis, but the vertical axis is free. Unshackled from the data, the designer introduced curves into the lines, varied their lengths, and dispersed their endings around the white space in an artistic manner.

The flowers/roots present another opportunity for creativity. The only data constraint is the number of cases in a cluster. The positions of the dots, and the shape of the lines leading to the dots are part of the playground.

What's more, the data visualization is a powerful reminder of the benefits of testing and contact tracing. The band cluster led to the closure of bars, which helped slow the spread of the coronavirus. 

 


The windy path to the Rugby World Cup

When I first saw the following chart, I wondered whether it is really that challenging for these eight teams to get into the Rugby World Cup, currently playing in Japan:

1920px-2019_Rugby_World_Cup_Qualifying_Process_Diagram.svg

Another visualization of the process conveys a similar message. Both of these are uploaded to Wikipedia.

Rugby_World_Cup_2019_Qualification_illustrated_v2

(This one hasn't been updated and still contains blank entries.)

***

What are some of the key messages one would want the dataviz to deliver?

  • For the eight countries that got in (not automatically), track their paths to the World Cup. How many competitions did they have to play?
  • For those countries that failed to qualify, track their paths to the point that they were stopped. How many competitions did they play?
  • What is the structure of the qualification rounds? (These are organized regionally, in addition to certain playoffs across regions.)
  • How many countries had a chance to win one of the eight spots?
  • Within each competition, how many teams participated? Did the winner immediately qualify, or face yet another hurdle? Did the losers immediately disqualify, or were they offered another chance?

Here's my take on this chart:

Rugby_path_to_world_cup_sm

 


The ebb and flow of an effective dataviz showing the rise and fall of GE

Wsj_ebbflowGE_800A WSJ chart caught my eye the other day – I spotted someone looking at it in a coffee shop, and immediately got a hold of a copy. The chart plots the ebb and flow of GE’s revenues from the 1980s to the present.

What grabbed my attention? The less-used chart form, and the appealing but not too gaudy color scheme.

The chart presents a highly digestible view of the structure of GE’s revenues. We learn about GE’s major divisions, as well as how certain segments split from or merged with others over time. Major acquisitions and divestitures are also depicted; if these events are the main focus, the designer should find ways to make these moments stand out more.

An interesting design decision concerns the sequence of the divisions. One possible order is by increasing or decreasing importance, typically indicated by proportional revenues. This is complicated by the changing nature of the business over the decades. So financial services went from nothing to the largest division by far to almost disappearing.

The sequencing need not be data-driven; it can be design-constrained. The merging and splitting of business units are conveyed via linking arrows. Longer arrows are unsightly, and meshes of arrows are confusing.

On this chart, the long arrow pointing from the orange to the gray around 2004 feels out of place. What if the financial services block is moved to the right of the consumer block? That will significantly shorten the long arrow. It won’t create other entanglements as the media block is completely disjoint and there are no other arrows tying financial services to another division.

 

***


To improve readability, the bars are spaced out horizontally. The addition of whitespace distorts the proportionality. So, in 2001, the annotation states that financial services (orange) accounted for “about half of the revenues,” which is directly contradicted by the visual perception – readers find the orange bar to be clearly shorter than the total length of the other bars. This is a serious deficiency of the chart form but this chart conveys the "ebb and flow" very well.


The merry-go-round of investment bankers

Here is the start of my blog post about the chart I teased the other day:

Businessinsider_ibankers

 

Today's post deals with the following chart, which appeared recently at Business Insider (hat tip: my sister).

It's immediately obvious that this chart requires a heroic effort to decipher. The question shown in the chart title "How many senior investment bankers left their firms?" is the easiest to answer, as the designer places the number of exits in the central circle of each plot relating to a top-tier investment bank (aka "featured bank"). Note that the visual design plays no role in delivering the message, as readers just scan the data from those circles.

Anyone persistent enough to explore the rest of the chart will eventually discover these features...

***

The entire post including an alternative view of the dataset is a guest blog at the JMP Blog here. This is a situation in which plotting everything will make an unreadable chart, and the designer has to think hard about what s/he is really trying to accomplish.


Made in France stereotypes

France is on my mind lately, as I prepare to bring my dataviz seminar to Lyon in a couple of weeks.  (You can still register for the free seminar here.)

The following Made in France poster brings out all the stereotypes of the French.

Made_in_france_small

(You can download the original PDF here.)

It's a sankey diagram with so many flows that it screams "it's complicated!" This is an example of a graphic for want of a story. In a Trifecta Checkup, it's failing in the Q(uestion) corner.

It's also failing in the D(ata) corner. Take a look at the top of the chart.

Madeinfrance_totalexports

France exported $572 billion worth of goods. The diagram then plots eight categories of exports, ranging from wines to cheeses:

Madeinfrance_exportcategories

Wine exports totaled $9 billion which is about 1.6% of total exports. That's the largest category of the eight shown on the page. Clearly the vast majority of exports are excluded from the sankey diagram.

Are the 8 the largest categories of exports for France? According to this site, those are (1) machinery (2) aircraft (3) vehicles (4) electrical machinery (5) pharmaceuticals (6) plastics (7) beverages, spirits, vinegar (8) perfumes, cosmetics.

Compare: (1) wines (2) jewellery (3) perfume (4) clothing (5) cheese (6) baked goods (7) chocolate (8) paintings.

It's stereotype central. Name 8 things associated with the French brand and cherry-pick those.

Within each category, the diagram does not show all of the exports either. It discloses that the bars for wines show only $7 of the $9 billion worth of wines exported. This is because the data only capture the "Top 10 Importers." (See below for why the designer did this... France exports wine to more than 180 countries.)

Finally, look at the parade of key importers of French products, as shown at the bottom of the sankey:

Madeinfrance_topimporters

The problem with interpreting this list of countries is best felt by attempting to describe which countries ended up on this list! It's the list of countries that belong to the top 10 importers of one or more of the eight chosen products, ordered by the total value of imports in those 8 categories only but only including the value in any category if it rises to the top 10 of the respective category.

In short, with all those qualifications, the size or rank of the black bars does not convey any useful information.

***

One feature of the chart that surprised me was no flows in the Wine category from France to Italy or Spain. (Based on the above discussion, you should realize that no flows does not mean no exports.) So I went to the Comtrade database that is referenced in the poster, and pulled out all the wine export data.

How does one visualize where French wines are going? After fiddling around the numbers, I came up with the following diagram:

Redo_jc_frenchwineexports

I like this type of block diagram which brings out the structure of the dataset. The key features are:

  • The total wine exports to the rest of the world was $1.4 billion in 2016
  • Half of it went to five European neighbors, the other half to the rest of the world
  • On the left half, Germany took a third of those exports; the UK and Switzerland together is another third; and the final third went to Belgium and the Netherlands
  • On the right half, the countries in the blue zone accounted for three-fifths with the unspecified countries taking two-fifths.
  • As indicated, the two-fifths (in gray) represent 20% of total wine exports, and were spread out among over 180 countries.
  • The three-fifths of the blue zone were split in half, with the first half going to North America (about 2/3 to USA and 1/3 to Canada) and the second half going to Asia (2/3 to China and 1/3 to Japan)
  • As the title indicates, the top 9 importers of French wine covered 80% of the total volume (in litres) while the other 180+ countries took 20% of the volume

 The most time-consuming part of this exercise was finding the appropriate structure which can be easily explained in a visual manner.

 

 


Visualizing movements of people

Long-time reader Daniel L. sends in this chart illustrating a large data set of intra-state migration flows in the U.S. The original chart is at Vizynary by way of Daily Kos.

Viznary_migration1

***

There is no denying that this chart is beautiful to look at. But what is its message? That there are people migrating from and to every state? (assuming all fifty states are present)

Daily Kos describes how one can hover over any state to see its individual patterns. Something like this:

Viznary_migrationFL

This is a great way, perhaps the only way, to consume the chart. Essentially, the reader is asked to generate a small-multiples panel of charts. The chart does a better job at showing the pairs of states between which people migrate than at showing the relative size of the flows. The size of the flows is coded in the width of the arcs. The widths are too similar to tell apart; and it doesn't help that no legend is provided.

The choice of color is curious. Each region of the country is its own color, in a "nominal" way. It is a design decision to emphasize regions.

Another decision is to hide information on the distances of the migrations. Evidently, the designer sacrificed that information in order to create the neat circular arrangement of states.

A shortcoming of this representation is one missing dimension: the direction of the flow. I'm not sure given any pair of states A and B, whether the net migration is into A or into B.

***

I propose a solution using the map while preserving the interactive element of the original.

On this map, when you hover over a particular state, it highlights all other states for which there are migrations flows into or out of that state. For color, use a blue-white-red scheme with blue indicating net inflow, red indicating net outflow, and white for near-zero flows. Include a legend.

Another important decision for the designer is absolute versus relative scales. In an absolute scheme, you rank the entire set of flows for all pairs of states; obviously, the resulting colors would be influenced by the state populations. Alternatively, you rank the flow sizes within each state; in this case, the smaller states will feel exaggerated.

The map has the additional advantage of showing the approximate distance (and direction) moved, which, for me, is a useful piece of information.


New but is it better?

Conventionally, the bracket in a sports tournament is presented like this (link):

EUROCHART-A3_1523448a

In the Euro 2012 that's happening right now, the group stage is followed by the knockout stage (quarter-, semi- and final).

The knockout stage is pretty straightforward. The group stage presents some challenges because it's difficult to present the chronology together with the team standing at the same time.

***

The official site of Euro 2012 has an innovative "Tournament Map" that is an attempt to improve upon the traditional design. (link)

Euro_map

I have mixed feelings about this presentation. It's easier to get a sense of how each team performed chronologically over the course of the competition. But then, I can't figure out what day the winner of a quarterfinal would play in the semifinal.


Ron Paul confuses the charts

Andrew Sullivan (link) re-printed this grouped column chart showing the result of a Washington Post-ABC poll on how voters say they would react to Ron Paul running as an independent candidate in next year's U.S. presidential election.

As_ronpaul

One aspect of this chart bothers me... depending on one's familiarity with the election politics, the need to read carefully both the titles at the bottom of the chart, and the legend, and possibly also the title of the chart (or the knowledge that the Republican wears red and Democrat blue) in order to orient onself. You can experiment by blocking out one or two of these three items.

Here's the same chart with a small number of fixes. Printing the legend onto the bars themselves makes the data more readable. This change necessitates flipping the columns over to horizontal bars. There are pros and cons to using a stacked chart versus a grouped chart.

Redo_paul1

Neither of these charts answer the burning question in the reader's mind, which is likely to be from whom would Paul take his votes. The key message from above is that the insertion of Paul is projected to make the identity of the Republican candidate irrelevant. The following flow chart emphasizes the shift in votes as opposed to the vote totals.

Redo_paul2

It appears that the Others/Undecided voters who can still swing the election do not consider Ron Paul as a desirable alternative. Most of Ron Paul's supporters would come from voters who would have cast their votes for the Republican or Democratic candidate (by a ratio of 3 Republican votes to 1 Democratic vote if Romney is running, or 3 to 2 if Gingrich is running).