Following this pretty flow chart

Bloomberg did a very nice feature on how drought has been causing havoc with river transportation of grains and other commodities in the U.S., which included several well-executed graphics.

Mississippi_sankeyI'm particularly attracted to this flow chart/sankey diagram that shows the flows of grains from various U.S. ports to foreign countries.

It looks really great.

Here are some things one can learn from this chart:

  • The Mississippi River (blue flow) is by far the most important conduit of American grain exports
  • China is by far the largest importer of American grains
  • Mexico is the second largest importer of American grains, and it has a special relationship with the "interior" ports (yellow). Notice how the Interior almost exclusively sends grains to Mexico
  • Similarly, the Puget Sound almost exclusively trades with China

The above list is impressive for one chart.

***

Some key questions are not as easy to see from this layout:

  • What proportion of the total exports does the Mississippi River account for? (Turns out to be almost exactly half.)
  • What proportion of the total exports go to China? (About 40%. This question is even harder than the previous one because of all the unlabeled values for the smaller countries.)
  • What is the relative importance of different ports to Japan/Philippines/Indonesia/etc.? (Notice how the green lines merge from the other side of the country names.)
  • What is the relative importance of any of the countries listed, outside the top 5 or so?
  • What is the ranking of importance of export nations to each port? For Mississippi River, it appears that the countries may have been drawn from least important (up top) to most important (down below). That is not the case for the other ports... otherwise the threads would tie up into knots.

***

Some of the features that make the chart look pretty are not data-driven.

See this artificial "hole" in the brown branch.

Bloomberg_mississippigrains_branchgap

In this part of the flow, there are two tiny outflows to Myanmar and Yemen, so most of the goods that got diverted to the right side ended up merging back to the main branch. However, the creation of this hole allows a layering effect which enhances the visual cleanliness.

Next, pay attention to the yellow sub-branches:

Bloomberg_mississippigrains_subbranching

At the scale used by the designer, all of the countries shown essentially import about the same amount from the Interior (yellow). Notice the special treatment of Singapore and Phillippines. Instead of each having a yellow sub-branch coming off the "main" flow, these two countries share the sub-branch, which later splits.

 

 

 


Unlocking the secrets of a marvellous data visualization

Scmp_coronavirushk_paperThe graphics team in my hometown paper SCMP has developed a formidable reputation in data visualization, and I lapped every drop of goodness on this beautiful graphic showing how the coronavirus spread around Hong Kong (in the first wave in April). Marcelo uploaded an image of the printed version to his Twitter. This graphic occupied the entire back page of that day's paper.

An online version of the chart is found here.

The data graphic is a masterclass in organizing data. While it looks complicated, I had no problem unpacking the different layers.

Cases were divided into imported cases (people returning to Hong Kong) and local cases. A small number of cases are considered in-betweens.

Scmp_coronavirushk_middle

The two major classes then occupy one half page each. I first looked at the top half, where my attention is drawn to the thickest flows. The majority of imported cases arrived from the U.K., and most of those were returning students. The U.S. is the next largest source of imported cases. The flows are carefully ordered by continent, with the Americas on the left, followed by Europe, Middle East, Africa, and Asia.

Junkcharts_scmpcoronavirushk_americas1

Where there are interesting back stories, the flow blossoms into a flower. An annotation explains the cluster of cases. Each anther represents a case. Eight people caught the virus while touring Bolivia together.

Junkcharts_scmpcoronavirushk_bolivia

One reads the local cases in the same way. Instead of flowers, think of roots. The biggest cluster by far was a band that played at clubs in three different parts of the city, infecting a total of 72 people.

Junkcharts_scmpcoronavirushk_localband

Everything is understood immediately, without a need to read text or refer to legends. The visual elements carry that kind of power.

***

This data graphic presents a perfect amalgam of art and science. For a flow chart, the data are encoded in the relative thickness of the lines. This leaves two unused dimensions of these lines: the curvature and lengths. The order of the countries and regions take up the horizontal axis, but the vertical axis is free. Unshackled from the data, the designer introduced curves into the lines, varied their lengths, and dispersed their endings around the white space in an artistic manner.

The flowers/roots present another opportunity for creativity. The only data constraint is the number of cases in a cluster. The positions of the dots, and the shape of the lines leading to the dots are part of the playground.

What's more, the data visualization is a powerful reminder of the benefits of testing and contact tracing. The band cluster led to the closure of bars, which helped slow the spread of the coronavirus. 

 


Powerful photos visualizing housing conditions in Hong Kong

I was going to react to Alberto's post about the New York Times's article about economic inequality in Hong Kong, which is proposed as one origin to explain the current protest movement. I agree that the best graphic in this set is the "photoviz" showing the "coffins" or "cages" that many residents live in, because of the population density. 

Nyt_hongkong_apartment_photoviz

Then I searched the archives, and found this old post from 2015 which is the perfect response to it. What's even better, that post was also inspired by Alberto.

The older post featured a wonderful campaign by human rights organization Society for Community Organization that uses photoviz to draw attention to the problem of housing conditions in Hong Kong. They organized a photography exhibit on this theme in 2014. They then updated the exhibit in 2016.

Here is one of the iconic photos by Benny Lam:

Soco_trapped_B1

I found more coverage of Benny's work here. There is also a book that we can flip on Vimeo.

In 2017, the South China Morning Post (SCMP) published drone footage showing the outside view of the apartment buildings.

***

What's missing is the visual comparison to the luxury condos where the top 1 percent live. For these, one can  visit the real estate sites, such as Sotheby's. Here is their "12 luxury homes for sales" page.

Another comparison: a 1000 sq feet apartment that sits between those extremes. The photo by John Butlin comes from SCMP's Post Magazine's feature on the apartment:

Butlin_scmp_home

***

Also check out my review of Alberto's fantastic, recent book, How Charts Lie.

Cairo_howchartslie_cover

 

 


Measles babies

Mona Chalabi has made this remarkable graphic to illustrate the effect of the anti-vaccine movement on measles cases in the U.S.: (link)

Monachalabi_measles

As a form of agitprop, the graphic seizes upon the fear engendered by the defacing red rash of the disease. And it's very effective in articulating its social message.

***

I wasn't able to find the data except for a specific year or two. So, this post is more inspired by the graphic than a direct response to it.

I think the left-side legend should say "1 case of measles in someone who was not vaccinated" (as opposed to 1 case of measles in aggregate).

The chart encodes the data in the density of the red dots. What does the density of the red dots signify? There are two possibilities: case counts or case rates.

2013 is a year in which I could find data. In 2013, the U.S. saw 187 cases of measles, only 4 of them in someone who was vaccinated. In other words, there are 49 times as many measles cases among the unvaccinated as the vaccinated.

But note that about 90 percent of the population (using 13-17 year olds as a proxy) are vaccinated. The chance of getting measles in the unvaccinated is 0.8 per million, compared to 0.002 per million in the vaccinated - 422 times higher.

The following chart shows the relative appearance of the dot densities. The bottom row which compares the relative chance of getting measles is the more appropriate metric, and it looks much worse.

Junkcharts_monachalabi_measles

***

Mona's instagram has many other provocative graphics.

 


Pretty circular things

National Geographic features this graphic illustrating migration into the U.S. from the 1850s to the present.

Natgeo_migrationtreerings

 

What to Like

It's definitely eye-catching, and some readers will be enticed to spend time figuring out how to read this chart.

The inset reveals that the chart is made up of little colored strips that mix together. This produces a pleasing effect of gradual color gradation.

The white rings that separate decades are crucial. Without those rings, the chart becomes one long run-on sentence.

Once the reader invests time in learning how to read the chart, the reader will grasp the big picture. One learns, for example, that migrants from the most recent decades have come primarily from Latin America (orange) or Asia (pink). Migrants from Europe (green) and Canada (blue) came in waves but have been muted in the last few decades.

 

What's baffling

Initially, the chart is disorienting. It's not obvious whether the compass directions mean anything. We can immediately understand that the further out we go, the larger numbers of migrants. But what about which direction?

The key appears in the legend - which should be moved from bottom right to top left as it's so important. Apparently, continent/country of origin is coded in the directions.

This region-to-color coding seems to be rough-edged by design. The color mixing discussed above provides a nice artistic effect. Here, the reader finds out that mixing is primarily between two neighboring colors, thus two regions placed side by side on the chart. Thus, because Europe (green) and Asia (pink) are on opposite sides of the rings, those two colors do not mix.

Another notable feature of the chart is the lack of any data other than the decade labels. We won't learn how many migrants arrived in any decade, or the extent of migration as it impacts population size.

A couple of other comments on the circular design.

The circles expand in size for sure as time moves from inside out. Thus, this design only works well for "monotonic" data, that is to say, migration always increases as time passes.

The appearance of the chart is only mildly affected by the underlying data. Swapping the regions of origin changes the appearance of this design drastically.

 

 

 

 

 


The art of contaminating data

Schwab_indexfundassets_sm

This is one of those innocent-looking charts that could have been a poster child for artistic embellishment. The straightforward time-series chart is deemed too boring. The designer shows admirable constraint in inserting “information-free” content, such as the dense gridlines (graph paper) and the 3D effect (ticker).

Seem harmless but not really.

Here I turn off the color.

Redo_schwab_indexassets_bw_sm

After the 3D effect is applied, the reader no longer knows whether to look at the top or bottom edge of the ticker.

This view makes this point even clearer.

Jc_redo_schwab_indexassets_bw2_sm

The art contaminates the data.


Light entertainment: Making art by making data

Chris P. sent in this link to a Wired feature on "infographics."

The first entry is by Giorgia Lupi and Stefanie Posavec.

Wired_Stefanie-Data-Final

These are fun images and I enjoy looking at it as hand-drawn art. But it's a stretch to call them "data visualization," "data," or "data analysis," which are all tags used by the Wired editing staff.

(PS. Wired chose a particular example of their work. There are many examples of Lupi's work that strike a balance between handicraft and data communications.)

 


Mapping the two Americas

If you type "two Americas map" into Google image search, you get the following top results:

Google_twoAmericasmaps

Designers overwhelmingly pick the choropleth map as the way to depitct the two nations.

Now, look at these maps from the New York Times (link):


Nytimes_election2016_mapDem

and this:

Nytimes_election2016_mapRep

I believe the background is a relief map. Would like to see one where the color is based on the strength of support for Democrats or Republicans.

The pair of maps is extremely effective at bringing out the story about the splitting of the U.S. population. From a design standpoint, I really like it.

I love, love, love the cute annotations everywhere on the page. I imagine the designer had fun coming up with them.

Nytimes_election2016_mapRep_inset

Pittsburgh Puddle, Cleveland Cove, Cincinnati Slough, ...

***

There is an artistic (or data journalistic) license behind the way the data are processed. Most likely, a 50% cutoff is applied to determine which map a county sits atop. The analysis is at the county level so there is neccessarily some simplification... in fact, this aggregation is needed to make the "islands" and other features contiguous.

I am a bit sad that at this moment, we are so focused on what sets us apart, and not what binds us together as a nation.

 

PS. Via twitter, Maciej reacted negatively to these maps: "Horribly tendentious map visualization from the NYT makes the candidate who won more votes look like a tiny minority."

This is a good illustration of selecting the chart form to bring out one's message. If the goal of the chart is to show that Clinton has more votes, I agree that these maps fail to convey that message.

What I believe the NYT designer wants to point out is that the supporters of Clinton are clustered into these densely populated urban areas, leaving the Republicans with most of the land mass. (Like I said above, because of the 50% cutoff criterion, we are over-simplifying the picture. There are definitely Democrats living somewhere in Trump's nation, and likewise Republicans residing in Clinton strongholds.)


Raining, data art, if it ain't broke

Via Twitter, reader Joe D. asked a few of us to comment on the SparkRadar graphic by WeatherSpark.

At the time of writing, the picture for Baltimore is very pretty:

Sparkradar

The picture for New York is not as pretty but still intriguing. We are having a bout of summer and hence the white space (no precipitation):

Sparkradar_newyork

Interpreting this innovative chart is a tough task - this is a given with any innovative chart. Explaining the chart requires all the text on this page.

The difficulty of interpreting the SparkRadar chart is twofold.

Firstly, the axes are unnatural. Time runs vertically, defying the horizontal convention. Also, "now" - the most recent time depicted - is at the very bottom, which tempts readers to read bottom to top, meaning we are reading time running backwards into the past. In most charts, time run left to right from past to present (at least in the left-right-centric part of the world that I live in.)

Location has been reduced to one dimension. The labels "Distance Inside" and "Distance from Storm" confuse me - perhaps those who follow weather more closely can justify the labels. Conventionally, location is shown in two dimensions.

The second difficulty is created by the inclusion of irrelevant data (aka noise). The square grid prescribes a fixed box inside which all data are depicted. In the New York graphic, something is going on in the top right corner - far away in both time and space - how does it help the reader?

***

Now, contrast this chart to the more standard one, a map showing rain "clouds" moving through space.

Bing_precipitationradar_baltimore

(From Bing search result)

The standard one wins because it matches our intuition better.

Location is shown in two dimensions.

Distance from the city is shown on the map as scaled distance.

Time is shown as motion.

Speed is shown as speed of the motion. (In SparkRadar, speed is shown by the slope of imaginary lines.)

Severity is shown by density and color.

Nonetheless, a panel of the new charts make great data art.

 

 


After seeing this chart, my mouth needed a rinse

The credit for today's headline goes to Andrew Gelman, who said something like that when I presented the following chart at his Statistical Graphics class yesterday:

Fidelityad_consumerstaples_adj_smWith this chart (which appeared in a large ad in the NY Times), Fidelity Investment wants to tell potential customers to move money into the consumer staples category because of "greater return" and "lower risk". You just might wonder what a "consumer staple" is. Toothbrushes, you see.

There are too many issues with the chart to fit into one blog post. My biggest problem concerns the visual trickery used to illustrate "greater" and "lower". The designer wants to focus readers on the two orange brushes: return for consumer staples is higher, and risk is lower, you see.

The "greater" (i.e. right-facing) toothbrush is associated with longer brushes and higher elevation; the "lower" (left-facing) toothbrush, with shorter brushes and lower elevation.

But looking carefully at the scales reveals that the return ranges from 6% to 14% and the risk ranges from 10% to 25%. So larger numbers are depicted by shorter brushes and lower elevation, exactly the opposite of one's expectation. The orange brushes happen to  represent the same value of 14.3% but the one on the right is at least four times as large as the one on the left. As the dentist says, time to rinse out!

The vertical axis represents ranking of the investment categories in terms of decreasing return and/or risk so on both toothbrushes, the axis should run from 1 to 10.

***

How would the dentist fix this?

The first step is to visit the Q corner of the Trifecta Checkup. The purpose of this chart is for investors to realize that (using the chosen metrics) consumer durables have the best combination of risk and return. In finance, risk is measured as the volatility of return. So, in effect, all the investors care about is the probability of getting a certain level of return.

The trouble with any chart that shows both risk and return is that readers have no way of going from the pair of numbers to the probability of getting a certain level of return.

The fix is to plot the probability of returns directly.

Redo_fidelity_staples

In the above sketch, I just assumed a normal probability model, which is incorrect; but it is not hard to substitute this with an empirial distribution, if you obtain the raw data.

Unlike the original chart, it does not appear that consumer staples is a clearcut winner.