Morphing small multiples to investigate Sri Lanka's religions

Earlier this month, the bombs in Sri Lanka led to some data graphics in the media, educating us on the religious tensions within the island nation. I like this effort by Reuters using small multiples to show which religions are represented in which districts of Sri Lanka (lifted from their twitter feed):

Reuters_srilanka_religiondistricts

The key to reading this map is the top legend. From there, you'll notice that many of the color blocks, especially for Muslims and Catholics are well short of 50 percent. The absence of the darkest tints of green and blue conveys important information. Looking at the blue map by itself misleads - Catholics are in the minority in every district except one. In this setup, readers are expected to compare between maps, and between map and legend.

The overall distribution at the bottom of the chart is a nice piece of context.

***

The above design isolates each religion in its own chart, and displays the spatial spheres of influence. I played around with using different ways of paneling the small multiples.

In the following graphic, the panels represent the level of dominance within each district. The first panel shows the districts in which the top religion is practiced by at least 70 percent of the population (if religions were evenly distributed across all districts, we expect 70 percent of each to be Buddhists.) The second panel shows the religions that account for 40 to 70 percent of the district's residents. By this definition, no district can appear on both the left and middle maps. This division is effective at showing districts with one dominant religion, and those that are "mixed".

In the middle panel, the displayed religion represents the top religion in a mixed district. The last panel shows the second religion in each mixed district, and these religions typically take up between 25 and 40 percent of the residents.

Redo_srilankareligiondistricts_v2

The chart shows that other than Buddhists, Hinduism is the only religion that dominates specific districts, concentrated at the northern end of the island. The districts along the east and west coasts and the "neck" are mixed with the top religion accounting for 40 to 70 percent of the residents. By assimilating the second and the third panels, the reader sees the top and the second religions in each of these mixed districts.

***

This example shows why in the Trifecta Checkup, the Visual is a separate corner from the Question and the Data. Both maps utilize the same visual design, in terms of forms and colors and so on, but they deliver different expereinces to readers by answering different questions, and cutting the data differently.

 

P.S. [5/7/2019] Corrected spelling of Hindu.


Form and function: when academia takes on weed

I have a longer article on the sister blog about the research design of a study claiming 420 "cannabis" Day caused more road accident fatalities (link). The blog also has a discussion of the graphics used to present the analysis, which I'm excerpting here for dataviz fans.

The original chart looks like this:

Harperpalayew-new-420-fig2

The question being asked is whether April 20 is a special day when viewed against the backdrop of every day of the year. The answer is pretty clear. From this chart, the reader can see:

  • that April 20 is part of the background "noise". It's not standing out from the pack;
  • that there are other days like July 4, Labor Day, Christmas, etc. that stand out more than April 20

It doesn't even matter what the vertical axis is measuring. The visual elements did their job. 

***

If you look closely, you can even assess the "magnitude" of the evidence, not just the "direction." While April 20 isn't special, it nonetheless is somewhat noteworthy. The vertical line associated with April 20 sits on the positive side of the range of possibilities, and appears to sit above most other days.

The chart form shown above is better at conveying the direction of the evidence than its strength. If the strength of the evidence is required, we use a different chart form.

I produced the following histogram, using the same data:

Redo_420day_2

The histogram is produced by first locating the midpoints# of the vertical lines into buckets, and then counting the number of days that fall into each bucket.  (# Strictly speaking, I use the point estimates.)

The midpoints# are estimates of the fatal crash ratio, which is defined as the excess crash fatalities reported on the "analysis day" relative to the "reference days," which are situated one week before and one week after the analysis day. So April 20 is compared to April 13 and 27. Therefore, a ratio of 1 indicates no excess fatalities on the analysis day. And the further the ratio is above 1, the more special is the analysis day. 

If we were to pick a random day from the histogram above, we will likely land somewhere in the middle, which is to say, a day of the year in which no excess car crashes fatalities could be confirmed in the data.

As shown above, the ratio for April 20 (about 1.12)  is located on the right tail, and at roughly the 94th percentile, meaning that there were 6 percent of analysis days in which the ratios would have been more extreme. 

This is in line with our reading above, that April 20 is noteworthy but not extraordinary.

 

P.S. [4/27/2019] Replaced the first chart with a newer version from Harper's site. The newer version contains the point estimates inside the vertical lines, which are used to generate the histogram.

 

 

 

 

 


The Bumps come to the NBA, courtesy of 538

The team at 538 did a post-mortem of their in-season forecasts of NBA playoffs, using Bumps charts. These charts have a long history and can be traced back to Cambridge rowing. I featured them in these posts from a long time ago (link 1, link 2). 

Here is the Bumps chart for the NBA West Conference showing all 15 teams, and their ranking by the 538 model throughout the season. 

Fivethirtyeight_nbawest_bumps

The highlighted team is the Kings. It's a story of ascent especially in the second half of the season. It's also a story of close but no cigar. It knocked at the door for the last five weeks but failed to grab the last spot. The beauty of the Bumps chart is how easy it is to see this story.

Now, if you'd focus on the dotted line labeled "Makes playoffs," and note that beyond the half-way point (1/31), there are no further crossings. This means that the 538 model by that point has selected the eight playoff teams accurately.

***

Now what about NBA East?

Fivethirtyeight_nbaeast_bumps

This chart highlights the two top teams. This conference is pretty easy to predict at the top. 

What is interesting is the spaghetti around the playoff line. The playoff race was heart-stopping and it wasn't until the last couple of weeks that the teams were settled. 

Also worthy of attention are the bottom-dwellers. Note that the chart is disconnected in the last four rows (ranks 12 to 15). These four teams did not ever leave the cellar, and the model figured out the final rankings around February.

Using a similar analysis, you can see that the model found the top 5 teams by mid December in this Conference, as there are no further crossings beyond that point. 

***
Go check out the FiveThirtyEight article for their interpretation of these charts. 

While you're there, read the article about when to leave the stadium if you'd like to leave a baseball game early, work that came out of my collaboration with Pravin and Sriram.


How to describe really small chances

Reader Aleksander B. sent me to the following chart in the Daily Mail, with the note that "the usage of area/bubble chart in combination with bar alignment is not very useful." (link)

Dailymail-image-a-35_1431545452562

One can't argue with that statement. This chart fails the self-sufficiency test: anyone reading the chart is reading the data printed on the right column, and does not gain anything from the visual elements (thus, the visual representation is not self-sufficient). As a quick check, the size of the risk for "motorcycle" should be about 30 times larger than that of "car"; the size of the risk for "car" should be 100 times larger than that of "airplane". The risk of riding motorcycles then is roughly 3,000 times that of flying in an airplane. 

The chart does not appear to be sized properly as a bubble chart:

Dailymail_travelrisk_bubble

You'll notice that the visible proportion of the "car" bubble is much larger than that of the "motorcycle" bubble, which is one part of the problem.

Nor is it sized as a bar chart:

Dailymail_travelrisk_bar

As a bar chart, both the widths and the heights of the bars vary; and the last row presents a further challenge as the bubble for the airplane does not touch the baseline.

***

Besides the Visual, the Data issues are also quite hard. This is how Aleksander describes it: "as a reader I don't want to calculate all my travel distances and then do more math to compare different ways of traveling."

The reader wants to make smarter decisions about travel based on the data provided here. Aleksandr proposes one such problem:

In terms of probability it is also easier to understand: "I am sitting in my car in strong traffic. At the end in 1 hour I will make only 10 miles so what's the probability that I will die? Is it higher or lower than 1 hour in Amtrak train?"

The underlying choice is between driving and taking Amtrak for a particular trip. This comparison is relevant because those two modes of transport are substitutes for this trip. 

One Data issue with the chart is that riding a motorcycle and flying in a plane are rarely substitutes. 

***

A way out is to do the math on behalf of your reader. The metric of deaths per 1 billion passenger-miles is not intuitive for a casual reader. A more relevant question is what's the chance of dying from the time I spend per year of driving (or riding a plane). Because the chance will be very tiny, it is easier to express the risk as the number of years of travel before I expect to see one death.

Let's assume someone drives 300 days per year, and 100 miles per day so that each year, this driver contributes 30,000 passenger-miles to the U.S. total (which is 3.2 trillion). We convert 7.3 deaths per 1 billion passenger-miles to 1 death per 137 million passenger-miles. Since this driver does 30K per year, it will take (137 million / 30K) = about 4,500 years to see one death on average. This calculation assumes that the driver drives alone. It's straightforward to adjust the estimate if the average occupancy is higher than 1. 

Now, let's consider someone who flies once a month (one outbound trip plus one return trip). We assume that each plane takes on average 100 passengers (including our protagonist), and each trip covers on average 1,000 miles. Then each of these flights contributes 100,000 passenger-miles. In a year, the 24 trips contribute 2.4 million passenger-miles. The risk of flying is listed at 0.07 deaths per 1 billion, which we convert to 1 death per 14 billion passenger-miles. On this flight schedule, it will take (14 billion / 2.4 million) = almost 6,000 years to see one death on average.

For the average person on those travel schedules, there is nothing to worry about. 

***

Comparing driving and flying is only valid for those trips in which you have a choice. So a proper comparison requires breaking down the average risks into components (e.g. focusing on shorter trips). 

The above calculation also suggests that the risk is not evenly spread out throughout the population, despite the use of an overall average. A trucker who is on the road every work day is clearly subject to higher risk than an occasional driver who makes a few trips on rental cars each year.

There is a further important point to note about flight risk, due to MIT professor Arnold Barnett. He has long criticized the use of deaths per billion passenger-miles as a risk metric for flights. (In Chapter 5 of Numbers Rule Your World (link), I explain some of Arnie's research on flight risk.) The problem is that almost all fatal crashes involving planes happen soon after take-off or not long before landing. 

 


Bar-density and pie-density plots for showing relative proportions

In my last post, I described a bar-density chart to show paired data of proportions with an 80/20-type rule. The following example illustrates that a small proportion of Youtubers generate a large proportion of views.

Jc_redo_youtube_bar_2col

Other examples of this type of data include:

  • the top 10% of families own 75% of U.S. household wealth (link)
  • the top 1% of artists earn 77% of recorded music income (link)
  • Five percent of AT&T customers consume 46% of the bandwidth (link)

In all these examples, the message of the data is the importance of a small number of people (top earners, superstars, bandwidth hogs). A good visual should call out this message.

The bar-density plot consists of two components:

  • the bar chart which shows the distribution of the data (views, wealth, income, bandwidth) among segments of people;
  • The embedded Voronoi diagram within each bar that encodes the relative importance of each people segment, as measured by the (inverse) density of the population among these segments - a people segment is more important if each individual accounts for more of the data, or in other words, the density of people within the group is lower.

The bar chart can adopt a more conventional horizontal layout.

Jc_redo_youtube_bar_h_2col

Voronoi tessellation

To understand the Voronoi diagram, think of a fixed number (say, 100) of randomly placed points inside a bar. Then, for any point inside the bar area, it has a nearest neighbor among those 100 fixed points. Assign every point on the surface to its nearest neighbor. From this, one can draw a boundary around each of the 100 points to include all its nearest neighbors. The resulting tessellation is the Voronoi diagram. (The following illustration comes from this AMS column.)

Ams_voronoi

 

The density of points in the respective bars encodes the relative proportions of people within those groups. For my example, I placed 6 points in the red bar, 666 points in the yellow bar, and ~2000 points in the gray bar, which precisely represents the relative proportions of creators in the three segments.

Density is represented statistically

Notice that the density is represented statistically, not empirically. According to the annotation on the original chart, the red bar represents 14,000 super-creators. Correspondingly, there are 4.5 million creators in the gray bar. Any attempt to plot those as individual pieces will result in a much less impactful graphic. If the representation is interpreted statistically, as relative densities within each people segment, the message of relative importance of the units within each group is appropriately conveyed.

A more sophisticated way of deciding how many points to place in the red bar is to be developed. Here, I just used the convenient number of 6.

The color shades are randomly applied to the tessellation pieces, and used to facilitate reading of densities.

***

In this section, I provide R code for those who want to explore this some more. This is code used for prototyping, and you're welcome to improve them. The general strategy is as follows:

  • Set the rectangular area (bar) in which the Voronoi diagram is to be embedded. The length of the bar is set to the proportion of views, appropriately scaled. The code utilizes the dirichlet function within the spatstat package to generate the fixed points; this requires setting up the owin parameter to represent a rectangle.
  • Set the number of points (n) to be embedded in the bar, determined by the relative proportion of creators, appropriately scaled. Generate a data frame containing the x-y coordinates of n randomly placed points, within the rectangle defined above.
  • Use the ppp function to generate the Voronoi data
  • Set up a colormap for plotting the Voronoi diagram
  • Plot the Voronoi diagram; assign shades at random to the pieces (in a production code, these random numbers should be set as marks in the ppp but it's easier to play around with the shades if placed here)

The code generates separate charts for each bar segment. A post-processing step is currently required to align the bars to attain equal height. I haven't figured out whether the multiplot option helps here.

library(spatstat)

# enter the scaled proportions of creators and views
# the Youtube example has three creator segments

# number of randomly generated points should be proportional to proportion of creators. Multiply nc by a scaling factor if desired

nc = c(3, 33, 965)*2

# bar widths should be proportional to proportion of views
# total width should be set based on the width of your page

wide = c(378, 276, 346)/2

# set bar height, to attain a particular aspect ratio
bar_h = 50

# define function to generate points
# defines rectangular window

makepoints = function (n, wide, height) {
    df <- data.frame(x = runif(n,0,wide),y = runif(n,0,height))
    W <- owin( c(0, wide), c(0,height) ) # rectangular window
    pp1 <- as.ppp( df, W )
    y <- dirichlet(pp1)
    # y$marks <- sample(0:wide, n, replace=T) # marks are for colors
    return (y)
}

y_red = makepoints(nc[1], wide[1], bar_h) # height of each bar fixed
y_yel = makepoints(nc[2], wide[2], bar_h)
y_gry = makepoints(nc[3], wide[3], bar_h)

# setting colors (4 shades per bar, one color per bar)

cr_red = colourmap(c("lightsalmon","lightsalmon2", "lightsalmon4", "brown"), breaks=round(seq(0, wide[1],length.out=5)))

cr_yel = colourmap(c("burlywood1", "burlywood2", "burlywood3", "burlywood4"), breaks=round(seq(0, wide[2],length.out=5)))

cr_gry = colourmap(c("gray80", "gray60", "gray40", "gray20"), breaks=round(seq(0, wide[3],length.out=5)))

# plotting

par(mar=c(0,0,0,0))


# add png to save image to png

# remove values= if colors set in ppp

plot.tess(y_red, main="", border="pink3", do.col=T, values = sample(0:wide[1], nc[1], replace=T), col=cr_red, xlim=c(0, wide[1]), ylim=c(0,bar_h), ribbon=F)

plot.tess(y_yel, main="", border="darkgoldenrod4", do.col=T, values=sample(0:wide[2], nc[2], replace=T), col=cr_yel, xlim=c(0, wide[2]), ylim=c(0,bar_h), ribbon=F)

plot.tess(y_gry, main="", border="darkgray", do.col=T, values=sample(0:wide[3], nc[3], replace=T), col=cr_gry, xlim=c(0, wide[3]), ylim=c(0,bar_h), ribbon=F)

# because of random points, the tessellation looks different each time
# post-processing: make each bar the same height when aligned side by side

***

A cousin of the bar-density plot is the pie-density plot. Since I'm using only three creator segments, which each account for about 30-40% of the total views, it is natural to use a pie chart. In this case, we embed the Voronoi diagrams into the pie sectors.

Jc_redo_youtube_pie_lobsided

If the distribution were more even, that is to say, the creators are more or less equally important, the pie-density plot looks like this:

Redo_jc_youtube_pie_even

***

Something that is more like 80/20

The original chart shows the top 0.3 percent generating almost 40 percent of the views. A more typical insight is top X percent generates 80 percent of the data. For the YouTube data, X is 11 percent. What does the pie-density chart look like if  top 11 percent <-> 80 percent, middle 33 percent <-> 11 percent, bottom 56 percent <-> 8 percent?

Jc_youtube_8020_barh_pie

Roughly speaking, the second segment includes 3 times the people as the largest, and the third has 5 times as the largest.

 

P.S.

1) Check out my first Linkedin "article" on this topic. 

2) The first post on bar-density charts is here.

 

 

 

 

 

 

 

 

 


Visualizing the 80/20 rule, with the bar-density plot

Through Twitter, Danny H. submitted the following chart that shows a tiny 0.3 percent of Youtube creators generate almost 40 percent of all viewing on the platform. He asks for ideas about how to present lop-sided data that follow the "80/20" rule.

Dannyheiman_twitter_youtubedata

In the classic 80/20 rule, 20 percent of the units account for 80 percent of the data. The percentages vary, so long as the first number is small relative to the second. In the Youtube example, 0.3 percent is compared to 40 percent. The underlying reason for such lop-sidedness is the differential importance of the units. The top units are much more important than the bottom units, as measured by their contribution to the data.

I sense a bit of "loss aversion" on this chart (explained here). The designer color-coded the views data into blue, brown and gray but didn't have it in him/her to throw out the sub-categories, which slows down cognition and adds hardly to our understanding.

I like the chart title that explains what it is about.

Turning to the D corner of the Trifecta Checkup for a moment, I suspect that this chart only counts videos that have at least one play. (Zero-play videos do not show up in a play log.) For a site like Youtube, a large proportion of uploaded videos have no views and thus, many creators also have no views.

***

My initial reaction on Twitter is to use a mirrored bar chart, like this:

Jc_redo_youtube_mirrorbar_lobsided

I ended up spending quite a bit of time exploring other concepts. In particular, I like to find an integrated way to present this information. Most charts, such as the mirrored bar chart, a Bumps chart (slopegraph), and Lorenz chart, keep the two series of percentages separate.

Also, the biggest bar (the gray bar showing 97% of all creators) highlights the least important Youtubers while the top creators ("super-creators") are cramped inside a slither of a bar, which is invisible in the original chart.

What I came up with is a bar-density plot, where I use density to encode the importance of creators, and bar lengths to encode the distribution of views.

Jc_redo_youtube_bar_h_2col

Each bar is divided into pieces, with the number of pieces proportional to the number of creators in each segment. This has the happy result that the super-creators are represented by large (red) pieces while the least important creators by little (gray) pieces.

The embedded tessellation shows the structure of the data: the bottom third of the views are generated by a huge number of creators, producing a few views each - resulting in a high density. The top 38% of the views correspond to a small number of super-creators - appropriately shown by a bar of low density.

For those interested in technicalities, I embed a Voronoi diagram inside each bar, with randomly placed points. (There will be a companion post later this week with some more details, and R code.)

Here is what the bar-density plot looks like when the distribution is essentially uniform:

Jc_redo_youtube_bar_even
The density inside each bar is roughly the same, indicating that the creators are roughly equally important.

 

P.S.

1) The next post on the bar-density plot, with some experimental R code, will be available here.

2) Check out my first Linkedin "article" on this topic.

 

 

 

 

 


Nice example of visual story-telling in the FT

I came across this older chart in the Financial Times, which is a place to find some nice graphics:

Ft_uklifeexpectancy

The key to success here is having a good story to tell. Blackpool is an outlier when it comes to improvement in life expectancy since 1993. Its average life expectancy has improved, but the magnitude of improvement lags other areas by quite a margin.

The design then illustrates this story in two ways.

On the right side, one sees Blackpool occupying a lone spot on the left side of the histogram. On the left chart, the gap between Blackpool and the national average is plotted over time. The gap is clearly widening; the size of the gap is labeled so the reader immediately knows it went from 1.8 to 4.9.

Although they're not labeled, the reader understand that the other two lines are the best and worst areas. The comparison between Glasgow City and Blackpool is also informative. Glasgow City, which has the worst life expectancy in the U.K. is fast catching up with Blackpool, the second worst.

I also like color-coded titles. It draws attention to Blackpool and it links the conclusion to both charts in an efficient manner.


Pretty circular things

National Geographic features this graphic illustrating migration into the U.S. from the 1850s to the present.

Natgeo_migrationtreerings

 

What to Like

It's definitely eye-catching, and some readers will be enticed to spend time figuring out how to read this chart.

The inset reveals that the chart is made up of little colored strips that mix together. This produces a pleasing effect of gradual color gradation.

The white rings that separate decades are crucial. Without those rings, the chart becomes one long run-on sentence.

Once the reader invests time in learning how to read the chart, the reader will grasp the big picture. One learns, for example, that migrants from the most recent decades have come primarily from Latin America (orange) or Asia (pink). Migrants from Europe (green) and Canada (blue) came in waves but have been muted in the last few decades.

 

What's baffling

Initially, the chart is disorienting. It's not obvious whether the compass directions mean anything. We can immediately understand that the further out we go, the larger numbers of migrants. But what about which direction?

The key appears in the legend - which should be moved from bottom right to top left as it's so important. Apparently, continent/country of origin is coded in the directions.

This region-to-color coding seems to be rough-edged by design. The color mixing discussed above provides a nice artistic effect. Here, the reader finds out that mixing is primarily between two neighboring colors, thus two regions placed side by side on the chart. Thus, because Europe (green) and Asia (pink) are on opposite sides of the rings, those two colors do not mix.

Another notable feature of the chart is the lack of any data other than the decade labels. We won't learn how many migrants arrived in any decade, or the extent of migration as it impacts population size.

A couple of other comments on the circular design.

The circles expand in size for sure as time moves from inside out. Thus, this design only works well for "monotonic" data, that is to say, migration always increases as time passes.

The appearance of the chart is only mildly affected by the underlying data. Swapping the regions of origin changes the appearance of this design drastically.

 

 

 

 

 


Check out the Lifespan of News project

Alberto Cairo introduces another one of his collaborations with Google, visualizing Google search data. We previously looked at other projects here.

The latest project, designed by Schema, Axios, and Google News Initiative, tracks the trending of popular news stories over time and space, and it's a great example of making sense of a huge pile of data.

The design team produced a sequence of graphics to illustrate the data. The top news stories are grouped by category, such as Politics & Elections, Violence & War, and Environment & Science, each given a distinct color maintained throughout the project.

The first chart is an area chart that looks at individual stories, and tracks the volume over time.

Lifespannews_areachart

To read this chart, you have to notice that the vertical axis measuring volume is a log scale, meaning that each tick mark up represents a 10-fold increase. Log scale is frequently used to draw far-away data closer to the middle, making it possible to see both ends of a wide distribution on the same chart. The log transformation introduces distortion deliberately. The smaller data look disproportionately large because of it.

The time scrolls automatically so that you feel a rise and fall of various news stories. It's a great way to experience the news cycle in the past year. The overlapping areas show competing news stories that shared the limelight at that point in time.

Just bear in mind that you have to mentally reverse the distortion introduced by the log scale.

***

In the second part of the project, they tackle regional patterns. Now you see a map with proportional symbols. The top story in each locality is highlighted with the color of the topic. As time flows by, the sizes of the bubbles expand and contract.

Lifespannews_bubblemap

Sometimes, the entire nation was consumed by the same story, e.g. certain obituaries. At other times, people in different regions focused on different topics.

***

In the last part of the project, they describe general shapes of the popularity curves. Most stories have one peak although certain stories like U.S. government shutdown will have multiple peaks. There is also variation in terms of how fast a story rises to the peak and how quickly it fades away.

The most interesting aspect of the project can be learned from the footnote. The data are not direct hits to the Google News stories but searches on Google. For each story, one (or more) unique search terms are matched, and only those stories are counted. A "control" is established, which is an excellent idea. The control gives meaning to those counts. The control used here is the number of searches for the generic term "Google News." Presumably this is a relatively stable number that is a proxy for general search activity. Thus, the "volume" metric is really a relative measure against this control.