People flooded this chart presented without comment with lots of comments

Oct 05, 2022

The recent election in Italy has resulted in some dubious visual analytics. A reader sent me this Excel chart:

In brief, an Italian politician (trained as a PhD economist) used the graph above to make a point that support of the populist Five Star party (M5S) is highly correlated with poverty - the number of people on RDC (basic income). "Senza commento" - no comment needed.

Except a lot of people noticed the idiocy of the chart, and ridiculed it.

The chart appeals to those readers who don't spend time understanding what's being plotted. They notice two lines that show similar "trends" which is a signal for high correlation.

It turns out the signal in the chart isn't found in the peaks and valleys of the "trends".  It is tempting to observe that when the blue line peaks (Campania, Sicilia, Lazio, Piedmonte, Lombardia), the orange line also pops.

But look at the vertical axis. He's plotting the number of people, rather than the proportion of people. Population varies widely between Italian provinces. The five mentioned above all have over 4 million residents, while the smaller ones such as Umbira, Molise, and Basilicata have under 1 million. Thus, so long as the number of people, not the proportion, is plotted, no matter what demographic metric is highlighted, we will see peaks in the most populous provinces.

***

The other issue with this line chart is that the "peaks" are completely contrived. That's because the items on the horizontal axis do not admit a natural order. This is NOT a time-series chart, for which there is a canonical order. The horizontal axis contains a set of provinces, which can be ordered in whatever way the designer wants.

The following shows how the appearance of the lines changes as I select different metrics by which to sort the provinces:

This is the reason why many chart purists frown on people who use connected lines with categorical data. I don't like this hard rule, as my readers know. In this case, I have to agree the line chart is not appropriate.

***

So, where is the signal on the line chart? It's in the ratio of the heights of the two values for each province.

Here, we find something counter-intuitive. I've highlighted two of the peaks. In Sicilia, about the same number of people voted for Five Star as there are people who receive basic income. In Lombardia, more than twice the number of people voted for Five Star as there are people who receive basic income.

Now, Lombardy is where Milan is, essentially the richest province in Italy while Sicily is one of the poorest. Could it be that Five Star actually outperformed their demographics in the richer provinces?

***

Let's approach the politician's question systematically. He's trying to say that the Five Star moement appeals especially to poorer people. He's chosen basic income as a proxy for poverty (this is like people on welfare in the U.S.). Thus, he's divided the population into two groups: those on welfare, and those not.

What he needs is the relative proportions of votes for Five Star among these two subgroups. Say, Five Star garnered 30% of the votes among people on welfare, and 15% of the votes among people not on welfare, then we have a piece of evidence that Five Star differentially appeals to people on welfare. If the vote share is the same among these two subgroups, then Five Star's appeal does not vary with welfare.

The following diagram shows the analytical framework:

What's the problem? He doesn't have the data needed to establish his thesis. He has the total number of Five Star voters (which is the sum of the two yellow boxes) and he has the total number of people on RDC (which is the dark orange box).

As shown above, another intervening factor is the proportion of people who voted. It is conceivable that the propensity to vote also depends on one's wealth.

So, in this case, fixing the visual will not fix the problem. Finding better data is key.

Aug 12, 2021

On July 30, Israel began administering third doses of mRNA vaccines to targeted groups of people. This decision was controversial since there is no science to support it. The policymakers do have educated guesses by experts based on best-available information. By science, I mean actual evidence. Since no one has previously been given three shots, there can be no data on which anyone can root such a decision. Nevertheless, the pandemic does not always give us time to collect relevant data, and so speculative analysis has found its calling.

Dvir Aran, at Technion, has been diligently tracking the situation in Israel on his Twitter. Ten days after July 30, he posted the following chart, which immediately led many commentators to bounce out of their seats crowning the third shot as a magic bullet. Notably, Dvir himself did not endorse such a claim. (See here to learn how other hasty conclusions by experts have fared.)

When you look at Dvir's chart, what do we see?

Possibly one of the following two things, depending on what concern you have in your head.

1) The red line sits far above the other two lines, showing that unvaccinated people are much more likely to get infected.

2) The blue line diverges from the green line almost immediately after the 3rd shots started getting into arms, showing that the 3rd shot is super effective.

If you take another moment to look, you might start asking questions, as many in Twitter world did. Dvir was startlingly efficient at answering these queries.

A) Does the green line represent people with 2 or 3 doses, or is it strictly 2 doses? Aron asked this question and got the answer (the former):

It's time to check our presumptions. When you read that chart, did you presume it's exactly 2 doses or did you presume it's 2 or 3 doses? Or did you immediately spot the ambiguity? As I said in this article, graphs attain efficiency at communication because the designer leverages unspoken rules - the chart conveys certain information without explicitly placing it on the chart. But this can backfire. In this case, I presumed the three lines to display three non-overlapping groups of people, and thus the green line indicates those with 2 doses but not 3. That presumption led me to misinterpret what's on the chart.

B) What is the denominator of the case rates? Is it literal - by that I mean, all unvaccinated people for the red line, and all people with 3 doses for the blue line? Or is the denominator the population of Israel, the same number for all three lines? Lukas asked this question, and got the answer (the former).

C) Since third shots are recommended for 60 year olds and over who were vaccinated at least 5 months ago, and most unvaccinated Israelis are below 60, this answer opens the possibility that the lines compare apples and oranges. Joe. S. asked about this, and received an answer (all lines display only 60 year olds and over.)

Jason P. asked, and learned that the 5-month-out criterion is immaterial since 90% of the vaccinated have already reached that time point.

D) We have even more presumptions. Like me, did you presume that the red line represents the "unvaccinated," meaning people who have not had any vaccine shots? If so, we may both be wrong about this. It has become the norm by vaccine researchers to lump "partially vaccinated" people with "unvaccinated", and call this combined group "unvaccinated". Here is an excerpt from a recent report from Public Health Ontario (link to PDF), which clearly states this unintuitive counting rule:

Notice that in this definition, someone who got infected within 14 days of the first shot is classified as an "unvaccinated" case and not a "partially vaccinated case".

In the following tweet, Dvir gave a hint of what he plotted:

In a previous analysis, he averaged the rates of people with 0 doses and 1 dose, which is equivalent to combining them and calling them unvaccinated. It's unclear to me what he did to the 1-dose subgroup in our featured chart - did it just vanish from the chart? (How people and cases are classified into these groups is a major factor in all vaccine effectiveness calculations - a topic I covered here. Unfortunately, most published reports do a poor job explaining what the analysts did).

E) Did you presume that all three lines are equally important? That's far from true. Since Israel is the world champion in vaccination, the bulk of the 60+ population form the green line. I asked Dvir and he responded that only 7.5%, or roughly 100K are unvaccinated.

That means 1.2 million people are part of the green line, 12 times higher. There are roughly 50 cases per day among unvaccinated, and 370 daily cases among those with 2 or 3 doses. In other words, vaccinated people account for almost 90% of all cases.

Yes, this is inevitable when over 90% of the age group have been vaccinated (but it is predictable on the first day someone blasted everywhere that real-world VE is proved by the fact that almost all new cases were in the unvaccinated.)

If your job is to minimize infections, you should be spending most of your time thinking about the 370 cases among vaccinated than the 50 cases among unvaccinated. If you halve the case rate, that would be a difference of 185 cases vs 25. In Israel, the vaccination campaign has already succeeded; it's time to look forward, which is exactly why they are re-focusing on the already vaccinated.

***

If what you worry about most is the effectiveness of the original two-dose regimen, Dvir's chart raises a puzzle. Ignore the blue line, and remember that the green line already includes everybody represented by the blue line.

In the following chart, I removed the blue line, and added reference lines in dashed purple that correspond to 25%, 50% and 75% vaccine effectiveness. The data plotted on this chart are unadjusted case rates. A 75% effective vaccine cuts case rate by three quarters.

This chart shows the 2-dose mRNA vaccine was nowhere near 90% effective. (As regular readers know, I don't endorse this simplistic calculation and have outlined the problems here, but this style of calculation keeps getting published and passed around. Those who use it to claim real-world studies confirm prior clinical trial outcomes can either (a) insist on using it and retract their earlier conclusions, or (b) admit that such a calculation was, and is, a bad take.)

Also observe how the vaccinated (green) line is moving away from the unvaccinated (red) line. The vaccine apparently is becoming more effective, which runs counter to the trend used by the Israeli government to justify third doses. This improvement also precedes the start of the third-shot campaign. When the analytical method is bad, it generates all sorts of spurious findings.

***

As Dvir said, it is premature to comment on the third doses based on 10 days of data. For one thing, the vaccine developers insist that their vaccines must be given 14 days to work. In a typical calculation, all of the cases in the blue line fall outside the case-counting window. The effective number of cases that would be attributed to the 3-dose group right now is zero, and the vaccine effectiveness using the standard methodology is 100%, even better than shown in the chart.

There is an alternative interpretation of this graph. Statisticians call this the selection effect. On July 30, the blue line split out of the green: some people were selected to receive the 3rd dose - this includes an official selection (the government makes certain subgroups eligible) as well as a self-selection (within the eligible subgroup, certain people decide to get the 3rd shot earlier.) If those who are less exposed to the virus, or more risk averse, get the shots first, then all that is happening may be that we have split off a high VE subgroup from the green line. Even if the third shot were useless, the selection effect itself could explain the gap.

Statistics is about grays. It's not either-or. It's usually some of each. If you feel like Groundhog Day, you're getting the picture. When they rolled out two doses, we lived through an optimistic period in which most experts rejoiced about 90-100% real-world effectiveness, and then as more people get vaccinated, the effect washed away. The selection effect gradually disappears when vaccination becomes widespread. Are we starting a new cycle of hope and despair? We'll find out soon enough.

Bloomberg made me digest these graphics slowly

Oct 12, 2020

Ask the experts to name the success metric of good data visualization, and you will receive a dozen answers. The field doesn't have an all-encompassing metric. A useful reference is Andrew Gelman and Antony Urwin (2012) in which they discussed the tradeoff between beautiful and informative, which derives from the familiar tension between art and science.

For a while now, I've been intrigued by metrics that measure "effort". Some years ago, I described the concept of a "return on effort" in this post. Such a metric can be constructed like the dominating financial metric of return on investment. The investment here is an investment of time, of attention. I strongly believe that if the consumer judges a data visualization to be compelling, engaging or  ell constructed, s/he will expend energy to devour it.

Imagine grub you discard after the first bite, compared to the delicious food experienced slowly, savoring every last bit.

I'm writing this post while enjoying the September issue of Bloomberg Businessweek, which focuses on the upcoming U.S. Presidential election. There are various graphics infused into the pages of the magazine. Many of these graphics operate at a level of complexity above what typically show up in magazines, and yet I spent energy learning to understand them. This response, I believe, is what visual designers should aim for.

***

Today, I discuss one example of these graphics, shown on the right. You might be shocked by the throwback style of these graphics. They look like they arrived from decades ago!

Grayscale, simple forms, typewriter font, all caps. Have I gone crazy?

The article argues that a town like Ambridge in Beaver County, Pennslyvania may be pivotal in the November election. The set of graphics provides relevant data to understand this argument.

It's evidence that data visualization does not need whiz-bang modern wizardry to excel.

Let me focus on the boxy charts from the top of the column. These:

These charts solve a headache with voting margin data in the U.S.  We have two dominant political parties so in any given election, the vote share data split into three buckets: Democratic, Republican, and a catch-all category that includes third parties, write-ins, and none of the above. The third category rarely exceeds 5 percent.  A generic pie chart representation looks like this:

Stacked bars have this look:

In using my Trifecta framework (link), the top point is articulating the question. The primary issue here is the voting margin between the winner and the second-runner-up, which is the loser in what is typically a two-horse race. There exist two sub-questions: the vote-share difference between the top two finishers, and the share of vote effectively removed from the pot by the remaining candidates.

Now, take another look at the unusual chart form used by Bloomberg:

The catch-all vote share sits at the bottom while the two major parties split up the top section. This design demonstrates a keen understanding of the context. Consider the typical outcome, in which the top two finishers are from the two major parties. When answering the first sub-question, we can choose the raw vote shares, or the normalized vote shares. Normalizing shifts the base from all candidates to the top two candidates.

The Bloomberg chart addresses both scales. The normalized vote shares can be read directly by focusing only on the top section. In an even two-horse race, the top section is split by half - this holds true regardless of the size of the bottom section.

This is a simple chart that packs a punch.

Election visual 3: a strange, mash-up visualization

Sep 21, 2020

Continuing our review of FiveThirtyEight's election forecasting model visualization (link), I now look at their headline data visualization. (The previous posts in this series are here, and here.)

It's a set of 22 maps, each showing one election scenario, with one candidate winning. What chart form is this?

Small multiples may come to mind. A small-multiples chart is a grid in which every component graphic has the same form - same chart type, same color scheme, same scale, etc. The only variation from graphic to graphic is the data. The data are typically varied along a dimension of interest, for example, age groups, geographic regions, years. The following small-multiples chart, which I praised in the past (link), shows liquor consumption across the world.

Each component graphic changes according to the data specific to a country. When we scan across the grid, we draw conclusions about country-to-country variations. As with convention, there are as many graphics as there are countries in the dataset. Sometimes, the designer includes only countries that are directly relevant to the chart's topic.

***

What is the variable FiveThirtyEight chose to vary from map to map? It's the scenario used in the election forecasting model.

This choice is unconventional. The 22 scenarios is a subset of the 40,000 scenarios from the simulation - we are left wondering how those 22 are chosen.

Returning to our question: what chart form is this?

Perhaps you're reminded of the dot plot from the previous post. On that dot plot, the designer summarized the results of 40,000 scenarios using 100 dots. Since Biden is the winner in 75 percent of all scenarios, the dot plot shows 75 blue dots (and 25 red).

The map is the new dot. The 75 blue dots become 16 blue maps (rounded down) while the 25 red dots become 6 red maps.

Is it a pictogram of maps? If we ignore the details on the maps, and focus on the counts of colors, then yes. It's just a bit challenging because of the hole in the middle, and the atypical number of maps.

As with the dot plot, the map details are a nice touch. It connects readers with the simulation model which can feel very abstract.

Oddly, if you're someone familiar with probabilities, this presentation is quite confusing.

With 40,000 scenarios reduced to 22 maps, each map should represent 1818 scenarios. On the dot plot, each dot should represent 400 scenarios. This follows the rule for creating pictograms. Each object in a pictogram - dot, map, figurine, etc. - should encode an equal amount of the data. For the 538 visualization, is it true that each of the six red maps represents 1818 scenarios? This may be the case but not likely.

Recall the dot plot where the most extreme red dot shows a scenario in which Trump wins 376 out of 538 electoral votes (margin = 214). Each dot should represent 400 scenarios. The visualization implies that there are 400 scenarios similar to the one on display. For the grid of maps, the following red map from the top left corner should, in theory, represent 1,818 similar scenarios. Could be, but I'm not sure.

Mathematically, each of the depicted scenario, including the blowout win above, occurs with 1/40,000 chance in the simulation. However, one expects few scenarios that look like the extreme scenario, and ample scenarios that look like the median scenario.

So, the right way to read the 538 chart is to ignore the map details when reading the embedded pictogram, and then look at the small multiples of detailed maps bearing in mind that extreme scenarios are unique while median scenarios have many lookalikes.

(Come to think about it, the analogous situation in the liquor consumption chart is the relative population size of different countries. When comparing country to country, we tend to forget that the data apply to large numbers of people in populous countries, and small numbers in tiny countries.)

***

There's a small improvement that can be made to the detailed maps. As I compare one map to the next, I'm trying to pick out which states that have changed to change the vote margin. Conceptually, the number of states painted red should decrease as the winning margin decreases, and the states that shift colors should be the toss-up states.

So I'd draw the solid Republican (Democratic) states with a lighter shade, forming an easily identifiable bloc on all maps, while the toss-up states are shown with a heavier shade.

Here, I just added a darker shade to the states that disappear from the first red map to the second.

Super-informative ping-pong graphic

May 11, 2016

Via Twitter, Mike W. asked me to comment on this WSJ article about ping pong tables. According to the article, ping pong table sales track venture-capital deal flow:

This chart is super-informative. I learned a lot from this chart, including:

• Very few VC-funded startups play ping pong, since the highlighted reference lines show 1000 deals and only 150 tables (!)
• The one San Jose store interviewed for the article is the epicenter of ping-pong table sales, therefore they can use it as a proxy for all stores and all parts of the country
• The San Jose store only does business with VC startups, which is why they attribute all ping-pong tables sold to these companies
• Startups purchase ping-pong tables in the same quarter as their VC deals, which is why they focus only on within-quarter comparisons
• Silicon Valley startups only source their office equipment from Silicon Valley retailers
• VC deal flow has no seasonality
• Ping-pong table sales has no seasonality either
• It is possible to predict the past (VC deals made) by gathering data about the future (ping-pong tables sold)

Further, the chart proves that one can draw conclusions from a single observation. Here is what the same chart looks like after taking out the 2016 Q1 data point:

This revised chart is also quite informative. I learned:

• At the same level of ping-pong-table sales (roughly 150 tables), the number of VC deals ranged from 920 to 1020, about one-third of the vertical range shown in the original chart
• At the same level of VC deals (roughly 1000 deals), the number of ping-pong tables sold ranged from 150 to 230, about half of the horizontal range of the original chart

The many quotes in the WSJ article also tell us that people in Silicon Valley are no more data-driven than people in other parts of the country.

Apr 28, 2015

Reader Jamie H. pointed me to the following chart in the Guardian (link), which originated from Spotify.

This chart is likely inspired by the Arctic ice cover chart discussed here last year (link):

Spotify calls its chart "the Coolness Spiral of Death" while the other one is called "Arctic Death Spiral".

The spiral chart has many problems, some of which I discussed in the post from last year. Just take a look at the headline, and then the black dotted spiral. Does the shape invoke the idea of rapid evolution, followed by maturation? Or try to figure out the amount of evolution between ages 18 and 30.

***

Instead of the V corner of the Trifecta, I'd like to focus on the D corner today. When I look at charts, I'm always imagining the data behind the chart. Here are some questions to ponder:

• Given that Spotify was founded in 2006 (not quite 10 years ago), how are they able to discern someone's music taste from 14 through 48?
• The answer to the above question is they don't have a longitudinal view of anyone's music taste. They are comparing today's 14-year-old kid with today's 48-year-old adult. Under what assumptions would such an analysis yield the same outcome as a proper analysis that tracks the same people over time?
• If the phenomenon under study follows a predictable trend, there will be little difference between the two ways of looking at the data. For example, teeth in the average baby follow a certain sequence of emergence, first incisors at six months, and first molars at 14 months (according to Wikipedia). Observing John's teething at six months and David's at 14 months won't yield much difference from looking at John at six then 14 months. Does music taste evolve like human growth?
• Unfortunately, no. Imagine that a new genre of music suddenly erupts and it becomes popular among every generation of listeners. This causes the Spotify curve to shift towards the origin at all ages. However, if you take someone who is currently 30 years ol, the emergence of the new genre should affect his profile at age 30 but not anytime before. In fact, the new music creates a sharp shift at different locations of everyone's taste profile depending on one's age!
• Let's re-interpret the chart, and accept that each spoke in the wheel concerns a different cohort of people. So we are looking at generational differences. Is the Spotify audience representative of music listeners? Particularly, is each Spotify cohort representative of all listeners of that age?
• I find it unlikely since Spotify has that "cool" factor. It is probably more representative for younger age groups. Among older customers, there should be some bias. How does this affect the interpretation of the taste profile?
• If we find that one cohort differs from another cohort, it is important to establish that the gap is a generational difference and not due to the older age group being biased (self-selected) in some way.

World Bank fails to lead the way in #dataviz

Aug 12, 2014

Matthew Yglesias, writing for Vox, cited the following chart from a World Bank project:

His comment was: "We can see that while China has overtaken Germany and Japan to become the world's second-largest economy (i.e., total area of the rectangle) its citizens are nowhere near being as rich as those of those countries or even Mexico."

Yes, the chart encodes the size of the economy in a rectangular area, with one side being the per-capita GDP and the other being the population. I am not sure about the "we can see". I am not confident that the short and wide rectangle for China is larger than the thin and tall ones for Japan and for Germany. Perhaps Matthew is relying on knowledge in his head, rather than knowledge on the chart, to come to this conclusion.

This is the trouble with rectangular area charts: they have a nerdy appeal since side x side = area but as a communications device, they fail.

Here are some problems with the chart:

• it's difficult to compare rectangular areas
• the columns can only be sorted in one way (I'd have chosen to order it by population)
• labeling is inelegant
• colors are necessitated by the chart type not the data
• the cumulative horizontal axis makes no sense unless the vertical axis is cumulative GDP (or cumulative GDP per capita)

Matthew should also have mentioned PPP (Purchasing Power Parity). If GDP is used as a measure of "wellbeing", then costs of living should be taken into account in addition to incomes. The cost of living in China is much lower than in Japan or Germany and using the prevailing exchange rates disguises this point.

In the Trifecta Checkup, this is a Type QDV.

Try your hand at fixing this one. There are no easy solutions. Does interactivity help? How about multiple charts? You will learn why I classify it as QDV instead of just DV.

[Update, 8/18/2014:] Xan Gregg created a scatter plot version of the chart. He also added, "There is still the issue of what the question is, but I'm assuming it's along the lines of "How do economies compare regarding GDP, population, and GDP/capita?" I'm using the PPP-based GDP, but I didn't read the report carefully enough to figure out if another measure was better."

A reader submits a Type DV analysis

Jun 09, 2014

Darin Myers at PGi was kind enough to send over an analysis of a chart using the Trifecta Checkup framework. I'm reproducing the critique in full, with a comment at the end.

***

At first glance this looks like a valid question, with good data, presented poorly (Type V). Checking the fine print (glad it’s included), the data falls apart.

Question

It’s a good question…What device are we using the most? With so much digital entertainment being published every day, it pays to know what your audience is using to access your content. The problem is this data doesn’t really answer that question conclusively.

DATA

This was based on Survey data asking respondents “Roughly how long did you spend yesterday…watching television (not online) / using the internet on a laptop or PC / on a smartphone / on a tablet? Survey respondents were limited to those who owned or had access to a TV and a smartphone and/or tablet.

• Did they ask everyone on the same day, random days, or are some days over represented here?
• This is self-reported, not tracked…who accurately remembers their average screen time on each device a day later? I imagine the vast majority of answers were round numbers (30, 45 minutes or 2 hours). This data shows accuracy to the minute that is not really provided by the users.

In fact the Council for Research Excellence found that self-reported screen time does not correlate with actual screen time. “Some media tend to be over-reported whereas others tend to be under-reported – sometimes to an alarming extent.” -Mike Bloxham, director of insight and research for Ball State

VISUAL

The visual has the usual problems with stacked bar charts where it is easy to see the first bar and the total, but not to judge the other values. This may not be an issue based on the question, but the presentation is focusing on an individual piece of tech (smartphones), so the design should focus on smartphones. At the very least, smartphones should be the first column in the chart and it should be sorted by smartphone usage.

My implementation is simply to compare the smartphone usage to the usage of the next highest device. Overall 53% of the time people are using a smartphone compared to something else. I went back and forth on whether I should keep the Tablet category in the Key though it was not the first or second used device. In the end, I decided to keep it to parallel the source visual.

Despite the data problems, I was really interested in seeing the breakdowns in each country by device, so I built the chart below with rank added (in bold). I also built some simple interaction to sort by column when you click the header [Ed: I did not attach the interactive excel sheet that came with the submission]. As a final touch, I displayed the color corresponding to the highest usage as a box to the left of the country name. It’s easy to see that the vast majority of countries use smartphones the most.

***

Hope you enjoyed Darin's analysis and revamp of the chart. The diagnosis is spot on. I like the second revision of the chart, especially for analysts who really want to know the exact numbers. The first redo has the benefit of greater simplicity--it can be a tough sell to an audience, especially when using color to indicate the second most popular device while disassociating the color and the length of the bar.

The biggest problem in the original treatment is the misalignment of the data with the question being asked. In addition to the points made by Darin, the glaring issue relates to the responder population. The analysis only includes people who have at least a smartphone or a tablet. But many people in lesser developed countries do not have either device. In those countries, it is likely that the TV screen time has been strongly underestimated. People who watch TV but do not own a smartphone or tablet are simply dropped from consideration.

For this same reason, the other footnoted comment claiming that the sampling frame accounts for ~70 percent of the global population is an irrelevance.

Small multiples with simple axes

Feb 05, 2014

Jens M., a long-time reader, submits a good graphic! This small-multiples chart (via Quartz) compares the consumption of liquor from selected countries around the world, showing both the level of consumption and the change over time.

What they did right:

• Did not put the data on a map
• Ordered the countries by the most recent data point rather than alphabetically
• Scale labels are found only on outer edge of the chart area, rather than one set per panel
• Only used three labels for the 11 years on the plot
• Did not overdo the vertical scale either

The nicest feature was the XL scale applied only to South Korea. This destroys the small-multiples principle but draws attention to the top left corner, where the designer wants our eyes to go. I would have used smaller fonts throughout.

Having done so much work to simplify the data and expose the patterns, it's time to look at whether we can add some complexity without going overboard. I'd suggest using a different color to draw attention to curves that are strangely shaped -- the Ukraine comes to mind, so does Brazil.

I'd also consider adding the top liquor in each country... the writeup made a big deal out of the fact that most of the drinking in South Korea is of Soju.

***

One way to appreciate the greatness of the chart is to look at alternatives.

Here, the Economist tries the lazy approach of using a map: (link)

For one thing, they have to give up the time dimension.

A variation is a cartogram in which the physical size and shape of countries are mapped to the underlying data. Here's one on Worldmapper (link):

One problem with this transformation is what to do with missing data.

Wikipedia has a better map with variations of one color (link):

The Atlantic realizes that populations are not evenly distributed on the map so instead of coloring countries, thay put bubbles on top of the map (link):

Unfortunately, they scaled the bubbles to the total consumption rather than the per-capita consumption. You guess it, China gets the biggest bubble and much larger than anywhere else but from a per-capita standpoint, China is behind many other countries depicted on the map.

PS. A note on submissions. I welcome submissions, especially if you have a good chart to offer. Please ping me if I don't reply within a few weeks. I may have just missed your email. Also, realize that submissions take even more time to research since it is likely in the area I have little knowledge about, and mostly because you sent it to me since you hope I'll research it. Sometimes I give up since it's taking too much time. If you ping me again, I'll let you know if I'm working on it.

The above does not apply to emails from people who are building traffic for their infographics.

PPS. Andrew Gelman chimes in with his take on small multiples.

Beyond the obvious

Dec 02, 2013

Flowing Data has been doing some fine work on the baby names data. The names voyager is a successful project by Martin Wattenberg that has received praise from many corners. It's one of these projects that have taken on a commercial life as you can see from the link.

Here is a typical area chart presentation of the baby names data:

The typical insight one takes from this chart is that the name "Michael" (as a boy's name) reached a peak in the 1970s and have not been as popular lately. The data is organized as a series of trend lines, for each name and each gender.

Speaking of area charts, I have never understood their appeal. If I were to click on Michael in the above chart, the design responds by restricting itself to all names starting with "Michael", meaning it includes Michael given to a girl, and Michaela, for example. See below.

What is curious is that the peak has a red lining. At first thought, one expects to find hiding behind the blue Michael a girl's name that is almost as popular. But this is a stacked area chart so in fact, the girl's name (Michael given to a girl, if you mouse over it) is much less popular than the boy Michael (20,000 to 500 roughly).

***

Nathan decides to dig a layer deeper. Is there more information beyond the popularity of baby names over time?

In this post, Nathan zones in on the subset of names that are "unisex," that is to say, have been used to name both boys and girls. He selects the top 35 names based on a mean-square-error criterion and exposes the gender bias for each name. The metric being plotted is no longer pure popularity but gender popularity. The larger the red area, the greater the proportion of girls being given that name.

You can readily see some interesting trends. Kim (#34) has become almost predominantly female since the 1960s. On the other hand, Robbie (#18) used to be predominantly female but is now mostly a boy's name.

One useful tip when performing this analysis is to pay attention to the popularity of each name (the original metric) even though you've decided to switch to the new metric of gender bias. This is because the relative proportions are unstable and difficult to interpret for less popular names. For example, the Name Voyager shows no values for Gale (#29) after the 1970s, which probably explains the massive gyrations in the 1990s and beyond.