Losing the big picture
Jan 08, 2014
One of the dangers of "Big Data" is the temptation to get lost in the details. You become so absorbed in the peeling of the onion that you don't realize your tear glands have dried up.
Hans Rosling linked to a visualization of tobacco use around the world from Twitter (link to original). The setup is quite nice for exploration. I'd call this a "tool" rather than a visual.
Let's take a look at the concentric circles on the right.
I appreciate the designer's concept -- the typical visualization of this type of data is looking at relative rates, which obscures the fact that China and India have far and away the most smokers even if their rates are middling (24% and 13% respectively).
This circular chart is supposed to show the absolute distribution of smokers across so-called "super-regions" of the world.
Unfortunately, the designer decided to pile on additional details. The concentric circles present a geography lesson, in effect. For example, high-income super-region is composed of high-income North America, Western Europe, high-income Asia Pacific, etc. and then high-income North America is composed of USA, Canada, etc.
Notice something odd? The further out you go, the larger the circular segments but the smaller the amount of people they represent! There are more people in the super-region of high-income worldwide than in high-income North America and in turn, there are more people in the high-income North American region than in USA. But the size of the graphical elements is reversed.
In principle, the "bumps"-like chart used to show the evolution of tobacco prevalence in individual countries make for a nice visual. In fact, Rosling marvelled that the global rate of consumption has fallen in recent years.
However, I'm often irritated when the designer pays no attention to what not to show. There are probably well above 200 lines densely packed into this chart. It is almost for sure that over-plotting will cause some of these lines to literally never see the light of day. Try hovering over these lines and see for yourself.
The same chart with say 10 judiciously chosen lines (countries or regions) provides the reader with a lot more profit.
The discerning reader figures out that the best visual actually does not even show up on the dashboard. Go ahead, and click on the tab called "Data" on top of the page. You now see a presentation of each country's "data" by age group and by gender. This is where you can really come up with stories for what is going on in different countries.
For example, the British have really done extremly well in reducing tobacco use. Look at how steep the declines are across the board for British men (in most parts of the world, the prevalence of smoking is much higher among men than women.)
Bulgaria on the other hand shows a rather odd pattern. It is one of the few countries in the bumps chart that showed a climb in smoking rates, at least in the early 2000s. Here the data for men is broken down into age groups.
This chart exposes a weakness of the underlying data. The error bars indicate to us that what is being plotted is not actual data but modeled data. The error bars here are enormous. With the average at about 40% to 50% for many age groups, the confidence interval is also 40% wide. Further, note that there were only three or four observations (purple dots) and curves are being fitted to these three or four dots, plus extrapolation outside the window of observation. The end result is that the apparent uplift in smoking in the early 2000s is probably a figment of the modeler's imagination. You'd want to understand if there are changes in methodologies around that time.
As a responsible designer of data graphics, you should focus less on comprehensiveness and focus more on highlighting the good data. I'm a firm believer of "no data is better than bad data".
esigners really do love using circular charts don't they.
This is an interesting post, I particularly like your point about the large error bars and lack of observations undermining the certainty shown in the global analysis. This is particularly telling if you look at Kiribati which appears to have by far the highest levels when looking at the combined chart up until around 2005 where things tighten up a bit.
However when looking at the data for Kiribati alone these levels are based on no more than 3 observations and several age groups have 0. Most of these do show very high levels of smoking but the level of uncertainty is suprising given the combined chart.
Posted by: Jamie O'Hare | Jan 08, 2014 at 01:33 PM
Thanks so much for the constructive criticism of the tool – it will help me (and hopefully others) make better visualizations in the future. Here are a few responses:
On the sunburst (concentric circles) diagram: I agree that the “geography lesson” is a bit taxing, though we wanted to communicate results at aggregate regional levels while showing the hierarchy involved. The biggest concern that you point out is that the outer areas represent smaller numbers. This is an oddity that could be resolved with a shorter radius for outer segments (which might come at a considerable aesthetic loss).
On the line chart (“bumps”): The lines are obviously uninterpretable (you’re right, there are over 200). However, I find it useful that one can highlight them via clicking the map, or the menu below. This allows users to see time trends for countries of interest.
On uncertainty: You point out that there is considerable uncertainty in many estimates, which we’ve highlighted in the “data” and “country” tabs. In the “worldwide” tab, uncertainty is available through mouseover, though not symbolically built into any of the charts. Communicating this visually would be preferable, and I’d be interested to hear suggestions on effective ways to do that.
On “good data”: I’m torn about the idea that “no data is better than bad data”. These estimates are the only comprehensive and comparable estimates of tobacco usage patterns that are available. The models use data from adjacent countries/years to generate estimates (more info from the paper here: http://jama.jamanetwork.com/article.aspx?articleid=1812960). Should policy makers ignore these estimates in favor of using no data? I’m not sure. I think emphasizing uncertainty is a better alternative to not showing the data.
Posted by: Mike Freeman (@mf_viz) | Jan 08, 2014 at 05:41 PM
Hi Mike, always excited to hear from the designer himself! Also gives readers a chance to hear the thought process that went into designing these graphics.
While some readers will surely object, I prefer a chart with fewer lines, which means not showing some of the smallest countries. That partly takes care of the uncertainty issue because larger sample sizes lead to smaller error bars. You can also cluster the countries by the shape of their trajectories. If there are fewer lines on the chart, it would leave more room to showing the error bars.
On "no data is better than bad data", I want to clarify I don't mean throw out the entire data set. I just mean identify the weak points in the data set and hide them rather than treat them on equal terms as everything else.
Posted by: Kaiser | Jan 09, 2014 at 01:15 AM
The classic example of this is the right hand side of a survival curve where there are few subjects remaining so high error. Truncating the axis to the region where there is meaning flu data will produce something that is easier to understand.
Posted by: Ken | Jan 09, 2014 at 11:17 PM