Visualization as an analysis tool

Visualizing data has many uses. We often explore how charts can be used to convey data insights and tell stories. We talk less on this blog about how slicing and dicing data helps us form impressions about the structure of the data sets we're analyzing.

I have been digging around some payroll employment data recently. (You can find the data at the Bureau of Labor Statistics website.) I thought the following two charts are quite instructive.

The first one surfaces one type of recurring patterns: there is a seasonal pattern running from January to December that repeats every year. I use a small-multiples setup, with each chartlet indiced by year.

Seasonalfactor_monthly_by yeargroup

The second chart shows a different kind of regularity: there is a cyclical pattern running from 2002 to 2012, no matter which month we're looking at. Again, we have a small-multiples setup, this time with each chartlet indiced by a month of year.

Unadj_yeartoyeartrend bymonth

This second chart is a simple form of "seasonal adjustment". The data used in this plot are unadjusted. The chart shows that there is a larger cyclical pattern during the period of 2002-2012 that affects every month of the year.

I already hear grumbling about using a line chart when there is no continuity from one dot to the next. In this chart, in fact, time runs left to right, top to bottom, then starts again at the first chartlet, and so on. This is a profile chart. As the name suggests, we should be focused on the shape of the line. It doesn't have to have physical meaning; we are only looking for regularity.

***

Statisticians love to find this kind of regular patterns because they are easy to describe. Of course, most data are much messier.


A matter of compactness

Andrew Gelman may have nominated himself the graphics advisor for the World Happiness Report (link). That would be a very good thing.

To kick this off, I re-made the Figures 2.1-2.2.8 in the report, which summarized the findings of the Gallup World Poll covering annual samples of 1,000 people aged 15 and over from each of 150 countries. (These charts are effectively the first charts to appear in the report. There is no Figure 1 because Chapter 1 has no charts. The report also inexplicably follows the outdated academic-publishing convention of banishing all diagrams to the end of the report as if they were footnotes.)

In the report, they presented histograms of the 0-10 ratings (10 = happiest) by region of the world, two charts a page running to 5 pages. Here's one such page:

Whr_histogram

If you're presenting regional data, you're expecting readers to want to compare regions. It's not very nice to make them flip back and forth and task their memory in order to do these comparisons.

This data set is where small multiples show their power. Small multiples are a set of charts all sharing the same execution (type, axes, etc.) but each showing different subsets of the data. This sort of chart is designed for group comparisons, and is one of the key propositions by Edward Tufte in his classic book.

In the following junkart version, I plotted each region's histogram against the global average histogram (indicated in gray as background). The average rating in each region is indicated with the light blue vertical line. The countries are sorted from highest average happiness to lowest.

Redo2_whr_smlines

 

The same data now occupies only one page of the report. (A topic for a different post: does the higher average rating in N. America/Europe indicate greater happiness or grade inflation?)

***

Redo_whr_linesAlternatively, one can stack up the line charts into a column, as shown on the right. This view is somewhat better for any pairwise comparisons. (Calling JMP developers: how do I rotate the text labels to make them horizontal?)

***

Finally, I made a chart for exploratory purpose, using a scatterplot matrix (see also this post). In this version, every pair of regions is under the microscope. Since there are 10 regions (including the global total), we have (10*9)/2 = 45 pairwise comparisons. Each of these comparisons have its own chart in the matrix, indexed by the labels on the axes. 

Each individual chart is a scatter plot of the proportions selecting a particular rating. If the histograms in region A and region B are identical, then we see 11 dots all lined up in a diagonal line going from bottom left to top right.

In addition, the pink area of the chart contains 95% of the data. So the more the pink area resembles a diagonal line, the more correlated are the histograms between the two regions being compared.

Redo_whr_scattermat

 

 

 

 

 

 

 

 

 

 

 

 

 

For example, the very top chart compares CIS with East Asia. The thinness of the pink area tells us that the histograms of happiness ratings in those two regions resemble each other. You can easily verify this finding by looking at the first two line charts shown in the column of line charts above.

By contrast, the chart comparing CIS and Europe has an expansive pink area, meaning the happiness ratings follow different distributions. This is also verified by looking at the line charts, which show that Europeans are generally happier than people in CIS. There is an "excess" of people with ratings around 6-8 in Europe compared to CIS. The dots corresponding to these ratings would appear above the diagonal.

This scatterplot matrix explores all possible comparisons on one page but it is a lab exercise not suitable for mass consumption because it has too much detail.

***

For those curious, the small-multiples of line charts is made using R. The column of line charts and scatterplot matrix are created using JMP.


Is that my third leg?

We look at another idea from the visualization project "Gaps in the US Healthcare System" (link). This was a tip from reader Jordan G. (link). One of the bright points about this project is the conscious attempt to try something different although the end result is not always successful.

A tree-like branching chart was used to represent cancer death rates, broken down by racial group, gender and type of cancer, in that order.

Visualizing_cancerrates

The tree structure loses its logic after the race and gender splits. Why link different types of cancers (the gray squares) together in a sequence? Stranger still is the existence of a third branch coming out of every race node (the four closest to the center). One branch is male, the other branch is female, what's the third leg? It appears to be prostate cancer which is male only--why doesn't it go with the male branch?

It's not easy to find the connection between what's depicted here, and the idea of "gaps" in the US healthcare system. I think the question is ill-posed to begin with.  The rate of death reflects both the possible differential quality of healthcare between groups and the differential incidence of cancers between groups so no visualization tricks could be used to find reliable answers to the question being posed.

Jc_trifectaThe chart fails the first corner of the Trifecta checkup. The chart type also does not fit the data.

***

The following chart plots the same data in a Bumps style.

Redo_cancerdeathrate

I separated the male and female data since certain cancers are limited to one gender, and the gender difference is not likely to be the primary interest. The gender difference, incidentally, is clearly observed: the male death rates are generally about twice as high as the female rates of the same type of cancer, except for colorectal.

In terms of the "race gap", we find that black death rates are generally quite a bit higher than white death rates, especially for prostate cancer but except for lung cancer in females.

Asians and American Indians have practially the same death rates but in both cases the sample sizes are small.

The raw data can be found at the CDC website here.

 


Taking pages from Gelman

Andrew Gelman has posted a few times recently on graphics-related topics. Here are the links, and my reaction:

  • He and I both think line charts are under-valued. Some people really, really hate using line charts when the horizontal axis consists of categorical data; as I've explained repeatedly (see posts on profile charts), by drawing lines to connect these categories, all I'm doing is to expose our eye movements while reading the bar charts that are often the default option for such data.
  • Ag_militaryspend Regarding a very "ugly" chart on factors affecting military spending, Gelman wrote the following spot-on sentences:
    • Just as a lot of writing is done by people without good command of the tools of the written language, so are many graphs made by people who can only clumsily handle the tools of graphics. The problem is made worse, I believe, because I don't think the creators of the graph thought hard about what their goals were.
    • That last point is exactly why I placed at the top of the Trifecta checkup the question of figuring out what is the key question the chart is supposed to address.
  • Seems to me the above chart presents in a complicated fashion a simplistic model of military spending share: military spend = military share of GDP x GDP, therefore relative military spend increases if either relative GDP increases or relative military share of GDP increases (or both). So, in each period, all we need to know is whether the US has increased/decreased its military share of GDP relative to the rest of the world, and whether the US has increased/decreased its GDP relative to the rest of the world. End of story.
  • 201104_community_call_map_441Some work on visually displaying telephone call data. Gelman's correspondent nominated this and another chart printed in the NYT as worst of the year. Chris Volinsky disagrees and points us to a nice article. The map shown here is definitely not close to being worst of the year. The other chart, with a lot of lines, is pretty bad - and raises the question I asked the other day: what makes a "pretty" chart?
     
  • Regarding the AT&T analysis, I have a few questions for the researchers: how representative is AT&T data especially at county level? do we have to worry about nonrandom missing data? Also, how should one interpret the large swath of the Midwest which had the "background color"? Is it that there weren't sufficient data or that the data showed that all of those states belong together in one super-cluster? Finally, how does a shift in the "similarity" metric change the look of the map?

Drugged-up American graphic

Reader Chris P. found this chart on Visualizing.org, which is one of those sites that invite anyone to contribute graphics to it:

Visualizing_drug_info

It looks like the designer has taken Tufte's advice of maximizing data-to-ink ratio too literally. There are many, many things going on in a tight space, which leaves the reader feeling drugged-up and cloudy.

From a cosmetic standpoint, fixing the following would help a lot:

  • Make fonts 1-2 points larger in all cases, especially the text on the left hand side
  • Use colors judiciously to stress the key data. In this version, the trends, which are more interesting, are shown in pale gray while the raw data, which are not very exciting, are shown in loud red. Just flip the gray with the red. 
  • Rethink the American flag motive: is drug abuse a uniquely American phenomenon? Should data about the American people always be accompanied by the American flag?
  • Separately present in two charts the time-series data on total arrests, and the cross-sectional data (2008)

Stars_and_drugs Also, realize that by forcing the data into the 50-star configuration, one arbitrarily decides that the data should be rounded to 2-percent buckets. (see right). 

And always ask the fundamental question: what makes this data tick?

***

As I explored the data, I noticed various arithmetic problems. For example, the arrests by race analysis is itself split into two parts: White/black/Indian/Asian add up to 100 percent and then Hispanic Latino and Latino non Hispanic add up to 100 percent. In some surveys, Hispanics are counted within whites but that doesn't seem to be the case here. The numbers just don't add up.

Also, adding the types of drugs involved does not yield the total number of arrests. Perhaps the category of "others" has been omitted without comment. Now I closed my eyes and proceeded to make a chart out of this.

***

The new version focuses on one insight: that certain races seem to get arrested for certain drugs. The relative incidence for arrests are not similar among the races for any given drug. Asians and Native Americans appear to have higher proportions of people arrested for marijuana or meth while blacks are much more likely to be arrested for crack. 

Redo_drugs

You're going to need to click on the chart for the large version to see the text.

Doing this chart gives me another chance to plug the Profile chart. We deliberately connect with lines the categorical data. The lines are meant to mean anything; they are meant to guide our eyes towards the important features of the chart.

One can sometimes superimpose all the lines onto the same plot but the canvass clogs up quickly with more lines, and then a small-multiples presentation like this one is preferred.

We have a temptation to generalize arrest data to talk about drug habits by race but if you intend to do so, bear in mind that arrests need not correlate strongly with usage.


Worst statistical graphic nominated

Phil, over at the Gelman blog, nominates this jaw-dropping graphic as the worst of the year. I have to agree:

GR_GraficFIN-web

Should we complain about the "pie chart"/4 quadrants representation with no reference to the underlying data? Or the "pie within a pie within a pie" invention, again defiantly not to scale? Or the creative liense to exaggerate the smallest numbers in the chart ($2 billion, $0.3 billion) making it disproportionate with the other pieces? Or the complete usurping of proportions (e.g. the $0.2 billion green strip on the top right quadrant compared to the $0.3 billion tiny blue arc on the top left quadrant)?

Or the random sprinkling of labels and numbers around the circle even though if one takes the time, one notices that the entire chart contains only 8 numbers, as follows:

Energysub_data

***

Instead, we can display the data with a small multiples layout showing readers how the data is structured along two dimensions.

Redo_energysub1

Or a profile chart may also work:

Redo_energysub2

 

 


Gelman joins in the fun

The great Andrew Gelman did a Junk Charts style post today, and very well indeed.

The offending Economist plot is the donut chart, which is a favorite of that magazine.  I commented on this type of chart before.

Econ_timespent

Andrew created two alternatives, one is a line chart (profile chart) which is often a better option (despite the data being categorical), the other is more creative, and the better of the two.

Redo_timespent1

 

Redo_timespent2

Some of Gelman's readers complained that he arbitrarily "standardized" the data by indexing against the average of the countries depicted; one can further grumble that a 50% "excess" may sound impressive but it would be equivalent to less than an hour, perhaps not as startling. These types of complaints are fair but do realize that blog posts like these are primarily concerned with how data is best visualized. If one prefers a different indexing method, or a different set of countries, or a different color for the lines, etc., one can easily revise the chart to reflect those preferences.

The easiest way to see why the third chart is better than the first is that the strongest message coming off the first chart is that there are no material differences between these six countries in terms of time usage but in the third chart, the designer (here, it's Gelman) is asserting that there are interesting differences.


The best way to handle two dimensions may be to not use two dimensions

Guess what the designer at Nielsen wanted to tell you with this chart:

Smartphone-age-os
Reader Steven S. couldn't figure it out, and chances are neither can you.

What about...

  • The smartphone (OS) market is dominated by three top players (Android, Apple and Blackberry) each having roughly 30% share, while others split the remaining 10%.
  • The age-group mix for each competitor is similar (or are they?)

Maybe those are the messages; if so, there is no need to present a bivariate plot (the so-called "mosaic" plot, or in consulting circles, the Marimekko). Having two charts carrying one message each would accomplish the job cleanly.

***

Trying to do too much in one chart is a disease; witness the side effects.  Smartphone_sm1

The two columns, counting from the right, contain rectangles that appear to be of different sizes, and yet the data labels claim each piece represents 1%, and in some cases "< 1%".  The simultaneous manipulation of both the height and the width plays mind tricks.

Also, while one would ordinarily applaud the dropping of decimals from a chart like this, doing so actually creates the ugly problem that the five pieces of 1% (on the left column shown here) have the same width but clearly varying heights!

Smartphone_sm2 What about this section of the plot shown on the left? Does the smaller green box look like it's less than 1/3 the size of the longer green box? This chart is clearly not self-sufficient, and as such one might prefer a simple data table.

The downfall of the mosaic plot is that it gives the illusion of having two dimensions but only an illusion: in fact, the chart is dominated by one dimension, as all proportions are relative to the grand total.

For instance, the chart says that 6% of all smartphone users are between the ages of 18 and 24 AND uses an Android phone. It also tells us that 2% of all smartphone users are between 35 and 44 AND uses a Palm phone. Those are not two numbers anyone would desire to compare. There are hardly any practical questions that require comparing them.

Sometimes, the best way to handle two dimensions is not to use two dimensions.

***

 The original article notes that "Of the three most popular smartphone operating systems, Android seems to attract more young consumers." In the chart shown below,  Redo_phoneos we assume that the business question is the relative popularity of phone operating systems across age groups. 

The right metric for comparison is the market share of each OS within an age group.

For example, tracing the black line labeled "Android", this chart tells us that Android has 37% of the 18-24 market while it has about 20% of the 65 and up market. 

Android has an overall market share of about 30%, and that average obscures a youth bias that is linear with age.

On the other hand, the iPhone (green line) has also an average market share of about 30% but its profile is pretty flat in all age groups except 65 and up where it has considerable strength.

Further, the gap between Android and iPhone at the older age group actually opens up at 55 years and up. In the 55-64 age group, the iPhone holds a market share that is similar to its overall average while the Android performs quite a bit worse than its average. We note that Palm OS has some strength in the older age groups as well while the Blackberry also significantly underperforms in 65 and over.

Why aren't all these insights visible in the mosaic chart? It all because the chosen denominator of the entire market (as opposed to each age group) makes a lot of segments very small, and then the differences between small segments become invisible when placed beside much larger segments.

Now, the reconstituted chart gives no information about the relative sizes of the age groups. The market size for the older groups is quite a bit smaller than the younger groups. This information should be provided in a separate chart, or as a little histogram tucked under the age-group axis.

 

 


Be guided by the questions

Information graphics is one of many terms used to describe charts showing data -- and a very ambitious one at that. It promises the delivery of "information". Too often, readers are disappointed, sometimes because the "information" cannot be found on the chart, and sometimes because the "information" is resolutely hidden behind thickets.

Statistical techniques are useful to expose the hidden information. They work by getting rid of the extraneous or misleading bits of data, and by accentuating the most informative parts. A statistical graphic distinguishes itself by not showing all the raw data.

Guardian_pisa_sm Here is the Guardian's take on the OECD PISA scores that were released recently. (Perhaps some of you are playing around with this data, which I featured in the Open Call... alas, no takers so far.) I only excerpted the top part of the chart.

This graphic is not bad, could have been much worse, and I'm sure there are much worse out there.

But think about this for a moment: what question did the designer hope to address with this chart? The headline says comparing UK against other OECD countries, which is a simple objective that does not justify such a complex chart.

The most noticeable feature are the line segments showing the correlation of ranks among the three subject areas within each country. So, South Korea is ranked first in reading and math, and third in science. Equally prominent is the rank of countries shown on the left-hand-side of the chart (which, on inspection, shows the ranking of reading scores); this ranking also determines the colors used, another eye-catching part of this chart. (The thick black UK line is, of course, important also.)

In my opinion, those are not the three or four most interesting questions about this data set. In such a rich data set, there could be dozens of interesting questions. I'm not arguing that we have to agree on which ones are the most prominent. I'm saying the designer should be clear in his or her own mind what questions are being answered -- prior to digging around the data.

***
With that in mind, I decided that a popular question concerns the comparison of scores between any pair of countries. From there, I worked on how to simplify the data to bring out the "information". Specifically, I used a little statistics to classify countries into 7 groups; countries within each group are judged to have performed equally well in the test and any difference could be considered statistical noise. (I will discuss how I put countries into these groups in a future post, just focusing on the chart here.)

Here is the result: (PS. Just realized the axis should be labelled "PISA Reading Score Differentials from the Reference Country Group" as they show pairwise differences, not scores.)

Redo_pisa2a

Each row uses one of the country groups as the reference level. For example, the first row shows that Finland and South Korea, the two best performing countries, did significantly better than all other country groups, except those in A2. The relative distance of each set of countries from the reference level is meaningful, and gives information about how much worse they did. 

(The standard error seems to be about 3-6 based on some table I found on the web, which may or may not be correct. This value leads to very high standardized score differentials, indicating that the spread between countries are very wide.

I have done this for the reading test only. The test scores were standardized, which is not necessary if we are only concerned about the reading test. But since I was also looking at correlations between the three subjects, I chose to standardize the scores, which is another way of saying putting them on an identical scale.)

Before settling on the above chart, I produced this version:

Redo_pisa2b

This post is getting too long so I'll be brief on this next point. You may wonder whether having all 7 rows is redundant. The reason why they are all there is that the pairwise differences lack "transitivity": e.g., the difference between Finland and UK is not the difference between Finland and Sweden plus the difference between Sweden and the UK. The right way to read it is to cling to the reference country group, and only look at the differences between the reference group and each of the other groups. The differences between two country groups neither of which is a reference group should be ignored in this chart: instead look up the two rows for which those countries are a reference group.

Before that, I tried a more typical network graph. It looks "sophisticated" and is much more compact but it contains less information than the previous chart, and gets murkier as the number of entities increases. Readers have to work hard to dig out the interesting bits.

Redo_pisa1

 

 

 


Reading this before the long weekend may save your life

Here's a chart that can save your life.

Ewg_sunblock
Here's my version (pretty much any such grouped column charts can be replaced by line charts):

Junkcharts_sunblock

(Chart purists: I like profile charts which means I like to connect categorical data with lines.)

Anyhow, this data supposedly came from an FDA study, which the FDA has apparently now disowned, according to this AOL News report. Rats were used in this study, and the rate at which they developed significant tumor or lesion was measured. The graph illustrated a clear trend that the higher the doses of Vitamin A, the faster the rats developed cancer; this correlation was intact whether they were exposed to high or low levels of UV rays.

Notice that I switched the primary categorical axis to Vitamin A doses rather than high/low UV because the study concerned Vitamin A primarily, and levels of UV secondarily.

Using the Trifecta Checkup, we can see that they have the right questiJC_trifectaon, and the right data but a suboptimal chart. Also, the original chart fails the self-sufficiency test: no point in printing the data on top of the columns when there is a vertical scale.

***

How will this save your life?

Vitamin A is widely added to sunblocks -- not because they have any screening value -- but because they may slow aging of the skin. But the study found that Vitamin A actually partially nullifies the screening ability of sunblocks.

About half of the 500 most popular sunblocks sold in the U.S. contain Vitamin A and only 39 out of the 500 are deemed safe by the Environmental Working Group, which has compiled a database of these products. (There are several other potentially harmful ingredients.)

The FDA denied that such a study existed although the reporter as well as EWG have copies of it. If this study is authentic, the FDA knew about this perhaps ten years ago.


Reference: "Study: Many Sunscreens May Be Accelerating Cancer", Andrew Schneider, AOL News, May 24 2010.


PS. I should explain to my non-U.S. readers that the U.S. is celebrating Memorial Day, the beginning of summer, on Monday so lots of people are going to beaches and other vacations.