More power brings more responsibility

Nick C. on Twitter sent us to the following chart of salaries in Major League Soccer. (link)

Mlbsalaries

This chart is hosted at Tableau, which is one of the modern visualization software suites. It appears to be a user submission. Alas, more power did not bring more responsibility.

Sorting the bars by total salary would be a start.

The colors and subsections of the bars were intended to unpack the composition of the total salaries, namely, which positions took how much of the money. I'm at a loss to explain why those rectangles don't seem to be drawn to scale, or what it means to have rectangles stacked on top of each other. Perhaps it's because I don't know much about how the cap works.

Combined with the smaller chart (shown below), the story seems to be that while all teams have similar cap numbers, the actual salaries being paid could differ by multiples.

***

This is the standard stacked bar chart showing the distribution of salary cap usage by team:

Tableau_mlbsalaries

 I have never understood the appeal of stacking data. It's not easy to compare the middle segments.

After quite a bit of work, I arrived at the following:

Redo_mlbsalaries

The MLS teams are divided into five groups based on how they used the salary cap. Salary cap figures are converted into proportion of total cap. For example, the first cluster includes Chicago, Los Angeles, New York, Seattle and Toronto, and these teams spread the wealth among the D, F, and M players while not spending much on goalie and "others". On the other hand, Groups 2 and 3, especially Group 3 allocated 30-45% of the cap on the midfield. 

Three teams form their own clusters. CLB spends more of its cap on "others" than any other team (others are mostly hyphenated positions like D-F, F-M, etc.) DAL and VAN spend a lot less on midfield players than other teams. VAN spends a lot on defense.

My version has many fewer data points (although the underlying data set is the same) but it's easier to interpret.

***

I tried various chart types like bar charts, and even pie charts. I still like the profile (line) charts best.

Redo_mlbsalaries_bar

In a modern software (I'm using JMP's Graph Builder here), it's only one click to go from line to bar, and one click to go to pie.

Redo_mlbsalaries_pie

 


Bad charts can happen to good people

I shouldn't be surprised by this. No sooner did I sing the praise of Significance magazine (link) than a reader sent me to some charts that are not deserving of their standard.

Here is one such chart (link):

Sig_ukuni1
Quite a few problems crop up here. The most hurtful is that the context of the chart is left to the text. If you read the paragraph above, you'll learn that the data represents only a select group of institutions known as the Russell Group; and in particular, Cambridge University was omitted because "it did not provide data in 2005". That omission is a curious decision as the designer weighs one missing year against one missing institution (and a mighty important one at that). This issue is easily fixed by a few choice words.

You will also learn from the text that the author's primary message is that among the elite institutions, little if any improvement has been observed in the enrollment of (disadvantaged) students from "low participation areas". This chart draws our attention to the tangle of up and down segments, giving us the impression that the data is too complicated to extract a clear message.

The decision to use 21 colors for 21 schools is baffling as surely no one can make out which line is which school. A good tip-off that you have the wrong chart type is the fact that you need more than say three or four colors.

The order of institutions listed in the legend is approximately reverse of their appearance in the chart. If software can be "intelligent", I'd hope that it could automatically sort the order of legend entries.

If the whitespace were removed (I'm talking about the space between 0% and 2.25% and between 8% and 10%), the lines could be more spread out, and perhaps labels can be placed next to the vertical axes to simplify the presentation. I'd also delete "Univ." with abandon.

The author concludes that nothing has changed among the Russell Group. Here is the untangled version of the same chart. The schools are ordered by their "inclusiveness" from left to right.

Redo_hesa

This is a case where the "average" obscures a lot of differences between institutions and even within institutions from year to year (witness LSE).

In addition, I see a negative reputation effect, with the proportion of students from low-participation areas decreasing with increasing reputation. I'm basing this on name recognition. Perhaps UK readers can confirm if this is correct. If correct, it's a big miss in terms of interesting features in this dataset.

 

 


The state of charting software

Andrew Wheeler took the time to write code (in SPSS) to create the "Scariest Chart ever" (link). I previously wrote about my own attempt to remake the famous chart in grayscale. I complained that this is a chart that is easier to make in the much-maligned Excel paradigm, than in a statistical package: "I find it surprising how much work it would be to use standard tools like R to do this."

Andrew disagreed, saying "anyone saavy with a statistical package would call bs". He goes on to do the "Junk Charts challenge," which has two parts: remake the original Calculated Risk chart, and then, make the Junk Charts version of the chart.

I highly recommend reading the post. You'll learn a bit of SPSS and R (ggplot2) syntax, and the philosophy behind these languages. You can compare and contrast different ways to creating the charts. You can compare the output of various programs to generate the charts.

I'll leave you to decide whether the programs he created are easier than Excel.

***

Unfortunately, Andrew skipped over one of the key challenges that I envision for anyone trying to tackle this problem. The data set he started with, which he found from the Minneapolis Fed, is post-processed data. (It's a credit to him that he found a more direct source of data.) The Fed data is essentially the spreadsheet that sits behind the Calculated Risk chart. One can just highlight the data, and create a plot directly in Excel without any further work.

What I started with was the employment level data from BLS. What such data lacks is the definition of a recession, that is, the starting year and ending year of each recession. The data also comes in calendar months and years, and transforming that to "months from start of recession" is not straightforward. If we don't want to "hard code" the details, i.e. allowing the definition of a recession to be flexible, and make this a more general application, the challenge is more severe.

***

Another detail that Andrew skimmed over is the uneven length of the data series. One of the nice things about the Calculated Risk chart is that each line terminates upon reaching the horizontal axis. Even though more data is available for out years, that part of the time series is deemed extraneous to the story. This creates an awkward dataset where some series have say 25 values and others have only 10 values. While most software packages will handle this, more code needs to be written either during the data processing phase or during the plotting.

By contrast, in Excel, you just leave the cells blank where you want the lines to terminate.

***

In the last section, Andrew did a check on how well the straight lines approximate the real data. You can see that the approximation is extremely well. (The two panels where there seems to be a difference are due to a disagreement between the data as to when the recession started. If you look at 1974 instead of 1973, and also follow Calculated Risk's convention of having a really short recession in 1980, separate from that of 1981, then the straight lines match superbly.)

  Wheeler_JunkChallenge4

***

I'm the last person to say Excel is the best graphing package out there. That's not the point of my original post. If you're a regular reader, you will notice I make my graphs using various software, including R. I came across a case where I think current software packages are inferior, and would like the community to take notice.


Which software is responsible for this?

@guitarzan wants us to see this chart from north of the border, and read the comments. Please hold your nose first.

Cbc-drinking-graph

 

Here's one insightful comment: "I think it's insane to debate the ages 18 or 19. Why not cap it off at the much more rounded and sensible numbers 18.2 or 19.4??"

Reminds me of signs that say this elevator holds 13 people, or this auditorium holds 147 people safely.

***

I mean, which software package enables this chart?

For the vertical axis, it appears that the major gridlines are specified to 0.4 with minor gridlines at 0.2 apart. The lower limit of the vertical axis was specifically set to 17, which violates the start-at-zero rule for bar charts.

The software also allows the yx-axis labels to be printed twice, one in super tiny font in the expected locations, and the other turned sideways and printed into the bars.

And Canadians, please tell us why the provinces were ordered in this way.

***

This data calls for a simple map, with two colors.

 


Remaking a great chart

One of the best charts depicting our jobs crisis is the one popularized by the Calculated Risk blog (link). This one:

JobLossesJan2013

I think a lot of readers have seen this one. It's a very effective chart.

The designer had to massage the data in order to get this look. The data published by the government typically gives an estimated employment level for each month of each year. The designer needs to find the beginning and ending months of each previous recession. Then the data needs to be broken up into unequal-length segments. A month counter now needs to be set up for each segment, re-setting to zero, for each new recession. All this creates the effect of time-shifting.

And we're not done yet. The vertical axis shows the percentage job losses relative to the peak of the prior cycle! This means that for each recession, he has to look at the prior recession and extract out the peak employment level, which is then used as the base to compute the percentage that is being plotted.

One thing you'll learn quickly from doing this exercise is that this is a task ill-suited for a computer (so-called artificial intelligence)! The human brain together with Excel can do this much faster. I'm not saying you can't create a custom-made application just for the purpose of creating this chart. That can be done and it would run quickly once it's done. But I find it surprising how much work it would be to use standard tools like R to do this.

***

Let me get to my point. While this chart works wonders on a blog, it doesn't work on the printed page. There are too many colors, and it's hard to see which line refers to which recession, especially if the printed page is grayscale. So I asked CR for his data, and re-made the chart like this:

FIGURE 5A-1

You'd immediately notice that I have liberally applied smoothing. I modeled every curve as a V-shaped curve with two linear segments, the left arm showing the average rate of decline leading to the bottom of the recession, while the right arm shows the average rate of growth taking us out of the doldrums. If you look at the original chart carefully, you'd notice that these two arms suffice to represent pretty much every jobs trend... all the other jittering are just noise.

I also chose a small-multiples to separate the curves into groups by decades. When you only have one color, you can't have ten lines plotted on top of one another.

One can extend the 2007 recession line to where it hits the 0% axis, which would really make the point that the jobs crisis is unprecedented and inexplicably not getting any kind of crisis management.

(Meanwhile, New York City calls a crisis with every winter storm... It's baffling.)


The coming commodization of infographics

An email lay in my inbox with the tantalizing subject line: "How to Create Good Infographics Quickly and Cheaply?" It's a half-spam from one of the marketing sites that I signed up for long time ago. I clicked on the link, which led me to a landing page which required yet another click to get to the real thing (link). (Now, you wonder why marketers keep putting things in your inbox!)

Easelly_walkwayThe article was surprisingly sane. The author, Carrie Hill, suggests that the first thing to do is to ask "who cares?" This is the top corner of my Trifecta Checkup, asking what's the point of the chart. Some of us not so secretly hope that answer to "who cares?" is no one.

Carrie then lists a number of resources for creating infographics "quickly and cheaply".

Easel.ly caught my eye. This website offers templates for creating infographics. You want time-series data depicted as a long, hard road ahead, you have this on the right.

You want several sections of multi-colored bubble charts, you have this theme:

Easelly_angel

 

In total, they have 15 ready-made templates that you can use to make infographics. I assume paid customers will have more.

infogr.am is another site with similar capabilities, and apparently for those with some data in hand.

***

Based on this evidence, the avanlanche of infographics is not about to pass. In fact, we are going to see the same styles repetitively. It's like looking at someone's Powerpoint presentation and realizing that they are using the "Advantage" theme (one of the less ugly themes loaded by default). In the same way, we will have a long, winding road of civil rights, and a long, winding road of Argentina's economy, and a long, winding road of Moore's Law, etc.

But I have long been an advocate of drag-and-drop style interfaces for producing statistical charts. So I hope the vendors out there learn from these websites and make your products ten times better so that it is as "quick and cheap" to make nice statistical charts as it is to make infographics.

 


Look what I found: two amazing charts

While doing some research for my statistics blog, I came across a beauty by Lane Kenworthy from almost a year ago (link) via this post by John Schmitt (link).

How embarrassing is the cost effectiveness of U.S. health care spending?

Americasinefficienthealthcaresystem-figure1-version2

When a chart is executed well, no further words are necessary.

I'd only add that the other countries depicted are "wealthy nations".

***

Even more impressive is this next chart, which plots the evolution of cost effectiveness over time. An important point to note is that the U.S. started out in 1970 similar to the other nations.

Americasinefficienthealthcaresystem-figure2-version5

Let's appreciate this beauty:

  • Let the data speak for itself. Time goes from bottom left to upper right. As more money is spent, life expectancy goes up. However, the slope of the line is much smaller for the US than the other countries. There is no need to add colors, data labels, interactivity, animation, etc.
  • Recognize what's important, what's not. The US line is in a different color, much thicker and properly made the foreground of the chart.
  • Rather than clutter up the chart, the other 19 lines are anonymized. They all have the same color and thickness, and all given one aggregate label. This is an example of overcoming loss aversion (see this post for more): it is ok to suppress some of the data.
  • The axis labeling is superb. Tufte preaches this clean style. There is no need to use regularly-spaced axis labels... use data-informed labels. Unfortunately, software is way behind on this issue. You can do this in R but that's about it.

 


The "data" corner of the Trifecta

TrifectaIn the JunkCharts Trifecta checkup, we reserve a corner for "data". The data used in a chart must be in harmony with the question being addressed, as well as the chart type being selected. When people think about data, they often think cleaning the data, processing the data but what comes before that is collecting the data -- specifically, collecting data that directly address the question at hand.

Our previous post on the smartphone app crashes focused on why the data was not trustworthy. The same problem plagues this "spider chart", submitted by Marcus R. (link to chart here)

Qlikview_Performance

Despite the title, it is impossible to tell how QlikView is "first" among these brands. In fact, with several shades of blue, I find it hard to even figure out which part refers to QlikView.

The (radial) axis is also a great mystery because it has labels (0, 0.5, 1, 1.5). I have never seen surveys with such a scale.

The symmetry of this chart is its downfall. These "business intelligence" software are ranked along 10 dimensions. There may not be a single decision-maker who would assign equal weight to each of these criteria. It's hard to imagine that "project length" is equally important as "product quality", for example.

Take one step backwards. This data came from responders to a survey (link). There is very little information about the composition of the responders. Are they asked to rate all 10 products along 10 dimensions? Do they only rate the products they are familiar with? Or only the products they actively use? If the latter, how are responses for different products calibrated so that a 1 rating from QlikView users equals a 1 rating from MicroStrategy users? Given that each of these products have broad but not completely overlapping coverage, and users typically deploy only a part of the solution, how does the analysis address for the selection bias?

***

The "spider chart" is, unfortunately, most often associated with Florence Nightingale, who created the following chart:

Nightingale

This chart isn't my cup of tea either.

***

Also note that the spider chart has so much over-plotting that it is impossible to retrieve the underlying data.

 

 


Want a signed book?

JMP is giving away signed copies of Numbers Rule Your World.  See details here.

JMP is a great piece of software for those who like to point and click, drag things around, interactively build models. People I hire who are analytical but don't have proper statistical training seem to enjoy using it and produce good work from it. There are other similar software on the market; I haven't tried them out so I don't know if they are better or worse but I can say I have had a pleasant time with JMP.

***

Speaking of which, if you haven't already, do subscribe to my sister blog, where I discuss the  statistical thinking behind everything that's happening around us.

The RSS feed: here. The twitter feed combines the two blogs.

 


Quick fix for word clouds: area not length

Here is one of my several suggestions for word-cloud design: encode the data in the area enclosed by each word, not the length of each word.

Every word cloud out there contains a distortion because readers are sizing up the areas, not the lengths of words. The extent of the distortion is illustrated in this example:

Wordcloud_distortion

The word "promise" is about 3.5 times the size of "McCain" but the ratio of frequency of occurrence is only 1.6 times.

This is a quick fix that Wordle and other word cloud software can implement right away. There are other more advanced issues I bring up in my presentation (see here).