What's in a cronut? Let me find out

Analyticsseo_gaReader Ross S. did not join the line for this cronut, illustrating the popularity of different makers of tracking software on 1.3 million websites.

Original by Analytics SEO is here.

***

The biggest beef I have with this cronut is the quality of the data. As I read their description of the underlying data, I see several red flags.

The analysis is hobbled by ignoring the competitive landscape in tracking software. Google Analytics carves out a huge share of the market by virtue of offering a richly featured product for free. (They justify this by establishing a gigantic spying operation on unsuspecting users.) However, industry insiders know that Omniture (owned by Adobe) is the heavyweight enterprise solution, with a complete feature set.

In other words, most of the 670,000 "customers" of Google Analytics are tiny websites; in addition, a lot of large websites also maintain Google Analytics in addition to Omniture since the former is free. It would be great if the researcher gives us one of two alternative views of market share: the share of revenues in the tracking software market; and the share of e-commerce revenues represented by the customers of each tracking software vendor. These two views give a fuller picture of the competitive landscape.

You'll notice this is the same game Google is playing in the mobile universe. Android has the most users but Apple makes the bulk of revenues.

***

The SEO agency says the chart is "based on 1.3 million e-commerce websites in May 2013". Are there really 1.3 million websites out there selling us stuff? How do they define e-commerce? Is NYTimes.com an e-commerce website, for example? Or facebook.com for that matter?

In the summary, they made a pretty startling claim--that "a large number of websites have no tracking software at all". The only problem is readers can't find out what proportion of websites don't track users. The data in the cronut excluded sites without tracking, which is a big problem.

***

Here is the link to the annual Top 500 Retailers report by Internet Retailer magazine. In Sep 2011, they found that 217 out of the top 500 use Omniture, 161 use Google Analytics, and 103 use Coremetrics (now owned by IBM).

Another place to look for corroborating evidence is Google Trends, which measures the popularity of search keywords. The relative order of the major vendors (excluding Google Analytics) does not match well with the data shown by Analytics SEO.

Googletrends_on_tracking

Compared to:

Analyticsseo_gatabletop

Coremetrics is way down in the list compiled by Analytics SEO.


Hate the defaults

One piece of  advice I give for those wanting to get into data visualization is to trash the defaults (see the last part of this interview with me). Jon Schwabish, an economist with the government, gives a detailed example of how this is done in a guest blog on the Why Axis.

Here are the highlights of his piece.

***

He starts with a basic chart, published by the Bureau of Labor Statistics. You can see the hallmarks of the Excel chart using the Excel defaults. The blue, red, green color scheme is most telling.

Schwabish_bls1

 

Just by making small changes, like using tints as opposed to different colors, using columns instead of bars, reordering the industry categories, and placing the legend text next to the columns, Schwabish made the chart more visually appealing and more effective.

Redo_schwabishbls1

 The final version uses lines instead of columns, which will outrage some readers. It is usually true that a grouped bar chart should be replaced by overlaid line charts, and this should not be limited to so-called discrete data.

Redo_schwabishbls2

Schwabish included several bells and whistles. The three data points are not evenly spaced in time. The year-on-year difference is separately plotted as a bar chart on the same canvass. I'd consider using a line chart here as well... and lose the vertical axis since all the data are printed on the chart (or else, lose the data labels). 

This version is considerably cleaner than the original.

***

I noticed that the first person to comment on the Why Axis post said that internal BLS readers resist more innovative charts, claiming "they don't understand it". This is always a consideration when departing from standard chart types.

Another reader likes the "alphabetical order" (so to speak) of the industries. He raises another key consideration: who is your audience? If the chart is only intended for specialist readers who expect to find certain things in certain places, then the designer's freedom is curtailed. If the chart is used as a data store, then the designer might as well recuse him/herself.

 


Stutter steps, and functional legends

Dona Wong asked me to comment on a project by the New York Fed visualizing funding and expenditure at NY and NJ schools. The link to the charts is here. You have to click through to see the animation.

Nyfed_funding

Here are my comments:

  • I like the "Takeaways" section up front, which uses words to tell readers what to look for in the charts to follow.
  • I like the stutter steps that are inserted into the animation. This gives me time to process the data. The point of these dynamic maps is to showcase the changes in the data over time.
  • I really, really want to click on the green boxes (the legend) and have the corresponding school districts highlighted. In other words, turning the legend into something functional. Tool developers, please take notes!
  • The other options on the map are federal, state and local shares of funding, given in proportions. These are controlled by the three buttons above. This is a design decision that privileges showing how federal funds are distributed across districts and across time. The tradeoff is that it's harder to comprehend the mix of sources of funds within each district over time.
  • I usually like to flip back and forth between actual values and relative values. I find that both perspectives provide information. Here, I'd like to see dollars and proportions.

I also find the line charts to be much clearer but the maps are more engaging. Here is an example of the line chart: (the blue dashed line is the New York state average)

Nyfed_linechart

After looking at these charts, I also want to see a bivariate analysis. How is funding per student and expenditure per student related?

Do you have any feedback for Dona?


Dampened by Google

Robert Kosara has a great summary of the "banking to 45 degrees" practice first proposed by Bill Cleveland (link). Roughly speaking, the idea is that the slope of a line chart should be close to 45 degrees for the best perception. It's not a rule that you see much on Junk Charts because it's one of those rules about which I don't hold a strong opinion.

Here are the examples given by Kosara:

Eager_eyes_aspect-ratios
The same data is presented three ways. The slope is a reflection of the scales used on the two axes.

***
Well, I lied when I said I didn't care. Look at this particular chart below:

Redo_aspectratio
Some of you may recognize this style... I'm imitating Google Analytics charts. Several of the other Web charting tools also seem to come up with gems like this. Pretty much every chart you see in the Google Analytics interface looks like a flat line. The chart above looks like nothing more than noisy data from week to week.

But then look at the scale! The leftmost part of the line is a rise over two weeks. The actual rise was 50% or 300,000, i.e. an earth-shattering change.

If you use Google Analytics, you are better off downloading the data to Excel and drawing your own charts.


More power brings more responsibility

Nick C. on Twitter sent us to the following chart of salaries in Major League Soccer. (link)

Mlbsalaries

This chart is hosted at Tableau, which is one of the modern visualization software suites. It appears to be a user submission. Alas, more power did not bring more responsibility.

Sorting the bars by total salary would be a start.

The colors and subsections of the bars were intended to unpack the composition of the total salaries, namely, which positions took how much of the money. I'm at a loss to explain why those rectangles don't seem to be drawn to scale, or what it means to have rectangles stacked on top of each other. Perhaps it's because I don't know much about how the cap works.

Combined with the smaller chart (shown below), the story seems to be that while all teams have similar cap numbers, the actual salaries being paid could differ by multiples.

***

This is the standard stacked bar chart showing the distribution of salary cap usage by team:

Tableau_mlbsalaries

 I have never understood the appeal of stacking data. It's not easy to compare the middle segments.

After quite a bit of work, I arrived at the following:

Redo_mlbsalaries

The MLS teams are divided into five groups based on how they used the salary cap. Salary cap figures are converted into proportion of total cap. For example, the first cluster includes Chicago, Los Angeles, New York, Seattle and Toronto, and these teams spread the wealth among the D, F, and M players while not spending much on goalie and "others". On the other hand, Groups 2 and 3, especially Group 3 allocated 30-45% of the cap on the midfield. 

Three teams form their own clusters. CLB spends more of its cap on "others" than any other team (others are mostly hyphenated positions like D-F, F-M, etc.) DAL and VAN spend a lot less on midfield players than other teams. VAN spends a lot on defense.

My version has many fewer data points (although the underlying data set is the same) but it's easier to interpret.

***

I tried various chart types like bar charts, and even pie charts. I still like the profile (line) charts best.

Redo_mlbsalaries_bar

In a modern software (I'm using JMP's Graph Builder here), it's only one click to go from line to bar, and one click to go to pie.

Redo_mlbsalaries_pie

 


Bad charts can happen to good people

I shouldn't be surprised by this. No sooner did I sing the praise of Significance magazine (link) than a reader sent me to some charts that are not deserving of their standard.

Here is one such chart (link):

Sig_ukuni1
Quite a few problems crop up here. The most hurtful is that the context of the chart is left to the text. If you read the paragraph above, you'll learn that the data represents only a select group of institutions known as the Russell Group; and in particular, Cambridge University was omitted because "it did not provide data in 2005". That omission is a curious decision as the designer weighs one missing year against one missing institution (and a mighty important one at that). This issue is easily fixed by a few choice words.

You will also learn from the text that the author's primary message is that among the elite institutions, little if any improvement has been observed in the enrollment of (disadvantaged) students from "low participation areas". This chart draws our attention to the tangle of up and down segments, giving us the impression that the data is too complicated to extract a clear message.

The decision to use 21 colors for 21 schools is baffling as surely no one can make out which line is which school. A good tip-off that you have the wrong chart type is the fact that you need more than say three or four colors.

The order of institutions listed in the legend is approximately reverse of their appearance in the chart. If software can be "intelligent", I'd hope that it could automatically sort the order of legend entries.

If the whitespace were removed (I'm talking about the space between 0% and 2.25% and between 8% and 10%), the lines could be more spread out, and perhaps labels can be placed next to the vertical axes to simplify the presentation. I'd also delete "Univ." with abandon.

The author concludes that nothing has changed among the Russell Group. Here is the untangled version of the same chart. The schools are ordered by their "inclusiveness" from left to right.

Redo_hesa

This is a case where the "average" obscures a lot of differences between institutions and even within institutions from year to year (witness LSE).

In addition, I see a negative reputation effect, with the proportion of students from low-participation areas decreasing with increasing reputation. I'm basing this on name recognition. Perhaps UK readers can confirm if this is correct. If correct, it's a big miss in terms of interesting features in this dataset.

 

 


The state of charting software

Andrew Wheeler took the time to write code (in SPSS) to create the "Scariest Chart ever" (link). I previously wrote about my own attempt to remake the famous chart in grayscale. I complained that this is a chart that is easier to make in the much-maligned Excel paradigm, than in a statistical package: "I find it surprising how much work it would be to use standard tools like R to do this."

Andrew disagreed, saying "anyone saavy with a statistical package would call bs". He goes on to do the "Junk Charts challenge," which has two parts: remake the original Calculated Risk chart, and then, make the Junk Charts version of the chart.

I highly recommend reading the post. You'll learn a bit of SPSS and R (ggplot2) syntax, and the philosophy behind these languages. You can compare and contrast different ways to creating the charts. You can compare the output of various programs to generate the charts.

I'll leave you to decide whether the programs he created are easier than Excel.

***

Unfortunately, Andrew skipped over one of the key challenges that I envision for anyone trying to tackle this problem. The data set he started with, which he found from the Minneapolis Fed, is post-processed data. (It's a credit to him that he found a more direct source of data.) The Fed data is essentially the spreadsheet that sits behind the Calculated Risk chart. One can just highlight the data, and create a plot directly in Excel without any further work.

What I started with was the employment level data from BLS. What such data lacks is the definition of a recession, that is, the starting year and ending year of each recession. The data also comes in calendar months and years, and transforming that to "months from start of recession" is not straightforward. If we don't want to "hard code" the details, i.e. allowing the definition of a recession to be flexible, and make this a more general application, the challenge is more severe.

***

Another detail that Andrew skimmed over is the uneven length of the data series. One of the nice things about the Calculated Risk chart is that each line terminates upon reaching the horizontal axis. Even though more data is available for out years, that part of the time series is deemed extraneous to the story. This creates an awkward dataset where some series have say 25 values and others have only 10 values. While most software packages will handle this, more code needs to be written either during the data processing phase or during the plotting.

By contrast, in Excel, you just leave the cells blank where you want the lines to terminate.

***

In the last section, Andrew did a check on how well the straight lines approximate the real data. You can see that the approximation is extremely well. (The two panels where there seems to be a difference are due to a disagreement between the data as to when the recession started. If you look at 1974 instead of 1973, and also follow Calculated Risk's convention of having a really short recession in 1980, separate from that of 1981, then the straight lines match superbly.)

  Wheeler_JunkChallenge4

***

I'm the last person to say Excel is the best graphing package out there. That's not the point of my original post. If you're a regular reader, you will notice I make my graphs using various software, including R. I came across a case where I think current software packages are inferior, and would like the community to take notice.


Which software is responsible for this?

@guitarzan wants us to see this chart from north of the border, and read the comments. Please hold your nose first.

Cbc-drinking-graph

 

Here's one insightful comment: "I think it's insane to debate the ages 18 or 19. Why not cap it off at the much more rounded and sensible numbers 18.2 or 19.4??"

Reminds me of signs that say this elevator holds 13 people, or this auditorium holds 147 people safely.

***

I mean, which software package enables this chart?

For the vertical axis, it appears that the major gridlines are specified to 0.4 with minor gridlines at 0.2 apart. The lower limit of the vertical axis was specifically set to 17, which violates the start-at-zero rule for bar charts.

The software also allows the yx-axis labels to be printed twice, one in super tiny font in the expected locations, and the other turned sideways and printed into the bars.

And Canadians, please tell us why the provinces were ordered in this way.

***

This data calls for a simple map, with two colors.

 


Remaking a great chart

One of the best charts depicting our jobs crisis is the one popularized by the Calculated Risk blog (link). This one:

JobLossesJan2013

I think a lot of readers have seen this one. It's a very effective chart.

The designer had to massage the data in order to get this look. The data published by the government typically gives an estimated employment level for each month of each year. The designer needs to find the beginning and ending months of each previous recession. Then the data needs to be broken up into unequal-length segments. A month counter now needs to be set up for each segment, re-setting to zero, for each new recession. All this creates the effect of time-shifting.

And we're not done yet. The vertical axis shows the percentage job losses relative to the peak of the prior cycle! This means that for each recession, he has to look at the prior recession and extract out the peak employment level, which is then used as the base to compute the percentage that is being plotted.

One thing you'll learn quickly from doing this exercise is that this is a task ill-suited for a computer (so-called artificial intelligence)! The human brain together with Excel can do this much faster. I'm not saying you can't create a custom-made application just for the purpose of creating this chart. That can be done and it would run quickly once it's done. But I find it surprising how much work it would be to use standard tools like R to do this.

***

Let me get to my point. While this chart works wonders on a blog, it doesn't work on the printed page. There are too many colors, and it's hard to see which line refers to which recession, especially if the printed page is grayscale. So I asked CR for his data, and re-made the chart like this:

FIGURE 5A-1

You'd immediately notice that I have liberally applied smoothing. I modeled every curve as a V-shaped curve with two linear segments, the left arm showing the average rate of decline leading to the bottom of the recession, while the right arm shows the average rate of growth taking us out of the doldrums. If you look at the original chart carefully, you'd notice that these two arms suffice to represent pretty much every jobs trend... all the other jittering are just noise.

I also chose a small-multiples to separate the curves into groups by decades. When you only have one color, you can't have ten lines plotted on top of one another.

One can extend the 2007 recession line to where it hits the 0% axis, which would really make the point that the jobs crisis is unprecedented and inexplicably not getting any kind of crisis management.

(Meanwhile, New York City calls a crisis with every winter storm... It's baffling.)


The coming commodization of infographics

An email lay in my inbox with the tantalizing subject line: "How to Create Good Infographics Quickly and Cheaply?" It's a half-spam from one of the marketing sites that I signed up for long time ago. I clicked on the link, which led me to a landing page which required yet another click to get to the real thing (link). (Now, you wonder why marketers keep putting things in your inbox!)

Easelly_walkwayThe article was surprisingly sane. The author, Carrie Hill, suggests that the first thing to do is to ask "who cares?" This is the top corner of my Trifecta Checkup, asking what's the point of the chart. Some of us not so secretly hope that answer to "who cares?" is no one.

Carrie then lists a number of resources for creating infographics "quickly and cheaply".

Easel.ly caught my eye. This website offers templates for creating infographics. You want time-series data depicted as a long, hard road ahead, you have this on the right.

You want several sections of multi-colored bubble charts, you have this theme:

Easelly_angel

 

In total, they have 15 ready-made templates that you can use to make infographics. I assume paid customers will have more.

infogr.am is another site with similar capabilities, and apparently for those with some data in hand.

***

Based on this evidence, the avanlanche of infographics is not about to pass. In fact, we are going to see the same styles repetitively. It's like looking at someone's Powerpoint presentation and realizing that they are using the "Advantage" theme (one of the less ugly themes loaded by default). In the same way, we will have a long, winding road of civil rights, and a long, winding road of Argentina's economy, and a long, winding road of Moore's Law, etc.

But I have long been an advocate of drag-and-drop style interfaces for producing statistical charts. So I hope the vendors out there learn from these websites and make your products ten times better so that it is as "quick and cheap" to make nice statistical charts as it is to make infographics.