It's your fault when you use defaults

The following chart showed up on my Twitter feed last week. It's a cautionary tale for using software defaults.

Booksaleschart_sourceBISG_fromtwitter

 At first glance, the stacking of years in a bar chart makes little sense. This is particularly so when there appears not to be any interesting annual trend: the four segments seem to have roughly equal length almost everywhere.

This designer might be suffering from what I have called "loss aversion" (link). Loss aversion in data visualization is the fear of losing your data, which causes people to cling on to every little bit of data they have.

Several challenges of the chart come from the software defaults. The bars are ordered alphabetically, making it difficult to discern a trend. The horizontal axis labels are given in single dollars and units, and yet the intention of the designer is to use millions, as indicated in the chart titles.

The one horrifying feature of this chart is the 3D effect. The third dimension contains no information at all. In fact, it destroys information, as readers who use the vertical gridlines to estimate the lengths of the bars will be sadly misled. As shown below, readers must draw imaginary lines to figure out the horizontal values.

Twitter_booksalescategories_0

The Question of this chart is the distribution of book sales (revenues and units) across different genres. When the designer chose to stack the bars (i.e. sum the yearly data), he or she has decided that the details of specific years are not as important as the total - this is the right conclusion since the bar segments have similar measurement within each genre.

So let's pursue the revolution of averaging the data, plotting average yearly sales.

Redo_twitter_bookssalescategories

This chart shows that there are two major types of genres. In the education world, the unit prices of (text)books are very high while sales are relatively small by units but in aggregate, the dollar revenues are high. In the "adult" world, whether it's fiction or non-fiction, the unit price is low while the number of units is high, which results in similar total dollar revenues as the education genres.

***

Simple lesson here: learn to hate software defaults


How to print cash, graphically

Twitter user @glennrice called out a "journalist" for producing the following chart:

Columbiaheartbeat_cashbalance

You can't say the Columbia Heartbeat site doesn't deserve a beating over this graph. I don't recognize the software but my guess is one of these business intelligence (BI) tools that produce canned reports with a button click.

Until I read the article, I kept thinking that there are several overlapping lines being plotted. But it's really a 3D plus color effect!

Wait there's more. This software treats years as categories rather than a continuous number. So it made equal-sized intervals of 2 years, 1 year, 2 years, and 8 years. I am still not sure how this happened because the data set given at the bottom of the article contains annual data.

The y-axis labels, the gridlines, the acronym in the chart title, the unnecessary invocation of start-at-zero, etc. almost make this feel like a parody.

***

Aside from visual design issues, I am not liking the analysis either. The claim is that taxes have been increasing every year in Columbia, Missouri, and that the additional revenue ended up sitting in banks as cash. 

We need to see a number of other data series in order to accept this conclusion. What was the growth in tax revenues relative to the increase in cash? What was the growth in population in Columbia during this period? Did the cash holding per capita increase or decrease? What were the changes in expenditure on schools, public works, etc.?

This is a Type DV chart. There is an interesting question being asked but the analysis must be sharpened and the graphing software must be upgraded asap.

PS. On second thought, I think the time axis might be deliberately distorted. Judging from the slope of the line, the cumulative increase in the last 8 years equals the increase in past two-year increments so if the proper scale is used, the line would flatten out significantly, demolishing the thesis of the article. Thus, it is a case of printing cash, graphically.

 

 


Statistics report raises mixed emotions

It's gratifying to live through the incredible rise of statistics as a discipline. In a recent report by the American Statistical Association (ASA), we learned that enrollment at all levels (bachelor, master and doctorate) has exploded in the last 5-10 years, as "Big Data" gather momentum.

But my sense of pride takes a hit while looking at the charts that appear in the report. These graphs demonstrate again the hegemony of Excel defaults in the world of data visualization.

Here are all five charts organized in a panel:

Asa_enrollment_panel

Chart #5 (bottom right) catches the eye because it is the only chart with two lines instead of three. You then flip to the prior page to find the legend. The legend tells you the red line is Bachelor and the green line is PhD. That seems wrong, unless biostats departments do not give out Master degrees.

This is confirmed by chart #2, where we find the blue line (Master) hugging zero.

Presumably the designer removed the blue line from chart #5 because the low counts mean that it fluctuates wildly between 0 and 100 percent and so disrupts the visual design. But the designer forgets to tell readers why the blue line is missing.

***

It turns out the article itself contradicts all of the above:

For biostatistics degrees, for which NCES started providing data specifically in 1992, master’s degrees track the overall increase from 2010– 2014 at 47%...The number of undergraduate degrees in biostatistics remains below 30.

Asa_enrollment_legendIn other words, the legend is mislabeled. The blue line represents Bachelor while the red line, Master. (The error was noticed after the print edition went out because the online version has the correct legend.)

***

There is another mystery. Charts #2, #3, and #5, all dealing with biostats, have time starting from 1992, while Charts #1 and #4 starts from 1987. The charts aren't lined up in a way that would allow comparisons across time.

Similarly, the vertical scale of each chart is different (aside from Charts #3 and #4). This design choice impairs comparison across charts.

In the article, it is explained that 1992 was when the agency started collecting data about biostatistics degrees. Between 1987 and 1992, were there no biostatistics majors? were biostatistics majors lumped into the counts of statistics majors? It's hard to tell.

***

While Excel is a powerful tool that has served our community well, its flexibility is often a source of errors. The remedy to this problem is to invest ample time in over-riding pretty much every default decision in the system.

For example:

Redo_asa_enrollment

This chart, a reproduction of Chart #1 above, was entirely produced in Excel.

 

 

 

 

 

 


Me and Alberto Cairo in one room tomorrow

JMP_LogoI have been a fan of Alberto Cairo for a while, and am slowly working my way through his great book, The Functional Art, which I will review soon.

Thanks to the folks at JMP, the two of us will be appearing together in the Analytically Speaking webcast, on Friday, 1-2 pm EST. Sign up here. We are both opinionated people, so the discussion will be lively. Come and ask us questions.

 


Learn EDA (exploratory data analysis) from the experts

The Facebook data science team has put together a great course on EDA at Udacity.

EDA stands for exploratory data analysis. It is the beginning of any data analysis when you have a pile of data (or datasets) and you need to get a feel for what you're looking at. It's when you develop some intuition about what sort of methodology would be appropriate to analyze the data. 

Not surprisingly, graphical methods form a big part of EDA. You will commonly see histograms, boxplots, and scatter plots. The scatterplot matrix (see my discussion of this) makes an appearance here as well.

The course uses R and in particular, Hadley's ggplot package throughout. I highly recommend the course for anyone who wants to become an expert in ggplot. ggplot does use quite a bit of proprietary syntax. This EDA course offers a lot of instruction in coding. You do have to work hard, but you will learn a lot. By working hard, I mean reading supplementary materials, and doing the exercises throughout the course. As good instruction goes, they expect students to discover things, and do not feed you bullet points.

While this course is not freeThis course is free, plus the quality of the instruction is heads and shoulders above other MOOCs out there. The course is designed from the ground up for online instruction, and it shows. If you have tried other online courses, you will immediately notice the difference in quality. For example, the people in these videos talk directly to you, and not a bunch of tuition-paying students in some remote classroom.

Sign up before they get started at Udacity. Disclaimer: No one paid me to write this post.


Update on Dataviz Workshop 2

The class practised doing critiques on the famous Wind Map by Fernanda Viegas and Martin Wattenberg.

Windmap

Click here for a real-time version of the map.

I selected this particular project because it is a heartless person indeed who does not see the "beauty" in this thing.

Beauty is a word that is thrown around a lot in data visualization circles. What do we mean by beauty?

***

The discussion was very successful and the most interesting points of discussion were these:

  • Something that is beautiful should take us to some truth.
  • If we take this same map but corrupt all the data (e.g. reverse all wind directions), is the map still beautiful?
  • What is the "truth" in this map? What is its utility?
  • The emotional side of beauty is separate from the information side.
  • "Truth" comes before the emotional side of beauty.

Readers: would love to hear what you think.

 

PS. Click here for class syllabus. Click here for first update.


What's in a cronut? Let me find out

Analyticsseo_gaReader Ross S. did not join the line for this cronut, illustrating the popularity of different makers of tracking software on 1.3 million websites.

Original by Analytics SEO is here.

***

The biggest beef I have with this cronut is the quality of the data. As I read their description of the underlying data, I see several red flags.

The analysis is hobbled by ignoring the competitive landscape in tracking software. Google Analytics carves out a huge share of the market by virtue of offering a richly featured product for free. (They justify this by establishing a gigantic spying operation on unsuspecting users.) However, industry insiders know that Omniture (owned by Adobe) is the heavyweight enterprise solution, with a complete feature set.

In other words, most of the 670,000 "customers" of Google Analytics are tiny websites; in addition, a lot of large websites also maintain Google Analytics in addition to Omniture since the former is free. It would be great if the researcher gives us one of two alternative views of market share: the share of revenues in the tracking software market; and the share of e-commerce revenues represented by the customers of each tracking software vendor. These two views give a fuller picture of the competitive landscape.

You'll notice this is the same game Google is playing in the mobile universe. Android has the most users but Apple makes the bulk of revenues.

***

The SEO agency says the chart is "based on 1.3 million e-commerce websites in May 2013". Are there really 1.3 million websites out there selling us stuff? How do they define e-commerce? Is NYTimes.com an e-commerce website, for example? Or facebook.com for that matter?

In the summary, they made a pretty startling claim--that "a large number of websites have no tracking software at all". The only problem is readers can't find out what proportion of websites don't track users. The data in the cronut excluded sites without tracking, which is a big problem.

***

Here is the link to the annual Top 500 Retailers report by Internet Retailer magazine. In Sep 2011, they found that 217 out of the top 500 use Omniture, 161 use Google Analytics, and 103 use Coremetrics (now owned by IBM).

Another place to look for corroborating evidence is Google Trends, which measures the popularity of search keywords. The relative order of the major vendors (excluding Google Analytics) does not match well with the data shown by Analytics SEO.

Googletrends_on_tracking

Compared to:

Analyticsseo_gatabletop

Coremetrics is way down in the list compiled by Analytics SEO.


Hate the defaults

One piece of  advice I give for those wanting to get into data visualization is to trash the defaults (see the last part of this interview with me). Jon Schwabish, an economist with the government, gives a detailed example of how this is done in a guest blog on the Why Axis.

Here are the highlights of his piece.

***

He starts with a basic chart, published by the Bureau of Labor Statistics. You can see the hallmarks of the Excel chart using the Excel defaults. The blue, red, green color scheme is most telling.

Schwabish_bls1

 

Just by making small changes, like using tints as opposed to different colors, using columns instead of bars, reordering the industry categories, and placing the legend text next to the columns, Schwabish made the chart more visually appealing and more effective.

Redo_schwabishbls1

 The final version uses lines instead of columns, which will outrage some readers. It is usually true that a grouped bar chart should be replaced by overlaid line charts, and this should not be limited to so-called discrete data.

Redo_schwabishbls2

Schwabish included several bells and whistles. The three data points are not evenly spaced in time. The year-on-year difference is separately plotted as a bar chart on the same canvass. I'd consider using a line chart here as well... and lose the vertical axis since all the data are printed on the chart (or else, lose the data labels). 

This version is considerably cleaner than the original.

***

I noticed that the first person to comment on the Why Axis post said that internal BLS readers resist more innovative charts, claiming "they don't understand it". This is always a consideration when departing from standard chart types.

Another reader likes the "alphabetical order" (so to speak) of the industries. He raises another key consideration: who is your audience? If the chart is only intended for specialist readers who expect to find certain things in certain places, then the designer's freedom is curtailed. If the chart is used as a data store, then the designer might as well recuse him/herself.

 


Stutter steps, and functional legends

Dona Wong asked me to comment on a project by the New York Fed visualizing funding and expenditure at NY and NJ schools. The link to the charts is here. You have to click through to see the animation.

Nyfed_funding

Here are my comments:

  • I like the "Takeaways" section up front, which uses words to tell readers what to look for in the charts to follow.
  • I like the stutter steps that are inserted into the animation. This gives me time to process the data. The point of these dynamic maps is to showcase the changes in the data over time.
  • I really, really want to click on the green boxes (the legend) and have the corresponding school districts highlighted. In other words, turning the legend into something functional. Tool developers, please take notes!
  • The other options on the map are federal, state and local shares of funding, given in proportions. These are controlled by the three buttons above. This is a design decision that privileges showing how federal funds are distributed across districts and across time. The tradeoff is that it's harder to comprehend the mix of sources of funds within each district over time.
  • I usually like to flip back and forth between actual values and relative values. I find that both perspectives provide information. Here, I'd like to see dollars and proportions.

I also find the line charts to be much clearer but the maps are more engaging. Here is an example of the line chart: (the blue dashed line is the New York state average)

Nyfed_linechart

After looking at these charts, I also want to see a bivariate analysis. How is funding per student and expenditure per student related?

Do you have any feedback for Dona?


Dampened by Google

Robert Kosara has a great summary of the "banking to 45 degrees" practice first proposed by Bill Cleveland (link). Roughly speaking, the idea is that the slope of a line chart should be close to 45 degrees for the best perception. It's not a rule that you see much on Junk Charts because it's one of those rules about which I don't hold a strong opinion.

Here are the examples given by Kosara:

Eager_eyes_aspect-ratios
The same data is presented three ways. The slope is a reflection of the scales used on the two axes.

***
Well, I lied when I said I didn't care. Look at this particular chart below:

Redo_aspectratio
Some of you may recognize this style... I'm imitating Google Analytics charts. Several of the other Web charting tools also seem to come up with gems like this. Pretty much every chart you see in the Google Analytics interface looks like a flat line. The chart above looks like nothing more than noisy data from week to week.

But then look at the scale! The leftmost part of the line is a rise over two weeks. The actual rise was 50% or 300,000, i.e. an earth-shattering change.

If you use Google Analytics, you are better off downloading the data to Excel and drawing your own charts.