I have been a fan of Alberto Cairo for a while, and am slowly working my way through his great book, The Functional Art, which I will review soon.
Thanks to the folks at JMP, the two of us will be appearing together in the Analytically Speaking webcast, on Friday, 1-2 pm EST. Sign up here. We are both opinionated people, so the discussion will be lively. Come and ask us questions.
The Facebook data science team has put together a great course on EDA at Udacity.
EDA stands for exploratory data analysis. It is the beginning of any data analysis when you have a pile of data (or datasets) and you need to get a feel for what you're looking at. It's when you develop some intuition about what sort of methodology would be appropriate to analyze the data.
Not surprisingly, graphical methods form a big part of EDA. You will commonly see histograms, boxplots, and scatter plots. The scatterplot matrix (see my discussion of this) makes an appearance here as well.
The course uses R and in particular, Hadley's ggplot package throughout. I highly recommend the course for anyone who wants to become an expert in ggplot. ggplot does use quite a bit of proprietary syntax. This EDA course offers a lot of instruction in coding. You do have to work hard, but you will learn a lot. By working hard, I mean reading supplementary materials, and doing the exercises throughout the course. As good instruction goes, they expect students to discover things, and do not feed you bullet points.
While this course is not freeThis course is free, plus the quality of the instruction is heads and shoulders above other MOOCs out there. The course is designed from the ground up for online instruction, and it shows. If you have tried other online courses, you will immediately notice the difference in quality. For example, the people in these videos talk directly to you, and not a bunch of tuition-paying students in some remote classroom.
Sign up before they get started at Udacity. Disclaimer: No one paid me to write this post.
The biggest beef I have with this cronut is the quality of the data. As I read their description of the underlying data, I see several red flags.
The analysis is hobbled by ignoring the competitive landscape in
tracking software. Google Analytics carves out a huge share of the
market by virtue of offering a richly featured product for free. (They
justify this by establishing a gigantic spying operation on
unsuspecting users.) However, industry insiders know that Omniture
(owned by Adobe) is the heavyweight enterprise solution, with a complete feature set.
words, most of the 670,000 "customers" of Google Analytics are tiny
websites; in addition, a lot of large websites also maintain Google
Analytics in addition to Omniture since the former is free. It would be
great if the researcher gives us one of two alternative views of market share: the
share of revenues in the tracking software market; and the share of
e-commerce revenues represented by the customers of each tracking
software vendor. These two views give a fuller picture of the competitive landscape.
You'll notice this is the same game Google is playing in the mobile universe. Android has the most users but Apple makes the bulk of revenues.
The SEO agency says the chart is "based on 1.3 million e-commerce websites in May 2013". Are there really 1.3 million websites out there selling us stuff? How do they define e-commerce? Is NYTimes.com an e-commerce website, for example? Or facebook.com for that matter?
In the summary, they made a pretty startling claim--that "a large number of websites have no tracking software at all". The only problem is readers can't find out what proportion of websites don't track users. The data in the cronut excluded sites without tracking, which is a big problem.
Here is the link to the annual Top 500 Retailers report by Internet Retailer magazine. In Sep 2011, they found that 217 out of the top 500 use Omniture, 161 use Google Analytics, and 103 use Coremetrics (now owned by IBM).
Another place to look for corroborating evidence is Google Trends, which measures the popularity of search keywords. The relative order of the major vendors (excluding Google Analytics) does not match well with the data shown by Analytics SEO.
Coremetrics is way down in the list compiled by Analytics SEO.
One piece of advice I give for those wanting to get into data visualization is to trash the defaults (see the last part of this interview with me). Jon Schwabish, an economist with the government, gives a detailed example of how this is done in a guest blog on the Why Axis.
Here are the highlights of his piece.
He starts with a basic chart, published by the Bureau of Labor Statistics. You can see the hallmarks of the Excel chart using the Excel defaults. The blue, red, green color scheme is most telling.
Just by making small changes, like using tints as opposed to different colors, using columns instead of bars, reordering the industry categories, and placing the legend text next to the columns, Schwabish made the chart more visually appealing and more effective.
The final version uses lines instead of columns, which will outrage some readers. It is usually true that a grouped bar chart should be replaced by overlaid line charts, and this should not be limited to so-called discrete data.
Schwabish included several bells and whistles. The three data points are not evenly spaced in time. The year-on-year difference is separately plotted as a bar chart on the same canvass. I'd consider using a line chart here as well... and lose the vertical axis since all the data are printed on the chart (or else, lose the data labels).
This version is considerably cleaner than the original.
I noticed that the first person to comment on the Why Axis post said that internal BLS readers resist more innovative charts, claiming "they don't understand it". This is always a consideration when departing from standard chart types.
Another reader likes the "alphabetical order" (so to speak) of the industries. He raises another key consideration: who is your audience? If the chart is only intended for specialist readers who expect to find certain things in certain places, then the designer's freedom is curtailed. If the chart is used as a data store, then the designer might as well recuse him/herself.
Dona Wong asked me to comment on a project by the New York Fed visualizing funding and expenditure at NY and NJ schools. The link to the charts is here. You have to click through to see the animation.
Here are my comments:
I like the "Takeaways" section up front, which uses words to tell readers what to look for in the charts to follow.
I like the stutter steps that are inserted into the animation. This gives me time to process the data. The point of these dynamic maps is to showcase the changes in the data over time.
I really, really want to click on the green boxes (the legend) and have the corresponding school districts highlighted. In other words, turning the legend into something functional. Tool developers, please take notes!
The other options on the map are federal, state and local shares of funding, given in proportions. These are controlled by the three buttons above. This is a design decision that privileges showing how federal funds are distributed across districts and across time. The tradeoff is that it's harder to comprehend the mix of sources of funds within each district over time.
I usually like to flip back and forth between actual values and relative values. I find that both perspectives provide information. Here, I'd like to see dollars and proportions.
I also find the line charts to be much clearer but the maps are more engaging. Here is an example of the line chart: (the blue dashed line is the New York state average)
After looking at these charts, I also want to see a bivariate analysis. How is funding per student and expenditure per student related?
Robert Kosara has a great summary of the "banking to 45 degrees" practice first proposed by Bill Cleveland (link). Roughly speaking, the idea is that the slope of a line chart should be close to 45 degrees for the best perception. It's not a rule that you see much on Junk Charts because it's one of those rules about which I don't hold a strong opinion.
Here are the examples given by Kosara:
The same data is presented three ways. The slope is a reflection of the scales used on the two axes.
*** Well, I lied when I said I didn't care. Look at this particular chart below:
Some of you may recognize this style... I'm imitating Google Analytics charts. Several of the other Web charting tools also seem to come up with gems like this. Pretty much every chart you see in the Google Analytics interface looks like a flat line. The chart above looks like nothing more than noisy data from week to week.
But then look at the scale! The leftmost part of the line is a rise over two weeks. The actual rise was 50% or 300,000, i.e. an earth-shattering change.
If you use Google Analytics, you are better off downloading the data to Excel and drawing your own charts.
Nick C. on Twitter sent us to the following chart of salaries in Major League Soccer. (link)
This chart is hosted at Tableau, which is one of the modern visualization software suites. It appears to be a user submission. Alas, more power did not bring more responsibility.
Sorting the bars by total salary would be a start.
The colors and subsections of the bars were intended to unpack the composition of the total salaries, namely, which positions took how much of the money. I'm at a loss to explain why those rectangles don't seem to be drawn to scale, or what it means to have rectangles stacked on top of each other. Perhaps it's because I don't know much about how the cap works.
Combined with the smaller chart (shown below), the story seems to be that while all teams have similar cap numbers, the actual salaries being paid could differ by multiples.
This is the standard stacked bar chart showing the distribution of salary cap usage by team:
I have never understood the appeal of stacking data. It's not easy to compare the middle segments.
After quite a bit of work, I arrived at the following:
The MLS teams are divided into five groups based on how they used the salary cap. Salary cap figures are converted into proportion of total cap. For example, the first cluster includes Chicago, Los Angeles, New York, Seattle and Toronto, and these teams spread the wealth among the D, F, and M players while not spending much on goalie and "others". On the other hand, Groups 2 and 3, especially Group 3 allocated 30-45% of the cap on the midfield.
Three teams form their own clusters. CLB spends more of its cap on "others" than any other team (others are mostly hyphenated positions like D-F, F-M, etc.) DAL and VAN spend a lot less on midfield players than other teams. VAN spends a lot on defense.
My version has many fewer data points (although the underlying data set is the same) but it's easier to interpret.
I tried various chart types like bar charts, and even pie charts. I still like the profile (line) charts best.
In a modern software (I'm using JMP's Graph Builder here), it's only one click to go from line to bar, and one click to go to pie.
Quite a few problems crop up here. The most hurtful is that the context of the chart is left to the text. If you read the paragraph above, you'll learn that the data represents only a select group of institutions known as the Russell Group; and in particular, Cambridge University was omitted because "it did not provide data in 2005". That omission is a curious decision as the designer weighs one missing year against one missing institution (and a mighty important one at that). This issue is easily fixed by a few choice words.
You will also learn from the text that the author's primary message is that among the elite institutions, little if any improvement has been observed in the enrollment of (disadvantaged) students from "low participation areas". This chart draws our attention to the tangle of up and down segments, giving us the impression that the data is too complicated to extract a clear message.
The decision to use 21 colors for 21 schools is baffling as surely no one can make out which line is which school. A good tip-off that you have the wrong chart type is the fact that you need more than say three or four colors.
The order of institutions listed in the legend is approximately reverse of their appearance in the chart. If software can be "intelligent", I'd hope that it could automatically sort the order of legend entries.
If the whitespace were removed (I'm talking about the space between 0% and 2.25% and between 8% and 10%), the lines could be more spread out, and perhaps labels can be placed next to the vertical axes to simplify the presentation. I'd also delete "Univ." with abandon.
The author concludes that nothing has changed among the Russell Group. Here is the untangled version of the same chart. The schools are ordered by their "inclusiveness" from left to right.
This is a case where the "average" obscures a lot of differences between institutions and even within institutions from year to year (witness LSE).
In addition, I see a negative reputation effect, with the proportion of students from low-participation areas decreasing with increasing reputation. I'm basing this on name recognition. Perhaps UK readers can confirm if this is correct. If correct, it's a big miss in terms of interesting features in this dataset.
Andrew Wheeler took the time to write code (in SPSS) to create the "Scariest Chart ever" (link). I previously wrote about my own attempt to remake the famous chart in grayscale. I complained that this is a chart that is easier to make in the much-maligned Excel paradigm, than in a statistical package: "I find it surprising how much work it would be to use standard tools like R to do this."
Andrew disagreed, saying "anyone saavy with a statistical package would call bs". He goes on to do the "Junk Charts challenge," which has two parts: remake the original Calculated Risk chart, and then, make the Junk Charts version of the chart.
I highly recommend reading the post. You'll learn a bit of SPSS and R (ggplot2) syntax, and the philosophy behind these languages. You can compare and contrast different ways to creating the charts. You can compare the output of various programs to generate the charts.
I'll leave you to decide whether the programs he created are easier than Excel.
Unfortunately, Andrew skipped over one of the key challenges that I envision for anyone trying to tackle this problem. The data set he started with, which he found from the Minneapolis Fed, is post-processed data. (It's a credit to him that he found a more direct source of data.) The Fed data is essentially the spreadsheet that sits behind the Calculated Risk chart. One can just highlight the data, and create a plot directly in Excel without any further work.
What I started with was the employment level data from BLS. What such data lacks is the definition of a recession, that is, the starting year and ending year of each recession. The data also comes in calendar months and years, and transforming that to "months from start of recession" is not straightforward. If we don't want to "hard code" the details, i.e. allowing the definition of a recession to be flexible, and make this a more general application, the challenge is more severe.
Another detail that Andrew skimmed over is the uneven length of the data series. One of the nice things about the Calculated Risk chart is that each line terminates upon reaching the horizontal axis. Even though more data is available for out years, that part of the time series is deemed extraneous to the story. This creates an awkward dataset where some series have say 25 values and others have only 10 values. While most software packages will handle this, more code needs to be written either during the data processing phase or during the plotting.
By contrast, in Excel, you just leave the cells blank where you want the lines to terminate.
In the last section, Andrew did a check on how well the straight lines approximate the real data. You can see that the approximation is extremely well. (The two panels where there seems to be a difference are due to a disagreement between the data as to when the recession started. If you look at 1974 instead of 1973, and also follow Calculated Risk's convention of having a really short recession in 1980, separate from that of 1981, then the straight lines match superbly.)
I'm the last person to say Excel is the best graphing package out there. That's not the point of my original post. If you're a regular reader, you will notice I make my graphs using various software, including R. I came across a case where I think current software packages are inferior, and would like the community to take notice.