The Facebook data science team has put together a great course on EDA at Udacity.
EDA stands for exploratory data analysis. It is the beginning of any data analysis when you have a pile of data (or datasets) and you need to get a feel for what you're looking at. It's when you develop some intuition about what sort of methodology would be appropriate to analyze the data.
Not surprisingly, graphical methods form a big part of EDA. You will commonly see histograms, boxplots, and scatter plots. The scatterplot matrix (see my discussion of this) makes an appearance here as well.
The course uses R and in particular, Hadley's ggplot package throughout. I highly recommend the course for anyone who wants to become an expert in ggplot. ggplot does use quite a bit of proprietary syntax. This EDA course offers a lot of instruction in coding. You do have to work hard, but you will learn a lot. By working hard, I mean reading supplementary materials, and doing the exercises throughout the course. As good instruction goes, they expect students to discover things, and do not feed you bullet points.
While this course is not freeThis course is free, plus the quality of the instruction is heads and shoulders above other MOOCs out there. The course is designed from the ground up for online instruction, and it shows. If you have tried other online courses, you will immediately notice the difference in quality. For example, the people in these videos talk directly to you, and not a bunch of tuition-paying students in some remote classroom.
Sign up before they get started at Udacity. Disclaimer: No one paid me to write this post.
One of the dangers of "Big Data" is the temptation to get lost in the details. You become so absorbed in the peeling of the onion that you don't realize your tear glands have dried up.
Hans Rosling linked to a visualization of tobacco use around the world from Twitter (link to original). The setup is quite nice for exploration. I'd call this a "tool" rather than a visual.
Let's take a look at the concentric circles on the right.
I appreciate the designer's concept -- the typical visualization of this type of data is looking at relative rates, which obscures the fact that China and India have far and away the most smokers even if their rates are middling (24% and 13% respectively).
This circular chart is supposed to show the absolute distribution of smokers across so-called "super-regions" of the world.
Unfortunately, the designer decided to pile on additional details. The concentric circles present a geography lesson, in effect. For example, high-income super-region is composed of high-income North America, Western Europe, high-income Asia Pacific, etc. and then high-income North America is composed of USA, Canada, etc.
Notice something odd? The further out you go, the larger the circular segments but the smaller the amount of people they represent! There are more people in the super-region of high-income worldwide than in high-income North America and in turn, there are more people in the high-income North American region than in USA. But the size of the graphical elements is reversed.
In principle, the "bumps"-like chart used to show the evolution of tobacco prevalence in individual countries make for a nice visual. In fact, Rosling marvelled that the global rate of consumption has fallen in recent years.
However, I'm often irritated when the designer pays no attention to what not to show. There are probably well above 200 lines densely packed into this chart. It is almost for sure that over-plotting will cause some of these lines to literally never see the light of day. Try hovering over these lines and see for yourself.
The same chart with say 10 judiciously chosen lines (countries or regions) provides the reader with a lot more profit.
The discerning reader figures out that the best visual actually does not even show up on the dashboard. Go ahead, and click on the tab called "Data" on top of the page. You now see a presentation of each country's "data" by age group and by gender. This is where you can really come up with stories for what is going on in different countries.
For example, the British have really done extremly well in reducing tobacco use. Look at how steep the declines are across the board for British men (in most parts of the world, the prevalence of smoking is much higher among men than women.)
Bulgaria on the other hand shows a rather odd pattern. It is one of the few countries in the bumps chart that showed a climb in smoking rates, at least in the early 2000s. Here the data for men is broken down into age groups.
This chart exposes a weakness of the underlying data. The error bars indicate to us that what is being plotted is not actual data but modeled data. The error bars here are enormous. With the average at about 40% to 50% for many age groups, the confidence interval is also 40% wide. Further, note that there were only three or four observations (purple dots) and curves are being fitted to these three or four dots, plus extrapolation outside the window of observation. The end result is that the apparent uplift in smoking in the early 2000s is probably a figment of the modeler's imagination. You'd want to understand if there are changes in methodologies around that time.
As a responsible designer of data graphics, you should focus less on comprehensiveness and focus more on highlighting the good data. I'm a firm believer of "no data is better than bad data".
One piece of advice I give for those wanting to get into data visualization is to trash the defaults (see the last part of this interview with me). Jon Schwabish, an economist with the government, gives a detailed example of how this is done in a guest blog on the Why Axis.
Here are the highlights of his piece.
He starts with a basic chart, published by the Bureau of Labor Statistics. You can see the hallmarks of the Excel chart using the Excel defaults. The blue, red, green color scheme is most telling.
Just by making small changes, like using tints as opposed to different colors, using columns instead of bars, reordering the industry categories, and placing the legend text next to the columns, Schwabish made the chart more visually appealing and more effective.
The final version uses lines instead of columns, which will outrage some readers. It is usually true that a grouped bar chart should be replaced by overlaid line charts, and this should not be limited to so-called discrete data.
Schwabish included several bells and whistles. The three data points are not evenly spaced in time. The year-on-year difference is separately plotted as a bar chart on the same canvass. I'd consider using a line chart here as well... and lose the vertical axis since all the data are printed on the chart (or else, lose the data labels).
This version is considerably cleaner than the original.
I noticed that the first person to comment on the Why Axis post said that internal BLS readers resist more innovative charts, claiming "they don't understand it". This is always a consideration when departing from standard chart types.
Another reader likes the "alphabetical order" (so to speak) of the industries. He raises another key consideration: who is your audience? If the chart is only intended for specialist readers who expect to find certain things in certain places, then the designer's freedom is curtailed. If the chart is used as a data store, then the designer might as well recuse him/herself.
Andrew Wheeler took the time to write code (in SPSS) to create the "Scariest Chart ever" (link). I previously wrote about my own attempt to remake the famous chart in grayscale. I complained that this is a chart that is easier to make in the much-maligned Excel paradigm, than in a statistical package: "I find it surprising how much work it would be to use standard tools like R to do this."
Andrew disagreed, saying "anyone saavy with a statistical package would call bs". He goes on to do the "Junk Charts challenge," which has two parts: remake the original Calculated Risk chart, and then, make the Junk Charts version of the chart.
I highly recommend reading the post. You'll learn a bit of SPSS and R (ggplot2) syntax, and the philosophy behind these languages. You can compare and contrast different ways to creating the charts. You can compare the output of various programs to generate the charts.
I'll leave you to decide whether the programs he created are easier than Excel.
Unfortunately, Andrew skipped over one of the key challenges that I envision for anyone trying to tackle this problem. The data set he started with, which he found from the Minneapolis Fed, is post-processed data. (It's a credit to him that he found a more direct source of data.) The Fed data is essentially the spreadsheet that sits behind the Calculated Risk chart. One can just highlight the data, and create a plot directly in Excel without any further work.
What I started with was the employment level data from BLS. What such data lacks is the definition of a recession, that is, the starting year and ending year of each recession. The data also comes in calendar months and years, and transforming that to "months from start of recession" is not straightforward. If we don't want to "hard code" the details, i.e. allowing the definition of a recession to be flexible, and make this a more general application, the challenge is more severe.
Another detail that Andrew skimmed over is the uneven length of the data series. One of the nice things about the Calculated Risk chart is that each line terminates upon reaching the horizontal axis. Even though more data is available for out years, that part of the time series is deemed extraneous to the story. This creates an awkward dataset where some series have say 25 values and others have only 10 values. While most software packages will handle this, more code needs to be written either during the data processing phase or during the plotting.
By contrast, in Excel, you just leave the cells blank where you want the lines to terminate.
In the last section, Andrew did a check on how well the straight lines approximate the real data. You can see that the approximation is extremely well. (The two panels where there seems to be a difference are due to a disagreement between the data as to when the recession started. If you look at 1974 instead of 1973, and also follow Calculated Risk's convention of having a really short recession in 1980, separate from that of 1981, then the straight lines match superbly.)
I'm the last person to say Excel is the best graphing package out there. That's not the point of my original post. If you're a regular reader, you will notice I make my graphs using various software, including R. I came across a case where I think current software packages are inferior, and would like the community to take notice.
One of the best charts depicting our jobs crisis is the one popularized by the Calculated Risk blog (link). This one:
I think a lot of readers have seen this one. It's a very effective chart.
The designer had to massage the data in order to get this look. The data published by the government typically gives an estimated employment level for each month of each year. The designer needs to find the beginning and ending months of each previous recession. Then the data needs to be broken up into unequal-length segments. A month counter now needs to be set up for each segment, re-setting to zero, for each new recession. All this creates the effect of time-shifting.
And we're not done yet. The vertical axis shows the percentage job losses relative to the peak of the prior cycle! This means that for each recession, he has to look at the prior recession and extract out the peak employment level, which is then used as the base to compute the percentage that is being plotted.
One thing you'll learn quickly from doing this exercise is that this is a task ill-suited for a computer (so-called artificial intelligence)! The human brain together with Excel can do this much faster. I'm not saying you can't create a custom-made application just for the purpose of creating this chart. That can be done and it would run quickly once it's done. But I find it surprising how much work it would be to use standard tools like R to do this.
Let me get to my point. While this chart works wonders on a blog, it doesn't work on the printed page. There are too many colors, and it's hard to see which line refers to which recession, especially if the printed page is grayscale. So I asked CR for his data, and re-made the chart like this:
You'd immediately notice that I have liberally applied smoothing. I modeled every curve as a V-shaped curve with two linear segments, the left arm showing the average rate of decline leading to the bottom of the recession, while the right arm shows the average rate of growth taking us out of the doldrums. If you look at the original chart carefully, you'd notice that these two arms suffice to represent pretty much every jobs trend... all the other jittering are just noise.
I also chose a small-multiples to separate the curves into groups by decades. When you only have one color, you can't have ten lines plotted on top of one another.
One can extend the 2007 recession line to where it hits the 0% axis, which would really make the point that the jobs crisis is unprecedented and inexplicably not getting any kind of crisis management.
(Meanwhile, New York City calls a crisis with every winter storm... It's baffling.)
An email lay in my inbox with the tantalizing subject line: "How to Create Good Infographics Quickly and Cheaply?" It's a half-spam from one of the marketing sites that I signed up for long time ago. I clicked on the link, which led me to a landing page which required yet another click to get to the real thing (link). (Now, you wonder why marketers keep putting things in your inbox!)
The article was surprisingly sane. The author, Carrie Hill, suggests that the first thing to do is to ask "who cares?" This is the top corner of my Trifecta Checkup, asking what's the point of the chart. Some of us not so secretly hope that answer to "who cares?" is no one.
Carrie then lists a number of resources for creating infographics "quickly and cheaply".
Easel.ly caught my eye. This website offers templates for creating infographics. You want time-series data depicted as a long, hard road ahead, you have this on the right.
You want several sections of multi-colored bubble charts, you have this theme:
In total, they have 15 ready-made templates that you can use to make infographics. I assume paid customers will have more.
infogr.am is another site with similar capabilities, and apparently for those with some data in hand.
Based on this evidence, the avanlanche of infographics is not about to pass. In fact, we are going to see the same styles repetitively. It's like looking at someone's Powerpoint presentation and realizing that they are using the "Advantage" theme (one of the less ugly themes loaded by default). In the same way, we will have a long, winding road of civil rights, and a long, winding road of Argentina's economy, and a long, winding road of Moore's Law, etc.
But I have long been an advocate of drag-and-drop style interfaces for producing statistical charts. So I hope the vendors out there learn from these websites and make your products ten times better so that it is as "quick and cheap" to make nice statistical charts as it is to make infographics.
The scatter-plot matrix is one of the lesser known graphical tools beloved by statisticians. A scatter plot displays the correlation between a pair of variables. Given a set of n variables, there are n-choose-2 pairs of variables, and thus the same numbers of scatter plots. These scatter plots can be organized into a matrix, making it easy to look at all pairwise correlations in one place.
Since Nate Silver's feature article about New York neighborhoods came out, I have been working on capturing the data because so much was left unsaid in that article. His ranking formula takes 12 factors (housing affordability, transit, green space, nightlife, etc.) and combines individual scores into an overall score based on chosen weights (e.g. housing affordability counted for 25%). Scores are then converted to ranks.
Silver's discussion focuses on explaining which factors caused which neighborhoods to be ranked high (or low). I'm interested in whether the individual factors are correlated. For example, do neighborhoods with more expensive housing also tend to have higher-quality housing? what about better schools? are more diverse neighborhoods also more creative? and so on. There is really a treasure trove of information locked up in this data.
A scatter-plot matrix neatly organizes all of the pairwise correlation information. See below.
Each small chart shows the correlation between the given pair of variables (one listed on the right, the other listed below). The dots represent the neighborhoods. The pink patch contains the "middle 75%" of the nieghborhoods, and we can use the orientation of these patches to get a sense of whether the two variables are positively, negatively or not correlated.
There are lots to see in this chart. I just picked a random few things for illustration:
In the top left corner, the slant shows that the more affordable the homes are, the worse is the transit.
The better the shopping, the better the dining.
Interestingly, more diversity seems to mean lower creative capital (also the correlation is only moderate).
Wellness scores fall within a rather narrow range compared to other categories, and they seem to be almost completely unrelated to any of the other factors.
(Note: I used JMP to generate this matrix. Excel unfortunately does not make scatter-plot matrices natively. JMP is great for such exploration... if the developers are reading this, please make it easier to man-handle the category labels! I made a mess of rotating the text on the right.)
P.S. I had an adventure processing the data from New York magazine. There appears to have been quite a few typos. For more, see my writeup on the book blog.
In response to my call for positive examples, reader Merle H. sent in an example of how good charts can make our lives simpler and easier.
All of us have seen the following presentation of air travel data.
Not trying to pick on Travelocity - it's the same format whether you use Expedia or any of the airline sites. For those customers who are looking to decide what dates to travel so as to minimize their air fare, this format is very cumbersome to use.
What about this fare chart at FuncTravel.com?
As you mouse along the line chart, the average fare for each day is visible. Clicking on a particular day will fix the departure or return dates.
So much easier, isn't it?
A few caveats, though:
Instead of just providing the historical averages, they should consider including information on variability, such as bars that indicate the middle 50% or 75% of prices. Also, what about a sliding control for customers to decide which period of past history the averages should use? More recent data may be more representative.
This particular feature appeals to the price-sensitive, date-flexible customer segment. Not everyone will pick itineraries based on those criteria. There is an easy fix. If some controls are available for customers to indicate other preferences, e.g. exclude all British Airways flights, include only evening flights, etc., and the chart can update itself based on such selections, then the chart becomes a lot more flexible, and useful to many more customers.
As with many automatically generated charts, the chosen labels on the vertical axis are laughable. That should be relatively easy to fix, you'd think.
A great start. I happen to notice that Travelocity has a beta feature that shows a similar chart. A revolution in how travel sites present data to us is long overdue.