I have been a fan of Alberto Cairo for a while, and am slowly working my way through his great book, The Functional Art, which I will review soon.
Thanks to the folks at JMP, the two of us will be appearing together in the Analytically Speaking webcast, on Friday, 1-2 pm EST. Sign up here. We are both opinionated people, so the discussion will be lively. Come and ask us questions.
The New York Times graphics team shows us how to do infographics poster the right way. They recently put up a feature showing how the repeal of helmet laws is linked to increasing vehicle fatalities. The graphic is here.
One of the key charts is this one (second to last screen):
The graphic tells the story, no additional words are needed. (Actually, you'd have to come from the prior page to know that the white vertical line represented the year in which Florida repealed its helmet law.)
Of course, one state does not prove a trend. It appears that other states face the same situation. It would be nicer if they could start this next chart at an earlier time.
I'm surprised by how much these lines fluctuate given that the raw counts are in the hundreds.
I wonder if there is any active debate in Florida or elsewhere as it would appear that the helmet law repeal may have caused hundreds of unnecessary deaths. Have people been coming up with other explanations for the sharp rise in motorcycle fatalities involving those not wearing helmets?
Today's post examines an example of Big Data analyses, submitted by a reader, Daniel T. The link to the analysis is here. (On the sister blog, I discussed the nature of this type of analysis. This post concerns the graphical element.)
The analyst looked at "the influence of operating systems and device types on hourly usage behavior". This dataset satisfies four of the five characteristics in the OCCAM framework (link).
Observational: the data are ad impressions coming from the Chitika Ad Network observed between February 26 and March 11, 2014. This means users are (unwittingly) being tracked by cookies, pixels, or some other form of tracking devices. The analyst did not plan this study and then collect the data.
Lacking Controls: There will be a time trend but what should we compare against? How do we know if something is out of the ordinary or not?
Seemingly Complete: Right up top, we are impressed with the use of "a sample of tens of millions of device-specific online ad impressions". At least they understand this is a sample, not everything.
Adapted: All weblog data are adapted in the sense that web logs originally serve web developers who are interested in debugging their code. Operating systems and device types are tracked because each variant of OS and devices require customization, and we need that data to understand how webpages render differently. I wrote about the adaptedness of this data in a separate blog post. (link)
The analysis did not require merging data, the fifth element of the framework.
Here is the chart type used to present the analysis. There are many problems.
The conclusion the analyst drew from the above chart is: "North American Android users are more active than their iOS counterparts late at night and during the majority of the workday." In other words, the analyst points out that the blue line sits on top of the orange line during certain times of the day.
Daniel is very annoyed with the way the data is processed, and rightfully so. The chart actually does not say what it appears to say. This is because of the use of indexing.
This simple chart is not so simple to interpret!
This is because each line is "indexed to self". For example, at 12 pm EST, Android users are at 75% of their peak-hour usage while iOS users are at 2/3 of their peak-hour usage. The trouble is the peak-hour usage by iOS users is more than 2.5 times as high as the peak-hour usage of Android users, so 100% blue is less than half of 100% orange by count.
Later in the same post, the analyst re-indexed both series to the iOS peak. This chart tells us that iOS users are more active no matter what time of the day.
The Chitika analyst is not doing anything unusual. This type of indexing is a pandemic in Web analytics. The worst thing about it is that a lot of Web data is long-tailed and the maximum value is an outlier. Indexing data to an outlier isn't wise. (Usually, the index is used to hide actual values of the data, usually for keeping company secrets. But there are better ways to accomplish this.)
Digging a little deeper, we've got to note other key assumptions that the analyst must have made in producing this analysis -- and about which we are in the dark.
Are users with both Apple and Microsoft devices counted on both blue and orange lines?
How is "volume" of Web usage determined? Is it strictly number of ad impressions?
Why is total volume displayed? If Microsoft PCs dominate Macs, and the chart shows the PC line well above the Mac line, is it speaking to market share or is it speaking to usage patterns of the average user?
How representative is the traffic in the Chitika network?
How did the analyst deal with bot traffic?
Finally, using EST (Eastern Standard Time) rather than local time is silly. Think of it this way: if you extract only New York and California users, and compare their curves, without even looking at the data, you can surmise that you will see a similar shape but time-shifted by approximately three hours. Ignoring time difference leads to silly statements like this: "Both sets of users are most active during the workday, with usage volume dropping off in the late evening/early morning."
My chart making workshop has passed the point where each participant (except one) has presented the first draft of his or her project, and the class has opined on these efforts. Previously, I posted the syllabus of the course here. Also catch up on previous updates (1, 2).
So far, I am very pleased with the results, and importantly, the students have given rave reviews. The in-class discussions have been very constructive, and civil. In every case, the chart designer went home with a few ideas for improvement. The types of issues that came up ranged widely. Here are some examples:
Figuring out what the message is in the data set
Thinking about what other data can be obtained to clarify the message
Discussing the level of detail appropriate for a legend
Dealing with data with a large number of small values
Because we have a color-blind student, we can examine how charts appear to the color-blind reader
How to reduce the complexity of a chart?
As the course draws to a close, several students have expressed an interest in keeping the class together via a meetup group or something similar. I'm thinking about how to accomplish this.
One lesson learned so far is that a few students got stuck trying to restructure the data, and were late submitting their work. I should stress that all submissions in the course are work in process, and maybe I should offer some data processing help during the course.
The next workshop will be offered in the summer.
PS. Don't miss Andrew Gelman's summary of his graphics tips here.
Peter Cock sent this Venn diagram to me via twitter. (Original from this paper.)
For someone who doesn't know genetics, it is very hard to make sense of this chart. It seems like there are five characteristics that each unit of analysis can have (listed on the left column) and each unit possesses one or more of these characteristics.
There is one glaring problem with this visual display. The area of each subset is not proportional to the count it represents. Look at the two numbers in the middle of the chart, each accounting for a large chunk of the area of the green tree. One side says 5,724 while the other say 13 even though both sides have the same areas.
In this respect, Venn diagrams are like maps. The area of a country or state on a map is not related to the data being plotted (unless it's a cartogram).
If you know how to interpret the data, please leave a comment. I'm guessing some kind of heatmap will work well with this data.
A twitter follower submitted this chart showing the shift in ethnicity in Texas:
If you blinked, you probably took away the wrong message. Our "prior" tells us that the proportion of Hispanics has been rising quite rapidly in Texas. So, like me, you might hone in on the blue columns which has increased drastically from 32% to 68%.
Things start to fall apart.
First, you might notice the blue label said "Non-Hispanic Whites," which is exactly the opposite of our hypothesis. For a moment, we are confused. Could it be that the Hispanics population in Texas has been shrinking?
Then, you might notice that the "information in our head" made us assume that the horizontal axis represents time. On a closer look, we discover that it's not time; what's being plotted from left to right are age groups. In fact, it's kind of a reversed time. The generations on the right side were born earlier and represent the ethnic distribution today of people born over 60 years ago while the columns on the left represent younger generations.
Finally, the gray columns are redundant and distracting.
On the other hand, the designer is admirably restrained with data labels, and included the baby and crooked man with a stick icons to provide some guidance, both of which are good ideas.
If I apply the Trifecta checkup to this chart, the biggest issue is misalignment between the interesting question of ethnic changes in Texans and the data used to explore this question. The current ethnic mix is not only impacted by the ethnic composition at birth but also by net migrations of different races and by their longevity. As pointed out above, the split by age groups forced us into a kind of reversed time thinking.
A simple fix involves expressing ages as birth years, and using a single line instead of columns:
This version doesn't address the tendency to interpret the left-right axis as time, and the excessive number of age groups.
An even better chart would put time on the horizontal axis, then have multiple lines each representing the proportion of non-Hispanic whites of a specific age group. It may be a political choice--I'm not sure why they chose to plot the declining proportion of non-Hispanic whites and lump Hispanics into "all others" as opposed to plotting the increasing mix of Hispanics.
The Facebook data science team has put together a great course on EDA at Udacity.
EDA stands for exploratory data analysis. It is the beginning of any data analysis when you have a pile of data (or datasets) and you need to get a feel for what you're looking at. It's when you develop some intuition about what sort of methodology would be appropriate to analyze the data.
Not surprisingly, graphical methods form a big part of EDA. You will commonly see histograms, boxplots, and scatter plots. The scatterplot matrix (see my discussion of this) makes an appearance here as well.
The course uses R and in particular, Hadley's ggplot package throughout. I highly recommend the course for anyone who wants to become an expert in ggplot. ggplot does use quite a bit of proprietary syntax. This EDA course offers a lot of instruction in coding. You do have to work hard, but you will learn a lot. By working hard, I mean reading supplementary materials, and doing the exercises throughout the course. As good instruction goes, they expect students to discover things, and do not feed you bullet points.
While this course is not freeThis course is free, plus the quality of the instruction is heads and shoulders above other MOOCs out there. The course is designed from the ground up for online instruction, and it shows. If you have tried other online courses, you will immediately notice the difference in quality. For example, the people in these videos talk directly to you, and not a bunch of tuition-paying students in some remote classroom.
Sign up before they get started at Udacity. Disclaimer: No one paid me to write this post.
On the sister blog, I wrote about a new report on the music industry lamenting that the hype over "Long Tail" retail has not really helped small artists (as a group). This was a tip sent by reader Patrick S. He was rightfully unhappy about the chart that was included in this summary of the report.
This classic Excel chart has some basic construction issues:
The data labels are excessive
The number of ticks on the vertical axis should be halved, given the choice to not show decimal places
With only two colors, it is a big ask for readers to shift their sight to the legend on top to understand what the blue and gray signify. Just include the legend text into the existing text annotation!
In terms of the Trifecta checkup, the biggest problem is the misalignment between the intended message and the chart's message. If you read the report, you'd learn that one of their key findings is that the top 1% (superstar) artists continue to earn ~75 percent of total income and this distribution has not changed noticeably despite the Long Tail phenonmenon.
But what is the chart's message? The first and most easily read trend is the fall in total income in the last 12-13 years. And it's a drastic drop of about $1 billion, almost 25 percent. Everything else is hard to compute on this stacked column chart. For example, the decline in the gray parts is even more drastic than the decline in the blue.
It also is challenging to estimate the proportions from these absolute amounts. Recognizing this, the designer added the proportions as text. But only for the most recent year.
So we have identifed two interesting stories, one about the decline in total income and the other about the unending dominance of the 1 percent. This is where the designer has to set priorities. Given that the latter message is the headline of the report, it is better to plot the proportions directly, while hiding the story about total income. The published chart has the priority reversed. Even though you can find both messages on the same chart, it is still not a good idea to highlight your lesser message.
A twitter follower @mdjoner felt that something is amiss with the squares in this chart comparing real estate prices in major cities around the world. I'm not sure where the chart originally came from but there is a CNBC icon.
There is one thing I really like about the chart, which is the metric that has been selected. The original data is likely to be price per square metre for luxury property in various places. The designer turned this around and computed the size of what you can buy assuming you spend $1 million. I think we have a better ability to judge areas than dollars.
The notion of floor area meshes well with the area on a chart, so there is an intuitive appeal as well.
So in the Trifecta checkup, they did well posing an interesting question, and picking some data. But like Mike, I'm not excited about the graphical construct.
There are a few problems with this chart:
It requires using colors when the colors do nothing other than delineating one city from the next.
There's overcrowding at the bottom of the chart because the designer maintained a fixed spacing throughout the chart.
The city label is always positioned above the middle of the diamond. I find it very confusing in the bottom half of the chart when the diamonds started overlapping.
The shadows plus the overlapping make it almost impossible to make out the actual areas of the pieces.
Here is an alternative display of the data:
Notice that I designed this for an American audience. I'd change certain decisions if using this for the non-American reader. I choose New York as the focal point, and split the cities into two parts. On the left are the cities less expensive than New York and on the right are those cities more expensive than New York.
Also, along the bottom, I provide some clues to help people bridge the gap between the areas shown on the graphic, and real-life areas. For example, the orange square represents 400 square feet but without the annotation telling you it's about the size of a typical Manhattan studio, you may not know how to map the size of the orange square to your perception of real spaces. I also included images (although if I'm publishing this, I'd want better ones).
Finally, note that the data set did not show up on my version of the chart.