I was traveling quite a lot recently, and last week, read the Wall Street Journal cover to cover for the first time in a while. I am happy to report that there are many more data graphics than I remember of past editions.
The following chart illustrating findings of an FCC report on broadband speeds has a number of issues (a related blog post containing this chart can be found here):
The biggest problem with the visual elements is the lack of linkage between the two components. The two charts should be connected: the one on the right presents ISP averages by the broadband technology while the one on the left presents individual ISP results. Evidently, the designer treats the two parts as separate.
If that was the intention, there are two decisions that create confusion for readers. First, the charts use two different but related scales. Just add 100% to the scale of the left chart and you get the scale of the right chart. There really is no need for two different scales.
Secondly, orange and blue are used in both charts but for different purposes. In the left chart, orange denotes all ISPs whose actual speeds were below their advertised speeds. In the right chart, orange denotes ISPs using DSL technology.
I also do not understand why some ISP names are bolded. The bolded companies include several cable providers (but not all), several DSL providers (but not all), one fiber provider and no satellite.
Lastly, I'd prefer they stick to one of "advertised" and "promised". I do like the axis labels, saying "faster than" and "slower".
One challenge of the data is that the FCC report (here) does not provide a mathematical linkage between the technology averages and the ISP data. We know that 91% for DSL is the average of the ISPs that use DSL as shown on the left of the chart, but we don't know the weights (relative popularity) of each ISP so we can't check the computation.
But if we think of the average by technology as a reference point to measure individual ISPs, we can still use the data, and more efficiently, such as in the following dot plot where the vertical lines indicate the appropriate technology average:
(The cable section should have come before the DSL section but you get the idea.)
The key message of the chart, in my mind, is that DSL providers as a class over-promise and under-deliver.
For those who don't use an iPhone, what you are staring at is the new keyboard. Is the SHIFT key on or off?
For most of us who use the iPhone, we can't tell you either. It's been confusing and exasperating.
The answer is when the SHIFT key is gray, it is off. When the SHIFT key is white, as shown in the following image, it is ON.
This design plays games with our head. We see all the white letter keys and none of them are pressed so we assume white keys are not pressed. This is especially annoying when we are entering names into a text box. Typically, the app developer would save us a keystroke and pre-press the SHIFT key. But when we see a white SHIFT key, our heads tell us it is not pressed, so our fingers press it to turn in gray, and then we learn that we just turned off the SHIFT key.
Here's the issue. Even after months of using this keyboard, and capitalizing words daily, I still haven't gotten used to it. I keep getting confused and frustrated. The knowledge in my head just won't go away.
This is not a rant. This is a lesson for graphics designers.
Reader and tipster Chris P. found this "death spiral" chart dizzying (link).
It's one of those charts that has conceptual appeal but does not do the data justice. As the name implies, the designer has a strong message, that the arctic sea ice volume has dramatically declined over time. This message is there in the chart but the reader has to work hard to find it.
Why doesn't this spider chart work? We can be more precise.
A big problem is the lack of scalability. This chart looks different every year. If you add an extra year to the chart, you either have to increase the density of the years or you have to drop the earliest year.
Years are not circular or periodic so the metaphor doesn't quite work.
Axis labeling is also awkward. Because of the polar coordinates, the axes are radiating so the numbers run up toward the top but run down toward the bottom.
This specific instance of spider chart benefits from the well-behaved data: the between-year variability is much lower than the within-year variability. As a result, the lines don't cross each other much. If the variability from year to year fluctuates a lot, we would have seen a bunch of noodles.
This is a pity because the designer did very well in aligning two corners of the Trifecta Checkup, namely what is the question and what does the data show? It is a great idea to control for month of year, and look at year to year changes. (A more typical view would be to look at month to month changes and plot one line per year.)
This is an example of a chart that does well on one side of the checkup but the failure is that the graph isn't in tune with the data or the question being addressed.
Whenever I see a spider chart, I want to unroll the spiral and see if a line chart is better. Thus:
The dramatic decrease in Arctic ice volume (no matter the month) is clear as day. You can actually read off the magnitude of the drop. (Try doing that in the spider chart, say between 1978 and 1995.)
This chart still has issues, namely too many colors. One can color the lines by season of the year, like this:
Or switch to a small-multiples set up with three lines per chart and one chart per season.
The seasonal arrangement is not arbitrary. You can see the effect of season by looking at side by side boxplots:
The pattern is UP-DOWN-DOWN-UP.
In fact, a side-by-side boxplot of the data provides a very informative look:
The monthly series is obscured in this view, built into the vertical variability, which we can see is quite stable. The idea of controlling for month is to make it irrelevant. This view emphasizes the year on year decline of the entire distribution.
If you're worried that dropping too much information, the data can be grouped by season as before in a small-multiples setup like this:
Regardless of season, the trend is down.
PS. Alberto reminds me of his post about one example of a spider chart (radar chart) that works. Here's the link. It works because the graphical element is more in tune with the data. While the ice cap data has a linear trend over time, the voting data is all about differences in distribution. Also, the designer is expecting readers to care about the high-level pattern, not about the specifics.
Note: If you are here to read about Google Flu Trends, please see this roundup of the coverage. My blog is organized into two sections: the section you are on is about data visualization; the other section concerns Big Data and use of statistical thinking in daily life--click to go there. Or, you can follow me on Twitter which combines both feeds.
Because the visual medium is powerful, it is a favorite of advocates. Creating a chart for advocacy is tricky. One must strike the proper balance between education and messaging. The chart needs to present the policy position strongly and also enlighten the unconverted with useful information.
In my interview with MathBabe Cathy O'Neil (link), she points to this graphic by Pew that illustrates where death-penalty executions have been administered in the past two decades in the U.S. (link) Here is a screenshot of the geographic distribution for 2006:
The chart is a variant of the CDC map of obesity, which I discussed years ago. At one level, the structure of the data is the same. Each state is evaluated on a particular metric (proportion obese, and number of executions) once a year. Both designers choose to roll through a sequence of small-multiple maps.
The key distinction is that the obesity map encodes the data in color while the executions map encodes data in the density of semi-transparent, overlapping dots, each dot representing a single execution.
Perhaps the idea is to combat one of the weaknesses of color encoding: humans don't have an instinctive sense of the mapping between a numerical scale and a color scale. If the color transitions from yellow to orange, how many more executions would that map to? By contrast, if you see 200 dots instead of 160, we know the difference is 40.
The switch to the dots aesthetic introduces a host of problems.
Density, as you recall from geometry class, is the count divided by the area. High density can be due to a lot of executions or a very small area. Look at Delaware (DE) versus Georgia (GA). The density of red appears similar but there have been far fewer executions in Delaware.
This is a serious mistake. By using dot density, the designer encourages readers to think in terms of area of each state but why should the number of executions be related to area? As Cathy pointed out, a more relevant reference point is the population of each state. An even cleverer reference point might be the number of criminals/convictions in each state.
Another design issue relates to the note at the bottom of the chart (shown on the right). Here, the designer is fighting against the reader's knowledge in his/her head. It is natural for a dot on a map to represent location and yet the spatial distribution of the dots here provide no information. Credit the designer for clarifying this in a footnote; but also let this be a warning that there are other visual representation that does not require such disclaimers.
I am confused by why dots appear but never disappear. It seems that the chart is plotting cumulative counts of executions from 1977, rather than the number of executions in each year, as the chart title suggests. (If you go to the Pew website, you find a version with "cumulative" in the title; when they produced the animated gif, they decided to simplify the title, which is a poor decision.)
It requires a quick visit to Wikipedia to learn that there was a break in executions in the 70s. This is a missed opportunity to educate readers about the context of this data. Similarly, a good chart presenting this data should distinguish between states that have banned the death penalty and states that have zero or low numbers of executions.
A great way to visualize this data is via a heatmap. Here, I whipped up a quick sketch (pardon the sideway text on the legend):
I forgot to add the footnote listing the states where the death penalty is banned. Also can add an axis labeling to the side histogram showing counts.
I have been a fan of Alberto Cairo for a while, and am slowly working my way through his great book, The Functional Art, which I will review soon.
Thanks to the folks at JMP, the two of us will be appearing together in the Analytically Speaking webcast, on Friday, 1-2 pm EST. Sign up here. We are both opinionated people, so the discussion will be lively. Come and ask us questions.
Today's post examines an example of Big Data analyses, submitted by a reader, Daniel T. The link to the analysis is here. (On the sister blog, I discussed the nature of this type of analysis. This post concerns the graphical element.)
The analyst looked at "the influence of operating systems and device types on hourly usage behavior". This dataset satisfies four of the five characteristics in the OCCAM framework (link).
Observational: the data are ad impressions coming from the Chitika Ad Network observed between February 26 and March 11, 2014. This means users are (unwittingly) being tracked by cookies, pixels, or some other form of tracking devices. The analyst did not plan this study and then collect the data.
Lacking Controls: There will be a time trend but what should we compare against? How do we know if something is out of the ordinary or not?
Seemingly Complete: Right up top, we are impressed with the use of "a sample of tens of millions of device-specific online ad impressions". At least they understand this is a sample, not everything.
Adapted: All weblog data are adapted in the sense that web logs originally serve web developers who are interested in debugging their code. Operating systems and device types are tracked because each variant of OS and devices require customization, and we need that data to understand how webpages render differently. I wrote about the adaptedness of this data in a separate blog post. (link)
The analysis did not require merging data, the fifth element of the framework.
Here is the chart type used to present the analysis. There are many problems.
The conclusion the analyst drew from the above chart is: "North American Android users are more active than their iOS counterparts late at night and during the majority of the workday." In other words, the analyst points out that the blue line sits on top of the orange line during certain times of the day.
Daniel is very annoyed with the way the data is processed, and rightfully so. The chart actually does not say what it appears to say. This is because of the use of indexing.
This simple chart is not so simple to interpret!
This is because each line is "indexed to self". For example, at 12 pm EST, Android users are at 75% of their peak-hour usage while iOS users are at 2/3 of their peak-hour usage. The trouble is the peak-hour usage by iOS users is more than 2.5 times as high as the peak-hour usage of Android users, so 100% blue is less than half of 100% orange by count.
Later in the same post, the analyst re-indexed both series to the iOS peak. This chart tells us that iOS users are more active no matter what time of the day.
The Chitika analyst is not doing anything unusual. This type of indexing is a pandemic in Web analytics. The worst thing about it is that a lot of Web data is long-tailed and the maximum value is an outlier. Indexing data to an outlier isn't wise. (Usually, the index is used to hide actual values of the data, usually for keeping company secrets. But there are better ways to accomplish this.)
Digging a little deeper, we've got to note other key assumptions that the analyst must have made in producing this analysis -- and about which we are in the dark.
Are users with both Apple and Microsoft devices counted on both blue and orange lines?
How is "volume" of Web usage determined? Is it strictly number of ad impressions?
Why is total volume displayed? If Microsoft PCs dominate Macs, and the chart shows the PC line well above the Mac line, is it speaking to market share or is it speaking to usage patterns of the average user?
How representative is the traffic in the Chitika network?
How did the analyst deal with bot traffic?
Finally, using EST (Eastern Standard Time) rather than local time is silly. Think of it this way: if you extract only New York and California users, and compare their curves, without even looking at the data, you can surmise that you will see a similar shape but time-shifted by approximately three hours. Ignoring time difference leads to silly statements like this: "Both sets of users are most active during the workday, with usage volume dropping off in the late evening/early morning."
The Facebook data science team has put together a great course on EDA at Udacity.
EDA stands for exploratory data analysis. It is the beginning of any data analysis when you have a pile of data (or datasets) and you need to get a feel for what you're looking at. It's when you develop some intuition about what sort of methodology would be appropriate to analyze the data.
Not surprisingly, graphical methods form a big part of EDA. You will commonly see histograms, boxplots, and scatter plots. The scatterplot matrix (see my discussion of this) makes an appearance here as well.
The course uses R and in particular, Hadley's ggplot package throughout. I highly recommend the course for anyone who wants to become an expert in ggplot. ggplot does use quite a bit of proprietary syntax. This EDA course offers a lot of instruction in coding. You do have to work hard, but you will learn a lot. By working hard, I mean reading supplementary materials, and doing the exercises throughout the course. As good instruction goes, they expect students to discover things, and do not feed you bullet points.
While this course is not freeThis course is free, plus the quality of the instruction is heads and shoulders above other MOOCs out there. The course is designed from the ground up for online instruction, and it shows. If you have tried other online courses, you will immediately notice the difference in quality. For example, the people in these videos talk directly to you, and not a bunch of tuition-paying students in some remote classroom.
Sign up before they get started at Udacity. Disclaimer: No one paid me to write this post.
My twitter followers have been sending in several howlers.
Twitter (link) made a bunch of bold claims about its own influence by using the number of tweets about the Oscars as fodder. They also adopt the euphenism common to the digital marketing universe, the so-called "view", which credit to them, they define as "how many times tweets are displayed to users". Yes, you read that right, displaying is the same as viewing in this world - and Twitter is just a follower not a trend setter here.
In the meantime, @wilte found this unfortunate donut chart, created by PWC in the Netherlands.
Both designers basically used appropriated a graphical form and deprived it of data. In one, the designer threw the concept of scale to the wind. In the other, the designer dumped the law of total probability. In either case, the fundamental rationale for the particular graphical form is sacrificed.
Both are examples that fail our self-sufficiency test. This test says if a visual display cannot be understood unless the entire data set is printed on the chart, then why create a visual display? In both charts, if you block out the numbers, you are left with nothing!
The PWC chart was submitted by @graphomate, who also submitted the following KPMG chart:
The complaint was the total adding up to 101%. I'm not really bothered by this as it is a rounding issue. That said, I like to "hide" such rounding issues. I have never understood why it is necessary to display the imperfection. Flip a coin and remove the decimals from one of the categories!