« February 2014 | Main | April 2014 »

Dissecting charts from a Big Data study

Today's post examines an example of Big Data analyses, submitted by a reader, Daniel T. The link to the analysis is here. (On the sister blog, I discussed the nature of this type of analysis. This post concerns the graphical element.)

The analyst looked at "the influence of operating systems and device types on hourly usage behavior". This dataset satisfies four of the five characteristics in the OCCAM framework (link).

Observational: the data are ad impressions coming from the Chitika Ad Network observed between February 26 and March 11, 2014. This means users are (unwittingly) being tracked by cookies, pixels, or some other form of tracking devices. The analyst did not plan this study and then collect the data.

Lacking Controls: There will be a time trend but what should we compare against? How do we know if something is out of the ordinary or not?

Seemingly Complete: Right up top, we are impressed with the use of "a sample of tens of millions of device-specific online ad impressions".  At least they understand this is a sample, not everything.

Adapted: All weblog data are adapted in the sense that web logs originally serve web developers who are interested in debugging their code. Operating systems and device types are tracked because each variant of OS and devices require customization, and we need that data to understand how webpages render differently. I wrote about the adaptedness of this data in a separate blog post. (link)

The analysis did not require merging data, the fifth element of the framework.


Here is the chart type used to present the analysis. There are many problems.


The conclusion the analyst drew from the above chart is: "North American Android users are more active than their iOS counterparts late at night and during the majority of the workday." In other words, the analyst points out that the blue line sits on top of the orange line during certain times of the day.

Daniel is very annoyed with the way the data is processed, and rightfully so. The chart actually does not say what it appears to say. This is because of the use of indexing.

This simple chart is not so simple to interpret!

This is because each line is "indexed to self". For example, at 12 pm EST, Android users are at 75% of their peak-hour usage while iOS users are at  2/3 of their peak-hour usage. The trouble is the peak-hour usage by iOS users is more than 2.5 times as high as the peak-hour usage of Android users, so 100% blue is less than half of 100% orange by count.

Later in the same post, the analyst re-indexed both series to the iOS peak. This chart tells us that iOS users are more active no matter what time of the day.



The Chitika analyst is not doing anything unusual. This type of indexing is a pandemic in Web analytics. The worst thing about it is that a lot of Web data is long-tailed and the maximum value is an outlier. Indexing data to an outlier isn't wise. (Usually, the index is used to hide actual values of the data, usually for keeping company secrets. But there are better ways to accomplish this.)


Digging a little deeper, we've got to note other key assumptions that the analyst must have made in producing this analysis -- and about which we are in the dark.

Are users with both Apple and Microsoft devices counted on both blue and orange lines?

How is "volume" of Web usage determined? Is it strictly number of ad impressions?

Why is total volume displayed? If Microsoft PCs dominate Macs, and the chart shows the PC line well above the Mac line, is it speaking to market share or is it speaking to usage patterns of the average user?

How representative is the traffic in the Chitika network?

How did the analyst deal with bot traffic?


Finally, using EST (Eastern Standard Time) rather than local time is silly. Think of it this way: if you extract only New York and California users, and compare their curves, without even looking at the data, you can surmise that you will see a similar shape but time-shifted by approximately three hours. Ignoring time difference leads to silly statements like this: "Both sets of users are most active during the workday, with usage volume dropping off in the late evening/early morning."



Update on Dataviz Workshop 3

My chart making workshop has passed the point where each participant (except one) has presented the first draft of his or her project, and the class has opined on these efforts. Previously, I posted the syllabus of the course here. Also catch up on previous updates (1, 2).

So far, I am very pleased with the results, and importantly, the students have given rave reviews. The in-class discussions have been very constructive, and civil. In every case, the chart designer went home with a few ideas for improvement. The types of issues that came up ranged widely. Here are some examples:

  • Figuring out what the message is in the data set
  • Thinking about what other data can be obtained to clarify the message
  • Discussing the level of detail appropriate for a legend
  • Dealing with data with a large number of small values
  • Because we have a color-blind student, we can examine how charts appear to the color-blind reader
  • How to reduce the complexity of a chart?

As the course draws to a close, several students have expressed an interest in keeping the class together via a meetup group or something similar. I'm thinking about how to accomplish this.

One lesson learned so far is that a few students got stuck trying to restructure the data, and were late submitting their work. I should stress that all submissions in the course are work in process, and maybe I should offer some data processing help during the course.


The next workshop will be offered in the summer.


PS. Don't miss Andrew Gelman's summary of his graphics tips here.



Some chart types are not scalable

Peter Cock sent this Venn diagram to me via twitter. (Original from this paper.)


For someone who doesn't know genetics, it is very hard to make sense of this chart. It seems like there are five characteristics that each unit of analysis can have (listed on the left column) and each unit possesses one or more of these characteristics.

There is one glaring problem with this visual display. The area of each subset is not proportional to the count it represents. Look at the two numbers in the middle of the chart, each accounting for a large chunk of the area of the green tree. One side says 5,724 while the other say 13 even though both sides have the same areas.

In this respect, Venn diagrams are like maps. The area of a country or state on a map is not related to the data being plotted (unless it's a cartogram).

If you know how to interpret the data, please leave a comment. I'm guessing some kind of heatmap will work well with this data. 

The closer you look, the more confused

A twitter follower submitted this chart showing the shift in ethnicity in Texas:


If you blinked, you probably took away the wrong message. Our "prior" tells us that the proportion of Hispanics has been rising quite rapidly in Texas. So, like me, you might hone in on the blue columns which has increased drastically from 32% to 68%.

Things start to fall apart.

First, you might notice the blue label said "Non-Hispanic Whites," which is exactly the opposite of our hypothesis. For a moment, we are confused. Could it be that the Hispanics population in Texas has been shrinking?

Then, you might notice that the "information in our head" made us assume that the horizontal axis represents time. On a closer look, we discover that it's not time; what's being plotted from left to right are age groups. In fact, it's kind of a reversed time. The generations on the right side were born earlier and represent the ethnic distribution today of people born over 60 years ago while the columns on the left represent younger generations.

Finally, the gray columns are redundant and distracting.

On the other hand, the designer is admirably restrained with data labels, and included the baby and crooked man with a stick icons to provide some guidance, both of which are good ideas.


If I apply the Trifecta checkup to this chart, the biggest issue is misalignment between the interesting question of ethnic changes in Texans and the data used to explore this question. The current ethnic mix is not only impacted by the ethnic composition at birth but also by net migrations of different races and by their longevity. As pointed out above, the split by age groups forced us into a kind of reversed time thinking.

A simple fix involves expressing ages as birth years, and using a single line instead of columns:


This version doesn't address the tendency to interpret the left-right axis as time, and the excessive number of age groups.

An even better chart would put time on the horizontal axis, then have multiple lines each representing the proportion of non-Hispanic whites of a specific age group. It may be a political choice--I'm not sure why they chose to plot the declining proportion of non-Hispanic whites and lump Hispanics into "all others" as opposed to plotting the increasing mix of Hispanics.



Learn EDA (exploratory data analysis) from the experts

The Facebook data science team has put together a great course on EDA at Udacity.

EDA stands for exploratory data analysis. It is the beginning of any data analysis when you have a pile of data (or datasets) and you need to get a feel for what you're looking at. It's when you develop some intuition about what sort of methodology would be appropriate to analyze the data. 

Not surprisingly, graphical methods form a big part of EDA. You will commonly see histograms, boxplots, and scatter plots. The scatterplot matrix (see my discussion of this) makes an appearance here as well.

The course uses R and in particular, Hadley's ggplot package throughout. I highly recommend the course for anyone who wants to become an expert in ggplot. ggplot does use quite a bit of proprietary syntax. This EDA course offers a lot of instruction in coding. You do have to work hard, but you will learn a lot. By working hard, I mean reading supplementary materials, and doing the exercises throughout the course. As good instruction goes, they expect students to discover things, and do not feed you bullet points.

While this course is not freeThis course is free, plus the quality of the instruction is heads and shoulders above other MOOCs out there. The course is designed from the ground up for online instruction, and it shows. If you have tried other online courses, you will immediately notice the difference in quality. For example, the people in these videos talk directly to you, and not a bunch of tuition-paying students in some remote classroom.

Sign up before they get started at Udacity. Disclaimer: No one paid me to write this post.

Pi Day Special: #onelesspie initiative to clean up Wikipedia

Xan Gregg and I talked about how we should celebrate PI day. How about cleaning up Wikipedia by getting rid of those ugly, confusing, multicolor, 3D, exploded pie charts? (PS. Here is Xan's post.)

A quick search on Google reveals the extent of this PIe pollution. Click this link to check it out!



So, I confess I don't know much about editing Wikipedia but it is easy!

Find a chart. Make your chart. Create a Wikipedia account. Use the upload wizard to get an image tag. Go to the Wiki page, click Edit, and paste the image tag there. And you're done.

Here's my first contribution. On the Ozone Depletion page, there used to be this pie chart:




There are way too many colors; the labels on the smaller pie pieces bleed onto their neighbors and the separation between natural and anthropogenic sources isn't as clear as it could be.

Here's what my revised version looks like:


In this case, I merely shrank the pie, using it as a legend. I also fixed the typo in the word anthropogenic.


You can do it too. Get started now!  #onelesspie

Setting the right priority

On the sister blog, I wrote about a new report on the music industry lamenting that the hype over "Long Tail" retail has not really helped small artists (as a group). This was a tip sent by reader Patrick S. He was rightfully unhappy about the chart that was included in this summary of the report.


This classic Excel chart has some basic construction issues:

  • The data labels are excessive
  • The number of ticks on the vertical axis should be halved, given the choice to not show decimal places
  • With only two colors, it is a big ask for readers to shift their sight to the legend on top to understand what the blue and gray signify. Just include the legend text into the existing text annotation!

In terms of the Trifecta checkup, the biggest problem is the misalignment between the intended message and the chart's message. If you read the report, you'd learn that one of their key findings is that the top 1% (superstar) artists continue to earn ~75 percent of total income and this distribution has not changed noticeably despite the Long Tail phenonmenon.

But what is the chart's message? The first and most easily read trend is the fall in total income in the last 12-13 years. And it's a drastic drop of about $1 billion, almost 25 percent. Everything else is hard to compute on this stacked column chart. For example, the decline in the gray parts is even more drastic than the decline in the blue.

It also is challenging to estimate the proportions from these absolute amounts. Recognizing this, the designer added the proportions as text. But only for the most recent year.

So we have identifed two interesting stories, one about the decline in total income and the other about the unending dominance of the 1 percent. This is where the designer has to set priorities. Given that the latter message is the headline of the report, it is better to plot the proportions directly, while hiding the story about total income. The published chart has the priority reversed. Even though you can find both messages on the same chart, it is still not a good idea to highlight your lesser message.


Spatial perception: on the chart and in real life

Cnbc_realestateA twitter follower @mdjoner felt that something is amiss with the squares in this chart comparing real estate prices in major cities around the world. I'm not sure where the chart originally came from but there is a CNBC icon.

There is one thing I really like about the chart, which is the metric that has been selected. The original data is likely to be price per square metre for luxury property in various places. The designer turned this around and computed the size of what you can buy assuming you spend $1 million. I think we have a better ability to judge areas than dollars.

The notion of floor area meshes well with the area on a chart, so there is an intuitive appeal as well.


So in the Trifecta checkup, they did well posing an interesting question, and picking some data. But like Mike, I'm not excited about the graphical construct.

There are a few problems with this chart:

  • It requires using colors when the colors do nothing other than delineating one city from the next.
  • There's overcrowding at the bottom of the chart because the designer maintained a fixed spacing throughout the chart.
  • The city label is always positioned above the middle of the diamond. I find it very confusing in the bottom half of the chart when the diamonds started overlapping.
  • The shadows plus the overlapping make it almost impossible to make out the actual areas of the pieces.


Here is an alternative display of the data:


Notice that I designed this for an American audience. I'd change certain decisions if using this for the non-American reader. I choose New York as the focal point, and split the cities into two parts. On the left are the cities less expensive than New York and on the right are those cities more expensive than New York.

Also, along the bottom, I provide some clues to help people bridge the gap between the areas shown on the graphic, and real-life areas. For example, the orange square represents 400 square feet but without the annotation telling you it's about the size of a typical Manhattan studio, you may not know how to map the size of the orange square to your perception of real spaces. I also included images (although if I'm publishing this, I'd want better ones).

Finally, note that the data set did not show up on my version of the chart.

Two charts that fail self-sufficiency

My twitter followers have been sending in several howlers.

Twitter (link) made a bunch of bold claims about its own influence by using the number of tweets about the Oscars as fodder. They also adopt the euphenism common to the digital marketing universe, the so-called "view", which credit to them, they define as "how many times tweets are displayed to users". Yes, you read that right, displaying is the same as viewing in this world - and Twitter is just a follower not a trend setter here.

For @dtellom, it is this bubble chart about the Ellen tweet that displeased him:



In the meantime, @wilte found this unfortunate donut chart, created by PWC in the Netherlands.


Both designers basically used appropriated a graphical form and deprived it of data. In one, the designer threw the concept of scale to the wind. In the other, the designer dumped the law of total probability. In either case, the fundamental rationale for the particular graphical form is sacrificed.

Both are examples that fail our self-sufficiency test. This test says if a visual display cannot be understood unless the entire data set is printed on the chart, then why create a visual display? In both charts, if you block out the numbers, you are left with nothing!


The PWC chart was submitted by @graphomate, who also submitted the following KPMG chart:


The complaint was the total adding up to 101%. I'm not really bothered by this as it is a rounding issue. That said, I like to "hide" such rounding issues. I have never understood why it is necessary to display the imperfection. Flip a coin and remove the decimals from one of the categories!

Nothing to see here

Some graphics are made to inform, some to amuse, some to delight. But the following scatter plot makes one wonder why why why...


What does the designer want to say?


I saw this chart inside an infographics titled "Where in the World are the Best Schools and the Happiest Kids?", via the Cool Infographics blog. The horizontal axis is happiness and the vertical axis is average test score.

So it appears that happy kids can get the best and the worst test scores, and kids with the best test scores can be both happy and sad.

That means the happiness of kids does not depend on their test scores.