Reader Aaron W. came across this "Facts and Figures" infographic about Boise State University that seemingly is aimed at alumni of the school. Given that Boise State has a good reputation for analytics, Aaron found it disconcerting to see such a low-quality data graphic. (click on the image to see it in full size).
There are numerous little things to grumble about in each section of the chart. The larger issue though is the overall composition. When assembling a chart like this, it is important to provide a navigation path for readers, whether explicitly or through cues.
It's difficult to discern the organizing principles of this chart. Aaron felt this way: "the total information flow is haphazard, if not entirely incoherent. There is some valuable information here, but at best it gets lost in the shuffle."
For example, some statistics are for undergraduate students only, some are for graduate students, and some are offered in aggregate.
Confusion reigns. We learn that the school has total enrollment of 22K students but it's a little math quiz to learn how many are undergraduates. In certain sections, data about faculty members are mixed with those about students.
Not breaking out undergraduates from graduates is a particular problem when presenting demographics, such as age distributions, ethnicity, etc.
It's odd to present this distribution of age without remarking that the undergrads are shown on the left and the graduate students are shown on the right.
Then, the sections presenting counts of students, faculty, degrees, etc. overlap with sections presenting financial data.
A rethinking of this page should start with identifying the key questions readers would be interested in learning, and then organizing the data to suit those needs.
Alberto Cairo just gave a wonderful talk to my workshop, in which he complains about the state of dataviz teaching. So, it's quite opportune that reader Maja Z. sent in a couple of examples from a recent course on data visualization for academics. She was surprised to see these held out as examples of good work. I'll discuss one chart today, and the other one some other day.
The instructor for the course praised this chart for this principle: "always try to find a graphic that relates to your subject, like the bullets here representing military spending, and use it in the chart."
For students who take my class, they learn the opposite lesson: I like to say imagery often backfires. I do like charts with imagery that makes the data come alive but more often than not, the designer falls in love with the imagery and let the data down.
This chart presumably shows the top 10 military spenders in the world by total amount spent in 2013. You'd think that the Chinese spent a bit more than half what the Americans did. But the data labels say $640 billion vs $188 billion, only about 30%. Next, the Russian spend is 46% of the Chinese according to the data, etc. So, is this really a data visualization or just some pictures with numbers printed next to them?
It's possible that the data is encoded in the surface areas or the volumes of these warheads but in reality, this is a glorified column chart, so most readers will respond to the heights of the columns.
Perhaps the shadows are there to demonstrate shadow spending.
The designer seems to appreciate that total spending is not necessarily a great metric. Spending as a proportion of GDP is provided as a secondary metric. I'm not so sure what to make of this though: should we expect richer nations to need/want to spend more building bombs and such? It just doesn't seem very logical to me.
Instead, a more meaningful metric might be military spending per capita. Controlling for population seems somewhat logical; the more people you have to protect, the more money you have to spend.
In the end, I made this scatter plot that tries to have it both ways:
(The percentages are of GDP.)
Here, we can see that Saudi Arabia and the U.S. are particularly aggressive spenders, spending over $2000 per person per year. The respective two dots are way above the average line (for the top 10 spenders). At the richer end of the scale, the American spending is way above the international average. On the other hand, Japan and Germany both spend significantly less than would be predicted by their GDP per capita levels.
Of note, readers more easily relate to the per-capita numbers than the aggregate figures in the original chart. They learn, for instance, that Saudi Arabia's average GDP was $27,000 per head, of which $2,500 went to arming itself up.
Learn how to make knock-out data visualization in an innovative, immersive and fun setting, with classmates who are similarly passionate about making the numbers speak visually.
The class is conducted in the style of creative-writing workshops. Each student will focus on one data visualization project during the term, and gain knowledge through drafting and revisions, offering and receiving critique, and above all, learning from others.
You will develop a discriminating eye for good visualizations. For students enrolled in the Certificate in Data Visualization, the course offers an ideal setting to demonstrate mastery of the integrated approach combining the perspectives of statistical graphics, graphical design, and information visualization.
Prerequisite: We welcome students from all backgrounds. A more diverse class makes a better experience for everyone. In order to be a full participant in the course, you should have prior experience making data graphics for an audience (broadly defined), and feel comfortable offering critique of others’ work.
Because of the workshop structure, enrollment is limited to 12 students. Enroll now to reserve your spot.
This chart published in Harvard Magazine has won my heart.
It is well executed in many ways. The chart illustrates a study of time spent by assistant and associate professors. It focuses specifically on time spent working versus time spent on household chores. One of the obvious questions of the study is whether female professors are disadvantaged when they have family obligations.
The general visual framework is the profile chart. Four segments of professors are arranged left to right from single with no children to married, with children and both parents working or single parent. The chart makes these points clear:
Having children adds about 15-30 hours to time spent on household duties, per partner
Household duties are not evenly split by gender, with the expected bias. (Of course, this observation must be carefully vetted. The men and women are not married to each other, even on the right side of the chart. But I presume the usual interpretation should hold.)
Male professors with kids do spend more time on household chores than those without but not as much as female professors with kids
In the meantime, the amount of time spent working is about the same for all four segments, raising a side question: what other activities got displaced? The juxtaposition of the lines allows us to see that the displaced hours are almost 50 percent of the total time spent working! What did they do less of?
I especially like the explicit depiction and labeling of the "gender gap" (the orange vertical lines). Also, the use of median hours instead of average hours.
My one little complaint is that the designer forgot to tell us the hours are off a weekly basis (I'm guessing here). Just adding "per week" after "median hours" would have fixed this.
One simple chart cannot address all possible questions on such a complicated subject. I like the restraint the designer exercised in not saddling the chart with too many questions.
I will just mention one tricky statistical issue. Getting tenure and making babies are both activities that occur within some time window in a professor's life, if at all. So there is a survivorship bias. The professors who receive tenure drops out of the picture. If you are older, and still in the pool, you probably are less "accomplished" from the perspective of the tenure-granting process. The longer you stay in that pool, the more likely you will have gotten married and/or have children--thus, there is an age bias going from left to right, as well as a survivorship bias. This implies that the characteristics of the professors in the four groups are likely to be different not just on their marital and child-rearing statuses but also on age and probability of tenure.
I'm excited to announce that there will be a summer session for my Dataviz Workshop at NYU (starting June 21). This is a chart-building workshop run like a creative writing workshop. You will work on a personal project throughout the term, receive feedback from classmates, and continually improve the product. I have previously written about the First Workshop here (with syllabus), here, here and here.
Here is the link to register for the course. (Note: the correct class time is 10a - 1p.)
The participants in the First Workshop were very happy with their experience. I can now report on the end-of-course survey. Ten people took the class, and seven responded to the survey. The satisfaction scores are as follows:
It's very gratifying to see that almost everyone thought the class time was well spent. During class, students gave each other feedback on projects. A key to making these sessions work is that students should be both givers and takers. It is really important that they become as comfortable giving critique as taking feedback. I asked the students to self-assess and this is what they said:
I'd also add that the few students who enrolled in the course with less background than the average ended up participating fully and actively in the discussion. As an instructor, I want to get out of the way while keeping the conversation on track. Based on the following rating, I think I did fine:
One of the feedback I received during class--not reflected here--is that some students want to spend more time discussing the reading. I assign three books, which everyone loved but I believe that it is hard for them to finish reading all three books in time for the second class. They would like to spread the discussion of the books over the course of the term. This arrangement would present a challenge. Due to the nature of a workshop, the first two sessions cannot involve project discussion, which is one of the reasons why I give introductory lectures and assign the books. In addition, students spend a lot of time during the term both working on their own projects and reviewing their classmates' projects; and I worry that assigning more reading distracts from the other activities.
Indeed, the course is not a gut course. Several students were surprised by how much work they put in. One or two learned that preparing the data took ten times as much time as they expected. (They selected particularly difficult datasets to work with.)
A specific feedback is to add a session in the computer lab. This creates an opportunity for students to share their knowledge. Those who are good coders can help others who are not with pre-processing tasks. Those who are good with Illustrator can show others how to make the charts pretty. I am not ready for this change in the summer session but in the fall, I'll likely experiment with this.
Finally, the tools used by students are diverse: Excel (5), Illustrator (3), R (2), followed by Powerpoint, Pixelmator (draft stage), Tableau, Stata, Paint and SQL Server (1 each). Three of the students put their work on a Web page, which was the most popular format.
If you are serious about dataviz, please join me this summer for the Second Art of Data Visualization Workshop.
Back in 2009, I wrote about a failed attempt to visualize regional dialects in the U.S. (link). The raw data came from Bert Vaux's surveys. I recently came across some fantastic maps based on the same data. Here's one:
These maps are very pleasing to look at, and also very effective at showing the data. We learn that Americans use three major words to describe what others might call "soft drinks". The regional contrast is the point of the raw data, and Joshua Katz, who created these maps while a grad student at North Carolina State, did wonders with the data. (Looks like Katz has been hired by the New York Times.)
What more evidence do we need that effective data visualization brings data alive... the corollary being bad data visualization takes the life out of data!
Look at the side by side comparisons of two ways to visualize the same data. This is the "soft drinks" question:
And this is the "caramel" question:
The set of maps referred to in the 2009 post can be found here.
Now, the maps on the left is more truthful to the data (at the zip code level) while Katz applies smoothing liberally to achieve the pleasing effect.
Katz has a poster describing the methodology -- at each location on the map, he averages the closest data. This is why the white areas on the left-side maps disappear from Katz's maps.
The dot notation on the left-side maps has a major deficiency, in that it is a binary element: the dot is either present or absent. We lost the granularity of how strongly the responses are biased toward that answer. This may be the reason why in both examples, several of the heaviest patches on Katz's maps correspond to relatively sparse regions on the left-side maps.
Katz also tells us that his maps use only part of the data. For each point on his maps, he only uses the most frequent answer; in reality, there are proportions of respondents for each of the available choices. Dropping the other responses is not a big deal if the responses are highly concentrated on the top choice but if the responses are evenly split, or well-balanced say among the top two choices, then using only the top choice presents a problem.
My chart making workshop has passed the point where each participant (except one) has presented the first draft of his or her project, and the class has opined on these efforts. Previously, I posted the syllabus of the course here. Also catch up on previous updates (1, 2).
So far, I am very pleased with the results, and importantly, the students have given rave reviews. The in-class discussions have been very constructive, and civil. In every case, the chart designer went home with a few ideas for improvement. The types of issues that came up ranged widely. Here are some examples:
Figuring out what the message is in the data set
Thinking about what other data can be obtained to clarify the message
Discussing the level of detail appropriate for a legend
Dealing with data with a large number of small values
Because we have a color-blind student, we can examine how charts appear to the color-blind reader
How to reduce the complexity of a chart?
As the course draws to a close, several students have expressed an interest in keeping the class together via a meetup group or something similar. I'm thinking about how to accomplish this.
One lesson learned so far is that a few students got stuck trying to restructure the data, and were late submitting their work. I should stress that all submissions in the course are work in process, and maybe I should offer some data processing help during the course.
The next workshop will be offered in the summer.
PS. Don't miss Andrew Gelman's summary of his graphics tips here.
The Facebook data science team has put together a great course on EDA at Udacity.
EDA stands for exploratory data analysis. It is the beginning of any data analysis when you have a pile of data (or datasets) and you need to get a feel for what you're looking at. It's when you develop some intuition about what sort of methodology would be appropriate to analyze the data.
Not surprisingly, graphical methods form a big part of EDA. You will commonly see histograms, boxplots, and scatter plots. The scatterplot matrix (see my discussion of this) makes an appearance here as well.
The course uses R and in particular, Hadley's ggplot package throughout. I highly recommend the course for anyone who wants to become an expert in ggplot. ggplot does use quite a bit of proprietary syntax. This EDA course offers a lot of instruction in coding. You do have to work hard, but you will learn a lot. By working hard, I mean reading supplementary materials, and doing the exercises throughout the course. As good instruction goes, they expect students to discover things, and do not feed you bullet points.
While this course is not freeThis course is free, plus the quality of the instruction is heads and shoulders above other MOOCs out there. The course is designed from the ground up for online instruction, and it shows. If you have tried other online courses, you will immediately notice the difference in quality. For example, the people in these videos talk directly to you, and not a bunch of tuition-paying students in some remote classroom.
Sign up before they get started at Udacity. Disclaimer: No one paid me to write this post.
Some graphics are made to inform, some to amuse, some to delight. But the following scatter plot makes one wonder why why why...
What does the designer want to say?
I saw this chart inside an infographics titled "Where in the World are the Best Schools and the Happiest Kids?", via the Cool Infographics blog. The horizontal axis is happiness and the vertical axis is average test score.
So it appears that happy kids can get the best and the worst test scores, and kids with the best test scores can be both happy and sad.
That means the happiness of kids does not depend on their test scores.