On the sister blog, I wrote about a new report on the music industry lamenting that the hype over "Long Tail" retail has not really helped small artists (as a group). This was a tip sent by reader Patrick S. He was rightfully unhappy about the chart that was included in this summary of the report.
This classic Excel chart has some basic construction issues:
The data labels are excessive
The number of ticks on the vertical axis should be halved, given the choice to not show decimal places
With only two colors, it is a big ask for readers to shift their sight to the legend on top to understand what the blue and gray signify. Just include the legend text into the existing text annotation!
In terms of the Trifecta checkup, the biggest problem is the misalignment between the intended message and the chart's message. If you read the report, you'd learn that one of their key findings is that the top 1% (superstar) artists continue to earn ~75 percent of total income and this distribution has not changed noticeably despite the Long Tail phenonmenon.
But what is the chart's message? The first and most easily read trend is the fall in total income in the last 12-13 years. And it's a drastic drop of about $1 billion, almost 25 percent. Everything else is hard to compute on this stacked column chart. For example, the decline in the gray parts is even more drastic than the decline in the blue.
It also is challenging to estimate the proportions from these absolute amounts. Recognizing this, the designer added the proportions as text. But only for the most recent year.
So we have identifed two interesting stories, one about the decline in total income and the other about the unending dominance of the 1 percent. This is where the designer has to set priorities. Given that the latter message is the headline of the report, it is better to plot the proportions directly, while hiding the story about total income. The published chart has the priority reversed. Even though you can find both messages on the same chart, it is still not a good idea to highlight your lesser message.
Happy to report on the dataviz workshop, a first-time offering at NYU. I previously posted the syllabus here.
I made minor changes to the syllabus, adding Alberto Cairo's book, The Functional Art (link), as optional reading, some articles from the recent debate in the book review circle about the utility of "negative reviews" (start here), and some blog posts by Stephen Few.
The Cairo and Few readings, together with Tufte, are closest to what I want to accomplish in the first two classes, before we start discussing individual projects: encouraging students to adopt the mentality of the course, that is to say, to think of dataviz as an artform. An artform implies many things, one of which is a seriousness about the output, and another is the recognition that the work has an audience.
The field of data visualization is sorely lacking high-level theory, immersed as so many of us are in tools, data, and rules of thumb. It is my hope that these workshop discussions will lead to a crytallization of the core principles of the field.
We went on a tour of many dataviz blogs, and documented various styles of criticism. In the next class, we will discuss what style we'd adopt in the course.
The composition of the class brings me great excitement. There are 12 enrolled students, which is probably the maximum for a class of this type. One student subsequently dropped out, after learning that the workshop is really not for true beginners.
The workshop participants come from all three schools of dataviz: computer science, statistics, and design. Amongst us are an academic economist trained in statistical methods, several IT professionals, and an art director. This should make for rewarding conversation, as inevitably there will be differences in perspective.
REQUEST FOR HELP: A variety of projects have been proposed; several are using this opportunity to explore data sets from their work. That said, some participants are hoping to find certain datasets. If you know of good sources for the following, please write a comment below and link to them:
Opening-day ratings from sites like Rotten Tomatoes
New York City water quality measures by county (or other geographical unit), probably from an environmental agency
Data about donors/donations to public media companies
Since this is a dataviz blog, I want to include a chart with this post. I did a poll of the enrolled students, and one of the questions was about what dataviz tools they use to generate charts. I present here two views of the same data.
The first is a standard column chart, plotting the number of students who include a particular tool in his or her toolset (each student is allowed to name more than one tools). This presents a simple piece of information simply: Excel is the most popular although the long tail indicates the variety of tools people use in practice.
What the first option doesn't bring out is the correlation between tools, indicated by several tools used by the same participant. The second option makes this clear, with each column representing a student. This chart is richer as it also provides information on how many tools the average student uses, and the relationship between different tools.
The tradeoff is that the reader has to work a little more to understand the relative importance of the different tools, a message that is very clear in the first option.
This second option is also not scalable. If there are thousands of students, the chart will lose its punch (although it will undoubtedly be called beautiful).
Which version do you like? Are there even better ways to present this information?
Long-time reader Daniel L. sends in this chart illustrating a large data set of intra-state migration flows in the U.S. The original chart is at Vizynary by way of Daily Kos.
There is no denying that this chart is beautiful to look at. But what is its message? That there are people migrating from and to every state? (assuming all fifty states are present)
Daily Kos describes how one can hover over any state to see its individual patterns. Something like this:
This is a great way, perhaps the only way, to consume the chart. Essentially, the reader is asked to generate a small-multiples panel of charts. The chart does a better job at showing the pairs of states between which people migrate than at showing the relative size of the flows. The size of the flows is coded in the width of the arcs. The widths are too similar to tell apart; and it doesn't help that no legend is provided.
The choice of color is curious. Each region of the country is its own color, in a "nominal" way. It is a design decision to emphasize regions.
Another decision is to hide information on the distances of the migrations. Evidently, the designer sacrificed that information in order to create the neat circular arrangement of states.
A shortcoming of this representation is one missing dimension: the direction of the flow. I'm not sure given any pair of states A and B, whether the net migration is into A or into B.
I propose a solution using the map while preserving the interactive element of the original.
On this map, when you hover over a particular state, it highlights all other states for which there are migrations flows into or out of that state. For color, use a blue-white-red scheme with blue indicating net inflow, red indicating net outflow, and white for near-zero flows. Include a legend.
Another important decision for the designer is absolute versus relative scales. In an absolute scheme, you rank the entire set of flows for all pairs of states; obviously, the resulting colors would be influenced by the state populations. Alternatively, you rank the flow sizes within each state; in this case, the smaller states will feel exaggerated.
The map has the additional advantage of showing the approximate distance (and direction) moved, which, for me, is a useful piece of information.
If your chart is titled "The Most Popular TV Show Set in Every State," what would you expect the data to look like?
You'd think the list would be dominated by the hit shows like The Walking Dead and Downton Abbey, and you might guess that there are probably only four or five unique shows on the list.
But then it's easy to miss the word "set" in the title. They are looking for most popular show given that it is set in a particular state. Now this is a completely different question -- and conversely, it guarantees that there will be 50 different shows for the 50 states, assuming that one show can't be set in multiple states. This is also, computationally, a much more complex question. Some locations, like New York, Mass. (Boston), and Illinois (Chicago), are many times more likely to be the settings of TV shows than other states. This means, one might need to go back many years to find the "popular" shows in the less attention-grabbing states.
I used quotations for the word "popular" because if one has to dig deep into history for a specific state, then it is possible that the selected show would not be popular in the aggregate! This is not unlike the issue of whether having your kids pick up a popular sport (like basketball) or instrument (like violin) is better or worse than an unpopular one (like squash or trombone). The latter route is potentially the shorter to stand out but their achievement will be known only to the niche audience.
This brings me to how one should look at a map like this one in Business Insider (link):
The first thing that strikes you are the colors. The colors that signify nothing. Since each state has its own TV show, by definition each piece of information is unique. As far as I can tell, the choice of which states share the same color is totally up to the designer.
As I have remarked in the past, too often the designer uses the map as a lesson in geography. The only information presented to readers through the map type is where each state in located in the union. Without the state names, even this lesson is incomplete. We learn nothing about the relative popularity of these shows, the longevity, the years in which they went on air, etc.
Geographical data should not automatically be placed on a map.
Is there any "data" in this map? It depends on how you see it. Here's what the author described went into pairing each state with a TV show:
To qualify, we looked at television series as opposed to reality shows.* Selections were based on each show’s longevity, audience and critical acclaim using info from IMDB/Metacritic, awards, and lasting impact on American culture and television... *When there wasn't a famous enough series to choose from, we selected a more popular reality show. That happens once on this list (IA).
A reader sends me to Adam Obeng, who did the dirty work deconstructing a set of charts by the U.S. National Highway Traffic Safety Administration on his blog. Here's an example of these charts:
Aside from the sneaker chart, they concocted a pop stick, a pencil, a tower of Hanoi, etc. These objects are ones I think should be evaluated as art. Adam gamely tells us that the proportions are totally off, and they are both internally and externally inconsitent.
I'll add two small points to Adam's post.
First, these charts pass my self-sufficiency test, that is to say, they did not print the entire data set (just one number here) on the page. Alas, given the distortion identified by Adam, not printing the data means everyone is free to create their own data. Herein lies the problem: there is an argument for allowing a small degree of distortion in exchange for "beauty" but these charts without any data have gone too far.
Second, see Adam's last point (the footnote). The original data is something quite convoluted: “3 out of 4 kids are not as secure in the car as they should be because their car seats are not being used correctly.” (How would they know this, I wonder.) This is a statistic about kids while the picture shows a statistic about their parents (or drivers).
A reader, Stephen M., who's a high school math Information Technology teacher in Australia, assigned the following chart to his class as a Junk Charts style assignment. (link to original here)
We have seen racetrack charts before (e.g. here or here), and we have dual racetracks here.
Stephen's class identified the following problems with the chart:
- The group agreed this should be better called a data visualisation than an infographic
- The purpose of the 'infographic' seems to be more on the design/form,
than the function of conveying an understanding of the data
- There seems to be a bit of an optical illusion with the lower upper circle
for the US appearing larger than the upper lower one (we checked, there isn't)
- There are no clear labels to assist. It is an assumption that because
in the heading and the figures, population is on top of donations, that
the lines are the same. The class agreed that country labels would help
to the left of each line start.
- No scale on the lines and where do you measure from/to (especially as the US line is a single line for a proportion of the way
- It's too abstract and the spatial separation of the curves makes comparison difficult.
Wow, that's great critique from the 16-year-olds. They are working on ways to re-make this graphic. One good idea is to collapse the two dimensions into one: per-capita donations.
Another issue with this chart is that the countries are sorted in different ways from one chart to the next. It's really difficult to compare one country to another.
It is also instructive to discuss what the key message is in this data. Why those six countries? What kinds of donations are being counted? Do the counting methodology differ by country? How comparable is the data?
Finally, is this art or is this science?
P.S. [12/2/2012] Stephen noted that another deficiency identified by the students is the lack of sourcing. Indeed, where did the data come from? They think it's the CIA Factbook.
Reader Steve S. sent in this article that displays nominations for the "Information is Beautiful" award (link). I see "beauty" in many of these charts but no "information". Several of these charts have appeared on our blog before.
Let's use the Trifecta checkup on these charts. (More about the Trifecta checkup here.)
The topic of this chart is both tangible and interesting. As someone who loves books, I do want to know what genres of books typically win awards.
However, both the data collection and graphical design make no sense.
The data collection problem presents a huge challenge and it's easy to get wrong. The problem is how narrow should a theme be. If it's too narrow, you can imagine every book has its own set of themes. If it's too wide, each theme maps to lots of books. The challenge is how to select the themes such that they have similar "widths". For example, "death" is a very wide theme and lots of books contain it, as indicated by the black lines. "Nanny trust issues" is a very narrow theme, and only one of those books deals with this theme. When there is such a theme, is its lack of popularity due to its narrow definition or due to writers not being interested in it?
The caption of this chart said "Cover stars: Charting 50 years up until 2010, this graphic shows The Beatles to be the most covered act in living memory." If that is the message, a much simpler chart would work a lot better.
Since the height of the chart indicates the number of covers sold in that year, the real information being shown is the boom and bust cycles of the worldwide economy. So, a lot more records were sold in 2005, and then the market tanked in 2008, for example.
That's why the data analyst should think twice before plotting raw data. Most data like these should be adjusted. In this case, you could either compare artists against one another in each year (by using proportions) or you have to do a seasonal and trend adjustment. I also don't see the point of highlighting year-to-year fluctuations. Nor do I understand why only in certain years is the top-rated cover identified by name and laurel wreath.
I talked about this stream graph of 311 calls back in 2010. See the post here.
I featured this set of infographics/pie charts back in 2011. See the post here.
This chart is a variant of the one from New York Times that I discussed here. I like the proper orientation on the NYT's version. The color scheme here may be slightly more attractive.
Reader Jim S. was rightfully mystified by the following map that appeared on the Ars Technica blog (link), and purported to demonstrate that high temperatures of March 2012 across most of the U.S. were of historical significance.
I must say the production values of this map, produced by the people at NOAA, are superb. I love, love, love the caption that the Ars Technica editors added to the map. I wish they had blown it up to 20-point font, and made it shiny :) Besides that, the colors are well-chosen, and it doesn't feel cluttered despite having 48 numbers printed on it.
Like Jim, I'm hypnotized by the drumbeat of 118, 118, 118, ... all over the red area. What could the numbers mean? They could be temperatures in Fahrenheit (although 118 degrees in March surely would have been newsworthy). The legend does lend support to this interpretation (see right), what with the extra-large font announcing "Temperature". Jim commented: "But it seems odd that such a large area would have precisely the same high."
Not so soon, Jim. The NOAA also made the chart shown on the right (link). So indeed, the entire country could be given one value of 118.
If not Fahrenheit, what could the numbers mean? They could be some kind of index in which case the average value would seem to be 50 (the white patch). That would be one strange index.
Too bad this map is produced by specialists for specialists, leaving us commoners guessing. The only clue we got is in the title, "Statewide Ranks".
But this isn't very helpful either. The 118s are still ringing in my ear. If the numbers are ranks, then 118 would likely be the maximum rank, given as there are so many 118s. But I can't figure out which metric has 118 levels.
I finally found my way to this page, which explains what NOAA calls "climatological ranking". The page also has a chart (below), which can serve as a sort of legend for the maps, but is almost as difficult to read.
Apparently there are 118 years worth of recorded temperatures, going back to 1895. And within each state, the annual temperatures for the past 118 years were ranked from lowest to highest, meaning that 118 is the hottest on record.
Given that there is lop-sided attention to hotter temperatures (global warming), it would be much better to reverse the ranking so that 1 is the hottest month year!
The chart also explains that the years are grouped into three equal buckets to indicate "below normal", "near normal" and "above normal".
Too bad this chart gives us three or five levels of ranking while in the map they use seven colors (levels).
They really ought to include on the map (a) the definition of the ranking and (b) the range of ranks corresponding to each color.
While researching this post, I found this wonderful page of NOAA maps (link). This is a beautiful illustration of the process of statistical aggregation. Notice the trade-off between simplicity and loss of information. The art in statistics is to figure out the right balance between the two.
I always like to explore doing away with the unofficial rule that says spatial data must be plotted on maps. Conceptually I'd like to see the following heatmap, where a concentration of red cells at the top of the chart would indicate extraordinarily hot temperatures across the states.
I couldn't make this chart because the NOAA website has this insane interface where I can only grab the rank for one state for one year one at a time. But you get the gist of the concept.
Did I tell you I love, love, love the caption? Go right ahead, and make a slogan for your chart today!
[PS: Reader Mark Bulling (see his comment below) contributes a realization of my heatmap suggestion above. One of the benefits of this chart is its economy, as a small version of it shows: