« April 2014 | Main | June 2014 »

Reference page for Trifecta Checkup

It's here! Many readers have requested a reference to the Junk Charts Trifecta Checkup. I finally found time to write this up. Here is the introduction:

The Junk Charts Trifecta Checkup is a general framework for data visualization criticism. It captures how I like to organize the thinking behind my critique pieces.

The need for such a framework is clear. Opinion pieces on specific data graphics frequently come across as stream of conscience. Proclaiming a chart "mind-blowing" or "worst of the century" isn't worth much if the author cannot articulate why. The state of dataviz criticism has not progressed further than assembling a set of "rules of thumb".

In putting this framework together, I aimed to make it simple to use and broadly applicable.

The Trifecta Checkup framework allows me to classify all dataviz critiques into eight types. A visualization of the eight types is as follows.

Junkcharts_trifecta_taxonomy

Please click here to read the entire post. Also, link to that permanent page for reference.

As usual, clicking on the "Trifecta Checkup" tag brings up all prior posts that apply the concept.

 


On the cool maps about baseball fandom

Josh Katz, who did the dialect maps I featured recently, is at it again. He's one of the co-authors of a series of maps (link) published by the New York Times about the fan territorities of major league baseball teams.

Nyt_baseballfandom

Similar to the dialect maps, these are very pleasing to look at, and also statistically interesting. The authors correctly point out that the primary points of interest are at the boundaries, and provide fourteen insets on particular regions. This small gesture represents a major shift from years past, when designers would have just printed an interactive map, letting readers figure out where the interesting stuff is.

The other interesting areas are the "no-man lands", the areas in which there are no local teams. The map uses the same kind of spatial averaging technology that blends the colors. The challenge here would be the larger number of colors.

I'd have preferred that they have given distinct colors to the teams like the Yankees and the Red Sox that have broader appeal. Maybe the Yankees is the only national team they discovered, since it does have the unique gray color which is very subtle.

I also think it is smart to hide the political boundaries of state, zip, etc. in the maps (unless you click on them).

I'd like to see a separate series of maps: small multiples by team, showing the geographical extent of each team. This is a solution to the domination issue to be addressed below.

***

Nyt_yankeesterritoryThe issue of co-dominant groups I discussed in the dialect maps also shows up here. Notably, in New York, the Mets are invisible, and in the Bay Area, the Oakland As similarly do not appear on the map.

Recall that the each zip code is represented by the team with the highest absolute proportion of fans. It may be true that the Mets are perennial #2 in all relevant zip codes. Zooming into the Yankee territory, I didn't see any zip code in which Mets fans are more numerous. So this may be the perfect example of what falls through the cracks when the algorithm just drops everything but the top level.

***

Now, in the Trifecta checkup, we want to understand what the data is saying. I have to say this is a bit challenging. The core dataset contains Facebook Likes (aggregated to the zip-code level). It is not even clear what the base of those proportions are.  Is it the total population in a zip code? the total Facebook users? the total potential baseball fans?

As I have said elsewhere, Facebook data is often taken to be "N=All". This is an assumption, not a fact of the data. Different baseball teams may have different social-media/Facebook strategies. Different teams may have different types of fans, who are more/less likely to be on Facebook. This is particularly true of cross-town rivals.

Apart from the obvious problem with brands buying or otherwise managing Likes, "Like" is a binary metric that doesn't measure fan fervor. It is a static measure as I don't believe Facebook users manage their list of Likes actively (please correct me if I am wrong about this behavior.)

We are not provided any real numbers, and none of the maps have scales. Unless we see some absolute counts, it is hard to know if the data make sense relative to other measures of fandom, like merchandise and ticket sales. With Facebook data, it is sometimes possible to have too much--in other words, you might find there are more team fans than potential baseball fans or even population in a specific zip code.

It is very likely that Facebook, which is the source of the aggregated data, did not want to have raw counts published. This is par for the course for the Internet giants, and also something I find completely baffling. Here are the evangelizers of privacy is dead, and they stockpile our data, and yet they lock the data up in their data centers, away from our reach. Does that make any sense?


The index of an index is confusion

Nytimes_cpi_components_price_fallThrough twitter, Antonio Rinaldi sent the following chart that accompanied a New York Times piece talking about the CPI (inflation index). The article concerns a very important topic--that many middle- to lower-income households have barely any saving after spending on necessities--and only touches upon the issue raised by this chart, which is that the official CPI is an average of prices of a basket of goods, and there is much variability in the price changes of different categories of goods.

I cover this subject in much greater detail in Chapter 7 of Numbersense (link). There are many reasons why the official inflation rate seems to diverge from our own experiences. One of the reasons is that we tend to notice and worry about price increases but we don't notice or take for granted price decreases. In the book, I cover the fascinating subject of the psychology of remembering prices. Obviously, this is a subject of utmost importance if we are to use surveys to understand perceived prices.

The price of a T-shirt (unbranded) has remained the same or may have declined in the last decades. Besides, the chart reveals that phone and accessories, computers and televisions have all enjoyed deflation over the last decade. Actually, much of the "deflation" is due to a controversial adjustment known as "hedonics". This is to claim that part of any price change is attributed to product or technology improvements. So, if you pay the same price today for an HDTV as in the past for a standard definition TV, then in reality, the price you paid today is cheaper than that in the past.

That adjustment is reasonable only to a certain extent. For instance, my cell phone company stuffs my plan with hundreds of unused and unusable minutes so on a per-minute basis, I am sure prices have come down substantially but on a per-used-minute basis, I'm not so sure.

***

Let's get to what we care about on this blog... the visual. There is one big puzzle embedded in this chart. Look at the line for televisions. It dipped below -100 percent! Like Antonio, many readers should be scratching their heads--did the price of television go negative? did the hedonic adjustment go bonkers?

As an aside, I don't like the current NYT convention of hiding too many axis labels. What period of time is this chart depicting? You'd only find out by reading the label of the vertical axis! I mentioned something similar the other day.

The key to understanding a chart like this is to learn what is being plotted. The first instinct is to think the change in prices over time. A quick glance at the vertical axis label would correct that misunderstanding. It said "Change in prices relative to a 23% increase in price for all items, 2005-2014".

This label is doing a lot of work--probably too much for its inconspicious location and unbolded, uncolored status.

Readers have to know that the official CPI is a weighted average of changes in prices of a specified basket of goods. Some but not all of the components are being graphed.

Then readers have to understand that there is an index of an index. The prices of each "item" (i.e. category or component of the CPI) are indiced to 1984 levels. So the prices of television is first re-indiced to 2005 as the baseline. This establishes a growth trajectory for television. But this is not what is being depicted.

Redo_nytinflation1Here is what the chart would have looked like if we plotted the growth of the television index (red), the apparel index and the all-items index (blue).

The blue line reflects the 23% average increase in prices in that 10-year period. Notice that the red line does not exhibit any weirdness--television prices have gone down by 90 percent. It's not negative.

What the designer tried to do is to index this data another time. Think of pulling the blue line down to the horizontal axis, and then see what happens to the gray and red lines.

***
Now, even this index on an index should not present a mathematical curiosity. If all items moved to 1.23 while apparel moved to 1.10, you might compute 110%/123% which is roughly 0.. You'd say the apparel index is 90% of the way to where the all-item index went. Similarly for TVs, you would compute 10%/123% which is 0.08. That would be saying the TV index ended up 8% of where the all-item index landed. 

That still doesn't yield -100%. The clue here is that the baseline is zero percent, not 100, not 1.0, etc. So if there is an item that moved in sync with all items, its trajectory would have been horizontal at zero percent. That means that the second index is not a division but a subtraction. So for TV, it's -90% - 23% = -113%. For apparel, it's +10%-23% = -13%.

Even though I reverse-engineered the chart, I don't understand the reason for using subtractions rather than division for the second layer of indicing. It's strange to me to add or subtract the two indices that have different baseline quantities.

Here is the same chart but using division:

  Redo_nytinflation2

I usually avoid telescoping indices. They are more trouble than it's worth. Here is an old post on the same subject.

 

 

 

 

 

 

 


How effective visualization brings data alive

Back in 2009, I wrote about a failed attempt to visualize regional dialects in the U.S. (link). The raw data came from Bert Vaux's surveys. I recently came across some fantastic maps based on the same data. Here's one:

Dialectmap_soda

These maps are very pleasing to look at, and also very effective at showing the data. We learn that Americans use three major words to describe what others might call "soft drinks". The regional contrast is the point of the raw data, and Joshua Katz, who created these maps while a grad student at North Carolina State, did wonders with the data. (Looks like Katz has been hired by the New York Times.)

The entire set of maps can be found here.

***

What more evidence do we need that effective data visualization brings data alive... the corollary being bad data visualization takes the life out of data!

Look at the side by side comparisons of two ways to visualize the same data. This is the "soft drinks" question:

  Sidebyside_soda

 

 And this is the "caramel" question:

Side_by_side_caramel

 

 The set of maps referred to in the 2009 post can be found here.

 ***

Now, the maps on the left is more truthful to the data (at the zip code level) while Katz applies smoothing liberally to achieve the pleasing effect.

Katz has a poster describing the methodology -- at each location on the map, he averages the closest data. This is why the white areas on the left-side maps disappear from Katz's maps.

The dot notation on the left-side maps has a major deficiency, in that it is a binary element: the dot is either present or absent. We lost the granularity of how strongly the responses are biased toward that answer. This may be the reason why in both examples, several of the heaviest patches on Katz's maps correspond to relatively sparse regions on the left-side maps.

Katz also tells us that his maps use only part of the data. For each point on his maps, he only uses the most frequent answer; in reality, there are proportions of respondents for each of the available choices. Dropping the other responses is not a big deal if the responses are highly concentrated on the top choice but if the responses are evenly split, or well-balanced say among the top two choices, then using only the top choice presents a problem.

 

 


Going overboard with simplicity

Today I look at an unlikely oversight by the New York Times:

Nyt_whiteviews

I think they tried to simplify the scale but ended up making a mess.

Tufte preaches getting rid of all unnecessary ink but sometimes, you go overboard.

***

I had a tough time understanding the scale of this chart. In particular, it is hard to figure out what the numbers at the top of the chart represent since all six data labels fall into the middle of the chart. There is no vertical axis, and not enough gridlines to easily see what levels the three white lines represent. All the labelled data fall under the middle gridline. Another question is whether the vertical axis starts at zero.

So I tried drawing in reference lines (first mentally but eventually I needed them physically):

Jc_nytwhiteamerica

After this, it still took a few minutes to see that the gridlines were set at 25, 50 and 75% so this chart actually starts at zero. Without the axis labels, it's not clear if the vertical axis starts at zero or not! The numbers near the top of the chart are in the seventies.

I am convinced now that the individual charts share the same vertical scale. (Sometimes, putting charts on different scales is preferred, as is the case here.)

***

To summarize, a number of design elements were taken out of these charts:

- vertical axis and with it, the axis labels

- minor horizontal gridlines

- all data labels except the most recent numbers.

Each of these tactics, if done separately, is a best practice. All three together create a barrier to comprehension.

***

Finally, I should note the hybrid dot and line plot utilized here. It's a clever idea. The lines only appear when there is a rather large swing from one data point to the next, and it neatly draws attention to where the big shifts are.