Via Dean Eckles on Twitter. We have this from Vox:
Have a great Labor Day! And thanks for keeping this blog alive.
Via Dean Eckles on Twitter. We have this from Vox:
Have a great Labor Day! And thanks for keeping this blog alive.
An anonymous reader sent in a Type V critique of the following map of July unemployment rates by state. The map was published by the Bureau of Labor Statistics (BLS), and used in a recent article in Vox.
Matt @ Vox took the BLS's bait, and singled out Mississippi as the worst in the nation. Our reader-contributor is none too pleased with this conclusion.
He noted that the red state stands out only because of the high "out of sample" top range of the legend. Three out of the seven colors are not found on the map at all! This is kind of like the white space problem when doing a line plot with large values and an axis starting at zero (for example, here), but the opposite. All the states are compressed into four colors, three of which are shades of orange.
The reader investigated, and reported back:
The top end of the legend seems to be set by Puerto Rico's 13.1%. Puerto Rico is omitted from the Vox map as well as from the BLS publication (link to PDF).
Happy Labor Day!
Vox published this chart:
This sort of chart is, unfortunately, quite common in business circles. Just about the only thing one can read readily from this chart is the overall growth in the plug-in vehicle market (the heights of the columns).
To fix this chart, start subtracting. First, we can condense the monthly data to quarterly:
This version is a bit less busy but there are still too many colors, and too many things to look at.
Next, we can condense the makes of the vehicles and focus on the manufacturers:
This version is still less busy and more readable. We can now see Chevrolet, Nissan, Toyota, Ford and Tesla being the five biggest manufacturers in this category. All the small brands have been aggregated into the "Others" category. The stacked column chart still makes it hard to know what's going on with each individual brand's share, other than the one brand situated at the bottom of the stack.
Next, we switch to a line chart:
This shows the growth in the overall market, as well as several interesting developments:
A smoothed version of the line chart is even more readable:
Graphics is a discipline that often rewards subtracting. Less is more.
In the above discussion, I focused on the Visual aspect of the Trifecta Checkup. This dataset is really difficult to interpret, and I'd not want to visualize it directly.
The real question we are after is to assess which manufacturer is leading the pack in plug-in vehicles.
There are a number of obstacles in our path. Different makes are being launched at different times, and it takes many months for a new make to establish itself in the market. Thus, comparing one make that just launched with another that has been in the market for twelve months is a problem.
Also, makes are of different vehicle types: compacts, SUVs, sedans, etc. More expensive vehicles will have fewer sales whether they are plug-ins or not.
Thirdly, population grows over time. The analyst would need to establish growth that is above the level of population growth.
Is data visualization worth paying for? In some quarters, this may be a controversial question.
If you are having doubts, just look at some examples of great visualization. This week, the NYT team brings us a wonderful example. The story is about whether dogs feel jealousy. Researchers have dog owners play with (a) a stuffed toy shaped like a dog (b) a Jack-o-lantern and (c) a book; and they measured several behavior that are suggestive of jealousy, such as barking or pushing/touching the owner.
This is how the researchers presented their findings in PLOS:
And this is how the same chart showed up in NYT:
Same data. Same grouped column format. Completely different effect on the readers.
Let's see what the NYT team did to the original, roughly in order of impact:
Even simple charts illustrating simple data can be done well or done poorly.
Note to New York metro readers: I'm an invited speaker at NYU's "Art and Science of Brand Storytelling" summer course which starts tomorrow. I will be speaking on Thursday, 12-1 pm. You can still register here.
The home run data set, compiled by ESPN and visualized by Mode Analytics, is pretty rich. I took a quick look at one aspect of the data. The question I ask is what differences exist among the 10 hitters that are highlighted in the previous visualization. (I am not quite sure how those 10 were picked because they are not the Top 10 home run hitters in the dataset for the current season.)
The following chart focuses on two metrics: the total number of home runs by this point in the season; and the "true" distances of those home runs. I split the data by whether the home run was hit on a home field or an away stadium, on the hunch that we'd need to correct for such differences.
The hitters are sorted by total number of home runs. Because I am using a single season, my chart doesn't suffer from a cohort bias. If you go back to the original visualization, it is clear that some of these hitters are veterans with many seasons of baseball in them while others are newbies. This cohort bias explains the difference in dot densities of those plots.
Having not been following baseball recently, I don't know many of these names on the list. I have to look up Todd Frazier - does he play in a hitter-friendly ballpark? His home to away ratio is massive. Frazier plays for Cincinnati, at the Great American Ballpark. That ballpark has the third highest number of home runs hit of all ballparks this season although up till now, opponents have hit more home runs there than home players. For reference, Troy Tulowitzki's home field is Colorado's Coors Field, which is hitter's paradise. Giancarlo Stanton, who also hits quite a few more home runs at home, plays for Miami at Marlins Park, which is below the median in terms of home run production; thus his achievement is probably the most impressive amongst those three.
Josh Donaldson is the odd man out, as he has hit more away home runs than home runs at home. His O.co Coliseum is middle-of-the-road in terms of home runs.
In terms of how far the home runs travel (bottom part of the chart), there are some interesting tidbits. Brian Dozier's home runs are generally the shortest, regardless of home or away. Yasiel Puig and Giancarlo Stanton generate deep home runs. Adam Jones Josh Donaldson, and Yoenis Cespedes have hit the ball quite a bit deeper away from home. Giancarlo Stanton is one of the few who has hit the home-run ball deeper at his home stadium.
The baseball season is still young, and the sample sizes at the individual hitter's level are small (~15-30 total), thus the observed differences at the home/away level are mostly statistically insignificant.
The prior post on the original graphic can be found here.
It's a nice small-multiples setup with two tabs, one showing the states in order of descending spend and the other, alphabetical.
In the article itself, they excerpt the top of the chart containing the states that have suspiciously high per-patient spend.
Several types of comparisons are facilitated: comparison over time within each state, comparison of each state against the national average, comparison of trend across states, and comparison of state to state given the year.
The first comparison is simple as it happens inside each chart component.
The second type of comparison is enabled by the orange line being replicated on every component. (I'd have removed the columns from the first component as it is both redundant and potentially confusing, although I suspect that the designer may need it for technical reasons.)
The third type of comparison is also relatively easy. Just look at the shape of the columns from one component to the next.
The fourth type of comparison is where the challenge lies for any small-multiples construction. This is also a secret of this chart. If you mouse over any year on any component, every component now highlights that particular year's data so that one can easily make state by state comparisons. Like this for 2008:
You see that every chart now shows 2008 on the horizontal axis and the data label is the amount for 2008. The respective columns are given a different color. Of course, if this is the most important comparison, then the dimensions should be switched around so that this particular set of comparisons occurs within a chart component--but obviously, this is a minor comparison so it gets minor billing.
I love to see this type of thoughtfulness! This is an example of using interactivity in a smart way, to enhance the user experience.
The Boston subway charts I featured before also introduce interactivity in a smart way. Make sure you read that post.
Also, I have a few comments about the data analysis on the sister blog.
Announcement: I'm giving a free public lecture on telling and finding stories via data visualization at NYU on 7/15/2014. More information and registration here.
The Economist states the obvious, that the current World Cup is atypically high-scoring (or poorly defended, for anyone who've never been bothered by the goal count). They dubiously dub it the Brazil effect (link).
Perhaps in a sly vote of dissent, the graphic designer came up with this effort:
(Thanks to Arati for the tip.)
The list of problems with this chart is long but let's start with the absence of the host country and the absence of the current tournament, both conspiring against our ability to find an answer to the posed question: did Brazil make them do it?
Turns out that without 2014 on the chart, the only other year in which Brazil hosted a tournament was 1950. But 1950 is not even comparable to the modern era. In 1950, there was no knock-out stage. They had four groups in the group stage but divided into two groups of four, one group of three and one group of two. Then, four teams were selected to play a round-robin final stage. This format is so different from today's format that I find it silly to try to place them on the same chart.
This data simply provide no clue as to whether there is a Brazil effect.
The chosen design is a homework assignment for the fastidious reader. The histogram plots the absolute number of drawn matches. The number of matches played has tripled from 16 to 48 over those years so the absolute counts are highly misleading. It's worse than nothing because the accompanying article wants to make the point that we are seeing fewer draws this World Cup compared to the past. The visual presents exactly the opposite message! (Hint: Trifecta Checkup)
Unless you realize this is a homework assignment. You can take the row of numbers listed below the Cup years and compute the proportion of draws yourself. BYOC (Bring Your Own Calculator). Now, pay attention because you want to use the numbers in parentheses (the number of matches), not the first number (that of teams).
Further, don't get too distracted by the typos: in both 1982 and 1994, there were 24 teams playing, not 16 or 32. The number of matches (52 in each case) is correctly stated.
Wait, the designer provides the proportions at the bottom of the chart, via this device:
As usual, the bubble chart does a poor job conveying the data. I deliberately cropped out the data labels to demonstrate that the bubble element cannot stand on its own. This element fails my self-sufficiency test.
I find the legend challenging as well. The presentation should be flipped: look at the proportion of ties within each round, instead of looking at the overall proprotion of ties and then breaking those ties by round.
The so-called "knockout round" has many formats over the years. In early years, there were often two round-robin stages, followed by a smaller knockout round. Presumably the second round-robin stage has been classified as "knockout stage".
Also notice the footnote, stating that third-place games are excluded from the histogram. This is exactly how I would do it too because the third-place match is a dead rubber, in which no rational team would want to play extra-time and penalty shootout.
The trouble is inconsistency. The number of matches shown underneath the chart includes that third-place match so the homework assignment above actually has a further wrinkle: subtract one from the numbers in parentheses. The designer gets caught in this booby trap. The computed proportion of draws displayed at the bottom of the chart includes the third-place match, at odds with the histogram.
Here is a revised version of the chart:
A few observations are in order:
Another reason for separate treatment is that the knockout stage has not started yet in 2014 when this chart was published. Instead of removing all of 2014, as the Economist did, I can include the group stage for 2014 but exclude 2014 from the knockout round analysis.
In the Trifecta Checkup, this is Type DV. The data do not address the question being posed, and the visual conveys the wrong impression.
Finally, there is one glaring gap in all of this. Some time ago (the football fans can fill in the exact timing), FIFA decided to award three points for a win instead of two. This was a deliberate effort to increase the point differential between winning and drawing, supposedly to reduce the chance of ties. Any time-series exploration of the frequency of ties would clearly have to look into this issue.
A graphic illustrating how Americans spend their time is a perfect foil to make the important case that the reader's time is a scarce resource. I wrote about this at the ASA forum in 2011 (link).
The visual form is of a treemap displaying the results of the recently released Time Use Survey results (link to pdf).
What does the designer want us to learn from this chart?
What jumps out first is the importance of various activities, starting with sleep, then work, TV, leisure/sports, etc.
If you read the legend, you'll notice that the colors mean something. The blue activities take up more time in 2013 compared to 2003. Herein, we encounter the first design hiccup.
The size of the blocks (which codes the absolute amount) and the color of the blocks (which codes the relative change in the amount) compete for our attention. According to Bill Cleveland's research, size is perceived more strongly than color. Thus, the wrong element wins.
Next, if we have time on our hands, we might read the data labels. Each block has two labels, the absolute values for 2003 and for 2013. In this, the designer is giving an arithmetic test. The reader is asked to compute the change in time spent in his or her head.
It appears that the designer's key message is "Aging Americans sleep more, work less", with the subtitle "TV remains No.1 hobby".
Now compare the treemap to this set of "boring" bar charts.
This visualization of the same data appears in WSJ online in lieu of the treemap. Here, the point of the article is made clear; the reader needs not struggle with mental gymnastics.
(One can grumble about the red-green color-blindness blindness but otherwise, the graphic is pretty good.)
When I see this sort of data, I like to make a Bumps chart. So here it is:
The labeling of the smaller categories poses a challenge because the lines are so close together. However, those numbers are so small that none of the changes would be considered statistically significant.
From a statistical/data perspective, a very important question must be raised. What is the error bar around these estimates? Is there anything meaningful about an observed difference of fewer than 10 minutes?
Amusingly, the ATUS press release (link to pdf) has a technical note that warns us about reliability of estimates but nowhere in the press release can one actually find the value of the standard error, or a confidence interval, etc. After emailing them, I did get the information promptly. The standard error of one estimate is roughly 0.025-0.05 hours, which means that standard error of a difference is roughly 0.05- 0.1 hours, which means that a confidence interval around any estimated difference is roughly 0.1-0.2 hours, or 6-12 minutes.
Except for the top three categories, it's hard to know if the reported differences are due to sampling.
A further problem with the data is its detachment from reality. There are two layers of averaging going on, once at the population level and once at the time level. In reality, not everyone does these things every day. This dataset is really only interesting to statisticians.
So, in a Trifecta Checkup, the treemap is a Type DV and the bar chart is a Type D.
His theory - originating from an economist at Hanley Wood, a real estate research firm - is that in a recovering market, the share of new home sales by home builders should be higher than the share by banks, as the bank share is associated with foreclosed houses. The data offered are both in aggregate and by regions. I'm particularly interested in the regional chart from a design perspective.
The published chart is the one shown on the left below. I am not a fan of nested bar charts. I don't think there is any justification for treating two data series (here, share by banks and share by builders) differently. Which of the two series should one assign to the fatter bars?
If we slim the fat bars down, we retrieve the more conventional paired bars chart, shown on the right. Among these two, I prefer the paired version.
This presentation also shines the light on a dark corner of Norris's analysis. In every city but Detroit, an unmentioned group of sellers accounts for the majority of home sales! Nowhere in the article did Norris tell readers who those sellers are, and why they are ignored.
In all these charts, I have kept the original order of cities. Before reading further, see if you can tease out the criterion for sorting the cities.
With some effort, you'll learn that the cities are arranged in the order of degree of housing recovery, which is measured by the difference in share: the cities at the top (Houston, Dallas, etc.) have a higher share of builders selling than banks selling.
Ironically, the difference in share is the least emphasized data in a nested bar chart. In fact, how you compute the difference depends on the relative share! When the olive bar is longer than the blue bar, the reader sizes up the white space between the edges of the bars; when the blue bar is longer, though, the reader must look inside the blue area, and compute the interior distance.
The reader can use some help here. Possible fixes include using a footnote, or adding a note informing readers that up implies stronger recovery, or creating a visual separation between those cities in which the share by builders exceeds that by banks, and vice versa.
Here is a dotplot with annotations. The separation between the dots is easily estimated.
Recall the theory that in recovering markets, banks account for a lesser share of home sales. The analyst turned this into a metric, by taking the difference in the share by builders from the share by banks.
This metric is highly problematic. The first problem, already discussed, is that there exist more than these two types of sellers, and it is absolutely not the case that if the share by banks goes down, the share by builders goes up.
Another issue is that the structure of the housing market in different cities is probably different. The chart promotes the view that there is a general trend that extends to all markets. In fact, the variation over time within one city should be more telling than the variation across twenty cities of a point in time.
And there is the third strike.
This is a confusion between forward and reverse causation (see Andrew's post here for a general discussion of this important practical issue). The Floyd Norris/Hanley Wood theory expresses a forward causation: if a housing market is recovering, then banks will work through its inventory of foreclosed homes, and account for a decreasing share of home sales.
The analysis addresses the reverse of this relationship. The analyst observes that banks (in some cities) are selling fewer homes, and concludes that the housing market is recovering. Notice that this is a problem of reverse causation: instead of cause -> effect, we have effect -> cause. The rub is that any given outcome has many possible causes. Banks sell fewer homes for many possible reasons, only one of which is a recovering market.
Here are some other possibilities. The banks expect prices to rise in the future, and they are holding on to the inventory. The economy is sputtering and banks are tightening up on mortgage lending, making it harder to sell homes. Instead of selling the homes, the banks decide to destroy the homes to reduce supply and raise prices. The mysterious third group of sellers has put a lot of homes on the market. etc.
In making claims based on observational data, one must conduct side investigations to rule out other causes.
From a Trifecta Checkup perspective, this chart addresses an interesting Question. The Visual design has hiccups. The biggest problem is that the Data provide an unsatisfactory answer to the question at hand. (Type DV)
Through twitter, Antonio Rinaldi sent the following chart that accompanied a New York Times piece talking about the CPI (inflation index). The article concerns a very important topic--that many middle- to lower-income households have barely any saving after spending on necessities--and only touches upon the issue raised by this chart, which is that the official CPI is an average of prices of a basket of goods, and there is much variability in the price changes of different categories of goods.
I cover this subject in much greater detail in Chapter 7 of Numbersense (link). There are many reasons why the official inflation rate seems to diverge from our own experiences. One of the reasons is that we tend to notice and worry about price increases but we don't notice or take for granted price decreases. In the book, I cover the fascinating subject of the psychology of remembering prices. Obviously, this is a subject of utmost importance if we are to use surveys to understand perceived prices.
The price of a T-shirt (unbranded) has remained the same or may have declined in the last decades. Besides, the chart reveals that phone and accessories, computers and televisions have all enjoyed deflation over the last decade. Actually, much of the "deflation" is due to a controversial adjustment known as "hedonics". This is to claim that part of any price change is attributed to product or technology improvements. So, if you pay the same price today for an HDTV as in the past for a standard definition TV, then in reality, the price you paid today is cheaper than that in the past.
That adjustment is reasonable only to a certain extent. For instance, my cell phone company stuffs my plan with hundreds of unused and unusable minutes so on a per-minute basis, I am sure prices have come down substantially but on a per-used-minute basis, I'm not so sure.
Let's get to what we care about on this blog... the visual. There is one big puzzle embedded in this chart. Look at the line for televisions. It dipped below -100 percent! Like Antonio, many readers should be scratching their heads--did the price of television go negative? did the hedonic adjustment go bonkers?
As an aside, I don't like the current NYT convention of hiding too many axis labels. What period of time is this chart depicting? You'd only find out by reading the label of the vertical axis! I mentioned something similar the other day.
The key to understanding a chart like this is to learn what is being plotted. The first instinct is to think the change in prices over time. A quick glance at the vertical axis label would correct that misunderstanding. It said "Change in prices relative to a 23% increase in price for all items, 2005-2014".
This label is doing a lot of work--probably too much for its inconspicious location and unbolded, uncolored status.
Readers have to know that the official CPI is a weighted average of changes in prices of a specified basket of goods. Some but not all of the components are being graphed.
Then readers have to understand that there is an index of an index. The prices of each "item" (i.e. category or component of the CPI) are indiced to 1984 levels. So the prices of television is first re-indiced to 2005 as the baseline. This establishes a growth trajectory for television. But this is not what is being depicted.
The blue line reflects the 23% average increase in prices in that 10-year period. Notice that the red line does not exhibit any weirdness--television prices have gone down by 90 percent. It's not negative.
What the designer tried to do is to index this data another time. Think of pulling the blue line down to the horizontal axis, and then see what happens to the gray and red lines.
Now, even this index on an index should not present a mathematical curiosity. If all items moved to 1.23 while apparel moved to 1.10, you might compute 110%/123% which is roughly 0.. You'd say the apparel index is 90% of the way to where the all-item index went. Similarly for TVs, you would compute 10%/123% which is 0.08. That would be saying the TV index ended up 8% of where the all-item index landed.
That still doesn't yield -100%. The clue here is that the baseline is zero percent, not 100, not 1.0, etc. So if there is an item that moved in sync with all items, its trajectory would have been horizontal at zero percent. That means that the second index is not a division but a subtraction. So for TV, it's -90% - 23% = -113%. For apparel, it's +10%-23% = -13%.
Even though I reverse-engineered the chart, I don't understand the reason for using subtractions rather than division for the second layer of indicing. It's strange to me to add or subtract the two indices that have different baseline quantities.
Here is the same chart but using division:
I usually avoid telescoping indices. They are more trouble than it's worth. Here is an old post on the same subject.