Reader Steph G. didn't like the effort by WRAL (North Carolina) to visualize the demographics of protestors in Raleigh. It sounds like the citizens of NC are making their voices heard. Maybe my friends in Raleigh can give us some background.
There are definitely problems with the choice of charts. But I rate this effort a solid B. In the Trifecta Checkup, they did a good job describing the central question, as well as compiled an appropriate dataset. I love it when people go out to collect the right data rather than use whatever they could grab. The issue was the execution of the charts.
The first was a map showing where the arrested protestors came from.
Maps are typically used to show geographical distribution. The chosen color scheme (two levels of green and gray) compresses the data so much that we learn almost nothing about distribution. I clicked on Wake County to learn that there were 178 arrests there. The neighboring Randolph County had only 1 arrest but you can't tell from the colors.
The next chart shows the trend of arrests over time. I like the general appearance (except for the shadows). The problem is the even spacing of the columns when the gaps between the arrests are uneven.
Here's a quick redo, with proper spacing:
The final set of charts is inspired. They compare the demographics of those arrested protestors against the average North Carolina resident. For example:
For categories like Age with quite a few levels, the pie chart isn't a good choice. It's also hard to compare across pie charts. A column or dot chart works better.
Reader omegatron came back with another shocking instance of a pie chart:
Here is the link to the AVERT organization in the U.K. that published the chart and several others.
For the umpteenth time, the pie chart plots proportions. All proportions are percentages but some percentages are not proportions. The data here would appear to be "rate of diagnosis" rather than proportion of diagnoses by age.
The data came from Table 3a of this CDC report (link), and they are clearly labelled "Rate". The footnote even disclosed that the "Rate" is measured per 100,000 people so they are being mislabeled as percentages.
Let's summarize. The percentages add up to much more than 100%, they are clearly not proportions, they are not even percentages, they are rates per 100,000.
omegatron even got confused by the colors. You'd think that the slices would be arranged by age group but no! The order of the slices is by size of the pie slices, with one exception--the lime green slice of 11.4%, which I cannot explain. In practice, this means the order goes from Under 13 to 13-14 to Over 65 to 60-64 to 50-54, etc.
A smarter use of color here would be to stick to one color while varying the tinge acccording to the rate of diagnosis. Using 13 colors for 13 age groups is distracting.
As a teacher, it's shocking that such pie charts continue to see the light of day. It's very disappointing, as I'd assume every teacher who teaches the pie chart will have pointed out the pitfalls. Why is this happening?
With this chart, I'm mostly baffled by the top corner of the Trifecta Checkup. What is the point of this data? If I understand the "per 100,000 population" definition, these rates are computed as the number of diagnosed divided by the population in each age group. So the diagnosis rate is a function of how many people in each age group are actually infected, and how effective is the diagnosis procedures, and whether that effectiveness varies with age. Plus, the completeness of reporting by age group (the footnote acknowledged that the mathematical model does not account for incomplete reporting. To call a spade a spade, that means the model assumes complete reporting.)
The rate of diagnosis can be low because the rate of infection is low or the proportion of the infected who gets diagnosed is low. I just can't conceive of a use of data that confound these factors.
A time series treatment would be interesting althought that addresses a different question.
Nick C. on Twitter sent us to the following chart of salaries in Major League Soccer. (link)
This chart is hosted at Tableau, which is one of the modern visualization software suites. It appears to be a user submission. Alas, more power did not bring more responsibility.
Sorting the bars by total salary would be a start.
The colors and subsections of the bars were intended to unpack the composition of the total salaries, namely, which positions took how much of the money. I'm at a loss to explain why those rectangles don't seem to be drawn to scale, or what it means to have rectangles stacked on top of each other. Perhaps it's because I don't know much about how the cap works.
Combined with the smaller chart (shown below), the story seems to be that while all teams have similar cap numbers, the actual salaries being paid could differ by multiples.
This is the standard stacked bar chart showing the distribution of salary cap usage by team:
I have never understood the appeal of stacking data. It's not easy to compare the middle segments.
After quite a bit of work, I arrived at the following:
The MLS teams are divided into five groups based on how they used the salary cap. Salary cap figures are converted into proportion of total cap. For example, the first cluster includes Chicago, Los Angeles, New York, Seattle and Toronto, and these teams spread the wealth among the D, F, and M players while not spending much on goalie and "others". On the other hand, Groups 2 and 3, especially Group 3 allocated 30-45% of the cap on the midfield.
Three teams form their own clusters. CLB spends more of its cap on "others" than any other team (others are mostly hyphenated positions like D-F, F-M, etc.) DAL and VAN spend a lot less on midfield players than other teams. VAN spends a lot on defense.
My version has many fewer data points (although the underlying data set is the same) but it's easier to interpret.
I tried various chart types like bar charts, and even pie charts. I still like the profile (line) charts best.
In a modern software (I'm using JMP's Graph Builder here), it's only one click to go from line to bar, and one click to go to pie.
Here's a chart in the November edition of Bloomberg Markets:
Curiosities include: how they split up the lamb chop, why an onion is chosen to represent "fresh vegetables/melons"?
The chart contains some strange data that make readers feel nervous. For example, the fish image seems to say 88 percent of seafood eaten in the States are imported, and yet the two largest importing countries listed below (China and Vietnam) together account for only 22.5 percent. So the residual 65.5 percent must be split among at least 10 countries each accounting for not more than 6.5 percent of the total.
Then when you look at vegetables, Mexico and Canada together supply 72 percent. But the onion graphic tells us it's less than 20 percent. The categorization seems to be different between the top and the bottom layers. We have "fruit and nuts" / "fresh vegetables/melons" on the one side, and "fruit" / "vegetables" on the other side.
And why are melons combined with fresh vegetables rather than fruit?
Reader Steve S. sent in this article that displays nominations for the "Information is Beautiful" award (link). I see "beauty" in many of these charts but no "information". Several of these charts have appeared on our blog before.
Let's use the Trifecta checkup on these charts. (More about the Trifecta checkup here.)
The topic of this chart is both tangible and interesting. As someone who loves books, I do want to know what genres of books typically win awards.
However, both the data collection and graphical design make no sense.
The data collection problem presents a huge challenge and it's easy to get wrong. The problem is how narrow should a theme be. If it's too narrow, you can imagine every book has its own set of themes. If it's too wide, each theme maps to lots of books. The challenge is how to select the themes such that they have similar "widths". For example, "death" is a very wide theme and lots of books contain it, as indicated by the black lines. "Nanny trust issues" is a very narrow theme, and only one of those books deals with this theme. When there is such a theme, is its lack of popularity due to its narrow definition or due to writers not being interested in it?
The caption of this chart said "Cover stars: Charting 50 years up until 2010, this graphic shows The Beatles to be the most covered act in living memory." If that is the message, a much simpler chart would work a lot better.
Since the height of the chart indicates the number of covers sold in that year, the real information being shown is the boom and bust cycles of the worldwide economy. So, a lot more records were sold in 2005, and then the market tanked in 2008, for example.
That's why the data analyst should think twice before plotting raw data. Most data like these should be adjusted. In this case, you could either compare artists against one another in each year (by using proportions) or you have to do a seasonal and trend adjustment. I also don't see the point of highlighting year-to-year fluctuations. Nor do I understand why only in certain years is the top-rated cover identified by name and laurel wreath.
I talked about this stream graph of 311 calls back in 2010. See the post here.
I featured this set of infographics/pie charts back in 2011. See the post here.
This chart is a variant of the one from New York Times that I discussed here. I like the proper orientation on the NYT's version. The color scheme here may be slightly more attractive.
Reader James H. spotted this offensive pie chart in Forbes (link).
This chart tells us that emerging markets will be responsible for the greatest growth in medical spending up to 2016.
It is hard to find this message in the chart. The gray sector for Japan in 2006 reads 10%, the exact same number as the gray sector in 2016, which appears several times as large. In a pie chart, it is hard enough to compare the sectoral areas within a pie, let alone sectors of different-sized pies.
James noticed that the pie areas are incorrect. The 2016 pie should be roughly double the area of the 2006 pie. This is not the case. It seems like the radius of the 2016 pie is at least three times larger than that of the 2006 pie.
As usual, a line chart brings out the trend more clearly:
The projected numbers should be clearly labelled as such. "2016" should read "2016P". I'm not sure if the 2011 number was projected also - depends on when the data source was published.
The worst thing about this chart is it's completely misleading. It fails to recognize that there are many billions of people in emerging markets and "rest of the world" while U.S, Europe and Japan combined have just over one billion people. Thus, all this chart is really saying is that population growth in the next several years will mostly occur in emerging markets. One can substitute medical spending with any kind of mass market spending and have essentially the same picture.
Below are a rough estimate of the per-capita medical spending by region using population sizes in 2011. For emerging markets, I have substitued BRIC i.e. Brazil, Russia, India and China, which underestimates the population and thus overestimates the per-capita spend. These parts of the world spend a fraction of what industrialized countries are spending. So what's the story?
Ryan McCarthy linked to a post by Ruchir Sharma running on Ezra Klein's blog analyzing global billionaires.
It has an accompanying chart, which fails our self-sufficiency test. That test involves erasing raw data from a chart, and figuring out how much information the graphical elements themselves convey.
The primary metric used by Sharma is the billionares' total net worth as a percentage of the country's GDP. This metric is embedded in double concentric circles. Unfortunately, without mental gymnastics, readers can't tell what the proportion is. This means we must look at the raw data which is supplied as a column on the right of the graphic. If readers are taking the information from the column of raw data, then why draw a chart?
The actual data is revealed on the left . Don't tell anyone you read it here but pie charts would work well with this dataset. You might complain that there is a conceptual problem - that if we sum up the net worth of everyone in a country, it would not equal GDP. I think the sum doesn't work - economists can chime in about this. Sharma seems to imply that the total would sum to 1. Anyone's net worth is accumulated over a number of years in which the GDP is fluctuating while the total GDP is given for a specific end of quarter of some year so does it make sense to divide one by the other?
Also, the fact that some people may have negative net worth creates problems with the pie-chart format and it's not much better in a concentric-circle format either.
*** A maddening decision puts the United States, which is the biggest circle, at the bottom of the chart. Notice that the countries are sorted from larger billionaires' share to smaller. The U.S. belongs to the top 5 nations with the worst inequality by this metric and yet a cheeky little bookmark sends us to the bottom of the list together with the more-equal nations.
Not only is the location of U.S. privileged, the location of the text, the number of decimal places given in the net worth amount, and the presence of the GDP value all set the U.S. apart from the other countries plotted.
The most interesting piece of information is waiting to be reconstructed. In Malaysia, nine citizens own as much as 18.3% of the country's GDP. In Mexico, 11 people own 10.9% of the country's GDP.
To make the number even more telling, we have to incorporate the population size. For Malaysia it is 28 million. This means that the top 0.000032% of the population owns 18.3%. In the case of perfect equality, this proportion would own 0.000032%. We can say the inequality index is 570,000. In Mexico, the index is 1.1 million. So in fact, the concentration of wealth at the time is worse in Mexico than in Malaysia. For reference, the U.S. comes in at 78,000.
Of course, the use of billionaires as a filtering device to determine who to count or not is completely arbitrary. In measuring income inequality, one should look at what proportion of the population control 50% of the wealth, for example.
There is no explanation for the choice of countries. The U.S. is the only developed nation in the entire chart.
Reader Dave S. was disturbed by the graphics in the inaugural World Happiness Report, published by Jeffrey Sachs's Earth Institute (link). It's a 200-page document with lots of graphs, many of which require rework.
Here's a pie chart showing (purportedly) what "happy" people in Bhutan are happy about:
I'm really curious how these domains add up to 100% exactly. Since the data came from some kind of survey, you typically would allow each respondent to pick more than one domains in which he or she is happy. If that is the case, then it would not make sense to add up responses, nor would the total (100%) signify anything.
If, on the other hand, respondents are forced to pick only one domain, it is very suspicious that all 9 domains would essentially receive the same number of votes. Nor would it make sense to ask survey-takers to select only one domain if all 9 domains contribute to someone's happiness.
Pie charts are perhaps the most abused chart type. There are just endless examples of poorly executed pie charts (just browse my last few posts). The prevalence of abuse may be reason enough to ban them.
Paired with Figure 4 shown above is Figure 5 shown below, which deepens the mystery:
Compare the captions. What's the difference between "In which domains do happy people enjoy sufficiency?" and "Indicators in which happy people enjoy sufficiency"? The categories are related but not identical (Education vs. Schooling, Health vs. Self reported health status, etc.) However, in Figure 5, the distribution is uniform as in Figure 4. Is the data contradictory? Or the captions misleading?
This column chart would be better presented as a horizontal bar chart so that readers don't have to break their necks trying to read the category names.
The designer should also perform the routine task to get rid of the 120% tick mark on the proportion axis that comes from Excel.