Reader (and author) Bernard L. sends us to the Economist (link), where they walked through a few charts they sketched to show data relating to the types of projects that get funded on Kickstarter. The three metrics collected were total dollars raised, average dollars per project, and the success rate of different categories of projects.
Here's the published version, which is a set of bar charts, ranked by individual metrics, and linked by colors.
This bar chart does the job. The only challenge is the large number of colors. But otherwise, it's not hard to see that fashion projects have the worst success rate and raised relatively little money overall although the average pledge amount tended to be higher than average.
The following chart used more of a Bumps chart aesthetic. It dropped the average pledge per project metric, which I think is a reasonable design choice. The variance in pledge amount is probably pretty high and thus the average may not be a good metric anyway. The Bumps format though suffers because there are too many categories and the two metrics are rather uncorrelated, resulting in a spider web. Instead of using colors as a link, this format uses explicit lines as links between the metrics.
The following version combines features from both. It requires no colors. It drops the third metric, while adopting the bar chart format. The two charts retain the same order of categories so that one can read across to learn about both metrics.
PS. Readers want to see a scatter plot:
The overall pattern is clearer on a scatter plot. When there are so many categories, it's a pain to put the data labels on the chart. It's odd that the amount pledged for games is the highest of the categories and yet it has among the lowest rate of being fully funded. Is this a sign of inefficiency?
Reader Steve S. tried to spoil my new year with this chart he didn't like:
Or maybe he's just chiding me for recommending Bumps charts. This example is very confusing, a tangled mess.
But not so fast.
The dataset has two characteristics that don't sit well with bumps charts. One is too many things being ranked (twenty). Two is too much rank swapping that happens over time (14 periods).
The latter challenge can be tamed by aggregating the time dimension. For some reason, the period under examination was the first half year after the debut of these computers. Do we really need to know the weekly statistics?
We can keep all 14 periods. If so, we should be judicious in selecting the colors, the lines and dashed lines, and gridlines, and so on. In particular, look for a story and use foreground/background techniques to highlight the story.
Here's a version that focuses on the brands that moved the most number of ranks either up or down during this period:
Here's one that tracks how the top five fared over this period of time. It turns out that despite all the noisy movements, not much happened at the top of the rankings:
Not knowing many of these computer brands, I really have no idea why seven colors were used and why different tints of the six colors were chosen. I also don't have a clue why some lines were dashed and others were solid.
Looking closely, I learn that the Sony PC was given a black color
because its label does not show up on either side. It was a product that
did not rank among the top 20 at the start nor at the end of this time
period. This Sony PC should be consigned to the dustbin of history, and yet in the color scheme selected for the original chart, the black solid line is the most visible!
I'd like to see an interactive layer added to this chart that brings out the "information". Two of the tabs can be "top movers" and "top five brands" as discussed above. If you hover over these tabs, the appropriate lines are highlighted.
Thanks to reader Don M, I came across this fascinating chart published in the New York Times Review recently (link). The main article, about gender segregation in job categories, is found here.
This is one of those charts that require a reader's guide.
The chart shows the proportion of women in each job category in year 1980 and in year 2010 (and nothing in between). The jobs are divided into three large chunks: the top chunk (shaded) consists of jobs in which women account for more than 70 percent of the total; the middle chunk (white background) are those jobs with 30 to 70 percent women; the bottom chunk (also shaded) are jobs with more than 70 percent men.
The designer then uses the red, green and gray colors (apologies to the color-blind folks) to group the jobs into three clusters. This is usually a great idea except that it is poorly executed here. Don is very annoyed with this because these colors lead the readers to the wrong conclusion, and I agree.
The color scheme is unnecessarily convoluted. Here is an alternative I prefer:
if the change is 5 percent or less, color as gray no matter where the line is. (It is insane to color the line for housekeepers "red" for going from 87 to 89 percent in 30 years).
if the change is over 5 percent in the female direction, color it red to indicate the occupation is becoming more female. (There would be many red lines, such as for managers in education, HR staff, social workers, architects, etc.)
if the change is over 5 percent in the male direction, color it blue to indicate the occupation is becoming more male (There would be only one blue line, and that is for welfare service aides.)
This would mean the lines for dentists and architects would be labelled progress. So too with most of the jobs that were predominantly male in 1980. In fact, there really isn't any occupation that went backwards--all those red lines in the bottom shaded chunk indicate shifts of only 1 to 4 percent, over 30 years!
This conclusion usurps the premise of the column in which the author claims that the conventional wisdom is wrong.
The other precaution in reading this chart is to realize that each occupation is put on equal footing in this chart even though some job categories employ a lot more people than others. Also confounded with this data is the differential growth/decline in job categories over the 30-year period. Further, the proportion of women entering the labor force must be accounted for.
This is a case in which less is less. The structure of the problem is complex, and it requires a more sophisticated approach.
Reader Sushil B. offers this chart from Business Week on hedge fund returns. (link)
Unmoored bubbles, slanted text, positive and negative returns undifferentiated, bubble within bubble, paired data scattered apart, and it's not even that attractive.
Here is a Bumps-chart style version of this data:
The author never explained how the five funds were chosen so it's hard to know what's the point of the chart. It appears like Harbinger Capital Partners had a similar experience as Paulson. In addition, given the potentially huge gyrations from year to year, it's very odd that we are not shown the annual returns between 2007 and 2011... we can't be sure that some of the three other funds suffered a particularly bad year in between the end points shown here.
We look at another idea from the visualization project "Gaps in the US Healthcare System" (link). This was a tip from reader Jordan G. (link). One of the bright points about this project is the conscious attempt to try something different although the end result is not always successful.
A tree-like branching chart was used to represent cancer death rates, broken down by racial group, gender and type of cancer, in that order.
The tree structure loses its logic after the race and gender splits. Why link different types of cancers (the gray squares) together in a sequence? Stranger still is the existence of a third branch coming out of every race node (the four closest to the center). One branch is male, the other branch is female, what's the third leg? It appears to be prostate cancer which is male only--why doesn't it go with the male branch?
It's not easy to find the connection between what's depicted here, and the idea of "gaps" in the US healthcare system. I think the question is ill-posed to begin with. The rate of death reflects both the possible differential quality of healthcare between groups and the differential incidence of cancers between groups so no visualization tricks could be used to find reliable answers to the question being posed.
The chart fails the first corner of the Trifecta checkup. The chart type also does not fit the data.
The following chart plots the same data in a Bumps style.
I separated the male and female data since certain cancers are limited to one gender, and the gender difference is not likely to be the primary interest. The gender difference, incidentally, is clearly observed: the male death rates are generally about twice as high as the female rates of the same type of cancer, except for colorectal.
In terms of the "race gap", we find that black death rates are generally quite a bit higher than white death rates, especially for prostate cancer but except for lung cancer in females.
Asians and American Indians have practially the same death rates but in both cases the sample sizes are small.
The raw data can be found at the CDC website here.
Stefan S. who works for the UN data project and is a regular contributor to this blog, points us to a new report they have issued that contain a host of charts. The report is an update on what has happened to our Earth since 1992 (The Earth Summit). Link to the PDF file here.
This life expectancy chart (shown on left) uses a Bumps-type chart, and is very nicely done, clean and informative.
This age distribution chart shown on the right is unusual. It's a case of the data defeating the chart type. The magnitude of the 5-year changes is just not large enough as a percentage of the total to register. On a different data set, I can see this chart type being more effective.
Now, this criss-cross chart (bottom left) reminds me of Friedman's foolish attempt some time ago. It has various issues, like dual axes, excessive labels and inattentive titles (not indicating that the base population was only of developing countries).
Instead, I attempted an area chart, using population size as the primary metric. Perhaps a more direct way to illustrate this point is to plot the growth rate of the slum population versus that of the total population.
This map is excellent, showing the spatial distribution of the countries with above-average and below-average GDP per capita. It would be even better if smaller geographic units can be used so that the distribution within each country can also be seen.
I'd like to salute all the people around the world who work at statistical agencies and who collect and make sense of all of this data, without which any of these charts would not have been possible.
First, we must fix the vertical scale. For column charts, one must start at zero, without exceptions. The effect of not starting at zero is to chop off an equal length piece from the bottom of each column, and in doing so, the relative lengths/areas of the columns are distorted. The amount of distortion can be very severe. For example look at the fourth set of columns as shown below:
In both charts, I made the length of the first column the same so we are staring at comparative charts. The data plotted is exactly the same; the only difference is that the left chart starts the axis at zero. Notice that the huge difference seen on the right chart for the 4th pair of columns does not appear as extraordinary when the proper scale is used.
A multitude of other problems exist, not the least this is a chart that is highly redundant. The same data (10 numbers) show up three times, once as data labels, once as column lengths (distorted), and once as levels on the vertical scale.
An alternative way to look at this data is the Bumps chart. Like this:
What this chart brings out is the variability of the estimated vehicle densities. In theory, the density estimate should be quite accurate for the "today" numbers. You'd think that in surveying 2,000+ people about how many vehicles they currently own, most people should be able to provide accurate counts.
The data paint a different picture. From quarter to quarter, the estimated "today" density shows a range of 1.90x to 2.00x in the 5 periods analyzed, which is roughly 5%, a difference which, according to the analyst, equates to 5 million vehicles! Given current vehicle sales of about 13 million per year, 5 million is almost 40% of the market.
So, one wonders how this survey was done, and one wants to know how large is the margin of error of this estimate. I also want to know if the survey produces estimates of number of households as well since the vehicle per household metric has two variable components.
The reader must first read the beginning pages of the report to find one's bearing. The two charts are supposed to investigate the correlation between streaming video and regular TV. What causes the confusion is that the populations being analyzed are different between the two charts.
In the left chart, they exclude anyone who do not watch streaming video (35% of the sample), and then divide those who watch streaming video into five equal-sized segments based on how much they watch. Then, they look at how much regular TV each segment watches on average.
In the right chart, they exclude anyone who do not watch regular TV (just 0.5% of the sample), and then divide those who watch regular TV into five equal-sized segments based on how much they watch. Then, they look at how much online streaming video each segment watches on average.
What crosses us up is the relative scales. The scale for regular TV viewing is tightly clustered between 212 and 247 daily minutes on the left chart but has a wide range between 24 and 522 on the right chart. The impression given by the designer is that the same population (18-34 year olds) is divided into five groups (quintiles) for each chart, albeit using different criteria. It just doesn't make sense that the group averages do not match.
The reason for this mismatch is the hugely divergent rates of exclusion as described above. What the chart seems to be saying is that the 65% who use streaming video have very similar TV viewing behavior (about 220 daily minutes). In other words, we surmise that most of those people on the left chart map to groups 2 and 3 on the right chart.
Who are the people in groups 1, 4 and 5 on the right chart? It appears that they are the 35% who don't watch streaming video. Thus, the real insight of this chart is that there are two types of people who don't watch streaming video: those who watch very little regular TV at all, and those who watch twice the average amount of regular TV.
Here's another puzzle: Nielsen claims that high streaming = low TV and low streaming = high TV. Is it really true that high streaming = low TV? Take the segment of highest streaming (#1 on the left chart). This group, which is 13% of the survey population, accounts for 83% of the streaming minutes -- almost 71,000 out of 86,000 minutes. Now look at the right chart. It turns out that the streaming minutes are quite evenly distributed among those TV-based quintiles, ranging from 15,000 minutes to 23,000 minutes each.
So, it is impossible to fit all of the top streaming quintile into any one TV quintile - they have too many streaming minutes. In fact, the top streaming quintile must be quite spread out among the TV quintiles since each of the TV quintiles is 1.5 times the size of a streaming quintile!
So, we must conclude that customers who stream a lot include both fervent TV fans as well as those who watch little TV.