Raw data and the incurious
More chart drama, and data aggregation

Is it worth the drama?

Quite the eye-catching chart this:


The original accompanied this article in the Wall Street Journal about avian flu outbreaks in the U.S.

The point of the chart appears to be the peak in the flu season around May. The overlapping bubbles were probably used for drama.

A column chart, with appropriate colors, attains much of the drama but retains the ability to read the data.




Feed You can follow this conversation by subscribing to the comment feed for this post.


I agree, the bubble chart might not be the best. IMO, the column chart is not a) visually pleasing and b) is marginalizes the trend the authors are trying to emphasize to about 4-6 lines in the histogram. Perhaps something more akin to a violin chart would be better?? This is a case where the individual data points are not the -point-.


Bob: Smoothing the data first will solve this problem. The "trend" here requires interpolation.


I agree that this particular bar chart is not visually pleasing, and also that it doesn't quite drive the point home(I don't think the original made much of a point either though).

I don't agree that showing the point as "only" 4-6 lines in a histogram is marginalizing it. 4-6 lines in a histogram can mean a great deal.

Interpolation, or simply aggregation, will go a long way to fix that.

But I think the main point is that the bubbles, other than looking pretty, aren't really telling us much, and there are better ways to show it.


I had earlier wanted to link to Kosara's response to this blog post and Typepad's spam filter struck again, and removed a comment by the author of the blog!

Here is Kosara's take on this post: link.

As you can see from the above comments, I agree with him that smoothing helps.

When I retained the spikes, they were cued by the original chart in which certain days were highlighted. But I agree that those dates were probably not very meaningful.


My first issue here is that the data doesn't directly talk about the number of infections, but rather reports the number of birds killed; without knowing more about the relationship between infection and culling we can't judge to what extent the one is a good proxy for the other.

I also think that looking at just the totals is misleading, it misses the fact that there are more interesting stories here than is captured by these aggregate charts.

For example, a brief study of the data reveals that there are two strains of the virus: H5N8 and H5N2; the first large culls (23 Jan and 12 Feb) are of flocks with H5N8, so could (should?) be omitted if the story is about H5N2.

Based on the reported species the vast majority (99.8%) of birds destroyed were chickens and turkeys of one sort or another; grouping these two together and disregarding the ducks, pheasant and other or mixed species, and looking at the timeline it appears that the virus appears to affect turkeys first, and chickens later.

This detail could lead to hypotheses about spread from the smaller (?) turkey population through mixed flocks to the large chicken flocks.

I don't have knowledge of poultry or virology, so don't know if these are valid concerns and hypotheses, but these are the sorts of stories I'd like to tease out from the data.

The comments to this entry are closed.