## Some chart types are not scalable

##### Mar 21, 2014

Peter Cock sent this Venn diagram to me via twitter. (Original from this paper.)

For someone who doesn't know genetics, it is very hard to make sense of this chart. It seems like there are five characteristics that each unit of analysis can have (listed on the left column) and each unit possesses one or more of these characteristics.

There is one glaring problem with this visual display. The area of each subset is not proportional to the count it represents. Look at the two numbers in the middle of the chart, each accounting for a large chunk of the area of the green tree. One side says 5,724 while the other say 13 even though both sides have the same areas.

In this respect, Venn diagrams are like maps. The area of a country or state on a map is not related to the data being plotted (unless it's a cartogram).

If you know how to interpret the data, please leave a comment. I'm guessing some kind of heatmap will work well with this data.

You can follow this conversation by subscribing to the comment feed for this post.

The tree shaped Venn Diagram is likely inspired by another infamous genome paper figure, the six-set Venn Diagram in the banana genome paper in Nature, DOI 10.1038/nature11241 (Figure 4).

That figure managed to keep one unique region for each set combination, which is a subtle complication in the pine tree version. Notice how many of the numbers are repeated? However there are at least two typos (341 vs 79, and 27 vs 38) which I have emailed the authors about and will hopefully be fixed for the final PDF of the article, see this annotated image.

Are Venn diagrams ever truly useful for data display?
One this complex certainly never could be.

Most that I have seen make no attempt to make areas match values in any sense, and exist purely to show that items do in fact overlap.

Surely there are better ways to show complex relationships...

Venn diagrams show in/out relationships. Are they inside or outside the set? So a single circle Venn diagram shows 2 possibilities, a two-circle four, a three-circle eight... and a five-set Venn diagram counts the cases that are in or out of one of thirty two areas. Venn diagrams can always be displayed as tables instead, and I'd suggest two (2x2)x(2x2) tables for this one. Obviously a table is two-dimensional, so you get the extra dimensions by nesting or, as in the biggest distinction in my suggested scheme, multiplicity.

You could enter the numbers into an Excel grid and print it out. It just wouldn't look as cool as the green tree diagram above.

Just after I hit "post", I remembered we've been here before...

(man, 2007? what happened to the time?)

The groups listed on the left are distinct groupings of plants. A given plant can only fall in one of them. I assume you know what conifers and mosses are, roughly, at least. Monocots includes plants such as Maize, while dicots is the group that includes flowers such as magnolia.

I'm a bit confused why they're still using dicots, to be honest, I thought dicots was no longer used as a grouping because it turned out to be polyphyletic (that is, simplifying a bit, that the group evolved from multiple ancestors not one).

The numbers then show the number of gene families that are common to different combinations of these major plant groupings. Such diagrams are actually pretty common in genetics but I've always found they shed as much heat as light and this one's rather cute decision to use a tree shaped cut out doesn't make it any easier to read.

Oh, and on the scaling thing; I don't see how it would be possible to display this data to scale - you'd need to be able to show 1 clearly and 5724 clearly.

Jack, I think Kaiser means that Venn diagrams start out cute with two or three sets, but don't scale up to five sets well. Venn diagrams almost never attempt to show the area of the sets proportional to their populations.

It's impossible for me to be sure because I can't see the whole paper, but I think they took the sequences of the loblolly pine genome, and asked "is this sequence common to one of the other pines named in the study, yes/no, at least one of the mosses yes/no, the basal example yes/no, and so on". The result is a 2x2x2x2x2 matrix of numbers of sequences fitting the criteria.

@derek - I think everyone understands *what* a Venn Diagram is intended to do.

The question is, does it do so in any way that's effective or useful?

With this level of complexity, I think the answer is a very clear "No!"

Yes, if you spend long enough studying it, you can come away with some information.

But the same would be true if the data were written in pictograms in the dirt with stick.

With 2 or three sets, it may be superficially useful as a visual aid, but still not in a way that provides any depth of understanding the data.

As far as looking cool, I'd say this one looks like a group of children took turns with a spirograph, one on top of the other :)

"Cool" was meant to be sarcastic.

The comments to this entry are closed.