Tricks of the trade 2
Jun 29, 2007
In a previous post, I explained the value of sketching when creating graphs. Today, I will share a few other graphs that plot the same data as we discussed the other day, regarding the proportion of time spent on developing different modules of software.
A stacked column chart, suggested by John J., would look like this:
Compared to the profile chart, this chart has some weaknesses:
- it's difficult to read off the proportions for middle blocks like Blinksale-Billing;
- because the middle blocks "float", it is impossible to compare them properly;
- it requires as many colors as there are variables.
These problems get worse as the data scale: more difficult to read off the data; more colors needed.
The Merrimecko, suggested by Bernard L., is the same chart as above except that the widths of the columns are made proportional to the relative number of lines of code. However, because these four companies do not make up the entire universe, proportional width make little sense here.
The profile chart can be drawn up in two ways:
These charts typically display results of cluster analysis. This is a statistical data mining technique which discovers groups of like objects within a large data set. Often times, the computer will only tell you these 15 belong to Cluster 1, those 22 form Cluster 2, etc.
To figure out why the 15 belong together, the analyst needs to plot the explanatory variables against cluster index. Now, think of WuFoo, FeedBurner, etc. as clusters, and the proportion of code given to Application, etc. as variables.
While the line segments don't signify anything real, they trace out the precise paths our eyes would take when reading the stacked column chart above! Remember we wanted to compare the number of lines given to each function across companies. If shown the column chart, my eyes would flip across the top of the Application (blue) blocks from WuFoo to regonline. This path is exactly the brown line on our first profile chart.
The numbers for Marketing, Support and Billing are much easier to read too as they all start from zero for each company.
The right chart is another possibility but for this particular situation, I prefer the left one.
Finally, I am less familiar with the "parallel coordinates plot" that Derek talked about. I believe it is a variant of the profile chart.
Profile charts are just another name for the parallel coordinates plot (and I believe that pcp is the more "statistical" term, for what that's worth)
Posted by: Hadley Wickham | Jun 29, 2007 at 02:40 AM
Basically, it's what you just described: an exploratory data analysis (EDA) device for comparing the properties (on an interval scale) of a group of objects (on a nominal scale) using lines.
As I said, I used to think lines were a no-no unless there was a clear sequence from object to object (i.e. must be an ordinal or interval scale) but now I personally downgrade that rule to a weak guideline; weaker, for example, than the "rule" which says the interval scale of a line chart must start at zero.
Posted by: derek | Jun 29, 2007 at 04:15 AM
Sorry, that was incoherent: the objects are the lines, one line per object. The properties are the nominal categories, and the values of those properties are the nominal scale.
I would also note that each property may be expressed in different units, or if the same units, have a different scale. They are each scaled to use the vertical space fully, from top to bottom (or top at least, if starting from the zero). The vertical scales may not be supplied, as each property would require its own. Usually they are omitted, and you are expected to concentrate on the distribution: is this point far from the pack, and if so, what other points on the line are also far from the pack?
The number of properties is moderate, but usually larger than four, and the number of objects is usually much larger than four. The lines are not individually identified by colour, but are uniform. You are expected to pick out the unusual ones by eye, and apply some colour to those: the rest remain a mess of (e.g.) gray lines.
In the example above, the properties are the lines, and the objects are on the nominal scale, making this not the same as a parallel coordinates plot. Because of this, it is necessary for the properties to share the same scale. If not, each scale must be visibly supplied and labelled.
Posted by: derek | Jun 29, 2007 at 04:31 AM
actually, I prefer a stacked bar chart,sorted so the size of the bars runs from high to low (or vice versa).
The lines above don't work well when the items they connect are not trend lines, eg time series.
Posted by: dermot | Jun 29, 2007 at 04:43 AM
Stephen Few's PDF article on parallel coordinates and their use in business intelligence here.
These are going to be rare at Junk Charts for several reasons: first, they are more often used as an interactive display for exploration than a static display for presentation; second, they are typically high-dimensional data sets using large numbers of objects, which are rarely displayed in the media where they can be criticised and given the junk art treatment; and thirdly, if they are, it's unusual for the data to be separately available, and impractical to reconstruct by hand from the graph itself.
Finally, my version of Kaiser's line chart done as parallel coords, with each line being a startup co., and each category on the x axis being a spend purpose.
I don't want to go on about them too much: I was just using them as an example of the sort of display that convinced me lines weren't verboten for nominal x-axes.
Posted by: derek | Jun 29, 2007 at 06:55 AM
Damn, I just realised Kaiser already did one.
Posted by: derek | Jun 29, 2007 at 06:57 AM
Sketching is important. You need to slice and dice the data several ways before a clear (or at least clearer) picture of the behavior emerges.
While a stacked column is not ideal for final presentation, it is useful as an initial sketch. In addition to changing chart type, one needs to rearrange the order of series and categories. For example, Kaiser and Derek each produced parallel coord/profile charts with Blinksale making a uniformly sloped line (albeit in opposite directions). I might have reordered Kaiser's first profile chart to put blinksale either far left or far right on the horizontal axis.
Another thing you often discover while sketching the relationships is that it might take two or three charts to clearly display all aspects of the data. I think that Kaiser's second profile chart is very effective, but in conjunction with the first, it gives a better picture.
Many people try to get two charts worth of information in a single chart using secondary axes and combinations of different chart types and such, but two adjacent charts are more quickly understood.
Posted by: Jon Peltier | Jun 29, 2007 at 08:50 AM
If the application category is the most important, then it may not be a problem if the middle bars float. But four categories is probably the maximum that I would use for the display.
Posted by: John Johnson | Jun 29, 2007 at 11:12 AM
I have to agree with dermot: I don't believe in using line graphs when you're not plotting a series on the x-axis.
Posted by: Darius K. | Jun 29, 2007 at 12:15 PM
I'm with Darius. A line graph for nominal categories is bad practice.
Of what's been shown, I like the stacked bar the best.
Posted by: zbicyclist | Jun 29, 2007 at 11:08 PM
Another option for those who vehemently dislike the lines, is to do a set of small multiples (aka trellises aka co-plots) of bars for each of the companies.
Some demonstration of the interactive use of parallel coordinate plots is available on the GGobi site. Interaction is highly software specific, and there are few features that are implemented by all interactive pcp software.
Derek: I don't believe that the type of scaling influences whether a plot is a parallel coordinates plot or not. The key feature is the parallel axes - ie. you are drawing a projective coordinate system rather than the more usual Cartesian coordinate system. Both of the above examples are pcps.
Posted by: Hadley Wickham | Jun 30, 2007 at 05:43 AM
"A line graph for nominal categories is bad practice."
This generally is true, but Kaiser is not drawing a line chart per se, but a parallel coordinate plot. The lines are understood not to convey a trend but merely to connect related points.
"I prefer a stacked bar chart."
The problems here are at least twofold.
1. Since points in the same category are stacked end-to-end rather than side-by-side, it is difficult to compare their relative values.
2. Since points in the same series in different stacks have different baselines (i.e., the tops of the bars they are stacked upon), it is difficult to compare their relative values.
Posted by: Jon Peltier | Jul 09, 2007 at 09:40 AM