There is a tendency when producing dashboards to go for the cutesy-cutesy. Reader Daniel L. came across an attempt by Facebook to document its data center metrics (link). They chose this circular, spiraling design:
Notice that the lines of equal distance on a circular plot are the concentric circles. Thus, when they connect different points in a continuous way, as if it were a standard line chart, the line segments between data points are distorted. The diagram below shows the problem:
One potential advantage (although not worthwhile) of wrapping the data into a circle is that the 24 hours become a continuous line. Except that it isn't the case here! Weirdly, the purple and blue lines show a huge discontinuity at the ray that points vertically upwards from the origin. This leads to an even more fascinating find.
The circle actually rotates! It's like a rotating restaurant. The time shown vertically pointing upwards keeps changing as I write this post. This makes the discontinuity even more baffling. You'd think the previous data point just shifts anti-clockwise but apparently not. If any of you can figure this out, please leave a comment.
As Daniel pointed out, the traditional line charts shown in the bottom half of the page would have done the job with less fuss. Not as eye-catching, but not as baffling either.
One innovation of on-line charts is the replacement of axis labels with mouse-over effects. Mousing over the chart here produces the underlying data values. This is elegance.
One horrible trend with on-line charts is the horrendous choice of scale. Look at the top two charts, especially the orange line chart about power usage. It makes no sense to choose a scale that completely annihilates the underlying fluctuations.
I have found the same problems with many Google charts. It looks as if nothing is happening except when you look more closely, you learn that a tiny distance represents a big percentage shift in the underlying data.
When we visualize data, we want to expose the information contained within, or to use the terminology Nate Silver popularized, to expose the signal and leave behind the noise.
When graphs are not done right, sometimes they manage to obscure the information.
Reader John H. found a confusing bar chart while studying a paper (link to PDF) in which the authors compared two algorithms used to determine the position of Wi-Fi access points under various settings.
The first reaction is maybe the researchers are telling us there is no information here. The most important variable on this chart is what they call "datanum", and it goes from left to right across the page. A casual glance across the page gives the idea that nothing much is going on.
Then you look at the row labels, and realize that this dataset is very well structured. The target variables (AP Position Error) is compared along four dimensions: datanum, the algorithm (WCL, or GPR+WCL), the number of access points, and the location of these access points (inner, boundary, all).
When the data has a nice structure, there should be better ways to visualize it.
John submitted a much improved version, which he created using ggplot2.
This is essentially a small multiples chart. The key differences between the two charts are:
Giving more dimensions a chance to shine
Spacing the "datanum" proportional to the sample size (we think "datanum" means the number of sample readings taken from each access point)
Using a profile chart, which also allows the y-axis to start from 2
When you read this chart, you finally realize that the experiment has yielded several insights:
Increasing sample size does not affect aggregate WCL error rate but it does reduce the error rate of aggregate GPR+WCL.
The improvement of GPR+WCL comes only from the inner access points.
The WCL algorithm performs really well in inner access points but poorly in outer access points.
The addition of GPR to the WCL algorithm improves the performance of outer access points but deteriorates the performance of inner access points. (In aggregate, it improves the performance... this is only because there are almost two outer access points to every inner access point.)
Now, I don't know anything about this position estimation problem. The chart leaves me thinking why they don't just use WCL on inner access points. The performance under that setting is far and away the best of all the tested settings.
The researchers described their metric as AP Position Error (2drms, 95% confidence). I'm not sure what they mean by that because when I see 95% confidence, I am expecting to see the confidence band around the point estimates being shown above.
And yet, the data table shows only point estimates -- in fact, estimates to two decimal places of precision. In statistics, the more precision you have, the less confidence.
Quite a few problems crop up here. The most hurtful is that the context of the chart is left to the text. If you read the paragraph above, you'll learn that the data represents only a select group of institutions known as the Russell Group; and in particular, Cambridge University was omitted because "it did not provide data in 2005". That omission is a curious decision as the designer weighs one missing year against one missing institution (and a mighty important one at that). This issue is easily fixed by a few choice words.
You will also learn from the text that the author's primary message is that among the elite institutions, little if any improvement has been observed in the enrollment of (disadvantaged) students from "low participation areas". This chart draws our attention to the tangle of up and down segments, giving us the impression that the data is too complicated to extract a clear message.
The decision to use 21 colors for 21 schools is baffling as surely no one can make out which line is which school. A good tip-off that you have the wrong chart type is the fact that you need more than say three or four colors.
The order of institutions listed in the legend is approximately reverse of their appearance in the chart. If software can be "intelligent", I'd hope that it could automatically sort the order of legend entries.
If the whitespace were removed (I'm talking about the space between 0% and 2.25% and between 8% and 10%), the lines could be more spread out, and perhaps labels can be placed next to the vertical axes to simplify the presentation. I'd also delete "Univ." with abandon.
The author concludes that nothing has changed among the Russell Group. Here is the untangled version of the same chart. The schools are ordered by their "inclusiveness" from left to right.
This is a case where the "average" obscures a lot of differences between institutions and even within institutions from year to year (witness LSE).
In addition, I see a negative reputation effect, with the proportion of students from low-participation areas decreasing with increasing reputation. I'm basing this on name recognition. Perhaps UK readers can confirm if this is correct. If correct, it's a big miss in terms of interesting features in this dataset.
Andrew Wheeler took the time to write code (in SPSS) to create the "Scariest Chart ever" (link). I previously wrote about my own attempt to remake the famous chart in grayscale. I complained that this is a chart that is easier to make in the much-maligned Excel paradigm, than in a statistical package: "I find it surprising how much work it would be to use standard tools like R to do this."
Andrew disagreed, saying "anyone saavy with a statistical package would call bs". He goes on to do the "Junk Charts challenge," which has two parts: remake the original Calculated Risk chart, and then, make the Junk Charts version of the chart.
I highly recommend reading the post. You'll learn a bit of SPSS and R (ggplot2) syntax, and the philosophy behind these languages. You can compare and contrast different ways to creating the charts. You can compare the output of various programs to generate the charts.
I'll leave you to decide whether the programs he created are easier than Excel.
Unfortunately, Andrew skipped over one of the key challenges that I envision for anyone trying to tackle this problem. The data set he started with, which he found from the Minneapolis Fed, is post-processed data. (It's a credit to him that he found a more direct source of data.) The Fed data is essentially the spreadsheet that sits behind the Calculated Risk chart. One can just highlight the data, and create a plot directly in Excel without any further work.
What I started with was the employment level data from BLS. What such data lacks is the definition of a recession, that is, the starting year and ending year of each recession. The data also comes in calendar months and years, and transforming that to "months from start of recession" is not straightforward. If we don't want to "hard code" the details, i.e. allowing the definition of a recession to be flexible, and make this a more general application, the challenge is more severe.
Another detail that Andrew skimmed over is the uneven length of the data series. One of the nice things about the Calculated Risk chart is that each line terminates upon reaching the horizontal axis. Even though more data is available for out years, that part of the time series is deemed extraneous to the story. This creates an awkward dataset where some series have say 25 values and others have only 10 values. While most software packages will handle this, more code needs to be written either during the data processing phase or during the plotting.
By contrast, in Excel, you just leave the cells blank where you want the lines to terminate.
In the last section, Andrew did a check on how well the straight lines approximate the real data. You can see that the approximation is extremely well. (The two panels where there seems to be a difference are due to a disagreement between the data as to when the recession started. If you look at 1974 instead of 1973, and also follow Calculated Risk's convention of having a really short recession in 1980, separate from that of 1981, then the straight lines match superbly.)
I'm the last person to say Excel is the best graphing package out there. That's not the point of my original post. If you're a regular reader, you will notice I make my graphs using various software, including R. I came across a case where I think current software packages are inferior, and would like the community to take notice.
Here's a chart from one of the Italian dailies I picked up in Rome last August . It apparently plots the number of hectares of farmland that was burnt during various fires over time.
While the chart is clean and pleasing to the eye, it has a malformed time axis. In the side-by-side comparison shown below, you can see how the evenly-spaced time axis completely distorts the cadence of the data.
In fact, the data should be put into a bar chart, rather than a line chart. Lines are used primarily to denote trends, and sometimes to compare profiles. Neither of these cases apply here.
The bar chart requires proper spacing too to present the years in which no hectares were burnt by fires.
One of the best charts depicting our jobs crisis is the one popularized by the Calculated Risk blog (link). This one:
I think a lot of readers have seen this one. It's a very effective chart.
The designer had to massage the data in order to get this look. The data published by the government typically gives an estimated employment level for each month of each year. The designer needs to find the beginning and ending months of each previous recession. Then the data needs to be broken up into unequal-length segments. A month counter now needs to be set up for each segment, re-setting to zero, for each new recession. All this creates the effect of time-shifting.
And we're not done yet. The vertical axis shows the percentage job losses relative to the peak of the prior cycle! This means that for each recession, he has to look at the prior recession and extract out the peak employment level, which is then used as the base to compute the percentage that is being plotted.
One thing you'll learn quickly from doing this exercise is that this is a task ill-suited for a computer (so-called artificial intelligence)! The human brain together with Excel can do this much faster. I'm not saying you can't create a custom-made application just for the purpose of creating this chart. That can be done and it would run quickly once it's done. But I find it surprising how much work it would be to use standard tools like R to do this.
Let me get to my point. While this chart works wonders on a blog, it doesn't work on the printed page. There are too many colors, and it's hard to see which line refers to which recession, especially if the printed page is grayscale. So I asked CR for his data, and re-made the chart like this:
You'd immediately notice that I have liberally applied smoothing. I modeled every curve as a V-shaped curve with two linear segments, the left arm showing the average rate of decline leading to the bottom of the recession, while the right arm shows the average rate of growth taking us out of the doldrums. If you look at the original chart carefully, you'd notice that these two arms suffice to represent pretty much every jobs trend... all the other jittering are just noise.
I also chose a small-multiples to separate the curves into groups by decades. When you only have one color, you can't have ten lines plotted on top of one another.
One can extend the 2007 recession line to where it hits the 0% axis, which would really make the point that the jobs crisis is unprecedented and inexplicably not getting any kind of crisis management.
(Meanwhile, New York City calls a crisis with every winter storm... It's baffling.)
Felix linked to a set of charts about guns in the U.S. (and elsewhere). The original charts, by Liz Fosslien, are found here.
I like the clean style used by Fosslien. Some of the charts are thought-provoking. Many of them may raise more questions than they answer. Here are a few that caught my eye.
A simplistic interpretation would claim that banning handguns is futile, and may even have an adverse impact on murder rate. However, this chart does not reveal the direction of causality. Did some countries ban handguns because they are reacting to higher violence? If that is the case, this chart is confirming that the countries with handgun bans are a self-selected group.
The U.S. is an outlier, both in terms of firearm ownership and firearm homicides. This makes the analysis much harder because the U.S. is really in a class of its own. It's not at all clear whether there is a positive correlation in the cluster below, and even if there is, whether we can draw a straight line up to the U.S. dot is also dubious.
Fosslien is being cheeky to deny us the identity of the other outlier, the country with few firearms but even higher death rate from intentional homicide. These scatter plots are great by the way to show bivariate distributions.
I'd still prefer a line chart for this type of data but this particular paired bar chart works for me as well. The contents of this chart is a shock to me.
Reader Steve S. tried to spoil my new year with this chart he didn't like:
Or maybe he's just chiding me for recommending Bumps charts. This example is very confusing, a tangled mess.
But not so fast.
The dataset has two characteristics that don't sit well with bumps charts. One is too many things being ranked (twenty). Two is too much rank swapping that happens over time (14 periods).
The latter challenge can be tamed by aggregating the time dimension. For some reason, the period under examination was the first half year after the debut of these computers. Do we really need to know the weekly statistics?
We can keep all 14 periods. If so, we should be judicious in selecting the colors, the lines and dashed lines, and gridlines, and so on. In particular, look for a story and use foreground/background techniques to highlight the story.
Here's a version that focuses on the brands that moved the most number of ranks either up or down during this period:
Here's one that tracks how the top five fared over this period of time. It turns out that despite all the noisy movements, not much happened at the top of the rankings:
Not knowing many of these computer brands, I really have no idea why seven colors were used and why different tints of the six colors were chosen. I also don't have a clue why some lines were dashed and others were solid.
Looking closely, I learn that the Sony PC was given a black color
because its label does not show up on either side. It was a product that
did not rank among the top 20 at the start nor at the end of this time
period. This Sony PC should be consigned to the dustbin of history, and yet in the color scheme selected for the original chart, the black solid line is the most visible!
I'd like to see an interactive layer added to this chart that brings out the "information". Two of the tabs can be "top movers" and "top five brands" as discussed above. If you hover over these tabs, the appropriate lines are highlighted.
Visualizing data has many uses. We often explore how charts can be used to convey data insights and tell stories. We talk less on this blog about how slicing and dicing data helps us form impressions about the structure of the data sets we're analyzing.
I have been digging around some payroll employment data recently. (You can find the data at the Bureau of Labor Statistics website.) I thought the following two charts are quite instructive.
The first one surfaces one type of recurring patterns: there is a seasonal pattern running from January to December that repeats every year. I use a small-multiples setup, with each chartlet indiced by year.
The second chart shows a different kind of regularity: there is a cyclical pattern running from 2002 to 2012, no matter which month we're looking at. Again, we have a small-multiples setup, this time with each chartlet indiced by a month of year.
This second chart is a simple form of "seasonal adjustment". The data used in this plot are unadjusted. The chart shows that there is a larger cyclical pattern during the period of 2002-2012 that affects every month of the year.
I already hear grumbling about using a line chart when there is no continuity from one dot to the next. In this chart, in fact, time runs left to right, top to bottom, then starts again at the first chartlet, and so on. This is a profile chart. As the name suggests, we should be focused on the shape of the line. It doesn't have to have physical meaning; we are only looking for regularity.
Statisticians love to find this kind of regular patterns because they are easy to describe. Of course, most data are much messier.