One of the most important steps in analyzing data is to remove noise. First, we have to identify where the noise is, then we find ways to reduce the noise, which has the effect of surfacing the signal.
The labor force participation rate data, discussed here and here, can be decomposed into two components, known as the trend and residuals. (See right.) The residuals are the raw data minus the trend; in other words, they are the data after removing the trend.
If the purpose of the analysis is to describe the evolution of the labor force participation rate over time, then the trend is the signal we're after.
Our purpose is the opposite. I want to remove the trend in order to surface correlations that are unrelated to time evolution. Thus, the residuals are where the signal is.
Another way to think about the residuals (bottom chart) is that positive values imply the actual data was above trend while negative values imply the actual data was below trend.
After decomposing the miles-driven data in the same way, I obtain two sets of residuals. These were plotted in the last post in a scatter plot.
The lack of correlation is also obvious in the plot below. You can see that the periods when one series of residuals went above trend was not well correlated with the other series being above trend (or below trend).
After I wrote the post about superimposing two time series to generate fake correlations, there was a lively discussion in the comments about whether a scatter plot would have done better. Here is the promised follow-up post.
The contentious issue is that X and Y might appear correlated but in
fact, what we are observing is that both data series are strongly
correlated with time (e.g. population almost always grows with time), and X and Y may not be correlated with each other.
Indeed, the first thing a statistician would do when encountering two data series is to create a scatter plot. Economists, by contrast, seem to prefer two line charts, superimposed.
The reason for looking at the scatter plot is to remove the time component. If X and Y are correlated systematically (and not individually with the time component), then even if we disturb the temporal order, we should still be able to see that correlation. If the correlation goes away in an x-y plot, then we know that the two variables are not correlated, and that the superimposed line charts created an illusion.
The catch is that the scatter plot analysis is necessary but not sufficient. In many cases, we will find strong correlation in the scatter plot. But that does not prove there is X-Y correlation beyond each data series being correlated with time. By plotting X and Y and ignoring time, we introduce time as an omitted variable, which can still be controlling both X and Y series.
The scatter plot (right) shows the per capita miles driven against the civilian labor force participation rate. Having hidden the time dimension, we still see a very strong correlation between the two data series.
This is because time is still the invisible hand. Time is running from left to right on the chart still. This pattern is visible if we have line segments connecting the data in temporal order, as in the chart below.
One solution to this problem is to de-trend the data. We want to remove the effect of time from each of the two data series individually, then we plot the residual signals against each other.
Here is the result (right). We now have a random scatter of points that average about zero. If anything, there may be a slightly negative correlation, meaning that when the labor force participation rate is above trend, the per-capita miles driven tend to be slightly below trend; this effect if it exists is small.
What I have done here is to establish the trend for each of the two time series. The actual data being plotted is what is above/below trend. What this chart is saying is that when one value is above trend, it gives us little information about whether the other value is above or below trend.
Business Insider (link) published the following chart and declared "the end of the car age in one chart". The chart superimposed the monthly motor vehicle miles driven per capita and the labor force participation rate.
This is the conclusion of the post:
There's a logical connection between the two. Not in the workforce? You're less inclined to drive.
It's strange that they chose to show a time series going back to the 1970s. The conclusion is logical only for the last five years of the data. Looking back even another decade, to the last recession (2001), one finds the exact opposite conclusion: as the work force participation rate fell, the per-capita miles driven went up.
The other problem is causation creep, about which I have written on the sister blog (link). This chart merely shows correlation (and that is questionable). The conclusion of cause and effect is purely theory. Another theory would be the rise in telecommuting and work-from-home situations. A counter-theory would be that the unemployed may have more free time to drive. Another theory is that gas prices have gone up:
Any time series you can find that has a peak during the 2000s can be similarly interpreted as having caused people to stop driving. Here's a chart of real house prices from Calculated Risk.
Falling house prices causes people to stop driving. Or perhaps falling house prices causes people to lose jobs.
Nick C. on Twitter sent us to the following chart of salaries in Major League Soccer. (link)
This chart is hosted at Tableau, which is one of the modern visualization software suites. It appears to be a user submission. Alas, more power did not bring more responsibility.
Sorting the bars by total salary would be a start.
The colors and subsections of the bars were intended to unpack the composition of the total salaries, namely, which positions took how much of the money. I'm at a loss to explain why those rectangles don't seem to be drawn to scale, or what it means to have rectangles stacked on top of each other. Perhaps it's because I don't know much about how the cap works.
Combined with the smaller chart (shown below), the story seems to be that while all teams have similar cap numbers, the actual salaries being paid could differ by multiples.
This is the standard stacked bar chart showing the distribution of salary cap usage by team:
I have never understood the appeal of stacking data. It's not easy to compare the middle segments.
After quite a bit of work, I arrived at the following:
The MLS teams are divided into five groups based on how they used the salary cap. Salary cap figures are converted into proportion of total cap. For example, the first cluster includes Chicago, Los Angeles, New York, Seattle and Toronto, and these teams spread the wealth among the D, F, and M players while not spending much on goalie and "others". On the other hand, Groups 2 and 3, especially Group 3 allocated 30-45% of the cap on the midfield.
Three teams form their own clusters. CLB spends more of its cap on "others" than any other team (others are mostly hyphenated positions like D-F, F-M, etc.) DAL and VAN spend a lot less on midfield players than other teams. VAN spends a lot on defense.
My version has many fewer data points (although the underlying data set is the same) but it's easier to interpret.
I tried various chart types like bar charts, and even pie charts. I still like the profile (line) charts best.
In a modern software (I'm using JMP's Graph Builder here), it's only one click to go from line to bar, and one click to go to pie.
There is a tendency when producing dashboards to go for the cutesy-cutesy. Reader Daniel L. came across an attempt by Facebook to document its data center metrics (link). They chose this circular, spiraling design:
Notice that the lines of equal distance on a circular plot are the concentric circles. Thus, when they connect different points in a continuous way, as if it were a standard line chart, the line segments between data points are distorted. The diagram below shows the problem:
One potential advantage (although not worthwhile) of wrapping the data into a circle is that the 24 hours become a continuous line. Except that it isn't the case here! Weirdly, the purple and blue lines show a huge discontinuity at the ray that points vertically upwards from the origin. This leads to an even more fascinating find.
The circle actually rotates! It's like a rotating restaurant. The time shown vertically pointing upwards keeps changing as I write this post. This makes the discontinuity even more baffling. You'd think the previous data point just shifts anti-clockwise but apparently not. If any of you can figure this out, please leave a comment.
As Daniel pointed out, the traditional line charts shown in the bottom half of the page would have done the job with less fuss. Not as eye-catching, but not as baffling either.
One innovation of on-line charts is the replacement of axis labels with mouse-over effects. Mousing over the chart here produces the underlying data values. This is elegance.
One horrible trend with on-line charts is the horrendous choice of scale. Look at the top two charts, especially the orange line chart about power usage. It makes no sense to choose a scale that completely annihilates the underlying fluctuations.
I have found the same problems with many Google charts. It looks as if nothing is happening except when you look more closely, you learn that a tiny distance represents a big percentage shift in the underlying data.
Sometimes, a chart just strains your mind. Such is the case with the following, a tip from Augustine F. (@acfou)
There are just so many percentages on the chart it's really hard to figure out which is which.
Under the title, it hints that they are showing results from a poll. The legend implies that the poll asks for estimates of budget and revenue allocations: one imagines the questions were what proportion of your marketing budget is allocated to digital? and what proportion of your revenues is attributed to digital? On top of the bars are some percentages, presumably percentages of respondents. Perhaps, or perhaps not. The column labels clearly add up to over 100% since there are two columns in the 30-35% range.
Under the axis, we have buckets of percentages. Are they percentages of people, of budgets or of revenues? Why and how are they bucketed?
My best guess is that the survey is a multiple-choice with 11 choices corresponding to the groups of columns. The axis labels refer to both percentage of budget and percentage of revenues, depending on which column you're looking at.
What is maximally confusing is the last set of columns, labeled "Average", with values in the 35% range. It is most likely not a choice in the survey. They somehow came up with an average based on the responses. So maybe I was wrong about the multiple-choice format: if the raw data comes in buckets like 61 to 70%, there is no easy way to average these responses. Maybe they asked for two exact percentages, and then grouped them afterwards.
To sum all that up, the percentages on top of the columns are percentages of respondents, except in the last set of columns, where they are percentages of budget (or revenues). The percentages of budget (or revenues) are sitting on the horizontal axis, except in the last label, called "Average", where it means the average respondent.
There is a problem with my interpretation. It makes the chart completely worthless!
What use is it to learn that "16% of the respondents say they allocate 11-20% of their budget on digital while 12% of the respondents say they derive 11-20% of their budget from digital"?
You might be interested in whether there is a return on investment to the money spent on digital marketing. You'd then need to know for a given company, what proportion of budget was spent on marketing versus what proportion of revenues was attributed to that marketing. In this chart, there is no linkage -- the companies who say they spend 11-20% on digital may or may not be the same set of companies who say they derive 11-20% from digital spend.
If the survey asked for exact percentages, then I'd prefer to see a scatter plot, showing proportion of budget on one axis, and proportion of revenues on the other axis, each dot representing a respondent.
A final note: it is worth asking what types of people answer this survey. Pretty much the only people in a company who can answer this question accurately are the heads of marketing. If you are working for the head of marketing, you likely know the details of a particular segment of marketing but not the aggregate numbers. If you work in a different department, there is little to no chance that you have any useful knowledge about marketing budgets and revenue allocations.
One would also appreciate it if all such pictures include the sample size.
Reader (and author) Bernard L. sends us to the Economist (link), where they walked through a few charts they sketched to show data relating to the types of projects that get funded on Kickstarter. The three metrics collected were total dollars raised, average dollars per project, and the success rate of different categories of projects.
Here's the published version, which is a set of bar charts, ranked by individual metrics, and linked by colors.
This bar chart does the job. The only challenge is the large number of colors. But otherwise, it's not hard to see that fashion projects have the worst success rate and raised relatively little money overall although the average pledge amount tended to be higher than average.
The following chart used more of a Bumps chart aesthetic. It dropped the average pledge per project metric, which I think is a reasonable design choice. The variance in pledge amount is probably pretty high and thus the average may not be a good metric anyway. The Bumps format though suffers because there are too many categories and the two metrics are rather uncorrelated, resulting in a spider web. Instead of using colors as a link, this format uses explicit lines as links between the metrics.
The following version combines features from both. It requires no colors. It drops the third metric, while adopting the bar chart format. The two charts retain the same order of categories so that one can read across to learn about both metrics.
PS. Readers want to see a scatter plot:
The overall pattern is clearer on a scatter plot. When there are so many categories, it's a pain to put the data labels on the chart. It's odd that the amount pledged for games is the highest of the categories and yet it has among the lowest rate of being fully funded. Is this a sign of inefficiency?