For information on upcoming meetings in which I am presenting, see this post on the sister blog.
Through twitter, Antonio Rinaldi sent the following chart that accompanied a New York Times piece talking about the CPI (inflation index). The article concerns a very important topic--that many middle- to lower-income households have barely any saving after spending on necessities--and only touches upon the issue raised by this chart, which is that the official CPI is an average of prices of a basket of goods, and there is much variability in the price changes of different categories of goods.
I cover this subject in much greater detail in Chapter 7 of Numbersense (link). There are many reasons why the official inflation rate seems to diverge from our own experiences. One of the reasons is that we tend to notice and worry about price increases but we don't notice or take for granted price decreases. In the book, I cover the fascinating subject of the psychology of remembering prices. Obviously, this is a subject of utmost importance if we are to use surveys to understand perceived prices.
The price of a T-shirt (unbranded) has remained the same or may have declined in the last decades. Besides, the chart reveals that phone and accessories, computers and televisions have all enjoyed deflation over the last decade. Actually, much of the "deflation" is due to a controversial adjustment known as "hedonics". This is to claim that part of any price change is attributed to product or technology improvements. So, if you pay the same price today for an HDTV as in the past for a standard definition TV, then in reality, the price you paid today is cheaper than that in the past.
That adjustment is reasonable only to a certain extent. For instance, my cell phone company stuffs my plan with hundreds of unused and unusable minutes so on a per-minute basis, I am sure prices have come down substantially but on a per-used-minute basis, I'm not so sure.
Let's get to what we care about on this blog... the visual. There is one big puzzle embedded in this chart. Look at the line for televisions. It dipped below -100 percent! Like Antonio, many readers should be scratching their heads--did the price of television go negative? did the hedonic adjustment go bonkers?
As an aside, I don't like the current NYT convention of hiding too many axis labels. What period of time is this chart depicting? You'd only find out by reading the label of the vertical axis! I mentioned something similar the other day.
The key to understanding a chart like this is to learn what is being plotted. The first instinct is to think the change in prices over time. A quick glance at the vertical axis label would correct that misunderstanding. It said "Change in prices relative to a 23% increase in price for all items, 2005-2014".
This label is doing a lot of work--probably too much for its inconspicious location and unbolded, uncolored status.
Readers have to know that the official CPI is a weighted average of changes in prices of a specified basket of goods. Some but not all of the components are being graphed.
Then readers have to understand that there is an index of an index. The prices of each "item" (i.e. category or component of the CPI) are indiced to 1984 levels. So the prices of television is first re-indiced to 2005 as the baseline. This establishes a growth trajectory for television. But this is not what is being depicted.
The blue line reflects the 23% average increase in prices in that 10-year period. Notice that the red line does not exhibit any weirdness--television prices have gone down by 90 percent. It's not negative.
What the designer tried to do is to index this data another time. Think of pulling the blue line down to the horizontal axis, and then see what happens to the gray and red lines.
Now, even this index on an index should not present a mathematical curiosity. If all items moved to 1.23 while apparel moved to 1.10, you might compute 110%/123% which is roughly 0.. You'd say the apparel index is 90% of the way to where the all-item index went. Similarly for TVs, you would compute 10%/123% which is 0.08. That would be saying the TV index ended up 8% of where the all-item index landed.
That still doesn't yield -100%. The clue here is that the baseline is zero percent, not 100, not 1.0, etc. So if there is an item that moved in sync with all items, its trajectory would have been horizontal at zero percent. That means that the second index is not a division but a subtraction. So for TV, it's -90% - 23% = -113%. For apparel, it's +10%-23% = -13%.
Even though I reverse-engineered the chart, I don't understand the reason for using subtractions rather than division for the second layer of indicing. It's strange to me to add or subtract the two indices that have different baseline quantities.
Here is the same chart but using division:
I usually avoid telescoping indices. They are more trouble than it's worth. Here is an old post on the same subject.
Carl Bialik used to be the Numbers Guy at Wall Street Journal - he's now with FiveThirtyEight. Apparently, he left a huge void. John Eppley sent me to this set of charts via Twitter.
This chart about Citibike is very disappointing.
Using the Trifecta checkup, I first notice that it addresses a stale question and produces a stale answer. The caption below the chart says "the peak times ... seem to be around 9 am and 6 pm." What a shock!
I sense a degree of meekness in usnig "seem to be". There is not much to inspire confidence in the data: rather than the full statistics which you'd think someone at Citibike has, the chart is based on "a two-day sample last autumn". The number of days is less concerning than the question of whether those two autumn days are representative of the year. Curious readers might want to know what data was collected, how it was collected, and the sample size.
Finally, the graph makes a mess of the data. While the black line appears to be data-rich, it is not. In fact, the blue dots might as well be randomly scattered and connected. As you can see from the annotations below, the scale of the chart makes no sense.
Plus, the execution is sloppy, with a missing data label.
The next chart is not much better.
The biggest howler is the choice of pie charts to illustrate three numbers that are not that different.
But I have to say the chart raises more questions than it answers. I am not an expert in pregnancy but doesn't a pregnant woman's weight include the weight of the baby she's carrying? So the more weight the woman gains, on average, the heavier is her baby. What a shock!
The last and maybe the least is this chart about basketball players in the playoff.
It's the dreaded bubble chart. The players are arranged in a perplexing order. I wonder if there is a natural numbering system for basketball positions (center = #1, etc.), like there is in soccer. Even if there is such a natural numbering system, I still question the decision to confound that system with a complicated ranking of current-year playoff players against all-time players.
Above all, the question being asked is uninteresting, and so the chart is uninformative. A more interesting question to me is whether the best players are playing in this year's playoff. To answer this question, the designer should be comparing only currently active players, and showing the all-time ranks of those players who are playing in the playoffs versus those who aren't.
Reader and tipster Chris P. found this "death spiral" chart dizzying (link).
It's one of those charts that has conceptual appeal but does not do the data justice. As the name implies, the designer has a strong message, that the arctic sea ice volume has dramatically declined over time. This message is there in the chart but the reader has to work hard to find it.
Why doesn't this spider chart work? We can be more precise.
This is a pity because the designer did very well in aligning two corners of the Trifecta Checkup, namely what is the question and what does the data show? It is a great idea to control for month of year, and look at year to year changes. (A more typical view would be to look at month to month changes and plot one line per year.)
This is an example of a chart that does well on one side of the checkup but the failure is that the graph isn't in tune with the data or the question being addressed.
Whenever I see a spider chart, I want to unroll the spiral and see if a line chart is better. Thus:
The dramatic decrease in Arctic ice volume (no matter the month) is clear as day. You can actually read off the magnitude of the drop. (Try doing that in the spider chart, say between 1978 and 1995.)
This chart still has issues, namely too many colors. One can color the lines by season of the year, like this:
Or switch to a small-multiples set up with three lines per chart and one chart per season.
The seasonal arrangement is not arbitrary. You can see the effect of season by looking at side by side boxplots:
The pattern is UP-DOWN-DOWN-UP.
In fact, a side-by-side boxplot of the data provides a very informative look:
The monthly series is obscured in this view, built into the vertical variability, which we can see is quite stable. The idea of controlling for month is to make it irrelevant. This view emphasizes the year on year decline of the entire distribution.
If you're worried that dropping too much information, the data can be grouped by season as before in a small-multiples setup like this:
Regardless of season, the trend is down.
PS. Alberto reminds me of his post about one example of a spider chart (radar chart) that works. Here's the link. It works because the graphical element is more in tune with the data. While the ice cap data has a linear trend over time, the voting data is all about differences in distribution. Also, the designer is expecting readers to care about the high-level pattern, not about the specifics.
This chart from Reuters is making the rounds on Twitter today.
Quickly, tell me whether the Gun Law in Florida did well or poorly.
That of course is the entire purpose of the chart.
This is the double edge of novelty in charts. There should be a very high bar against running counter to convention. Readers do bring their "baggage" to the chart, and the designer should take that into consideration.
Some commentators are complaining about trickery. That may be true. But it's also possible the designer actually thought reversing the direction of the vertical axis made the chart better.
Don't forget about we have another convention: up is good and down is bad. Fewer murders is good and more murders is bad. So why not make it such that a rising line indicates goodness (fewer murders)?
Going back to the Trifecta Checkup. This chart has dual problems. We just talked about the syncing between the data and the graphical element.
The other issue is that the data is insufficient to draw conclusions about the underlying question: what explains the shift in number of murders since the late 2000s? This is a complex problem--the chapter in Freakonomics about abortion and crime rate is still instructive, not for the disputed conclusion but for the process of testing various hypotheses. The reduction of the complex causal structure to a single factor is dissatisfying.
The associated article is here.
The question on the table is motivated by the extraordinary performance of a young baseball player Mike Trout. The early success can be interpreted either as evidence of future potential or as evidence of a future drought. As an analogy, someone wins a lottery. You can argue that the odds are so low that winning again is impossible. Or you can argue that winning once indicates that this person is "lucky" and lucky people might win again.
The chart shows the proportion of players who performed even better after the initial success, given the age at which they first broke out. One way to read this chart is to mentally replace the bubbles with dots (or columns), and then interpret the size of the bubbles as the statistical significance of the corresponding probability estimate. The legend says number of players, which is the sample size, which governs the error bar associated with that particular number.
This bubble chart is no different from others: it is impossible to judge the relative sizes of bubbles. Even though the legend provides us two reference points (a nice enough idea on its own), it is still impossible to know, for example, what proportion of players did better later in life when they first peaked at age 24. The bubble for age 23 looks like it's exactly five players but I still cannot figure out how many players the adjacent bubble represents.
The designer should have just replaced each bubble with an error bar, and the chart is instantly more readable. (I have another version of this at the end of the post.)
The rest of the design elements are clean and well-done, particularly use of notes to point out interesting aspects of the data.
From a Trifecta checkup perspective, I am uncertain about how the nature of the data used to investigate the interesting question posed above.
Readers should note the concept of "early success" and "later success" are not universally defined. The author here selects two proxies. Reaching an early peak is equated to "batters first posting 15+ WAR over two seasons". Next, reversion to the mean is defined as not having a better two-year span subsequent to the aforementioned early peak.
Why two seasons? Why WAR and not a different metric? Why 15 as the cutoff? These are all design decisions made while working with the data.
One can make reasonable arguments to justify the above two questions. A bigger head-scratcher relates to the horizontal axis, which identifies the first time a player reaches his "early peak," as defined above. The way the above chart is set up, it is almost preordained to exhibit a negative slope. The older the player is when he reaches the first peak, the fewer years left in his playing career to try to emulate or surpass that feat.
This last point is nicely illustrated in the next chart of the article:
This chart is excellent on many levels. It's not clear, though, whether it says anything other than aging.
Near the end of the post, the author rightfully pointed out that "there’s not really enough data to demonstrate this effect". Going back to the first chart, it appears that no single bubble contains a double-digit count of players. So every sample size is between one and, say, seven. We should be wary of conclusions based on so little data.
It's always fun to find examples of the Law of Small Numbers, courtesy of Kahneman & Tversky.
Here is a sketch of how I might re-make the first chart (I made up data; see the note below).
While making this chart, I realize another issue with the original bubble chart. When the proportion of players improving on their early peak is zero percent, how many players did not make it is quite hidden. In the revised chart, this data is clearly seen (look at age 22).
Note: I wonder if I totally missed the point of the original chart.... I actually had trouble eyeballing the data so I ended up making up numbers. The bubble at age 22 looks like it should stand for 5 players and yet it sits at precisely 50%, which would map to 2.5 players. If I assume the 22 bubble to be 4 players, then I don't know what the 26 bubble is. If it is 4 players also, then the minimum non-zero proportion should have been 1/4, but the bubble clearly lies below 25%. If it is 3 players, the minimum non-zero proportion is 1/3, which should be at 33%.
Today's post examines an example of Big Data analyses, submitted by a reader, Daniel T. The link to the analysis is here. (On the sister blog, I discussed the nature of this type of analysis. This post concerns the graphical element.)
The analyst looked at "the influence of operating systems and device types on hourly usage behavior". This dataset satisfies four of the five characteristics in the OCCAM framework (link).
Observational: the data are ad impressions coming from the Chitika Ad Network observed between February 26 and March 11, 2014. This means users are (unwittingly) being tracked by cookies, pixels, or some other form of tracking devices. The analyst did not plan this study and then collect the data.
Lacking Controls: There will be a time trend but what should we compare against? How do we know if something is out of the ordinary or not?
Seemingly Complete: Right up top, we are impressed with the use of "a sample of tens of millions of device-specific online ad impressions". At least they understand this is a sample, not everything.
Adapted: All weblog data are adapted in the sense that web logs originally serve web developers who are interested in debugging their code. Operating systems and device types are tracked because each variant of OS and devices require customization, and we need that data to understand how webpages render differently. I wrote about the adaptedness of this data in a separate blog post. (link)
The analysis did not require merging data, the fifth element of the framework.
Here is the chart type used to present the analysis. There are many problems.
The conclusion the analyst drew from the above chart is: "North American Android users are more active than their iOS counterparts late at night and during the majority of the workday." In other words, the analyst points out that the blue line sits on top of the orange line during certain times of the day.
Daniel is very annoyed with the way the data is processed, and rightfully so. The chart actually does not say what it appears to say. This is because of the use of indexing.
This simple chart is not so simple to interpret!
This is because each line is "indexed to self". For example, at 12 pm EST, Android users are at 75% of their peak-hour usage while iOS users are at 2/3 of their peak-hour usage. The trouble is the peak-hour usage by iOS users is more than 2.5 times as high as the peak-hour usage of Android users, so 100% blue is less than half of 100% orange by count.
Later in the same post, the analyst re-indexed both series to the iOS peak. This chart tells us that iOS users are more active no matter what time of the day.
The Chitika analyst is not doing anything unusual. This type of indexing is a pandemic in Web analytics. The worst thing about it is that a lot of Web data is long-tailed and the maximum value is an outlier. Indexing data to an outlier isn't wise. (Usually, the index is used to hide actual values of the data, usually for keeping company secrets. But there are better ways to accomplish this.)
Digging a little deeper, we've got to note other key assumptions that the analyst must have made in producing this analysis -- and about which we are in the dark.
Are users with both Apple and Microsoft devices counted on both blue and orange lines?
How is "volume" of Web usage determined? Is it strictly number of ad impressions?
Why is total volume displayed? If Microsoft PCs dominate Macs, and the chart shows the PC line well above the Mac line, is it speaking to market share or is it speaking to usage patterns of the average user?
How representative is the traffic in the Chitika network?
How did the analyst deal with bot traffic?
Finally, using EST (Eastern Standard Time) rather than local time is silly. Think of it this way: if you extract only New York and California users, and compare their curves, without even looking at the data, you can surmise that you will see a similar shape but time-shifted by approximately three hours. Ignoring time difference leads to silly statements like this: "Both sets of users are most active during the workday, with usage volume dropping off in the late evening/early morning."
A twitter follower submitted this chart showing the shift in ethnicity in Texas:
If you blinked, you probably took away the wrong message. Our "prior" tells us that the proportion of Hispanics has been rising quite rapidly in Texas. So, like me, you might hone in on the blue columns which has increased drastically from 32% to 68%.
Things start to fall apart.
First, you might notice the blue label said "Non-Hispanic Whites," which is exactly the opposite of our hypothesis. For a moment, we are confused. Could it be that the Hispanics population in Texas has been shrinking?
Then, you might notice that the "information in our head" made us assume that the horizontal axis represents time. On a closer look, we discover that it's not time; what's being plotted from left to right are age groups. In fact, it's kind of a reversed time. The generations on the right side were born earlier and represent the ethnic distribution today of people born over 60 years ago while the columns on the left represent younger generations.
Finally, the gray columns are redundant and distracting.
On the other hand, the designer is admirably restrained with data labels, and included the baby and crooked man with a stick icons to provide some guidance, both of which are good ideas.
If I apply the Trifecta checkup to this chart, the biggest issue is misalignment between the interesting question of ethnic changes in Texans and the data used to explore this question. The current ethnic mix is not only impacted by the ethnic composition at birth but also by net migrations of different races and by their longevity. As pointed out above, the split by age groups forced us into a kind of reversed time thinking.
A simple fix involves expressing ages as birth years, and using a single line instead of columns:
This version doesn't address the tendency to interpret the left-right axis as time, and the excessive number of age groups.
An even better chart would put time on the horizontal axis, then have multiple lines each representing the proportion of non-Hispanic whites of a specific age group. It may be a political choice--I'm not sure why they chose to plot the declining proportion of non-Hispanic whites and lump Hispanics into "all others" as opposed to plotting the increasing mix of Hispanics.
The Facebook data science team has put together a great course on EDA at Udacity.
EDA stands for exploratory data analysis. It is the beginning of any data analysis when you have a pile of data (or datasets) and you need to get a feel for what you're looking at. It's when you develop some intuition about what sort of methodology would be appropriate to analyze the data.
Not surprisingly, graphical methods form a big part of EDA. You will commonly see histograms, boxplots, and scatter plots. The scatterplot matrix (see my discussion of this) makes an appearance here as well.
The course uses R and in particular, Hadley's ggplot package throughout. I highly recommend the course for anyone who wants to become an expert in ggplot. ggplot does use quite a bit of proprietary syntax. This EDA course offers a lot of instruction in coding. You do have to work hard, but you will learn a lot. By working hard, I mean reading supplementary materials, and doing the exercises throughout the course. As good instruction goes, they expect students to discover things, and do not feed you bullet points.
While this course is not freeThis course is free, plus the quality of the instruction is heads and shoulders above other MOOCs out there. The course is designed from the ground up for online instruction, and it shows. If you have tried other online courses, you will immediately notice the difference in quality. For example, the people in these videos talk directly to you, and not a bunch of tuition-paying students in some remote classroom.
Sign up before they get started at Udacity. Disclaimer: No one paid me to write this post.
In the prior post, I linked to Eric P.'s (link) vetting of the Bloomberg chart on the drop in median male income in the U.S. in the last few decades. Just as a reminder, here is the key chart:
In the 25-34 age group (blue line), the median income has suffered two waves of drastic declines, about 25% from 1972 to 1992 and then about 18% from 1999 to 2011.
There is a different way to digest the chart above, which is what I want to talk about in this post. Notice that people age over time so if you trace the blue line from left to right, at every point in time we are comparing different people.
Instead, let's trace the same people across time -- this is known as a cohort analysis. I traced a black line through the above chart:
This cohort consists of male workers who were 25 to 34 years old between 1972 and 1982. By 1982, they would have aged to between 35 and 44 years old and so they would belong to the green line. Then they shifted up to the yellow line. So over the lifespan, the median worker increased their income.
You might notice that this analysis is very rough because the data is not granular enough. For example, if you are 34 years old in 1972, by 1973, you already moved from the blue to the green line. With the proper data, this analysis can be made precise. The weird jump (indicated by the dashed lines) is most likely a consequence of the imperfections in the cohorting.
If we have the birth year data, then we can trace people who are born in each year forward, and then stack all these traces on the same chart to figure out true generational changes. Imagine that the chart would have age on the horizontal axis.
One of the key elements of numbersense is realizing that there is no single way to analyze any given dataset. When the data is rich, it holds many different insights.