Who is a millennial? An example of handling uncertainty

I found this fascinating chart from CNBC, which attempts to nail down the definition of a millennial.

Millennials2-01

It turns out everyone defines "millennials" differently. They found 23 different definitions. Some media outlets apply different definitions in different items.

I appreciate this effort a lot. The design is thoughtful. In making this chart, the designer added the following guides:

  • The text draws attention to the definition with the shortest range of birth years, and the one with the largest range.
  • The dashed gray gridlines help with reading the endpoints of each bar.
  • The yellow band illustrates the so-called average range. It appears that this average range is formed by taking the average of the beginning years and the average of the ending years. This indicates a desire to allow comparisons between each definition and the average range.
  • The bars are ordered by the ending birth year (right edge).

The underlying issue is how to display uncertainty. The interest here is not just to feature the "average" definition of a millennial but to show the range of definitions.

***

In making my chart, I apply a different way to find the "average" range. Given any year, say 1990, what is the chance that it is included in any of the definitions? In other words, what proportion of the definitions include that year? In the following chart, the darker the color, the more likely that year is included by the "average" opinion.

Redo_junkcharts_cnbcmillennials

I ordered the bars from shortest to the longest so there is no need to annotate them. Based on this analysis, 90 percent (or higher) of the sources list 19651985 to 1993 as part of the range while 70 percent (or higher) list 19611981 to 1996 as part of the range.

 

 


The rule governing which variable to put on which axis, served a la mode

When making a scatter plot, the two variables should not be placed arbitrarily. There is a rule governing this: the outcome variable should be shown on the vertical axis (also called y-axis), and the explanatory variable on the horizontal (or x-) axis.

This chart from the archives of the Economist has this reversed:

20160402_WOC883_icecream_PISA

The title of the accompanying article is "Ice Cream and IQ"...

In a Trifecta Checkup (link), it's a Type DV chart. It's preposterous to claim eating ice cream makes one smarter without more careful studies. The chart also carries the xyopia fallacy: by showing just two variables, readers are unwittingly led to explain differences in "IQ" using differences in per-capita ice-cream consumption when lots of other stronger variables will explain any gaps in IQ.

In this post, I put aside my objections to the analysis, and focus on the issue of assigning variables to axes. Notice that this chart reverses the convention: the outcome variable (IQ) is shown on the horizontal, and the explanatory variable (ice cream) is shown on the vertical.

Here is a reconstruction of the above chart, showing only the dots that were labeled with country names. I fitted a straight regression line instead of a curve. (I don't understand why the red line in the original chart bends upwards when the data for Japan, South Korea, Singapore and Hong Kong should be dragging it down.)

Redo_econ_icecreamIQ_1A

Note that the interpretation of the regression line raises eyebrows because the presumed causality is reversed. For each 50 points increase in PISA score (IQ), this line says to expect ice cream consumption to raise by about 1-2 liters per person per year. So higher IQ makes people eat more ice cream.

***

If the convention is respected, then the following scatter plot results:

Redo_econ_icecreamIQ_2

The first thing to note is that the regression analysis is different here from that shown in the previous chart. The blue regression line is not equivalent to the black regression line from the previous chart. You cannot reverse the roles of the x and y variables in a regression analysis, and so neither should you reverse the roles of the x and y variables in a scatter plot.

The blue regression line can be interpreted as having two sections, roughly, for countries consuming more than or less than 6 liters of ice cream per person per year. In the less-ice-cream countries, the correlation between ice cream and IQ is stronger (I don't endorse the causal interpretation of this statement).

***

When you make a scatter plot, you have two variables for which you want to analyze their correlation. In most cases, you are exploring a cause-effect relationship.

Higher income households cares more on politics.
Less educated citizens are more likely to not register to vote.
Companies with more diverse workforce has better business performance.

Frequently, the reverse correlation does not admit a causal interpretation:

Caring more about politics does not make one richer.
Not registering to vote does not make one less educated.
Making more profits does not lead to more diversity in hiring.

In each of these examples, it's clear that one variable is the outcome, the other variable is the explanatory factor. Always put the outcome in the vertical axis, and the explanation in the horizontal axis.

The justification is scientific. If you are going to add a regression line (what Excel calls a "trendline"), you must follow this convention, otherwise, your regression analysis will yield the wrong result, with an absurd interpretation!

 


Statistical significance explainer, and Instagram's experiment to hide Likes

There are some statistical concepts that all data visualization practitioners should know about, and the concept of statistical significance is one of them.

It's a hard concept to grasp because it requires one to think beyond the data that are collected. The abstract thinking is necessary since we typically want to make general statements - while using the collected data as evidence.

My new video in the Data Science: The Missing Pieces series explains statistical significance. To be precise, it explains NOT statistically significant. When something is not significant, it causes all sorts of anxieties, panics, half-measures, re-examinations, and havoc. Much of the time, the result is confusion and misinterpretation.

The video addresses a recent news item - Instagram's experiment to hide the Like count. See for example this article. After running this experiment, Instagram's analysts will look for statistical significance. If the result is NOT significant, what does it mean?

Check out the video for more.

 

***
Placed here to serve the machine:

DSTMP3_thumb_significance


Does this chart tell the sordid tale of TI's decline?

The Hustle has an interesting article on the demise of the TI calculator, which is popular in business circles. The article uses this bar chart:

Hustle_ti_calculator_chart

From a Trifecta Checkup perspective, this is a Type DV chart. (See this guide to the Trifecta Checkup.)

The chart addresses a nice question: is the TI graphing calculator a victim of new technologies?

The visual design is marred by the use of the calculator images. The images add nothing to our understanding and create potential for confusion. Here is a version without the images for comparison.

Redo_junkcharts_hustlet1calc

The gridlines are placed to reveal the steepness of the decline. The sales in 2019 will likely be half those of 2014.

What about the Data? This would have been straightforward if the revenues shown are sales of the TI calculator. But according to the subtitle, the data include a whole lot more than calculators - it's the "other revenues" category in the financial reports of Texas Instrument which markets the TI. 

It requires a leap of faith to believe this data. It is entirely possible that TI calculator sales increased while total "other revenues" decreased! The decline of TI calculator could be more drastic than shown here. We simply don't have enough data to say for sure.

 

P.S. [10/3/2019] Fixed TI.