It's Spring Break at NYU, which for professors, is not a break. I have been marking midterms for my business analytics class. Since I like to set open-ended questions (are there anything else in statistics?), I get a variety of answers. One of the questions helps clarify what I mean by numbersense.
The question asks students to comment on the distribution of a variable (median income) in a dataset of customers. Every student should know how to generate a histogram and a boxplot, plus summary statistics and percentiles for this data. The figure below shows what each student was looking at. Before you read further, think about what features of this distribution attract your attention.
The responses I received fell into several categories. Let me list them out:
- The mean is $40,369 and the median is $43,174. Most of the customers have median income between $26,083 and $56,897.
- The mean is $40,369 and the median is $43,174. Most of the customers have median income between $26,083 and $56,897. There is a large range of incomes from $0 to $200,001, with a lot of high outliers.
- The median is $43,174 about the same as the mean. Most of the customers have median income between $26,083 and $56,897. There are a lot of high outliers. Almost a quarter of the sample has $0. Based on the age distribution (skewing older people), I think these may be retirees.
- The median is $43,174 about the same as the mean. Most of the customers have median income between $26,083 and $56,897. There are a lot of high outliers. There appears to be two types of customers, those with zero income and those with a standard distribution. Some of the entries with zero income may have been missing values coded as zeroes, because they correlate with unknowns or zeroes in other variables.
- The median is $43,174 about the same as the mean. Most of the customers have median income between $26,083 and $56,897. There are a lot of high outliers. There appears to be two types of customers, those with zero income and those with a standard distribution. Since the data are not collected at the individual level but at the Zip+9 level, meaning it measures the median income of the residential blocks around each customer, $0 surely does not mean zero. The zero-income segment has average values of other variables not too different from the positive-income segment and so most likely, zero means unknown.
These answers are ordered from demonstrating least numbersense to most. Response types #1 and #2 make no mention of the spike of zeroes despite the strong hint in the question: "Give plausible explanations for any parts of the distribution that is not smooth". Response #2 notices but is not bothered enough to explain it.
Responses #3-#5 all attempt to explain the observed anomaly. Response #3 has a good theory ("retirees") but somehow looks past the fact that the zero-income segment spans a wide age range. (The highlighted parts of the histogram below are the zero-income customers.)
In fact, this chart was used by several to prove that retirees accounted for the zero-income segment. This is a "strong priors" problem: it's all too easy to take weak evidence in the face of a strong theory.
One student divided the customers into zero-income versus not. This allows us to examine the distribution of other variables. For example, the median home value of those with "zero income" is almost the same as those with positive income.
***
Think about the people you hire to do analytics. While any of the answers above are acceptable, if you find someone who can give you Response #3-#5, you are in much better shape. That's what I mean by hiring for numbersense.
Comments