Ezra Klein (link) cited a particularly revealing graphic from the Hamilton Project. And I found this via Washington's blog (link).
The gist of the story is that most reports citing median annual earnings of US workers fail to disclose a severe selection bias: the number reflects the median of full-time wage earners.
The number cited is the black line in the chart to the left.
The more realistic number is the gray line, which includes part-timers, and the unemployed.
If you want to know what's the plight of 25-64 year old (i.e. fit to work) men in the States over time, looking at the black line is like judging a fruit vendor by looking at the fruits displayed on top of the pile.
Removing part-timers and the unemployed is naughty because those are precisely the ones who would drag down the average. This is clearly shown in the chart, where the gray line is everywhere beneath the black line.
The other feature of interest in this analysis is the "plummeting" of the median. The median is the mid-point of a distribution: half the men earn above that amount, and half earn below that the median earnings. It is not a number that is easy to manipulate (especially when the data have been adjusted for inflation).
If someone who's earning $25K (below median) becomes unemployed, this event has no impact on the median because earning $0 also is a below-median number. To move the number down significantly, we need big shifts from above-median to below-median.
Let's pick out 1997 when the median earnings of all men peaked at about $40,000. The median then declined over the next 10 years or so to $33,000 or so. This means that instead of 50% earning less than $40K, we now have a higher than 50% of men earning less than $40K. This keeps getting worse over time. One percentage point represents 800,000 people as there are some 80 million men between 25 and 64.
The fact that the full-time median has stayed flat implies that it's the bottom that has fallen out. (There is an additional chart cited by Klein that showed that the average full-time earnings have soared in the same period, and that is possible because the median would not move if we suddenly doubled the earnings of the top 1%; such a policy would result in a drastic increase in the average earnings.)
When the median number is changing as drastically as is portrayed here, we are looking at a crisis.
There is a stimulating conversation going on between Cathy O'Neil (mathbabe) and CMU Prof. Cosma Shalizi about whether "data science" is different from "statistics". Cathy started by posting some comments about "how to hire data scientists" (link). Cosma responded with white is the new black (link): a "modern" statistics undergraduate training would prepare one well for such jobs. Cathy disagreed on several fronts, favoring PhD training (to be able to cobble together methodology on the fly and defend it) and dealing with people.
Cosma has some more thoughts (link), agreeing with Cathy on most points but unconvinced by her repeated argument that one should just hire some "smart" people and they will figure it out. He pointed to a bunch of wrong results in network science coming from physicists who are generally considered smart people.
Cathy has another follow-up (link, cross-posted to Naked Capitalism where I first picked up this thread). She doubled down on her position, arguing that statistics graduates do not have the necessary communications skills. She then railed against "poseurs" in the data science community, people who just know how to press a button and run some black-box algorithms.
Agreeing with Both
Since I have hired a few people in business statistics and seen how they fared, I have strong opinions on these topics. I think Cathy's post on what skills are the most necessary is a must-read, so is her point about asking the right questions. I agree with Cosma that a statistics degree should be very desirable to employers (who understand what they want from data science). To summarize Cathy's points, we look for creative problem-solvers.
People who follow my blogs know I have long stressed communications skills, which include Powerpoint-type presentations, in-person meetings, translations (connecting engineers and business people), and negotiations (balancing business and technical objectives). Emphatically, I did not mention dashboards, dynamic and interactive graphics, 3D charts, piles of spreadsheets, or volumes of statistical output. Some people may not want to interface with the business side; that's fine but effective communication is still important when speaking to one's manager. It's a chance to demonstrate that one understands the statistics is serving business objectives.
Disagreeing with Both
Cathy and Cosma both feel that knowing specific programming languages is not essential. To quote Cathy, "you shouldn’t obsess over something small like whether they already know SQL." To put it politely, I reject this statement. To apply to a data science job without learning the five key SQL statements is a fool's errand. Simply put, I'd never hire such a person. To come to an interview and draw a blank trying to explain "left join" is a sign of (a) not smart enough or (b) not wanting the job enough or (c) not having recently done any data processing, or some combination of the above. If the job candidate is a fresh college grad, I'd be sympathetic. If he/she has been in the industry, you won't be called back. (One not-disclosed detail in the Cosma-Cathy dialogue is what level of hire they are talking about.)
Why do I insist that all (experienced) hires demonstrate a minimum competence in programming skills? It's not because I think smart people can't pick up SQL. The data science job is so much more than coding -- you need to learn the data structure, what the data mean, the business, the people, the processes, the systems, etc. You really don't want to spend your first few months sitting at your desk learning new programming languages.
Both Cathy and Cosma also agree that basic statistical concepts are easily taught or acquired. Many studies have disproven this point, starting with the Kahneman-Tversky work. A recent example cited by Felix Salmon (and Andrew Gelman) showed that economists can't interpret a simple linear regression properly. Loads of shady research get published in peer-reviewed journals, in many fields that demonstrate little to no understanding of basic statistics. One of my favorite examples is a paper in transportation research that I came across when writing my book in which a t-test was used to show that an entire dataset is "statistically significant".
What really sets one apart in data science/statistics is intuition. Given the large data sets with gazillion dimensions, there are gazillion ways to look at the data. How does the analyst figure out what to look at, and efficiently come to useful conclusions? When does the analyst discover that the data contain a chunk of user_ids equal to zero: at the start of the project, while he/she digs through the results of the first or second analyses, half-way through the project, or never?
In the textbook, this last question is a completely solved problem. Just follow the flowchart. Do your data cleanliness checks. In the real world, things are not that simple. It may take hours, even days, to conduct a thorough check. There are millions of variables and you can't check them all. You may have developed some familiarity with the data, which leads you to skip certain checkpoints. The zero user-ids may have only recently appeared due to a mistake by someone upstream. They may not have affected your output in a noticeable way if the methodology you used is robust to outliers. One of the cues is typically a much larger output data set than you expected when you join the user-ids to some other data -- but I'd be impressed if you always do a count of every immediate dataset you have ever produced. If you sense something wrong with the analysis output, could you come up with hypotheses as to why it went wrong, have one of those hypotheses check out, say exposing the zero-id issue? Being able to "sense something wrong" is easier said than done, when you are staring at pages of calculations.
A lot of the intuition come with experience. But experience is not sufficient. That's why earlier I mentioned the importance of learning the data, understanding how it was collected. That isn't sufficient either. It's a whole lot of things, many intangible qualities, together that produce the intuition. This requirement is the toughest on fresh graduates, whether you are an undergrad or a PhD.