Last week, I gave some brief remarks at the INFORMS New York Metro Student-Practitioner Forum (link), attended by a large group of enthusiastic students eager to enter the field of data science and analytics. (By the way, if you are at the INFORMS Analytics Conference in Orlando, come and find me. I am speaking on Ethics on Tuesday morning.)
I told the students that it is not too early to dispense of some myths of the data science and analytics career. The sooner they get rid of these wrong ideas, the brighter their future in this field.
Myth 1: Data science & analytics is all about coding and tools.
Wrong. Data science & analytics is all about problem solving. Coding and tools are useful for problem solving but it is more important to be able to frame the business problem, collect good data, understand your data, etc.
Myth 2: Coding is hard.
Wrong. Coding is easy. It may not feel that way. The media tells us that we need coding bootcamps. Many academic degrees focus their training on R, Python, and other coding platforms. The truth is Google, StackExchange and similar websites have made coding as trivial as copying and pasting bits and pieces of text. The hard part is to know what you want to do with the data - once you know that, it is only a few clicks to code it up. [PS. This is not meant to say there aren't good vs. less good coders.]
Myth 3: Data analysis is easy.
Wrong. Data analysis is really hard. The myth that it is easy arises because we have great tools that generate output quickly with a few clicks. But using a standard methodology does not guarantee good results. I have various examples of good methods delivering horrible results in the Prologue to my book Numbersense (link).
Myth 4: Data science & analytics is fun.
Wrong, unless you like laborious and tedious work. As an analyst, you will spend 80 percent of your time doing grunt work, such as diagnosing errors in the data, and correcting data issues. Something as simple as correcting the format of a date may require you to move data through multiple servers (see here). You will need to probe to the deepest level of the data generation process, which frequently means you need to get in the face of engineers and others, who have other pressing concerns. This part of life is to be tolerated in order to enjoy the fun parts.
Myth 5: Machines will replace humans.
To end my comments on a high note, I believe this career is unlikely to get replaced by machines. Here is some food for thought:
Last year, I did an article for 538 about New York City restaurant health care ratings. In the dataset, each restaurant is classified by cuisine type, which has its share of errors. Imagine that there is a business called Ivory that is described as a Thai restaurant in the data. You live next to Ivory, and so you know that it is an Ethiopian restaurant, not Thai.
How would a machine figure out this error? It doesn't. If one wants to argue, one could say that the machine can go and collect a huge dataset, then find all co-mentions of the restaurant Ivory with a cuisine type, eventually compute relative frequencies, and finally select the cuisine with the greatest likelihood of being correct.
So we have humans who need a sample size of one to get a guaranteed correct answer, and machines who need massive data to get a sometimes-wrong answer.
I'd like to comment on myth #5 (Machines will replace humans).
For any particular analysis, machines will likely replace humans over time. Once we figure out that data structure X, preparation steps Y and algorithm set Z provide an answer to a particular set of problems, there is a huge advantage to standardizing that analysis (i.e. embedding it in automated procedures) so the results can be validly compared across time, across markets, etc.
But for every problem we solve, there are new opportunities for analysis, either because the problem can now be gone into deeper (e.g. not "does advertising pay out" but "how do I optimize my advertising"), or because new data become available, or because there are new domains to explore (e.g. this analysis will work for airline frequent flyers -- how can we adapt it for frequent gambler analysis?).
Let's look at aviation: the big prize in the late 1920's was for flying solo across the Atlantic. Once that engineering problem was solved, it evolved into bigger issues -- how do we get lots of people across the Atlantic? Where do we locate waystations? How can we figure out how to lose lots of luggage? What sort of pricing model should be used? (etc, etc)
Posted by: zbicyclist | 04/12/2016 at 03:18 PM
Hi Professor Fung,
I'd like to comment something on Myth2.
I agree to the argument that the codes are only the tools and we need not much complex and difficult knowledges about the computer science. Usually we should focus on the problems and use Google and other sources to find the codes as well as solutions we needed and solved the problems.
However, we can use a lot of different ways to solve one specific problem. When the data is not so huge, it will make maybe no difference. But when the amount of data is very huge, the computer will spend a lot of time on running the algorithm. At this time, maybe the complex knowledge on data structure or computer science will play an important role? Because the huge amount of data, any nuance will result in huge difference.
But I am not very sure. My experience in the data science is somewhat limited. Above are only some naive ideas :)
XG
Posted by: plus.google.com/101727942306810250614 | 05/23/2016 at 03:39 AM
XG: Sure, for truly large problems, or problems in which you have a microscopic amount of time to do computations, coding skill becomes much more important. But despite what the media reports, most real-world problems do not have those characteristics. Copying and pasting ready-made code is not a bad thing - most people who cook at home consult recipes.
zbicyclist: I like to draw a distinction between engineering problems and statistical problems. Because machines (of today's vintage) operate on, to use your terminology, "data structure X and preparation steps Y and algorithm set Z," which presumably have been "figured out," these machines do not handle uncertainty. They may be able to deal with uncertainty that can be described by a probability model but that is a severe restriction.
Posted by: Kaiser | 05/24/2016 at 12:32 AM