How to Become a Data Scientist
“You can have data without information, but you cannot have information without data” – Daniel Keys Moran
The question of how to become a data scientist comes up often and many seem to have the answer. However, before you take their advice make sure that you’re speaking the same language, because there’s Paris, France and there’s Paris, Texas. Both lovely, of course, but what a surprise if you meant one and got directions to the other. Likewise, there are multiple interpretations of what a data scientist is and does.
So, let’s begin with some definitions and context on how we arrived here.
What exactly is data science?
Data science, also known as data-driven science, is defined as an interdisciplinary field about scientific methods, processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured.
Data Scientist is a relatively new job title, coined in 2008 by the likes of LinkedIn and Facebook. A quick Google trends search will show that after brief fits and starts the term began a sustained increase in popularity, as compared to the established job title of data engineer, after August 2012.
A data engineer historically has referred to someone trained as a software engineer and working on database systems or scaling production machines. As technology changes, new analytic techniques are required. Thus, a new and distinct title, data scientist.
Enter Drew Conway who attempted to differentiate/define the skills needed by a data scientist with his now famous Venn diagram. Although developed in 2010, it was not widely distributed until 2013.
This diagram began as a discussion of disciplines needed in a university-level data science curriculum. While there have been many revisions of this diagram and debates on topics omitted or emphasized, the three circles remain a relevant lay of the land.
The skillset of the data scientist is a comprehensive one. It requires hacking skills that facilitate data analysis and visualization, statistics, mathematics, and the knowledge of business operations.
Briefly, let’s examine each of the circles.
Substantive expertise or more directly, domain knowledge, is the underpinning of what makes data science an exciting career choice. The ability to work with subject matter experts to understand the business strategy and process in order to convert a business problem into an analytics solution, is what separates a data scientist from, for example, a business analyst who is simply analyzing results of A/B testing.
Executive communication is a key business skill of the data scientist, as well. Domain knowledge added to the proficient delivery of actionable quantitative insights to a non-technical audience – leveraging effective visualization techniques – elevates the analytics solution provided to the client.
Math & Statistics
Statistics and probability are a must for a data scientist, which should surprise no one. Different classes of business problems require different statistical techniques. Furthermore, the rigor offered by formal mathematical analysis ensures that the results are statistically sound.
Josh Wills seems to have the perfect definition with his quote:
Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.
— Josh Wills (@josh_wills) May 3, 2012
Also, a software engineer is often writing production code whereas typically only the data scientist sees their own code. Because of this, the rise of open source software like R and Python, allowing the sharing of algorithms has been a huge benefit to data scientists.
Python or R?
On that note, a widely-debated question is whether you should learn R or Python. This question is difficult because it sets up a false choice and pits one language against another unnecessarily.
A programming language is a tool. As a data scientist, you should choose the tool that’s best for the problem at hand.
Python is known for its readability and is effective when your solution requires integration with web applications or if a production database is involved. R is great for exploratory analysis and efficient implementation of statistical models and tests.
R also has a very active community providing support to both novices and experts. Not to mention many recognizable evangelists like Hadley Wickham and Hilary Parker among others.
Python’s community is huge but less focused on data science specifically, so it’s a bit less organized. Nevertheless, the number of Python packages that are relevant to data science is growing steadily.
The bottom line is, for beginners, it’s advantageous to learn both languages to round out your tool kit. You’ll then be able to specialize in one, while still being able to read and work with the other if the need arises.
Why the emphasis on big data?
Big data and data science are often linked. But why the emphasis on big data? As recently as 2011, industry experts were sounding the alarm to gear up the necessary analytics talent which would be required to handle the coming deluge of huge volumes of data being generated every day.
In the report “Big data: The next frontier for innovation, competition, and productivity”, McKinsey predicted: “There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”
Consider further why this is so. The number of people with a “smartphone subscription” reached 2.6 billion in 2015, according to Ericsson Mobility Report. This figure is expected to grow to 6.1 billion by 2020. Smartphones allow convenient consumption of various multimedia. According to Pew Research Center, 77% of all US adults now own a smartphone versus 35% in 2011.
Moore’s Law is the engine behind past growth that both decreased the physical size of cell phones and increased computing power. To put this in context, in order to reach the moon, the Apollo’s Guidance Computer had only 2k of memory, a clock speed of 1.024 MHz, and 32k of storage. Today’s typical cell phone has a dual-core, 64-bit processor with maximum speeds of around 1.3GHz, paired with 1GB of RAM and a minimum of 16GB of storage.
Every day social media produces even more data. See for yourself how much data is produced every minute of the day in this infographic by DOMO.com.
The talent to handle, analyze, and derive insights from such large volumes of data has never been more critical.
Resources for data science education
The October 2012 edition of the Harvard Business Review featured an article that declared Data Scientist: The Sexiest Job of the 21st Century.
The article also noted that demand already exceeded supply as no universities were currently offering degrees in data science. However, soon that began to change and the landscape for data science education exploded.
So, how do you become a data scientist? The resources available to educate the next generation of data scientists depend mainly on the amount of time and money available to pursue your goal. Use these resources to close the gap between your current skills and the fundamental skills of a data scientist.
No article about how to become a Data Scientist would be complete without a discussion of MOOCs. A MOOC, or massive open online course, is typically free access to an online course aimed at unlimited participation via the internet. Coursera, possibly the largest and most well-known MOOC, was founded in 2012 in the wake of Stanford’s Andrew Ng successful online Machine Learning course. Other MOOCs soon followed.
By far the greatest benefit of this approach is that most MOOCs are free and learning is on your own time, at your own pace.
Some drawbacks to MOOCs include the lack of community that naturally occurs in a physical classroom setting and the fact that self-study requires quite a bit of discipline and motivation.
For an introductory overview of Data Science, try BigDataU’s Data Science Fundamentals Learning Path followed by Big Data Fundamentals Learning Path. Continue to explore other big data aspects in learning paths and courses, per your interests and desired specialization. All courses are free of charge.
Consider also, the following, which charge fees for a verified certification of participation: Coursera’s 10 course Data Science Specialization, edX’s 4 course Data Science specialization, and Udacity Data Science nanodegrees.
Bootcamps came on the scene initially focused on front and back-end web designers/coders. The format soon expanded to include data science education. The boot camp is an immersive 12-26 week full-time experience where the promise is to move participants from novice to work-ready. Bootcamps can be pricey given the initial cost added to the 12-26 week period of unemployment.
There are many to choose from and you should investigate the bootcamps history of placement and alumni satisfaction. Here are a few of the more well known:
If you asked people how to become a Data Scientist, their go-to answer would probably be to enroll in some relevant program at your local University.
The first university data science programs were simply the retooling of existing data analytics programs to include exercises working with big data. Today, it seems most major universities have centers dedicated specifically to data science.
When choosing a program, look for a well-respected institution with a program combining the three pillars of business skills, math/statistics, and data analysis.
But, do you need an advanced degree? Nate Silver of FiveThirtyEight, Moneyball’s Paul DePodesta and Cloudera’s Jeff Hammerbacher only have bachelor’s degrees. The necessary skills for data science can be obtained at the bachelors or masters level. Supplement your education with the excellent resources available, mostly for free, with some experience and becoming a data scientist is achievable without a PhD.
A partial list of Universities offering data science Masters degrees:
- The NYU Center for Data Science
- Data Science Institute at Columbia University
- University of Washington Master of Science in Data Science (MSDS)
- Stanford University Master of Science in Statistics: Data Science
Having discussed data science and some of the options to become a data scientist, what’s your next move?
How to become a Data Scientist? Start where you are!
We often speak of the evolution of a data-driven company. The stages range from awareness and analysis to insight and strategically data-driven as depicted below:
Similarly, becoming a data scientist is a process. Start the journey by focusing on the fundamentals of data science:
Fundamentals of Data Science
Develop your data awareness by learning to manipulate data and building basic data science skills. Using the free resources available, learn to code using both Python and R, and move from there. Using the BDU catalog, we recommend the following:
Overview of Data Science
Statistics and Probability
Machine Learning Algorithms
Another great resource for projects to practice your skills is Kaggle which provides real world data sets and a community of aspiring data scientists.
Because data science is about the discovery of insights, a healthy dose of curiosity is also essential in becoming a data scientist. Incorporate the fundamentals of data science wherever you are as you evolve your career. Imagine data science as a journey, not a destination, which provides the opportunity for life-long learning.
Take the journey and enjoy the detours along the way.