My Instagram and LinkedIn DMs are often filled up with people asking me a variation of the same question:

How do I become a Data Scientist?

I always take some time to try to understand the motivations of each person who turns to me for advice, but since I often notice the same patterns I decided to write this post as a general guidance for everybody.

I noticed that there are mostly three kinds of people asking me this question:

  1. Students of non-technical subjects like Marketing, Business, (sometimes even Art or Literature!)
  2. People that work in organizations with business functions. These people mostly have the same background as [1]
  3. Students of technical subjects like Engineering or Physics

I came up with a framework to help anyone answer this question. It’s based on three steps:

  1. Answer the question: do I really want to become a Data Scientist?
  2. Find out your gaps
  3. Plan how to bridge your gaps

Process of becoming a Data Scientist

Understanding if you really want it

First of all, ask yourself if you know what a Data Scientist actually does. The typical (exceptions exist) work of a regular (no special skills, no special tasks) Data Scientist looks like this:

  • Understand business questions: usually, ~5-20% of your time. Depends on the specific problem (some are easier then others) and your team structure (your boss may be the one that goes talk to the business and you’re more operational).
  • Clean, explore, understand data: ~70% of your time
  • Create fancy ML models: ~10-20% of your time.
  • Reporting results: ~5-10% of your time.

Obviously, if you’re working on Machine Learning models to design drugs to cure cancer your work will be different. Take this as a general rule of thumb for most DS jobs in traditional organizations.

Now, if you look at what we wrote above you’ll realize that a big chunk of your life will be spent looking at a screen trying to figure out the data you’re working with (Clean, explore, understand data: ~70% of your time). This can be fun, but also frustrating, especially if you’re dealing with corporate datasets that are often messy, dirty, and tough to understand.

To give you an example, it’s extremely common that your job will require you to go through data sent to you by an IT department of a company that maybe hasn’t done much Data Science in the past. This is just the reality of most companies that are starting off with Data Science today. Often this scenario translates into a long and frustrating back and forth between you, the IT department and some other stakeholder trying to figure out what the hell is that 3rd column on that excel sheet, why there are so many missing values, and why the 35th version of the file you received is different from the 34th.

In short: the job of a Data Scientist is an office job you spend staring at a screen, spending most of your time cursing at messy data. The rest of the time can be extremely satisfying though.

Once you know the truth, ask yourself whether you’re up for it. Now, time to think about step 2:

Step 2: find out your gaps

In general, a good Data Scientist needs 2 skills:

  1. Scientific knowledge: ML models, mathematics and statistics
  2. Computer Science (CS) knowledge: Ability to write good code (usually in Python or R)

Some people add also business acumen and domain knowledge, but I think that this is often overlooked as it can (and probably must) be learnt on the field. On the other hand, the importance of coding skills is extremely under estimated.

Too many times I’ve seen corporations hire people based on their “strong math/physics/scientific background”, without even testing a candidate’s coding skills. When they find out that their hire can use just excel, all his/her scientific skills become useless at the first project more complex than a simple report.

Also, some coding skills but no supervision from expert developers lead to buggy code, irreproducible results, and usually unmaintainable and unsuccessful projects. Knowing some Python or R syntax is not enough: having some ideas of what it means to build a software project is key to be a real Data Scientist that can bring value to an organization.

You now probably can assess what are your gaps, and go to step 3 of the process:

Step 3: filling the gaps

Obviously, filling the gaps largely depends on your background. Going back to the three personas we identified in the beginning:

  1. If you have a non-technical major, you probably have both a gap in the scientific and CS knowledge
  2. If you have work for an organization in a business function, you’re probably in the same boat of [1]
  3. If you’re studying a technical subject like engineering, you probably have part of the scientific knowledge and part of the CS one

If you’re in situation 1 or 2, becoming a Data Scientist is hard but not impossible. The easiest way is to enroll in a university course, but this assumes that you have the time, the patience, and you’re also 100% sure you want to invest at least 3 years of your life pursuing something you’ve never tried.

If you go with self-learning, Desmond Tutu once wisely said that “there is only one way to eat an elephant: a bite at a time”. In this case, I think it’s better to give two bites at a time :)

I believe that it’s best to start by building a basic knowledge of math and statistics, while also moving your first steps into coding. This way, you get two advantages:

  1. You balance practice (coding) and theory (math), and minimize the risk of getting bored
  2. You’re immediately exposed to the two core skills that make a Data Scientist and figure out faster if it’s something for you

You can find an infinite number of maths and statistics courses on coursera, and free coding classes on websites like Datacamp or codecademy. Just Google and you’ll be amazed at the offering!

Depending on how much time you put into this, I’d say that getting some foundations on Maths and Statistics starting from nothing can take anything from 6 months if you’re into it to 1 year.

Once you have some foundations, you can start building the actual house. This is the point you’d be starting with if you’re the persona #3, which is someone with a technical degree.

In this case, I’d start by studying some ML and getting some coding experience on real projects. You can get the ML theory from many online courses; a classic one is Andrew NG’s coursera course named “Machine Learning”, with the only issue that it’s in Matlab, a language that it’s basically never used in the industry. Apart from this, it’s a great course that will give you a solid knowledge of the basic ML algorithms. Then, move on to some more practical courses on Udacity or everywhere on the web (literally, it’s full of free offering).

The hard part will be getting some coding experience on real projects. The ideal solution is having a coding mentor or try being lucky with an Internship in a company that has some good coders. These are the two options for lucky people.

For regular people, a good approach that doesn’t require begging good professionals is to go on Kaggle.com, take one of the projects, and try understanding how the Kaggle pros structure a ML project. Then, try it yourself! I can’t stretch enough the importance of deliberate practice: don’t wait for someone to give you a problem to work on, find it yourself and do your best to solve it. Kaggle can provide some examples of how other people have solved a common problem.

It’s not a straight path

The idea that you can simply follow three well-defined steps is reassuring, but it’s also a convenient lie. at many different points in your path, you’ll be facing challenges that will probably either make you rethink your decision, or uncover new gaps. This is true especially in the beginning of your journey, when often you’ll asking yourself why are you spending your days stuck on math problems or coding exercises. In the beginning of your journey, your decision process will look more like this:

Process loop

The other lie is that there will be a point when you’ll fill like you are a Data Scientist, and your journey is complete. This is also a lie: Data Science is a vast and continuously evolving subject, so you’ll never finish studying and evolving. There’s always new stuff to learn, and this is part of the beauty of this job!

Conclusions

I hope this framework can be helpful for anyone that has ever thought about becoming a Data Scientist. It’s a hard, complex, sometimes frustrating job that can also give you immense satisfaction. You just need to figure out if and how bad you want it.