How to get started in the field of data science

Introduction

In this article I will describe the path you should take to get started with learning Data Science. This is a long but a sure shot path to mastery in the field of Data Science and Analytics.

I am describing this path from my personal journey on how I entered the field of data science. Below are the steps I think you must take to become a master in the field of Data Science.

Start with Python or R

First and foremost in order to start playing with datasets you need to learn a programming language which makes data manipulation easy. Python is one of the most popular and easy to learn language to get started with Data Science.

There are a ton of free resources available online to learn Python. Beginner to intermediate knowledge is enough to get started. If you are total beginner to programming then I would suggest to buy the Programming Python book by Mark Lutz. There are also numerous courses on youtube which you can start with.

Another very useful book is Python for Data Science by Jake Vanderplas.

If you already know Python I would highly recommend to also learn R language. This has the best statistical tools available of any language. Some of the time series libraries in R are unmatched by any other language. Hence it is a very important tool in a Data Scientist’s arsenal.

Again I am a big fan of books to start learning anything new and you cannot go wrong if you follow the book R for Data Science by Hadley Wickham

Brush up on Linear Algebra

Once you have got a beginner handle on any one of the programming language above, it is time to revisit some of the important Math concepts that you might have learned in your high school or Bachelors degree. Linear Algebra is the back bone of both classical statistical analysis as well as modern day machine learning and deep learning techniques.

For this I would recommend going through the free Youtube course on Linear Algebra by Gilbert Strang. You can even skip it if you are already clear about basic concepts in Linear Algebra like Dot products, Vector space, basis and components etc

Learn the basics of Statistical Analysis

Once you have covered the basics of Linear Algebra, now is the time to enhance those concepts with Statistical Analysis. If you are getting started with Statistics the online stats website is best resource to get started.

After you have completed and finished all the exercises in the online stats book, now it is time to get serious about Statistical Analysis. For this you can break down your learning into two parts. First Linear modeling and second Non-Linear Modeling.

For Linear modeling you can follow the book Statistical Analysis by Springer and for Non Linear modeling you can refer the book Extending the Linear model by Faraway.

Both these topics are pretty intense and are the backbone of becoming a world leading Data Scientist, because anybody can use python ML libraries to predict and forecast but only the person with clarity of these basic concepts can tweak the parameters in such a way so as to create a world beating analytical model.

Do learn about Copulas!

So fasten your seat belts and get ready to dive into the math heavy field of Statistical Analysis

Play with public datasets

Once your have got a handle on the programming language and while you are half way through your statistical analysis journey, you can pickup some publicly available datasets from kaggle and start applying your learning to find insights, build models, create visualization.

I am leaving visualization out of this article but that is another level of skill that has its own path, which I will describe in another dedicated article.

This is the fun part. Try to deep dive into any dataset you can get your hands on. Try collecting your own data or design experiments to gather statistical observations from control and experimental groups. Get your hands dirty with data from different domains like medical, finance, astronomy etc.

One of the major challenges of Data Science is to get the data in a format so that you can build models on top of it. This part is almost 70-80% work of any Data Science project.

Start with Machine Learning

After getting through above steps which will easily take 6 months of your time you can start delving into traditional machine learning methods like SVM, Logistic regression etc.

This is the sexy part of the data science which gets written and talked about the most. You are now ready to get deeper into using Python ML libraries like scikit learn to predict and forecast.

Again there are ton of free course available online. or if you want to learn from a book then I would recommend referring to Hands on Machine Learning by Aurelien Geron.

Have Fun with Deep Learning and AI

I know some of you got excited about the field of Data Science with these most used buzz words of Deep Learning and AI, that are nothing but a small subset of a Data Scientist’s toolbox.

Again there are ton of online resources like FastAI to get started with this. The book by Aurelien Geron also covers it from all practical purposes.

However, If you want to get deeper in to the theory of Deep learning I would highly recommend the Deep Learning book by Ian Goodfellow and Bengio.

This is the most fun part you can dive deeper into many kinds of models like Image Recognition, Speech recognition, Deep fakes, Reinforcement Learning etc. You can easily do PHD in any one of these topics. My personal favorite is Reinforcement Learning. I even created a bot to play Pacman while participating in one of the AI competitions organized by my University

Do not neglect Time series modeling

Time Series modeling is another area of Data Science which has its own specialization and tons of books written on the topic. This is an area where I believe R has an advantage over python.

Forecasting is a major area in Finance, weather forecasting etc and the time series methods are highly useful in these areas. There is a very good online free resource available by Hyndman to start with Time Series.

Participate in Kaggle Competitions

Now after getting all above knowledge if you are feeling really really brave you can start participating in Kaggle competitions but I must warn you that they are great time suckers and you will get lost in the race to perfect your model by a 1/millionth of a decimal point.

So tread cautitously!

Join a formal University course

This is a big one and an expensive one. But for those really committed to getting the best knowledge available, this is a highly recommended option in my opinion. It is worth every penny if you enroll in a classroom course from a reputed university.

The interaction with professors and fellow students can be really insightful. So do give it a consideration if you have the means.

Create something useful

Last but not least, what use is to get educated if you do not put it to create something useful for people. Find your own projects, help NGOs, do something with your knowledge and you will certainly rejoice once you see the fruits of your labor being used by someone

Conclusion

Anybody can start on this journey because most of the information is freely available. But only few can be the best at it, because it involves lot of time and effort to get really good at creating useful models.

Remember – “All models are wrong, some are useful”

With that I will leave you to ponder over your choice of what you want to do. Do let me know in comments if you have any questions or feedback. Catch you in my next post!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.