What all you need to become a data scientist?
There is no single starting point or path you can follow to become a data scientist. You can start from anywhere — from a science, engineering, commerce graduate, Ph. D degree and continue your journey with coding any kind of problem you see around, to attending online courses, participating in a Kaggle competition or doing a data science project under a mentor. Even there is no single starting point or path still there is set of common skills and passions that you must possess. Mathematics & reasoning comes first and along that you should have a passion for coding/programming and problem solving.
It goes without saying that data scientists work in an ecosystem of evolving technology so some of what you had learned five or ten years ago may not be relevant today so you must be aware what technologies are in market at present, whether they are related to database systems, big data, analytics, visualization or machine learning frameworks. Thanks to thousands of blog posts out there on medium, LinkedIn and outstanding videos on You Tube, it should not be difficult to find out what is trending. The journey of a data scientist always starts with data, if there were no data then there were no data scientists !
Data
If you look at the data from a very narrow point of view, such as data from a particular domain, you will have limited creativity to show in the field of data science. Think about data in the way you interact with your environment using mobile phone (text, audio, video, fitness record etc.,) or your experiences. If you think about data in terms of nicely formatted csv or excel files or a SQL table, you may be disappointed when you start working on your first assignment as a data scientist (I started one of my first serious data science projects with scrapping millions of commits data from GitHub). Do not depend on others to provide you data tables, you create your tables from dirty data — scrap data from twitter, Facebook, news websites, You Tube! This is a fact that data scientists should know how to create data and, in some situations, guide other teams’ member how and what type of data to get. Data scientists are not expected to complain about the format or correctness/cleanness of data. Every form of data must be welcome! Being dependent too much on proprietary tools for trivial data manipulation may be too risky. To be honest, closer you are to your data more potential you have to show your creativity. Apart from some restricted formats, python can be used to read/write and manipulate any kind of data you have. I cannot imagine any other kind of data than text data where python seriously has upper hand! Data is the goldmine of information and software tools are just tools! You cannot get gold from a mine where there is no gold no matter how fancy your tools are. Tools can make digging easy but cannot implant gold where there is none! Data is goldmine and the solution we are looking to solve problems is the gold and the rest is just detail.
Data Structures & Algorithms
You can play with data without having much knowledge about data structures & algorithms but that does not take you too far. At some point of time, you may need to educate yourself different type of data structures used for different type of data. Data structures provide a framework to collect, store, process and share data. Broadly there are two kind of data structures — one you immediately make use of, such as python list, dictionaries, string, tuples, data frames etc., and others (a bit advanced) like linked list, stacks, ques, trees etc., which you may need to understand for some advanced stuff. It becomes so easy when we convert text data into python lists (of words), tabular data into (pandas) data frame and image data into NumPy array. Due to large number of libraries available and inbuilt function in python, you may not need to code any algorithm from scratch (although that may be awesome!) but in spare time it is worth to code some of the common algorithms at least for historical reasons (in case you bother for that).
Mathematics & Statistics
Mathematics is a vast subject and if you are familiar with topology, differential geometry, number theory & graph theory that is great but to start your journey as a data scientist there is good news — you just need to brush up your concepts in linear algebra (vectors/matrices/tensors) and calculus (most for understanding optimization problems). I think it greatly helps if one can leave the concept of vectors learned in physics behind and think in terms of general vectors — like “word vectors” or “rows” in a excel file, matrices in terms of neural network weights and NumPy arrays. Liner transformations (multiplying a vector by matrix) are very important and so are eigen values/vectors & singular value decomposition. To have a “feeling” of “data vectors” and covariance matrices is important. Having a good knowledge if linear & non-linear relationship is very important.
Real word data is “noisy” so there will be a lot of uncertainties about the information we extract from any real-world data set. In the other words statistical analysis is the only kind of analysis that is possible with a large volume of real-world data. Basic principles of probability as well as common probability distributions must be mastered. Just working out multivariate Gaussian distribution and a line fitting exercise with Chi-square minimization technique, with pen and paper, on some Sunday afternoon must help. New data bring new information and how that append our knowledge about the situation and how these fit in the Bayesian framework, knowing that is extremely important — to be precise I am talking about parameter estimation and model comparison in Bayesian framework.
As I mentioned above the analysis of any real-world data set must be statistical and so we must have a knowledge of technical terms used to present this is a must have knowledge — to be precise, I am talking about precision, recall, accuracy, f-score, roc, auc, type I & II errors etc., Finally basic knowledge of information theory — Shannon’s information & cross entropy is a must have knowledge. Two of the concepts in statistical analysis — central limit theorem & principal component analysis clearly stand out.
Visualisation
Data visualisation is as important as data analysis — there is a quote “if you do not see there, it is not there!”. In many cases just looking at the data we can see what is there and no analysis is needed (excluding minor cleaning), this is particularly true for small data sets. There is so much offering from python matplotlib that you may hardly require any other tool. Line chart, scatter plots, bar charts, pie charts etc., there are so many ways you can visualize data. Sometime just plain matplotlib may not be sufficient, for example when you need animation (matplotlib supports) that or when you need show the plots for some dynamical data then it is good to have some knowledge of dashboards (I mostly use python dash and that works most of the time). Microsoft Power BI provides quite easy visualization tools without any coding so worth to explore that.
Software Tools
There are so many software tools which a data scientist may require at some point of time but here I am mention just three — SQL, Git & Python Flask.
A lot of data is still existing in SQL databases, so it is worth to know how to get from that. As a data scientist you may not require very advance knowledge of SQL, but you may have hard time if you are not familiar with the basic knowledge. So worth to know a) how to connect to a SQL server b) how to create/read/update/delete (CRUD) tables.
To my knowledge GIT is the most common version control system and most developers use it for collaboration therefor it is very important for a data scientist to have a basic knowledge of GIT so that he or she can collaborate with other team members and write production quality codes.
One of the most common ways to deploy machine learning models is to create APIs and deploy them on a web server and flask helps in that. It does not take more than 30 min to write a python flask “hello world” program and get it up & running.
Machine Learning
If you think data scientists do only AI & machine learning you are far from truth. Machine learning is one of the tools in the toolkit of a data scientist which is useful when you have a large volume of labelled data available. If you have sound background in mathematics & statistics it may take less than a month to understand the theoretical concepts of the famous five machine learning algorithms (linear regression, logistic regression, decision trees, support vector machines & neural networks).
At present it is more fashionable to call machine learning algorithms which employ neural networks as “deep learning” and for these algorithms you need to understand (1) architecture (2) loss function and (3) optimization used. For a beginner of deep learning, one must know the basic difference between multilevel perceptron, convolutional neural networks and recurrent neural network. If you are a beginner do not mention “BERT”, “ATTENTION” and “TRANSFORMER” in your CV!
There are so many worked out examples on internet, so it is worth to practice the most common ones such, classification on IRIS & MNIST data sets. There are many machine learning python libraries and two are worth mentioning — TensorFlow/keras and scikit learn. The first one is important it makes possible to complex neural network models in a very easy way (by the way automatic differentiation is what is a non-trivial offering from TensorFlow/keras). Scikit-learn clearly stands out due to its comprehensive documentation, examples & easy to use APIs.
Python
There is no doubt that Python (Python 3.x to be precise) is the most favorable language of data scientists not because this is the best or fastest but due to the reasons that it comes with very good documentation & tutorials and there are a very large number of python libraries available as open source. Python runs on all kind of machines (mac/windows/Linux) and have handy tools like Jupyter notebook. As a data scientist you are expected to have good knowledge of python OOPs and libraries like NumPy & Pandas.
Cloud
Due to affordable network bandwidth, it has become practical to store your data & run your codes remotely on compute infrastructure managed by big cloud companies such as google, amazon and Microsoft. There may be many reasons for a data scientist to employ cloud computing, but I roughly see the following three main.
1. When you are short of compute/storage resources on local system you can buy storage/compute instances and pay according to your consumption. In this way what you get is a server with large number of computing cores and/or storage and memory. You are given some keys to connect (using ssh) with the server and option to get a Linux or windows instances. You must install all the software you needed, build your compute pipeline and transfer data from/to cloud to/from local. This is really a genuine case, and you cannot do better than this one.
2. There are some cloud companies which not only provide storage & compute resources they provide machine learning algorithms/pipeline also and you may not need to write a single line of code — you just upload your data, configure the pipeline and run. In this case most of time you do not know what is going behind the scenes & you have very less control. To me it does not make sense to use cloud for linear regression problem which you can easily do on your local machine.
3. In some cases, you even have very less control — cloud companies provide access to their APIs for using ML facilities on different kind of data (text/images/time series etc.,). This case is the most opaque but is most common (you can check how you can use GPT3).
In this article I have deliberately avoided to mentions tools & technologies which are tied to some company since that makes discussion quite biased. Commercial tools encourage users to switch to noncoding’ environment. Most of these commercial tools can be learned easily whenever you need to learn them and what you learn about a particular tool remains useful only for that tool. However, the topics I covered above may remain useful irrespective of the environment you work in or the vertical you work for.
Data science is a vast area and what I covered is just a tip of iceberg may be in futures articles I will touch upon some other important points.
Please like, comment, share & post your feedback.