Updated: Jan 12
We have been hearing a lot about how artificial intelligence and machine learning are changing the world and how these things will make everyone’s life easier. Data Science is the study of analyzing data to extract useful insights from data to find solutions or to predict outcomes for a problem statement.
By using Data Science we can help companies to make their businesses smarter and help make better marketing decisions. Due to the tremendous growth of Data Science field, we are seeing remarkable changes in today’s world. Companies are capable of building successful prediction models, digital assistance like Alexa, Siri, Google Home, online fraud detection, early detection of diseases, etc. The applications are numerous and in near future, we will be seeing many such amazing changes with the help of Data Science.
In earlier days, data was stored in excel sheet as the data was very less, it was structured, and was easily manageable. This small amount of data was easy to analyze and process using simple business tools. But now times have changed, the amount of data generated each day is extremely huge and this number is going to increase to a great extent. So, we need to be well prepared with more complex and effective algorithms to manage such huge data.
Data Science covers many domains of Artificial intelligence, Machine learning and Deep learning in order to analyze data and extract useful knowledgeable insights from data:
To better understand data science, let’s consider example of how information is used to improve user experience on our Facebook page. We have seen flash-back memories of posts, pictures, videos, friend suggestions, friend invites to groups, etc.
Below is the life cycle process of Data science:
Stage 1: Business Requirements phase
At this initial stage, data science always begins with understanding the business requirement or the problem we’re trying to solve. It is important to understand the central objective of the project.
Stage 2: Data Collection phase
Next stage is to collect all the relevant data from different sources. This process is also known as Data mining or Data collection.
Stage 3: Data Preprocessing phase
Sometimes unnecessary data gets collected and such data only increases the complexity of the problem. So such irrelevant and inconsistent data is not needed to analyze and has to be removed/fixed at this stage itself. About 50–80% of time goes in doing data cleaning. This stage is also known as Data processing. The data needs to be transformed into desired format from various sources. This stage is considered one of the most time-consuming task. It is a tedious task as we need to find relevant data and handle missing values.
Stage 4: Data Exploration Analysis phase
The data exploration stage is like the brainstorming of data analysis. This is a critical stage where we need to understand the patterns in our data and try to retrieve useful insights. Such information is used to grow the business.
Stage 5: Data Modelling phase
This stage basically includes building a machine learning model that predicts to solve our requirement. This model is built by using all the insights and trends collected in the exploration stage. The model is trained by feeding thousands of customer records so that it can learn to predict the outcome more precisely. Data modelling stage includes implementing machine learning model and following are the steps:
Step1: Import data-Importing required libraries and the data that we gathered at previous stages need to be imported in a readable format for the machine learning process. Consider an example of Student performance information:
# Import required libraries import pandas as pd import numpy as np import seaborn as sb from sklearn import linear_model
Step 2: Data cleaning - We have already performed data pre-processing and data exploration at earlier stages, but data cannot be cleaned all at once, data cleaning is a iterative and repetitive process. Duplicate, missing or null values can perform incorrect predictions so any inconsistencies need to be handled at this stage itself.
# Checking the total null values in each column data.isnull().sum()
Step 3: Build model- Here the dataset is split into two sets, one set is to train the model and the other one is to test the model. After this, we build the model by using the training dataset. These models are nothing but the machine learning algorithms like Linear Regression, Support Vector Machines, Decision trees, etc.
# Creating linear regression object model_LR = linear_model.LinearRegression()
Step 4: Train model - The machine learning model is trained on the training dataset. A large portion of dataset is used for training so that the model can learn to map the input to the output from a set of varied values.
# Creating X and y X = student.drop('math score', axis='columns') y = student['math score'] # Train the LinearRegression model using the dataset model_LR.fit(X,y)
Step 5: Test model - After the model is trained, it is then evaluated by using the testing dataset at this stage. At this stage, the model is fed with new data points and it must predict the outcome by running the new data points on the machine learning model.
# Predicting math_score (y_pred) with values [0,1,50,40] y_pred = model_LR.predict([[0,1,50,40]]) print("The predicted math score is: ", y_pred)
Step 6: Improve Efficiency - After creating the model and evaluating with testing data, its accuracy is calculated. Different methods can be used to improve the efficiency and accuracy of the model.
Stage 6: Data Validation phase
At this stage, it is checked if any new data acts abnormal or any false prediction is happening. If any such abnormalities are detected then a notification is sent to the data scientist to fix the problem.
Stage 7: Deployment and Optimization phases
The final stage of data science is deployment and optimization. So after testing the model and improving its efficiency, it is deployed at production environment for all users. After production deployment, the performance of the model must be checked. If the users are facing any issues, they are fixed at this level.
The need for Data science is increasing to have better decision making machines, to have solutions for complex problems, to improve healthcare and services by predicting diseases early, sentiment analysis, emotion recognition, better ideas will make life easier, all tedious work can be done by machines, machines can be trained to efficiently do a job reducing human errors, accurate self-driving cars which would reduce car accidents to a great extend, and many more accomplishments can be achieved in this field.
There is a lot more to Data science, AI, machine learning and deep learning. So let’s keep learning!
Thanks for reading!