The Five-Step Process of Data Science
Data science impacts our modern lives in far more ways than we may think. When we use Google search or Bing, we are using a sophisticated application of data science. The suggestions we see for other searches that come up when we are typing is all due to data science. Data has become an integral part of every business and is inevitable in everyday life. Even doctors rely on data science interpretations more and more these days.
Big Data is the term used to refer to large and complex datasets that are too large for traditional data processing software like SPSS, Spreadsheets, etc. When we talk about big data, there are 3 concepts called "Three V's": Volume, Variety, and Velocity.
Volume refers to how big the dataset is that we are using for our work. It can be really really huge. For example, there are over 250 million images and 2.5 trillion posts on Facebook, which is really a very big amount of data.
Variety refers to how many different variations can be available in a dataset so that it is not biased with the same kind of data. For example, Alexa has been used by so many people and people ask different questions about different things in various different ways and so on. If Amazon keeps track of all the queries and comments that are asked to Alexa, then they will have lots of data and a lot of information can be generated with that data as people ask about different things and thus we will have a variety of data.
Velocity refers to how fast the data in a dataset is changing and how fast it's being added to the dataset. For example, Instagram is a high-velocity dataset with almost about 1 billion pictures/Videos getting uploaded in a day. With this frequency, in the next couple of years, Instagram will have over 1 trillion images in it's dataset, which is huge.
The above three V's describe any dataset and will give us an idea of the parameters of the dataset, using which we can gain more insights into the data.
The process of doing science on data can be broken down into five steps. These steps are:
Capture the data
Process the data
Analyze the data
Communicate the results
Maintain the data
Capturing the data
Firstly we need to have data to do analysis on. So, we have to capture or collect the data. In a real-world situation, we may have a number of potential sources of data. We need to collect them through all these sources, inventory them and decide what to include. To decide what data to include requires one to know about the business and the goals for the analysis. We can integrate the data sources if we can, so that it will be easy to get to the information we need to build all those reports that management requires.
Processing the data
This is one of the very important steps in data science. The data that is collected has to be processed to get a decent dataset to work on. The collected data should be massaged and cleaned to remove duplicates, replace or remove missing values and inconsistent data, removing any unwanted data, manipulating the string data and converting them to numerical data, etc. Cleaning and processing of data require more time and should be done carefully to avoid bias in the data. if the data is biased then the predictions may not be accurate.
Analyzing the data
Once the dataset has been processed and cleaned, we can use that dataset to analyze and predict the outcomes. We can use so many algorithms that are available like Regression algorithms, classifiers, bagging and boosting algorithms to find the correlation between the data in the dataset and predict the required outcomes. If the dataset has a variety of data and if it's not biased then the predictions will be more accurate.
Communicating the results
After the data wrangling process and analysis of the dataset, we would have the predictions that we inferred from the data which are basically the results. We have to either present these results to the management or to the customer. The pictorial representation of the data is always better to present any data to anyone as it will be easy for everyone to visualize information better when presented in graphical format. We can use the MatPlotLib package of Python for data visualization purposes.
Maintaining the data
After we have got the first round of predictions or answers, most of us will shut down the current project and start the next work. There will be a very good chance of coming back to the same project with more questions. So, it will always useful to maintain all the artifacts and documentation of the project, so that it will be easy and useful to restart the project very quickly.
These are the basic five steps in a data science project and basically, the outcome of any data science project depends mainly on the quality of data used. As we all know that data is the core requirement for any data science project, so data processing is a very important and inevitable process in the data science world!