Being in IT for around 15+ years, primarily in the application development side, I personally never got an opportunity to work in ML/AI/Data Science. But I have a strong interest on ML/AI/Data Science for many years. I have tried to learn on my own a few years back but it didn’t work out well. Of late my ten year old son Sagnik developed an interest in Java/Golang/AWS by following me and most importantly, he started coding with very little help from me. I am surprised by seeing his progress and keep encouraging him to take his coding to the next level. About 2 to 3 weeks ago, he expressed his interest on AI and kept pushing me for few days to help. I started chasing my son’s dream to learn AI to help him and eventually joined with him to learn ML/AI. It’s very funny when and how people get inspired/motivated and keep chasing their dreams. To summarize Inspire,motivate and appreciate people around you . In turn get inspiration,motivation and appreciation from others and with those aim high, fail fast and learn from failure but do not stop learning…life will change slowly around you.
Anyway going back to ML/AI, I was trying figure out what the most important topics to know as an beginner in ML/AI area and it looks like data pre-processing more important than anything. Yes, we need to know models and over time we will go there but need to know how to prepare data to fit into a model. Looks a little strange isn't it? But in practical , this is very true, getting data the way a subscriber need is big challenge. In this blog I am trying to explain some beginner techniques ( given below) for data processing which is required by any model. These steps are required for most of model so you can consider as framework for any model.
Importing the libraries
Importing the dateset
Taking care of missing data
Encoding categorical data
Encoding the Independent Variable
Encoding the Dependent Variable
Splitting the datasets into the Training set and Test set - Required for model
Feature Scaling -Optional for model
1. Importing the libraries
This is required step for before any modelling activity. As part of this I imported the packages/libraries for pre-processing using below code as first step. This code is reusable in most of pre-processing steps so keep in handy.
import numpy as np import matplotlib.pyplot as plt import pandas as pd
2. Importing the datasets
This is my sample dateset which I have sued as part of my explanation. There are three feature variables (Country,Age and Salary) and one dependent variable (Purchased).
In this step I imported data using read_csv method of panda. In panda iloc is used to loop through rows and columns of dateset to get feature and dependent variables.
dataset = pd.read_csv("/content/Data.csv") x= dataset.iloc[:,:-1].values y= dataset.iloc[:,-1].values print (" Matrix of Features ") print (x) print (" Dependent variable") print (y)
The below output shows how features and dependent variables values separated post processing. It also shows existence of null value for elements in array identified by nan.
Matrix of Features : [ ['France' 44.0 72000.0] ['Spain' 27.0 48000.0] ['Germany' 30.0 54000.0] ['Spain' 38.0 61000.0] ['Germany' 40.0 nan] ['France' 35.0 58000.0] ['Spain' nan 52000.0] ['France' 48.0 79000.0] ['Germany' 50.0 83000.0] ['France' 37.0 67000.0] ] Dependent variable: ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
3. Taking care of missing data
As explained above the feature variables listed above have some missing value identified as nan. Generally model will have issues with missing data. There are few options to take care of missing data.The options are
1.Remove the data from dateset. It holds good for large datasets.
2.Replace missing data by mean or max for the specific feature.
I have used option 2 using SimpleImputer class. The fit method calculates mean for age and salary using input dateset. The original dateset was modified using mean values by transform method.
from sklearn.impute import SimpleImputer imputer = SimpleImputer(missing_values = np.nan,strategy='mean') imputer.fit(x[:,1:3]) x[:,1:3] = imputer.transform(x[:,1:3]) print (x)
The below output shows features matrix with mean values in place of missing values.
[['France' 44.0 72000.0] ['Spain' 27.0 48000.0] ['Germany' 30.0 54000.0] ['Spain' 38.0 61000.0] ['Germany' 40.0 63777.77777777778] ['France' 35.0 58000.0] ['Spain' 38.77777777777778 52000.0] ['France' 48.0 79000.0] ['Germany' 50.0 83000.0] ['France' 37.0 67000.0]]
4. Encoding categorical data
If feature or dependent variables are string of different values then it needs to be encoded.
i. Encoding the feature/Independent Variable
As feature variables represent as 2D array,I have used OneHotEncoder encoding strategy from ColumnTransformer class and make sure it only applied to string data type .After that I converted into numpy array post transformation.
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder ct = ColumnTransformer(transformers =[('encoder',OneHotEncoder(),)] , remainder ='passthrough') x = np.array(ct.fit_transform(x)) print (x)
The below output shows that County data type converted into encoded value ( binary form).
As there was 3 countries in my dateset so it assigned values between 001 and 100.
[[1.0 0.0 0.0 44.0 72000.0] [0.0 0.0 1.0 27.0 48000.0] [0.0 1.0 0.0 30.0 54000.0] [0.0 0.0 1.0 38.0 61000.0] [0.0 1.0 0.0 40.0 63777.77777777778] [1.0 0.0 0.0 35.0 58000.0] [0.0 0.0 1.0 38.77777777777778 52000.0] [1.0 0.0 0.0 48.0 79000.0] [0.0 1.0 0.0 50.0 83000.0] [1.0 0.0 0.0 37.0 67000.0]]
ii.Encoding the dependent Variable
As dependent variable is 1D array,I have used LabelEncoder as encoding strategy from pre-processing class.
from sklearn.preprocessing import LabelEncoder le = LabelEncoder() y = le.fit_transform(y) print (y)
The below output shows that dependent variable converted into encoded value ( binary form).As there was 2 responses in my dateset so it assigned value between 0 to 1.
[0 1 0 0 1 1 0 1 0 1]
5. Splitting the datasets into the Training set and Test set
In this step I split the input dateset into training and test dateset using train_test_split.
The train_test_split method gave me feature training ,feature test,dependent training and dependent test datasets.
from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test = train_test_split(x,y,test_size = 0.2, random_state = 1)
The below output shows feature training post split.
[ [0.0 0.0 1.0 38.77777777777778 52000.0] [0.0 1.0 0.0 40.0 63777.77777777778] [1.0 0.0 0.0 44.0 72000.0] [0.0 0.0 1.0 38.0 61000.0] [0.0 0.0 1.0 27.0 48000.0] [1.0 0.0 0.0 48.0 79000.0] [0.0 1.0 0.0 50.0 83000.0] [1.0 0.0 0.0 35.0 58000.0] ]
The below output shows feature test post split.
[ [0.0 1.0 0.0 30.0 54000.0] [1.0 0.0 0.0 37.0 67000.0] ]
The below output shows dependent training post split.
[0 1 0 0 1 1 0 1]
The below output shows dependent test post split.
6. Feature Scaling
Feature scaling is a method used to normalize the range of independent variables or features of data. I felt modelling is like a mixed fruit juice. If we want to get the best-mixed juice, we need to mix all fruit not by their size but based on their right proportion and it takes more time to prepare mixed juice in mixer if size of fruits are not same . So it is important to bring all feature in same scale to get the best model with high performance. I have used StandardScaler class to transform age and salary into more relevant scale to get best model.
from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train[:, 3:] = sc.fit_transform(X_train[:, 3:]) X_test[:, 3:] = sc.transform(X_test[:, 3:])
The below output showed training set moved to relevant scale.
[ [0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425] [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372] [1.0 0.0 0.0 0.566708506533324 0.633562432710455] [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867] [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582] [1.0 0.0 0.0 1.1475343068237058 1.232653363453549] [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885] [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332] ]
The below output showed test set moved to relevant scale.
[ [0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727] [1.0 0.0 0.0 -