Decision Tree Classification

Let's say I was at a zoo with my dad and went to the crocodile and alligators section of the zoo. When I would look at the crocodile and alligator, my mind would say they are both alligators or they are both crocodiles. But, even if they look exactly the same, is there a way to distinguish both of those reptiles into separate parts? Well then Decision Tree Classification can help you out!

In this blog, I will go through creating and training the Decision Tree Classification model step-by-step and if you want to follow along, you can use Google Colaboratory or Kaggle.


QUESTIONS:

Before we get cracking with the step-by-step for the Decision Tree Classification, let me answer a tiny amount of not-so-long questions that could help during this blog.

  1. What is Decision Tree CLASSIFICATION (Noticed I put classification all in caps? That's because there is also Decision Tree Regression)?

  2. What coding languages is Decision Tree Classification compatible with?

  3. How should you use Decision Tree Classification

Answer to Q1: Decision Tree Classification helps you classify your data into two parts by using leaf nodes. For example, let's say that I wanted to see whether a person is fit or not fit. The Decision Tree Classification would help you split the leaf node into smaller leaf nodes. Here's a picture.

Answer to Q2: Decision Tree Classification can be and is mostly used with Python/NumPY and R.

Answer to Q3: Decision Tree Classification is used for classifying nodes by using yes/no type outcomes.

Now let's crack on with the coding part!


CODING:


1. Import the libraries (as always)

First, we will have to import all of the needed libraries.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

2. Import the dataset (as always)

After the first step, we will now import the dataset.

dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, :-1].values
= dataset.iloc[:, -1].values

3. Splitting the dataset into the Training set and Test set

Now, let's split the dataset into two parts, the training set and the test set.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

Let's see if all the variables have any output at all. Starting with X_train....

print (X_train)

If you ran it, you should have gotten an output that looks like this (in reality it looks much longer):

[[    44  39000]  [    32 120000]  [    38  50000]  [    32 135000]  [    52  21000]  [    53 104000]  [    39  42000]  [    38  61000]  [    36  50000]  [    36  63000]  [    35  25000]  [    35  50000]  [    42  73000]  [    47  49000]  [    59  29000]  [    49  65000]  [    45 131000]  [    31  89000]  [    46  82000]  [    47  51000]  [    26  15000]  [    60 102000]  [    38 112000]  [    40 107000]  [    42  53000]  [    35  59000]  [    48  41000]  [    48 134000]  [    38 113000]  [    29 148000]  [    26  15000]  [    60  42000]  [    24  19000]  [    42 149000]  [    46  96000]  [    28  59000]  [    39  96000]  [    28  89000]  [    41  72000]  [    45  26000]  [    33  69000]  [    20  82000]  [    31  74000]  [    42  80000]  [    35  72000]  [    33 149000]  [    40  71000]  [    51 146000]  [    46  79000]  [    35  75000]  [    38  51000]  [    36  75000]  [    37  78000]  [    38  61000]  [    60 108000]  [    20  82000] [    57  74000]  [    42  65000]  [    26  80000]  [    46 117000]  [    35  61000]  [    21  68000]  [    28  44000]  [    41  87000]  [    37  33000]  [    27  90000]  [    39  42000]  [    28 123000]  [    31 118000]  [    25  87000]  [    35  71000]  [    37  70000]  [    35  39000]  [    47  23000]  [    35 147000]  [    48 138000]  [    26  86000]  [    25  79000]  [    52 138000]  [    51  23000]  [    35  60000]  [    33 113000]  [    30 107000]  [    48  33000]  [    41  80000]  [    48  96000]  [    31  18000]  [    31  71000]  [    43 129000]  [    59  76000]  [    18  44000]  [    36 118000]  [    42  90000]  [    47  30000]  [    26  43000]  [    40  78000]  [    46  59000]  [    59  42000]  [    46  74000]  [    35  91000]  [    28  59000]  [    40  57000]  [    59 143000]  [    57  26000]  [    52  38000]  [    47 113000]  [    53 143000]  [    35  27000]  [    58 101000]  [    45  45000]  [    23  82000]  [    46  23000]  [    42  65000]  [    28  84000]  [    38  59000]  [    26  84000]  [    29  28000]  [    37  71000]  [    22  55000]  [    48  35000]  [    49  28000]  [    38  65000]  [    27  17000]  [    46  28000]  [    48 141000]  [    26  17000]  [    35  97000]  [    39  59000]  [    24  27000]  [    32  18000]  [    46  88000]  [    35  58000]  [    56  60000]  [    47  34000]  [    40  72000]  [    32 100000]  [    19  21000]  [    25  90000]  [    35  88000]  [    28  32000]  [    50  20000]  [    40  59000]  [    50  44000]  [    35  72000]  [    40 142000]  [    46  32000]  [    39  71000]  [    20  74000]  [    29  75000]  [    31  76000]  [    47  25000]  [    40  61000]  [    34 112000]  [    38  80000]  [    42  75000]  [    47  47000]  [    39  75000]  [    19  25000]  [    37  80000]  [    36  60000]  [    41  52000]  [    36 125000]  [    48  29000]  [    36 126000]  [    51 134000]  [    27  57000]  [    38  71000]  [    39  61000]  [    22  27000]  [    33  60000]  [    48  74000]  [    58  23000]  [    53  72000]  [    32 117000]  [    54  70000]  [    30  80000]  [    58  95000]  [    26  52000]  [    45  79000]  [    24  55000]  [    40  75000]  [    33  28000]  [    44 139000]  [    22  18000]  [    33  51000]  [    43 133000]  [    24  32000]  [    46  22000]  [    35  55000]  [    54 104000]  [    48 119000]  [    35  53000]  [    37 144000]

Now let's print out the value of y_train...

print (y_train)

If you ran it, you should have gotten an output that looks like this:

[0 1 0 1 1 1 0 0 0 0 0 0 1 1 1 0 1 0 0 1 0 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 1  0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1  1 1 0 0 1 1 0 0 1 1 0 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 1 1 1 1 0 1 1 0  1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 0 1 1 0 0  0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 1 0 0 0 0 0 1 0 0  0 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0  0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0  0 0 1 0 1 1 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1  0 0 0 0]

You might be thinking why there are only 0's and 1's for the output. Well, those numbers are called binary numbers and can be used to replace a word. For example, in a computer's electrical system, the binary number 0 would represent "off" and the 1 would represent "on".

After that brief explanation, we shall now see what the output of X_test is...

print (X_test)

If you ran that, you should have gotten an output similar to this:

[[    30  87000]  [    38  50000]  [    35  75000]  [    30  79000]  [    35  50000]  [    27  20000]  [    31  15000]  [    36 144000]  [    18  68000]  [    47  43000]  [    30  49000]  [    28  55000]  [    37  55000]  [    39  77000]  [    20  86000]  [    32 117000]  [    37  77000]  [    19  85000]  [    55 130000]  [    35  22000]  [    35  47000]  [    47 144000]  [    41  51000]  [    47 105000]  [    23  28000]  [    49 141000]  [    28  87000]  [    29  80000]  [    37  62000]  [    32  86000]  [    21  88000]  [    37  79000]  [    57  60000]  [    37  53000]  [    24  58000]  [    18  52000]  [    22  81000]  [    34  43000]  [    31  34000]  [    49  36000]  [    27  88000]  [    41  52000]  [    27  84000]  [    35  20000]  [    43 112000]  [    27  58000]  [    37  80000]  [    52  90000]  [    26  30000]  [    49  86000]  [    57 122000]  [    34  25000]  [    35  57000]  [    34 115000]  [    59  88000]  [    45  32000]  [    29  83000]  [    26  80000]  [    49  28000]  [    23  20000]  [    32  18000]  [    60  42000]  [    19  76000]  [    36  99000]  [    19  26000]  [    60  83000]  [    24  89000]  [    27  58000]  [    40  47000]  [    42  70000]  [    32 150000]  [    35  77000]  [    22  63000]  [    45  22000]  [    27  89000]  [    18  82000]  [    42  79000]  [    40  60000]  [    53  34000]  [    47 107000]  [    58 144000]  [    59  83000]  [    24  55000]  [    26  35000]  [    58  38000]  [    42  80000]  [    40  75000]  [    59 130000]  [    46  41000]  [    41  60000]  [    42  64000]  [    37 146000]  [    23  48000]  [    25  33000]  [    24  84000]  [    27  96000]  [    23  63000]  [    48  33000]  [    48  90000]  [    42 104000]]

Now this output is definitely a lot shorter than X_train but let's print the output of y_test...

print (y_test)

If you ran that, you should have gotten an output similar to this:

[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0  0 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 1 0 0 1  0 0 0 0 1 1 1 0 0 0 1 1 0 1 1 0 0 1 0 0 0 1 0 1 1 1]

After printing all of those outputs from the variables, we should now jump into the next step.

4. Feature Scaling

Feature Scaling is a method used to normalize the range of independent variables or features of data. For example, let's say that you wanted to make a fruit smoothie containing two types of fruit such as watermelon and pineapple. Now let's say you plopped both of them into the blender, the what would happen? Well, it would definitely clog wouldn't it?

But what if you cut them into tiny 1 inch * 1 inch * 1 inch pieces? Now if you plop them into the blender, it would then work smoothly instead of getting clogged. Enough talk about the example now and let me give the code now.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

I'm not going to put the output of the variables because it's a very long list but if you really want, you can print the outputs yourself. Let's crack on with the next step now!

5. Training the Decision Tree Classification model on the Training set

Now, let's train the Decision Tree Classification model on the training set so that we can predict something in the next step.

from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

If you run this piece of code now, it should give you a result like this:

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',                        max_depth=None, max_features=None, max_leaf_nodes=None,                        min_impurity_decrease=0.0, min_impurity_split=None,                        min_samples_leaf=1, min_samples_split=2,                        min_weight_fraction_leaf=0.0, presort='deprecated',                        random_state=0, splitter='best')

Let's dive into the next step now!

6. Predicting a new result

Let's predict a new result for the Decision Tree Classification model.

print(classifier.predict(sc.transform([[30,87000]])))

And the output that you've got should be a simple 0:

[0]

Let's rocket into the next step now!

7. Predicting the Test set results

Now let's predict the Test set results with the code that I've given below down below:

y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

Let's see what the output would be if you ran it:

[[0 0]  [0 0]  [0 0]  [0 0]  [0 0]  [0 0]  [0 0]  [1 1]  [0 0]  [0 0]  [0 0]  [0 0]  [0 0]  [1 0]  [0 0]  [1 0]  [1 0]  [0 0]  [1 1]  [0 0]  [0 0]  [1 1]  [0 0]  [1 1]  [0 0]  [0 1]  [0 0]  [0 0]  [0 0]  [0 0]  [0 0]  [0 1]  [1 1]  [0 0]  [0 0]  [0 0]  [0 0]  [0 0]  [0 0]  [1 1]  [0 0]  [0 0]  [0 0]  [0 0]  [1 1]  [0 0]  [0 0]  [1 1]  [0 0]  [1 1]  [1 1]  [0 0]  [0 0]  [1 0]  [1 1]  [1 1]  [0 0]  [0 0]  [1 1]  [0 0]  [0 0]  [1 1]  [0 0]  [1 1]  [0 0]  [1 1]  [0 0]  [0 0]  [0 0]  [1 0]  [1 1]  [0 0]  [0 0]  [1 1]  [0 0]  [0 0]  [0 0]  [0 0]  [1 1]  [1 1]  [1 1]  [1 0]  [0 0]  [0 0]  [1 1]  [0 1]  [0 0]  [1 1]  [1 1]  [0 0]  [0 0]  [1 1]  [0 0]  [0 0]  [0 0]  [1 1]  [0 0]  [1 1]  [1 1]  [1 1]]

Looking at this, I can see that this uses binary numbers for the predictions. 0 is no and 1 is yes.

And now, let's get on with the final step!

8. Making the Confusion Matrix

Before we start the last step, there is a question that needs to be answered. What is a Confusion Matrix? A Confusion Matrix is a table (look down below) that is often used to describe the performance of a classification model on a set of test data for which the true values are known.



Legend :


  • TP- True Positive

  • FP- False Positive

  • FN- False Negative

  • TN- True Negative

Now let's create the Confusion Matrix with the code down below:

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

You might have noticed that in the code up above, there is something called accuracy_score. That accuracy_score if ran would give us an accuracy of the prediction that we made in the last step.

And after that brief explanation of the accuracy_score, let's run this code....

[[62  6]  [ 3 29]]
0.91 

Did you notice the decimal 0.91? Well, that decimal represents the accuracy of the prediction from the last step.

CONCLUSION:

And with that, this blog will end. I hope you like this blog and without further ado, happy reading!

25 views0 comments

Recent Posts

See All