Decision Tree Classification

Let's say I was at a zoo with my dad and went to the crocodile and alligators section of the zoo. When I would look at the crocodile and alligator, my mind would say they are both alligators or they are both crocodiles. But, even if they look exactly the same, is there a way to distinguish both of those reptiles into separate parts? Well then Decision Tree Classification can help you out!

In this blog, I will go through creating and training the Decision Tree Classification model step-by-step and if you want to follow along, you can use Google Colaboratory or Kaggle.


QUESTIONS:

Before we get cracking with the step-by-step for the Decision Tree Classification, let me answer a tiny amount of not-so-long questions that could help during this blog.

  1. What is Decision Tree CLASSIFICATION (Noticed I put classification all in caps? That's because there is also Decision Tree Regression)?

  2. What coding languages is Decision Tree Classification compatible with?

  3. How should you use Decision Tree Classification

Answer to Q1: Decision Tree Classification helps you classify your data into two parts by using leaf nodes. For example, let's say that I wanted to see whether a person is fit or not fit. The Decision Tree Classification would help you split the leaf node into smaller leaf nodes. Here's a picture.

Answer to Q2: Decision Tree Classification can be and is mostly used with Python/NumPY and R.

Answer to Q3: Decision Tree Classification is used for classifying nodes by using yes/no type outcomes.

Now let's crack on with the coding part!


CODING:


1. Import the libraries (as always)

First, we will have to import all of the needed libraries.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

2. Import the dataset (as always)

After the first step, we will now import the dataset.

dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, :-1].values
= dataset.iloc[:, -1].values

3. Splitting the dataset into the Training set and Test set

Now, let's split the dataset into two parts, the training set and the test set.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

Let's see if all the variables have any output at all. Starting with X_train....

print (X_train)

If you ran it, you should have gotten an output that looks like this (in reality it looks much longer):

[[    44  39000]  [    32 120000]  [    38  50000]  [    32 135000]  [    52  21000]  [    53 104000]  [    39  42000]  [    38  61000]  [    36  50000]  [    36  63000]  [    35  25000]  [    35  50000]  [    42  73000]  [    47  49000]  [    59  29000]  [    49  65000]  [    45 131000]  [    31  89000]  [    46  82000]  [    47  51000]  [    26  15000]  [    60 102000]  [    38 112000]  [    40 107000]  [    42  53000]  [    35  59000]  [    48  41000]  [    48 134000]  [    38 113000]  [    29 148000]  [    26  15000]  [    60  42000]  [    24  19000]  [    42 149000]  [    46  96000]  [    28  59000]  [    39  96000]  [    28  89000]  [    41  72000]  [    45  26000]  [    33  69000]  [    20  82000]  [    31  74000]  [    42  80000]  [    35  72000]  [    33 149000]  [    40  71000]  [    51 146000]  [    46  79000]  [    35  75000]  [    38  51000]  [    36  75000]  [    37  78000]  [    38  61000]  [    60 108000]  [    20  82000] [    57  74000]  [    42  65000]  [    26  80000]  [    46 117000]  [    35  61000]  [    21  68000]  [    28  44000]  [    41  87000]  [    37  33000]  [    27  90000]  [    39  42000]  [    28 123000]  [    31 118000]  [    25  87000]  [    35  71000]  [    37  70000]  [    35  39000]  [    47  23000]  [    35 147000]  [    48 138000]  [    26  86000]  [    25  79000]  [    52 138000]  [    51  23000]  [    35  60000]  [    33 113000]  [    30 107000]  [    48  33000]  [    41  80000]  [    48  96000]  [    31  18000]  [    31  71000]  [    43 129000]  [    59  76000]  [    18  44000]  [    36 118000]  [    42  90000]  [    47  30000]  [    26  43000]  [    40  78000]  [    46  59000]  [    59  42000]  [    46  74000]  [    35  91000]  [    28  59000]  [    40  57000]  [    59 143000]  [    57  26000]  [    52  38000]  [    47 113000]  [    53 143000]  [    35  27000]  [    58 101000]  [    45  45000]  [    23  82000]  [    46  23000]  [    42  65000]  [    28  84000]  [    38  59000]  [    26  84000]  [    29  28000]  [    37  71000]  [    22  55000]  [    48  35000]  [    49  28000]  [    38  65000]  [    27  17000]  [    46  28000]  [    48 141000]  [    26  17000]  [    35  97000]  [    39  59000]  [    24  27000]  [    32  18000]  [    46  88000]  [    35  58000]  [    56  60000]  [    47  34000]  [    40  72000]  [    32 100000]  [    19  21000]  [    25  90000]  [    35  88000]  [    28  32000]  [    50  20000]  [    40  59000]  [    50  44000]  [    35  72000]  [    40 142000]  [    46  32000]  [    39  71000]  [    20  74000]  [    29  75000]  [    31  76000]  [    47  25000]  [    40  61000]  [    34 112000]  [    38  80000]  [    42  75000]  [    47  47000]  [    39  75000]  [    19  25000]  [    37  80000]  [    36  60000]  [    41  52000]  [    36 125000]  [    48  29000]  [    36 126000]  [    51 134000]  [    27  57000]  [    38  71000]  [    39  61000]  [    22  27000]  [    33  60000]  [    48  74000]  [    58  23000]  [    53  72000]  [    32 117000]  [    54  70000]  [    30  80000]  [    58  95000]  [    26  52000]  [    45  79000]  [    24  55000]  [    40  75000]  [    33  28000]  [    44 139000]  [    22  18000]  [    33  51000]  [    43 133000]  [    24  32000]  [    46  22000]  [    35  55000]  [    54 104000]  [    48 119000]  [    35  53000]  [    37 144000]

Now let's print out the value of y_train...

print (y_train)

If you ran it, you should have gotten an output that looks like this:

[0 1 0 1 1 1 0 0 0 0 0 0 1 1 1 0 1 0 0 1 0 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 1  0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1