Decision Tree is one of the most popular and powerful classification algorithms in machine learning, that is mostly used for predicting categorical data. Entropy/Information Gain and Gini Impurity are 2 key metrics used in determining the relevance of decision making when constructing a decision tree model.
In this blog, let's build a decision tree classifier model using both Gini and Entropy to detect Heart Disease. I have used Kaggle notebook to code and used the UC Irvine Heart Disease dataset from Kaggle to find out the most important factor that impacts heart disease in a patient.
Let's do it step by step as shown below:
To begin with, let's import all the required modules as shown below:
# Importing the required packages import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score from sklearn import tree import graphviz
To access the UCI heart dataset from Kaggle, let's write a function that returns the required dataset.
# import Dataset balance_data = pd.read_csv('../input/heart-disease-uci/heart.csv') balance_data
Out of 14 features in the dataset, let's just take 3 main factors that I thought would be reason for heart disease in a person as "X" and the target column as "Y" which is nothing but heart disease is there or not. The features for "X" are:
cp - Chest Pain
trestbps - resting blood pressure (Good Blood Circulation)
ca - number of major vessels (0-3) colored by flourosopy (Blocked Arteries)
# Separating the target variable X = balance_data[['cp','trestbps','ca']] Y = balance_data.target
Next step is to split the dataset in to train and test sets.
# Splitting the dataset into train and test X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 100)
Let's create a decision tree classifier model and train using Gini as shown below:
# perform training with giniIndex # Creating the classifier object clf_gini = DecisionTreeClassifier(criterion = "gini",random_state = 100,max_depth=3, min_samples_leaf=5) # Fit the model clf_gini.fit(X_train, y_train)
Let's create a decision tree classifier model and train using Entropy as shown below:
# perform training with entropy # Decision tree with entropy clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100,max_depth = 3, min_samples_leaf = 5) # Fit the model clf_entropy.fit(X_train, y_train)
Since, we are going to build decision trees using 2 different approaches (Gini and Entropy), let's write a function that will take the respective models and X_test as input and return the predicted values for each approach. The function code is shown below:
# Function to make predictions def prediction(X_test, clf_object): # Predicton on test with giniIndex y_pred = clf_object.predict(X_test) print("Predicted values:") print(y_pred)return y_pred
Similarly, let's write another function to calculate the accuracy of both the models.
# Function to calculate accuracy def cal_accuracy(y_test, y_pred): print ("Accuracy : ", accuracy_score(y_test,y_pred)*100)
Now, let's execute the above 2 functions to get the predicted values and accuracy of each model as shown below:
# Operational Phase print("Results Using Gini Index:") # Prediction using gini y_pred_gini = prediction(X_test, clf_gini) cal_accuracy(y_test, y_pred_gini) print("Results Using Entropy:") # Prediction using entropy y_pred_entropy = prediction(X_test, clf_entropy) cal_accuracy(y_test, y_pred_entropy)
We can see that both the models have similar accuracy. Sometimes it might be different too.
Now let's try to print the tree created by both both the models and see which is the root node, which is nothing but the feature that has most impact on heart disease. Again, we will write a function that take the model as input as shown below:
#Print tree def printTree(classifier): feature_names = ['Chest Pain', 'Blood Circulation', 'Blocked Arteries'] target_names = ['HD-Yes', 'HD-No'] #Build the tree dot_data = tree.export_graphviz(classifier, out_file=None,feature_names=feature_names, class_names=target_names, filled = True) #Draw tree tr = graphviz.Source(dot_data, format ="png") return tr
Let's print the tree generated by Gini model by using the printTree() function.
#Print Gini tree tr_gini = printTree(clf_gini) tr_gini
Let's print the tree generated by Entropy model by using the printTree() function.
#Print entropy tree tr_entropy = printTree(clf_entropy) tr_entropy
From the above predictions of both the decision tree classifier models, we can see that "Blocked Arteries" is the root node indicating that "Blocked Arteries" is the main factor that impacts heart disease in a patient.
The predictions change according to the data and also the train_test_split percentage and even the bias and variance in data would have an impact.
To increase the accuracy, we can split the dataset in to many small datasets and create models and again consolidate the results to get an optimum accuracy.
Hope this blog will give a basic idea of how to use a dataset to create a decision tree classifier models using Gini and Entropy.
For further reference on Decision Trees, Gini Impurity and Entropy please review these links: