Detection of heart disease using Decision Tree Classifier



Decision Tree is one of the most popular and powerful classification algorithms in machine learning, that is mostly used for predicting categorical data. Entropy/Information Gain and Gini Impurity are 2 key metrics used in determining the relevance of decision making when constructing a decision tree model.

To know more about these you may want to review my other blogs on Decision Trees, Entropy/Information Gain and Gini Impurity.

In this blog, let's build a decision tree classifier model using both Gini and Entropy to detect Heart Disease. I have used Kaggle notebook to code and used the UC Irvine Heart Disease dataset from Kaggle to find out the most important factor that impacts heart disease in a patient.

Let's do it step by step as shown below:

Step 1:

To begin with, let's import all the required modules as shown below:

# Importing the required packages 
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
import graphviz

Step 2:

To access the UCI heart dataset from Kaggle, let's write a function that returns the required dataset.

# import Dataset 
balance_data = pd.read_csv('../input/heart-disease-uci/heart.csv')
balance_data

Output:

UCI Heart Disease Dataset

Step 3:

Out of 14 features in the dataset, let's just take 3 main factors that I thought would be reason for heart disease in a person as "X" and the target column as "Y" which is nothing but heart disease is there or not. The features for "X" are:

cp - Chest Pain

trestbps - resting blood pressure (Good Blood Circulation)

ca - number of major vessels (0-3) colored by flourosopy (Blocked Arteries)

# Separating the target variable 
X = balance_data[['cp','trestbps','ca']]
Y = balance_data.target 

Step 4:

Next step is to split the dataset in to train and test sets.

# Splitting the dataset into train and test 
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 100) 

Step 5:

Let's create a decision tree classifier model and train using Gini as shown below:

# perform training with giniIndex
# Creating the classifier object 
clf_gini = DecisionTreeClassifier(criterion = "gini",random_state = 100,max_depth=3, min_samples_leaf=5)
# Fit the model 
clf_gini.fit(X_train, y_train) 

Step 6:

Let's create a decision tree classifier model and train using Entropy as shown below:

# perform training with entropy
# Decision tree with entropy 
clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100,max_depth = 3, min_samples_leaf = 5)
# Fit the model  
clf_entropy.fit(X_train, y_train) 

Step 7:

Since, we are going to build decision trees using 2 different approaches (Gini and Entropy), let's write a function that will take the respective models and X_test as input and return the predicted values for each approach. The function code is shown below:

# Function to make predictions 
def prediction(X_test, clf_object):
    # Predicton on test with giniIndex 
    y_pred = clf_object.predict(X_test)
    print("Predicted values:")
    print(y_pred)return y_pred  

Step 8:

Similarly, let's write another function to calculate the accuracy of both the models.

# Function to calculate accuracy 
def cal_accuracy(y_test, y_pred):
print ("Accuracy : ", accuracy_score(y_test,y_pred)*100) 

Step 9:

Now, let's execute the above 2 functions to get the predicted values and accuracy of each model as shown below:

# Operational Phase 
print("Results Using Gini Index:")
# Prediction using gini 
y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)

print("Results Using Entropy:")
# Prediction using entropy 
y_pred_entropy = prediction(X_test, clf_entropy)
cal_accuracy(y_test, y_pred_entropy) 

Output:

We can see that both the models have similar accuracy. Sometimes it might be different too.

Step 10:

Now let's try to print the tree created by both both the models and see which is the root node, which is nothing but the feature that has most impact on heart disease. Again, we will write a function that take the model as input as shown below:

#Print tree
def printTree(classifier):
    feature_names = ['Chest Pain', 'Blood Circulation', 
                         'Blocked Arteries']
    target_names = ['HD-Yes', 'HD-No']
    
    #Build the tree
    dot_data = tree.export_graphviz(classifier,                                      
                         out_file=None,feature_names=feature_names,
                         class_names=target_names, filled = True)
    
    #Draw tree
    tr = graphviz.Source(dot_data, format ="png")
    return tr

Step 11:

Let's print the tree generated by Gini model by using the printTree() function.

#Print Gini tree
tr_gini = printTree(clf_gini)
tr_gini

Output:

Gini Tree : Image by Author

Step 12:

Let's print the tree generated by Entropy model by using the printTree() function.

#Print entropy tree
tr_entropy = printTree(clf_entropy)
tr_entropy

Output:

Entropy Tree : Image by Author

From the above predictions of both the decision tree classifier models, we can see that "Blocked Arteries" is the root node indicating that "Blocked Arteries" is the main factor that impacts heart disease in a patient.

The predictions change according to the data and also the train_test_split percentage and even the bias and variance in data would have an impact.

To increase the accuracy, we can split the dataset in to many small datasets and create models and again consolidate the results to get an optimum accuracy.

Hope this blog will give a basic idea of how to use a dataset to create a decision tree classifier models using Gini and Entropy.

For further reference on Decision Trees, Gini Impurity and Entropy please review these links:

https://www.numpyninja.com/post/is-decision-tree-a-classification-or-regression-model

https://www.numpyninja.com/post/what-is-entropy-and-information-gain-how-are-they-used-to-construct-decision-trees

https://www.numpyninja.com/post/what-is-gini-impurity-how-is-it-used-to-construct-decision-trees


Happy Modeling!





42 views0 comments

Recent Posts

See All

Text Summarization through use of Spacy library

Text summarization in NLP means telling a long story in short with a limited number of words and convey an important message in brief. There can be many strategies to make the large message short and

 

© Numpy Ninja.