Introduction to Endometriosis:
Endometriosis is a challenging disease for females of reproductive age. Endometriosis stages differ from person to person, depending on the location and intensity of the recurrence. The endometrial layer within the uterus sheds during each menstruation. Endometriosis develops when a layer spreads to numerous sites. There are four phases: a) endometriosis inside the uterus, b) ovarian endometriosis, c) peritoneal endometriosis, and d) deep infiltrating endometriosis. Endometriosis was discovered using scanning methods. The conventional laparoscopic surgical approach was used to determine the most exact location of endometriosis.
Â
Endometriosis' severity has an impact on women's physical and emotional health. External variables associated with endometriosis included severe abdominal pain, dysmenorrhea, dyspareunia, abnormal uterine haemorrhage, breast tenderness, and so on. These external variables have an important role in predicting the scanning and laparoscopic procedures. The internal components found during the laparoscopic operation highlight the severity of endometriosis. Internal aspects include adnexal bulk, tissue-like structure, changes in tissue colour, etc.
Â
Why Decision Tree
The decision tree is a learning method that constructs a tree to analyze multiple aspects of a situation. The tree is made up of a root node at the top, then intermediate leaf nodes, and finally decision nodes. The decision tree determines which traits were most appropriate for analysis. The decision tree separated the nodes using a concept known as information gain. There are two techniques of achieving information gain. They are: a) Entropy and b) Gini Index. Information acquisition determines the most acceptable features for categorization.
Â
Dataset Description
Adnexal mass:Â Presence of Adnexal mass represented in binary format (0 or 1).
Tube blockage: Presence of tube blockage represented in binary format (0 or 1).
Lesion Color: Color of the lesion as Black, Brown, Dark brown and red.
Lesion Size: Size of the lesion.
Target: To represent the presence of endometriosis in binary format (0 or 1).
Â
Dataset:
Python Execution steps for Endometrial analysis using Decision tree algorithm
The obtained data holds 600 values of external factors influencing endometriosis, where it was preprocessed and spitted into training and test data.
The obtained data was preprocessed and spitted into training and test data.
from sklearn. model selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.33, random_state = 42)
X_train. shape, X_test. shape
The trained shape and tested shape of the data was 402 and 198 respectively.
import category_encoders as ce.
encoder = ce. Ordinal Encoder (cols=['Adnexal mass', 'Tube blockage', 'Lesion Colour', 'Le'])
X_train = encoder.fit_transform(X_train)
X_test = encoder. transform(X_test)
The encoding was performed with the identified dataset and fitted for training and testing dataset. Also, the decision tree is constructed, and the height of the decision tree is initialized at this step.
from sklearn. tree import DecisionTreeClassifier.
clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)
# fit the model
clf_gini.fit(X_train, y_train)
y_pred_gini = clf_gini. predict(X_test)
The training and test score of the model was evaluated and classification report was generated.
print ('Training set score: {:.4f}’. format (clf_gini. score (X_train, y_train)))
print ('Test set score: {:.4f}’. format (clf_gini. Score (X_test, y_test)))
from sklearn. Metrics import classification report.
print (classification report (y_test, y_pred_gini))
Figure 1: Classification report for Decision tree analysis (using Gini) for Endometriosis dataset.
The decision tree was constructed using two factors known as information gain that includes entropy and gini index. Initially, the decision tree was constructed with gini index. The decision tree expounds the majorly influenced symptoms across the leaves and parent of the decision tree, where the top influenced symptoms was illustrated at the root node.
plt. figure (figsize= (12,8))
from sklearn import tree
tree. plot tree (clf_gini.fit(X_train, y_train))
Figure 2: Decision tree for endometriosis analysis using GINI.
Similarly, the classification report and decision tree were constructed for entropy as follows.
clf_en = DecisionTreeClassifier (criterion='entropy', max_depth=3, random_state=0)
# Fit the model
clf_en.fit (X_train, y_train)
y_pred_en = clf_en. predict(X_test)
from sklearn.metrics import classification_report
print (classification_report (y_test, y_pred_en))
Figure 3: Classification report for Decision tree analysis (using Entropy) for Endometriosis dataset.
plt. figure (figsize= (12,8))
from sklearn import tree
tree. plot tree (clf_en.fit(X_train, y_train))
Figure 4: Decision tree for endometriosis analysis using Entropy.
From the decision tree constructed from both gini and entropy, it is identified Adnexal mass is the most influential factor for predicting endometriosis as it is located as the root node.
Performance Evaluation using AUC-ROC Curve
An ROC(receiver operating characteristics curve) evaluates the performance of the identified model and it include two parameters such as
True Positive rate
False Positive rate
AUC(Area Under the Curve) measures the two dimensional area under entire ROC Curve. AUC is an aggregate measure of performance across all categorization criteria. AUC might be interpreted as the likelihood that the model would rate a random positive case higher than a random negative example.
from sklearn. metrics import roc_auc_score, roc_curve.
y_probabilities = clf_en. predict_proba(X_test) [:1]
y_probabilities1 = clf_gini. predict_proba(X_test) [:1]
false_positive_rate, true_positive_rate, threshold = roc_curve (y_test, y_probabilities)
false_positive_rate1, true_positive_rate1, threshold1 = roc_curve (y_test, y_probabilities1)
plt. figure (figsize= (10,6))
plt. figure (figsize= (10,6))
plt. Title ('ROC for decision tree')
plt. plot (false_positive_rate, true_positive_rate, linewidth=5, line style = 'dotted’, color='red')
plt. plot (false_positive_rate1, true_positive_rate1, linewidth=5, linestyle = 'dashdot’, color='Green')
plt. plot ([0,1], ls='--’, linewidth=6)
plt. plot ([0,0], [1,0], c='.5')
plt. plot([1,1], c='.5')
plt.Text(0.0,0.8,'AUC::.2f}’.format(roc_auc_score(y_test,y_probabilities)),size= 16)
plt. Text (0.2,0.6,'AUC: {:.2f}’. format(roc_auc_score(y_test, y_probabilities1)),size= 16)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
location = 0Â # For the best location
legend_drawn_flag = True
plt.legend([" AUC for Entropy", "AUC for Gini"], loc=0, frameon=legend_drawn_flag)
plt.show()
Figure 5 : AUC-ROC Curve for Entropy and Gini Index for analyzing endometrial dataset
The AUC-ROC curve obtained determines the performance of the executed Decision tree algorithm for the identified Endometrial dataset. The AUC obtained for Entropy was 0.89 and AUC for Gini was 0.87.
Comments