Decision Tree in Python

A decision tree is a supervised machine-learning technique used to solve classification and regression issues. It is a tree-like model in which each internal node represents an attribute test, each branch represents the test result, and each leaf node represents a class label or a continuous value.

The decision tree algorithm's fundamental premise is to partition the data recursively depending on the attribute that provides the most information gain or the least impurity. The reduction in entropy (or increase in information) that follows from separating data depending on a characteristic is referred to as information gain. Entropy is a measure of data impurity or randomness, and the algorithm's purpose is to minimize it.

In classification issues, each leaf node represents a class label, and the predicted class label is the majority of the class label of the examples that reach the leaf node. For regression issues, each leaf node represents a continuous value, and the projected value is the average of the values of the cases that reach the leaf node.

The decision tree algorithm has the benefit of being simple to grasp and interpret since the tree structure is analogous to human decision-making. However, decision trees are susceptible to overfitting, which occurs when the tree is very complicated and fits the training data too well but performs badly on fresh data.

There are a number of methods that can be used to avoid overfitting, including pruning, which involves removing nodes from the tree that do not enhance its efficiency, or establishing a minimum number of instances per leaf node.

In conclusion, classification and regression issues can be solved using the decision tree approach, a potent and simple machine learning algorithm. It is simple to understand but is susceptible to overfitting because it splits the data recursively based on the attribute that provides the most information gain or the least amount of impurity.

Here's an example of a Python program for a Decision Tree classifier using the breast cancer dataset from the scikit-learn library. It includes making predictions, calculating accuracy, and plotting a tree diagram and ROC curve.

Importing the required packages:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.tree import plot_tree

Load breast cancer dataset

data = load_breast_cancer()
X = data.data
y = data.target

Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

Train Decision Tree classifier

model = DecisionTreeClassifier(random_state=123)
model.fit(X_train, y_train)

Make predictions on the test set

predictions = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

Output: Accuracy: 0.956140350877193

Plot tree diagram

plt.figure(figsize=(12, 8))
plot_tree(model, feature_names=data.feature_names, class_names=data.target_names, filled=True)
plt.show()

Calculate ROC curve

probabilities = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, probabilities)
auc = roc_auc_score(y_test, probabilities)

# Plot ROC curve
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

For the ROC curve, we calculate the probabilities of the positive class using predict_probability and calculate the false positive rate (FPR), true positive rate (TPR), and thresholds using the roc_curve function. We also calculate the area under the ROC curve (AUC) using roc_auc_score.
To visualize the decision tree, we use 'plot_tree' from 'sklearn.tree' and 'matplotlib.pyplot' to plot the tree diagram.
Finally, we plot the ROC curve using 'matplotlib.pyplot' with the FPR and TPR values and show the AUC in the legend.

Translate

AKSTATS

Contact Form