Random forest in Python

Random forest is a well-known machine-learning technique that may be used for classification as well as regression. It is an ensemble approach for improving overall performance and reducing overfitting by combining numerous decision trees.

Each tree in a random forest is constructed using a random subset of the training data and a random subset of the features. This implies that each tree in the forest is trained on a distinct subset of the data and uses a separate set of characteristics. The randomness reduces the connection between the trees and strengthens the model.

Each tree in the forest is formed throughout the training phase by recursively dividing the data depending on the specified characteristics, with each split attempting to maximise the information gain. The splits are chosen by a set of principles designed to increase the purity of the resulting subgroups.

After training all of the trees in the forest, the algorithm aggregates their forecasts to generate a final prediction. The final prediction in classification tasks is formed by taking the majority vote on the forecasts from all the trees. The final prediction in regression tasks is the average of the predictions from all the trees.

The random forest outperforms other machine learning algorithms in various ways, including its capacity to handle high-dimensional data, resistance to overfitting, and ability to capture nonlinear correlations between features. It is, nevertheless, computationally costly and may need a higher number of trees to attain ideal performance.

Let's see the example of implementing random forest in Python, along with accuracy measures:

First, we'll start by loading the necessary libraries and dataset:

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns

Load the breast cancer dataset

data = load_breast_cancer()
X = data.data  # Features
y = data.target  # Target variable

Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create a Random Forest classifier

A Random Forest classifier is created with 100 estimators (trees) using the RandomForestClassifier class from sklearn.ensemble

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier
rf_model.fit(X_train, y_train)

Make predictions on the test set

y_pred = rf_model.predict(X_test)

Calculate the accuracy

accuracy = metrics.accuracy_score(y_test, y_pred)
print("Random Forest Accuracy:", accuracy)

The confusion matrix is created using the confusion_matrix function from sklearn.metrics.
The confusion matrix is visualized as a heatmap using the heatmap function from the seaborn library.

Plot the confusion matrix

# Create a confusion matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)

# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix, annot=True, fmt=".0f", cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

Create a bar plot to visualize feature importance

# Get feature importances from the Random Forest model
feature_importances = rf_model.feature_importances_

# Create a DataFrame to store feature importances and corresponding feature names
feature_importances_df = pd.DataFrame({'Feature': data.feature_names, 'Importance': feature_importances})

# Sort the DataFrame by importance values in descending order
feature_importances_df = feature_importances_df.sort_values('Importance', ascending=False)

# Create a bar plot to visualize feature importances
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importances_df)
plt.title('Feature Importances')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

The code uses the breast cancer dataset to train a Random Forest classifier and then evaluates its accuracy on a test set. It also visualizes the confusion matrix and feature importances to provide insights into the performance and importance of different features in the classification task.

Translate

AKSTATS

Contact Form