Random forest is a well-known machine-learning technique that may be used for classification as well as regression. It is an ensemble approach for improving overall performance and reducing overfitting by combining numerous decision trees.

Each tree in a random forest is constructed using a random subset of the training data and a random subset of the features. This implies that each tree in the forest is trained on a distinct subset of the data and uses a separate set of characteristics. The randomness reduces the connection between the trees and strengthens the model.

Each tree in the forest is formed throughout the training phase by recursively dividing the data depending on the specified characteristics, with each split attempting to maximise the information gain. The splits are chosen by a set of principles designed to increase the purity of the resulting subgroups.

After training all of the trees in the forest, the algorithm aggregates their forecasts to generate a final prediction. The final prediction in classification tasks is formed by taking the majority vote on the forecasts from all the trees. The final prediction in regression tasks is the average of the predictions from all the trees.

The random forest outperforms other machine learning algorithms in various ways, including its capacity to handle high-dimensional data, resistance to overfitting, and ability to capture nonlinear correlations between features. It is, nevertheless, computationally costly and may need a higher number of trees to attain ideal performance.

Let's see the example of implementing random forest in Python, along with accuracy measures:

First, we'll start by loading the necessary libraries and dataset:
import numpy as np import pandas as pd from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn import metrics import matplotlib.pyplot as plt import seaborn as sns
Load the breast cancer dataset
data = load_breast_cancer() X = data.data # Features y = data.target # Target variable
Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Create a Random Forest classifier
A Random Forest classifier is created with 100 estimators (trees) using the RandomForestClassifier class from sklearn.ensemble
rf_model = RandomForestClassifier(n_estimators=100, random_state=42) # Train the classifier rf_model.fit(X_train, y_train)
Make predictions on the test set
y_pred = rf_model.predict(X_test)
Calculate the accuracy
accuracy = metrics.accuracy_score(y_test, y_pred) print("Random Forest Accuracy:", accuracy)
The confusion matrix is created using the confusion_matrix function from sklearn.metrics.
The confusion matrix is visualized as a heatmap using the heatmap function from the seaborn library.
Plot the confusion matrix
# Create a confusion matrix confusion_matrix = metrics.confusion_matrix(y_test, y_pred) # Plot the confusion matrix plt.figure(figsize=(8, 6)) sns.heatmap(confusion_matrix, annot=True, fmt=".0f", cmap='Blues') plt.title('Confusion Matrix') plt.xlabel('Predicted') plt.ylabel('Actual') plt.show()
Create a bar plot to visualize feature importance
# Get feature importances from the Random Forest model feature_importances = rf_model.feature_importances_ # Create a DataFrame to store feature importances and corresponding feature names feature_importances_df = pd.DataFrame({'Feature': data.feature_names, 'Importance': feature_importances}) # Sort the DataFrame by importance values in descending order feature_importances_df = feature_importances_df.sort_values('Importance', ascending=False) # Create a bar plot to visualize feature importances plt.figure(figsize=(10, 6)) sns.barplot(x='Importance', y='Feature', data=feature_importances_df) plt.title('Feature Importances') plt.xlabel('Importance') plt.ylabel('Feature') plt.show()
The code uses the breast cancer dataset to train a Random Forest classifier and then evaluates its accuracy on a test set. It also visualizes the confusion matrix and feature importances to provide insights into the performance and importance of different features in the classification task.
Previous Post Next Post

Translate

AKSTATS

Learn it 🧾 --> Do it 🖋 --> Get it 🏹📉📊