Random forest is a well-known machine-learning technique that may be used for classification as well as regression. It is an ensemble approach for improving overall performance and reducing overfitting by combining numerous decision trees.
Each tree in a random forest is constructed using a random subset of the training data and a random subset of the features. This implies that each tree in the forest is trained on a distinct subset of the data and uses a separate set of characteristics. The randomness reduces the connection between the trees and strengthens the model.
Each tree in the forest is formed throughout the training phase by recursively dividing the data depending on the specified characteristics, with each split attempting to maximise the information gain. The splits are chosen by a set of principles designed to increase the purity of the resulting subgroups.
After training all of the trees in the forest, the algorithm aggregates their forecasts to generate a final prediction. The final prediction in classification tasks is formed by taking the majority vote on the forecasts from all the trees. The final prediction in regression tasks is the average of the predictions from all the trees.
The random forest outperforms other machine learning algorithms in various ways, including its capacity to handle high-dimensional data, resistance to overfitting, and ability to capture nonlinear correlations between features. It is, nevertheless, computationally costly and may need a higher number of trees to attain ideal performance.
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns
Load the breast cancer datasetdata = load_breast_cancer()
X = data.data # Features
y = data.target # Target variable
Split the dataset into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Create a Random Forest classifierRandomForestClassifier
class from sklearn.ensemblerf_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the classifier
rf_model.fit(X_train, y_train)
Make predictions on the test setCalculate the accuracyy_pred = rf_model.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Random Forest Accuracy:", accuracy)
confusion_matrix
function from sklearn.metrics.The confusion matrix is visualized as a heatmap using the
heatmap
function from the seaborn library.# Create a confusion matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix, annot=True, fmt=".0f", cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
Create a bar plot to visualize feature importance# Get feature importances from the Random Forest model
feature_importances = rf_model.feature_importances_
# Create a DataFrame to store feature importances and corresponding feature names
feature_importances_df = pd.DataFrame({'Feature': data.feature_names, 'Importance': feature_importances})
# Sort the DataFrame by importance values in descending order
feature_importances_df = feature_importances_df.sort_values('Importance', ascending=False)
# Create a bar plot to visualize feature importances
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importances_df)
plt.title('Feature Importances')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()