Random forest is a well-known machine-learning technique that may be used for classification as well as regression. It is an ensemble approach for improving overall performance and reducing overfitting by combining numerous decision trees.
Each tree in a random forest is constructed using a random subset of the training data and a random subset of the features. This implies that each tree in the forest is trained on a distinct subset of the data and uses a separate set of characteristics. The randomness reduces the connection between the trees and strengthens the model.
Each tree in the forest is formed throughout the training phase by recursively dividing the data depending on the specified characteristics, with each split attempting to maximise the information gain. The splits are chosen by a set of principles designed to increase the purity of the resulting subgroups.
After training all of the trees in the forest, the algorithm aggregates their forecasts to generate a final prediction. The final prediction in classification tasks is formed by taking the majority vote on the forecasts from all the trees. The final prediction in regression tasks is the average of the predictions from all the trees.
The random forest outperforms other machine learning algorithms in various ways, including its capacity to handle high-dimensional data, resistance to overfitting, and ability to capture nonlinear correlations between features. It is, nevertheless, computationally costly and may need a higher number of trees to attain ideal performance.
Load the breast cancer datasetimport numpy as np import pandas as pd from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn import metrics import matplotlib.pyplot as plt import seaborn as sns
Split the dataset into training and testing setsdata = load_breast_cancer() X = data.data # Features y = data.target # Target variable
Create a Random Forest classifierX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
RandomForestClassifier
class from sklearn.ensembleMake predictions on the test setrf_model = RandomForestClassifier(n_estimators=100, random_state=42) # Train the classifier rf_model.fit(X_train, y_train)
Calculate the accuracyy_pred = rf_model.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_pred) print("Random Forest Accuracy:", accuracy)
confusion_matrix
function from sklearn.metrics.The confusion matrix is visualized as a heatmap using the
heatmap
function from the seaborn library.Create a bar plot to visualize feature importance# Create a confusion matrix confusion_matrix = metrics.confusion_matrix(y_test, y_pred) # Plot the confusion matrix plt.figure(figsize=(8, 6)) sns.heatmap(confusion_matrix, annot=True, fmt=".0f", cmap='Blues') plt.title('Confusion Matrix') plt.xlabel('Predicted') plt.ylabel('Actual') plt.show()
# Get feature importances from the Random Forest model feature_importances = rf_model.feature_importances_ # Create a DataFrame to store feature importances and corresponding feature names feature_importances_df = pd.DataFrame({'Feature': data.feature_names, 'Importance': feature_importances}) # Sort the DataFrame by importance values in descending order feature_importances_df = feature_importances_df.sort_values('Importance', ascending=False) # Create a bar plot to visualize feature importances plt.figure(figsize=(10, 6)) sns.barplot(x='Importance', y='Feature', data=feature_importances_df) plt.title('Feature Importances') plt.xlabel('Importance') plt.ylabel('Feature') plt.show()