In the previous post, Logistic Regression is evidently explained with R. Take a minute to look at the Logistic Regression in R.
Let's start by importing the required tools and loading the dataset:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, roc_auc_score
import matplotlib.pyplot as plt
# Loading the dataset
data = load_breast_cancer()
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)
# Creating the Logistic Regression model
model = LogisticRegression(max_iter=100)
# Training the model on the training data
model.fit(X_train, y_train)
# Making predictions on the testing data
y_pred = model.predict(X_test)
# Calculating the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Creating a confusion matrix to visualize the performance of the model
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)
Creating a ROC curve and calculating the area under the curve (AUC) to evaluate the performance of the model
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
In this example, we import the breast cancer dataset and using the train_test_split() tool from scikit-learn, the dataset is divided into training and testing sets. We use the LogisticRegression() class to construct a Logistic Regression model, then use the fit() method to train the model using training data and the predict() method to make predictions on testing data.
The accuracy of the model is then calculated through the accuracy_score() function, and the confusion matrix is generated by applying the confusion_matrix() function. In the end, we use the roc_curve() function to generate a ROC curve and the roc_auc_score() function to determine the area under the curve (AUC). We plot the ROC curve with matplotlib's plot() function.
It is important to note that the breast cancer dataset is a binary classification dataset, with an outcome variable representing the presence or absence of breast cancer. The model's accuracy indicates how effectively the model is able to forecast the presence or absence of breast cancer. The confusion matrix suggests the number of true positives, false positives, true negatives, and false negatives and can assist us in understanding the model's strengths and limitations. The ROC curve and AUC are essential for assessing the model's performance across various threshold levels.