Naive Bayes in Python

Naive Bayes is a probabilistic algorithm used in machine learning for classification tasks. It is based on Bayes' theorem, which states that the probability of a hypothesis H given evidence E is proportional to the probability of the evidence given the hypothesis times the prior probability of the hypothesis. In the context of classification, hypothesis H is the class label of an instance, and evidence E is the set of features or attributes that describe that instance.

Naive Bayes assumes that the features are conditionally independent given the class label, which means that the presence or absence of one feature does not affect the presence or absence of any other feature. This assumption simplifies the calculation of the posterior probability of the class label given the evidence, which is the probability that an instance belongs to a particular class given its features.

Naive Bayes works by first estimating the prior probability of each class label based on the training data, and then estimating the conditional probability of each feature given each class label. These probabilities are then used to compute the posterior probability of each class label given the features of a test instance. The class label with the highest posterior probability is then assigned to the test instance.

Naive Bayes is widely used in text classification tasks such as spam filtering, sentiment analysis, and topic classification. It is also used in other domains such as image classification and medical diagnosis.

One of the advantages of Naive Bayes is that it is computationally efficient and can handle a large number of features. It also works well with small training datasets and can handle missing values. However, the assumption of feature independence may not hold in some cases, which can lead to suboptimal performance.

There are different variants of Naive Bayes such as Gaussian Naive Bayes, Multinomial Naive Bayes, and Bernoulli Naive Bayes, which are suited for different types of data.

In practice, Naive Bayes is implemented in various machine learning libraries, scikit-learn in Python. Here's an example of how to implement Naive Bayes classification in Python using the sklearn package:

Importing the required packages

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns

Load the breast cancer dataset

data = load_breast_cancer()
X = data.data  # Features
y = data.target  # Target variable

Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=30)

Create a Naive Bayes classifier and Train the classifier

nb_model = GaussianNB()

nb_model.fit(X_train, y_train)

Make predictions on the test set

y_pred = nb_model.predict(X_test)

Calculate the accuracy

accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Plot the confusion matrix

# Create a confusion matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)

# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix, annot=True, fmt=".0f", cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

Plot the ROC curve

# Get predicted probabilities for each class
y_prob = nb_model.predict_proba(X_test)

# Compute false positive rate, true positive rate, and threshold for ROC curve
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_prob[:, 1])

# Calculate the AUC score
auc_score = metrics.auc(fpr, tpr)

# Plot the ROC curve
plt.plot(fpr, tpr, label='ROC Curve (AUC = %0.2f)' % auc_score)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

By following these steps, you can implement Naive Bayes classification on the breast cancer dataset, evaluate the accuracy of the model, and visualize the performance using a confusion matrix and a ROC curve.

Translate

AKSTATS

Contact Form