Support Vector Machine (SVM) is a widely used machine learning technique for classification and regression tasks. Its key principle is to identify a hyperplane that separates the data points into two classes with the largest margin. The margin is the distance between the hyperplane and the closest data points of both classes. SVM is ideal for linear classification tasks, but if the data points cannot be separated linearly, a kernel function is used to move them to a higher-dimensional feature space where they may become separable.
SVM is widely used in various industries due to its ability to handle complex datasets and high accuracy. SVM can employ various kernel functions, including linear, polynomial, and radial basis functions. Nonetheless, the choice of kernel function may impact the model's accuracy and training speed. In practice, it is necessary to fine-tune the SVM parameters for optimum performance, and it is critical to prevent overfitting by selecting an optimal margin.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
cancer = pd.read_csv(cancar.csv)
cancer.columns = ["id", "diagnosis", "radius_mean", "texture_mean", "perimeter_mean", "area_mean", "smoothness_mean", "compactness_mean", "concavity_mean", "concave_points_mean", "symmetry_mean", "fractal_dimension_mean", "radius_se", "texture_se", "perimeter_se", "area_se", "smoothness_se", "compactness_se", "concavity_se", "concave_points_se", "symmetry_se", "fractal_dimension_se", "radius_worst", "texture_worst", "perimeter_worst", "area_worst", "smoothness_worst", "compactness_worst", "concavity_worst", "concave_points_worst", "symmetry_worst", "fractal_dimension_worst"]
cancer = cancer.drop("id", axis=1)
cancer.head(10)
X = cancer.drop("diagnosis", axis=1)
y = cancer["diagnosis"]
imp = SimpleImputer(strategy="mean")
X = imp.fit_transform(X)
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
svm_rbf = SVC(kernel="rbf", gamma="scale")
svm_rbf.fit(X_train, y_train)
pred = svm_rbf.predict(X_test)
accuracy = accuracy_score(y_test, pred)
precision = precision_score(y_test, pred, pos_label="M")
recall = recall_score(y_test, pred, pos_label="M")
f1 = f1_score(y_test, pred, pos_label="M")
metrics = pd.DataFrame({"Accuracy": [accuracy], "Precision": [precision], "Recall": [recall], "F1": [f1]})
print(metrics)
fpr, tpr, thresholds = roc_curve(y_test, svm_rbf.decision_function(X_test), pos_label="M")
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=1, label="ROC curve (area = %0.2f)" % roc_auc)
plt.plot([0, 1], [0, 1], "--", color="gray", label="Random guess")
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()