K-Nearest Neighbors (KNN) is a supervised machine learning algorithm that can be used for both classification and regression tasks. The algorithm makes predictions by finding the k closest data points in the training set to the new data point and choosing the majority class (for classification) or the mean value (for regression) of those k data points as the predicted value for the new data point.
The distance metric used to determine the k closest data points can vary, but Euclidean distance is commonly used. The value of k is also a hyperparameter that can be tuned for optimal performance. In general, a more significant value of k will result in a smoother decision boundary and less overfitting. In comparison, a smaller value of k will result in a more complex decision boundary and more overfitting.
KNN has several advantages, such as being simple to understand and implement and being able to handle non-linear decision boundaries. However, it can also have drawbacks, such as being computationally expensive for large datasets and being sensitive to the choice of distance metric and value of k.
Overall, KNN can be a useful algorithm for various tasks, especially when the dataset is small and the decision boundary is complex.
Here's an example of how to implement KNN for a binary classification task in R:
Load the required packagesLoad the datasetlibrary(class) library(ggplot2)
Data Preprocessingdata = read.csv("breast-cancer", header=FALSE)
Split the data into training and testing setsdata = data[,-1] X = data[,-1] y = data[,1] X = scale(X)
KNN model fitting with k=5set.seed(123) train_indices = sample(1:nrow(data), 0.7*nrow(data)) X_train = X[train_indices,] y_train = y[train_indices] X_test = X[-train_indices,] y_test = y[-train_indices]
Make predictions on the testing setknn = knn(X_train, X_test, y_train, k=5) plot(knn)
Calculating the accuracy measurespred = as.factor(knn)
Plotting of decision boundaryaccuracy = sum(pred==y_test)/length(y_test) precision = sum(pred=="M" & y_test=="M")/sum(pred=="M") recall = sum(pred=="M" & y_test=="M")/sum(y_test=="M") f1 = 2*precision*recall/(precision+recall) metrics = data.frame(Accuracy=accuracy, Precision=precision, Recall=recall, F1=f1) print(metrics)
data_plot = data.frame(X=X_test[,1], Y=X_test[,2], Class=pred, Actual=y_test) ggplot(data_plot, aes(X, Y, color=Class, shape=Actual)) + geom_point() + ggtitle("KNN Decision Boundary")