K-Nearest Neighbor(KNN) in R

K-Nearest Neighbors (KNN) is a supervised machine learning algorithm that can be used for both classification and regression tasks. The algorithm makes predictions by finding the k closest data points in the training set to the new data point and choosing the majority class (for classification) or the mean value (for regression) of those k data points as the predicted value for the new data point.

The distance metric used to determine the k closest data points can vary, but Euclidean distance is commonly used. The value of k is also a hyperparameter that can be tuned for optimal performance. In general, a more significant value of k will result in a smoother decision boundary and less overfitting. In comparison, a smaller value of k will result in a more complex decision boundary and more overfitting.

KNN has several advantages, such as being simple to understand and implement and being able to handle non-linear decision boundaries. However, it can also have drawbacks, such as being computationally expensive for large datasets and being sensitive to the choice of distance metric and value of k.

Overall, KNN can be a useful algorithm for various tasks, especially when the dataset is small and the decision boundary is complex.

Here's an example of how to implement KNN for a binary classification task in R:

Load the required packages

library(class)
library(ggplot2)

Load the dataset

data = read.csv("breast-cancer", header=FALSE)

Data Preprocessing

data = data[,-1]
X = data[,-1]
y = data[,1]
X = scale(X)

Split the data into training and testing sets

set.seed(123)
train_indices = sample(1:nrow(data), 0.7*nrow(data))
X_train = X[train_indices,]
y_train = y[train_indices]
X_test = X[-train_indices,]
y_test = y[-train_indices]

KNN model fitting with k=5

knn = knn(X_train, X_test, y_train, k=5)
plot(knn)

Make predictions on the testing set

pred = as.factor(knn)

Calculating the accuracy measures

accuracy = sum(pred==y_test)/length(y_test)
precision = sum(pred=="M" & y_test=="M")/sum(pred=="M")
recall = sum(pred=="M" & y_test=="M")/sum(y_test=="M")
f1 = 2*precision*recall/(precision+recall)
metrics = data.frame(Accuracy=accuracy, Precision=precision, Recall=recall, F1=f1)
print(metrics)

Plotting of decision boundary

data_plot = data.frame(X=X_test[,1], Y=X_test[,2], Class=pred, Actual=y_test)
ggplot(data_plot, aes(X, Y, color=Class, shape=Actual)) + geom_point() + ggtitle("KNN Decision Boundary")

The given code shows how to create a K-Nearest Neighbours (KNN) model in R using the UCI Machine Learning Repository's breast cancer Wisconsin dataset (wdbc.data). The code can be summarised as follows:

Data preparation processes are carried out. The first column (which contains the ID) is deleted (data[-1]). The scale() function is used to scale the features (X), which standardises the characteristics to have a zero mean and unit variance.

Using the sample() method, the data is divided into training and testing sets. 70% of the data is chosen at random as the training set, with the remaining 30% serving as the testing set.

The training set with k = 5 is used to fit the KNN model. The "knn()" method predicts class labels for the testing set (X_test) based on the training set's nearest neighbours (X_train).

Predictions (pred) are translated to factors for the purpose of computing accuracy measurements.

Accuracy measurements like accuracy, precision, recall, and F1 score are generated and saved in a data frame (metrics).

Finally, ggplot2 is used to generate a scatter plot to visualise the KNN model's decision boundary. Colour (Class) represents the anticipated classes, whereas shape (Actual) represents the actual classes from the testing set.

Translate

AKSTATS

Contact Form