Logistic Regression in R

Logistic Regression is a statistical approach for investigating the connection between a binary dependent variable with one or more independent variables. The dependent variable indicates the existence or absence of a specific trait or occurrence and can only have two values, typically 0 or 1. The main aim of Logistic Regression is to calculate the probability that the dependent variable is equal to 1 given the values of the independent variables. Due to its S-shaped curve, the logistic function is also known as a sigmoid function.

The logistic regression model uses a maximum likelihood technique to estimate the parameters of the logistic function, which aims to discover the set of parameters that maximises the likelihood of witnessing the observed data. After the parameters have been estimated, they can be used to forecast the likelihood of the response variable is equal to 1 for fresh observations based on the values of the independent variables.

Several industries, including finance, marketing, medical, and social sciences, employ logistic regression.

In finance, for example, it may be used to estimate the chance of loan default.
It may be used in marketing to forecast the likelihood of customer reaction to a marketing effort.
It may be used in medicine to forecast the presence or absence of disease by considering multiple patient characteristics.

Overall, Logistic Regression is a useful tool for predicting binary outcomes that may be applied in several situations.

Let's see the example in the R program. Here's an example using the "Titanic" dataset, which is a common dataset used in binary classification tasks:

Importing and Loading the required packages.

library(caTools)
library(ggplot2)
library(pROC)
library(dplyr)
library(titanic)

Load the dataset:

data(titanic_train)

Preprocess the data

titanic <- titanic_train %>%
  select(Survived, Pclass, Sex, Age, SibSp) %>%
  na.omit()

Splitting the datasets into Train and Test data.

set.seed(123)
split <- sample.split(titanic$Survived, SplitRatio = 0.7)
train <- subset(titanic, split == TRUE)
test <- subset(titanic, split == FALSE)

Fitting the Logistic Regression model

logistic <- glm(Survived ~ ., data = train, family = binomial)

Make predictions on the test dataset

prob <- predict(logistic, newdata = test, type = "response")
pred <- ifelse(prob > 0.5, 1, 0)

Calculating the accuracy measures by using Confusion Matrix, CM- measures and ROC curve

table(pred, test$Survived) # confusion matrix


# accuracy measures from confusion matrix
accuracy <- sum(diag(table(pred, test$Survived))) / sum(table(pred, test$Survived))
precision <- diag(table(pred, test$Survived)) / colSums(table(pred, test$Survived))
recall <- diag(table(pred, test$Survived)) / rowSums(table(pred, test$Survived))
f1 <- 2 * (precision * recall) / (precision + recall)
metrics <- data.frame(Accuracy = accuracy, Precision = precision, Recall = recall, F1 = f1)
print(metrics)

# Generate ROC curve
roc = roc(test$Survived, prob)
plot(roc, col = "blue", print.thres = "best", legacy.axes = TRUE, legacy.layout = TRUE)

Conclusion:

We can conclude that the predictive model for the Titanic data is performing poorly based on the ROC value of 0.487 (95% CI: 0.747-0.874). The ROC value indicates that the model has low discriminatory power and cannot differentiate between positive and negative cases well. As a result, the model may be ineffective in predicting outcomes for the Titanic data. Here we just used titanic data as an example to explain and explore the logistic regression. And to understand the process of analysis in the logistic regression.

Conclusion:

Translate

AKSTATS

Contact Form