Logistic Regression is a statistical approach for investigating the
connection between a binary dependent variable with one or more independent
variables. The dependent variable indicates the existence or absence of a
specific trait or occurrence and can only have two values, typically 0 or 1.
The main aim of Logistic Regression is to calculate the probability that the
dependent variable is equal to 1 given the values of the independent
variables. Due to its S-shaped curve, the logistic function is also known as
a sigmoid function.
The logistic regression model uses a maximum likelihood technique to
estimate the parameters of the logistic function, which aims to discover the
set of parameters that maximises the likelihood of witnessing the observed
data. After the parameters have been estimated, they can be used to forecast
the likelihood of the response variable is equal to 1 for fresh observations
based on the values of the independent variables.
Several industries, including finance, marketing, medical, and social
sciences, employ logistic regression.
- In finance, for example, it may be used to estimate the chance of loan default.
- It may be used in marketing to forecast the likelihood of customer reaction to a marketing effort.
- It may be used in medicine to forecast the presence or absence of disease by considering multiple patient characteristics.
Overall, Logistic Regression is a useful tool for predicting binary outcomes
that may be applied in several situations.
Let's see the example in the R program. Here's an example using the
"Titanic" dataset, which is a common dataset used in binary classification
tasks:
Importing and Loading the required packages.
library(caTools)
library(ggplot2)
library(pROC)
library(dplyr)
library(titanic)
Load the dataset:
data(titanic_train)
Preprocess the data
titanic <- titanic_train %>%
select(Survived, Pclass, Sex, Age, SibSp) %>%
na.omit()
Splitting the datasets into Train and Test data.
set.seed(123)
split <- sample.split(titanic$Survived, SplitRatio = 0.7)
train <- subset(titanic, split == TRUE)
test <- subset(titanic, split == FALSE)
Fitting the Logistic Regression model
logistic <- glm(Survived ~ ., data = train, family = binomial)
Make predictions on the test dataset
prob <- predict(logistic, newdata = test, type = "response")
pred <- ifelse(prob > 0.5, 1, 0)
Calculating the
accuracy measures
by using Confusion Matrix, CM- measures and ROC curve
table(pred, test$Survived) # confusion matrix
# accuracy measures from confusion matrix
accuracy <- sum(diag(table(pred, test$Survived))) / sum(table(pred, test$Survived))
precision <- diag(table(pred, test$Survived)) / colSums(table(pred, test$Survived))
recall <- diag(table(pred, test$Survived)) / rowSums(table(pred, test$Survived))
f1 <- 2 * (precision * recall) / (precision + recall)
metrics <- data.frame(Accuracy = accuracy, Precision = precision, Recall = recall, F1 = f1)
print(metrics)
# Generate ROC curve
roc = roc(test$Survived, prob)
plot(roc, col = "blue", print.thres = "best", legacy.axes = TRUE, legacy.layout = TRUE)
Conclusion:
We can conclude that the predictive model for the Titanic data is performing
poorly based on the ROC value of 0.487 (95% CI: 0.747-0.874). The ROC value
indicates that the model has low discriminatory power and cannot
differentiate between positive and negative cases well. As a result, the
model may be ineffective in predicting outcomes for the Titanic data. Here
we just used titanic data as an example to explain and explore the logistic
regression. And to understand the process of analysis in the logistic
regression.