Data cleaning in R

Data cleaning is the process of fixing or eliminating erroneous, incorrect, improperly formatted, duplicated, or missing data from a dataset. Data can be duplicated or mislabeled in a variety of ways, including when separate data sources or datasets are merged. Even if the software runs the data, the findings and techniques are untrustworthy without the right data. There is no universal approach for describing the various stages of data cleaning. Data cleaning aspects are clearly explained in the separate Data Cleaning article. And you can download the .csv file here.

At first, the dataset is imported by using the readxl package in R.


library(readr)data=read_csv("test_measurements.csv")

When we import the .csv or .xlsx file in the R window, by default it will consider it as a data frame. Then we are checking the dimension of the data frame and the summary of the data. In summary, we can get a clear view of the NA's (missing) values of the corresponding columns.


dim(data)summary(data)

To Find the total number of missing values.


total = sum(is.na(data))print(total)colSums(is.na(data))

The missing values are replaced with the median by using the below code:


New_df = data[,2:12]

New_df$Presentation  = ifelse(is.na(New_df$Presentation),                               median(New_df$Presentation,na.rm = TRUE),New_df$Presentation)

New_df$`Influencing and Convincing`  = ifelse(is.na(New_df$`Influencing and Convincing`),                                              median(New_df$`Influencing and Convincing`,na.rm = TRUE),New_df$`Influencing and Convincing`)

New_df$`Stress Tolerance` = ifelse(is.na(New_df$`Stress Tolerance` ),                                    median(New_df$`Stress Tolerance`, na.rm = TRUE),New_df$`Stress Tolerance`)

New_df$`Achievement Orientation` = ifelse(is.na(New_df$`Achievement Orientation`),                                           median(New_df$`Achievement Orientation`, na.rm = TRUE),New_df$`Achievement Orientation` )

Again we are checking for the missing values in the data frame.


total = sum(is.na(New_df))print(total)summary(New_df)

To check for outliers

boxplot(New_df)

col = c('Presentation','Influencing.and.Convincing','Stress.Tolerance','Achievement.Orientation')

boxplot(New_df[,c('Presentation','Influencing.and.Convincing','Stress.Tolerance','Achievement.Orientation')])

for (x in c('Presentation','Influencing.and.Convincing','Stress.Tolerance','Achievement.Orientation'))
{
value =New_df[,x][New_df[,x] %in% boxplot.stats(New_df[,x])$out] 
New_df[,x][New_df[,x] %in% value] = NA
}

Checking whether the outliers in the above-defined columns are replaced by NULL or not


as.data.frame(colSums(is.na(New_df)))

In some cases, the null values may lead to less accuracy. So we have to remove them. Removing the null values by this code:

library(tidyr)
New_df = drop_na(New_df)
as.data.frame(colSums(is.na(New_df)))

To view the overall source code (R.file)!!!! -

Translate

AKSTATS

Contact Form