Data wrangling is the process of cleaning, converting, and organizing raw data into appropriate format for analysis. It is also known as data munging or data preparation. The quality and structure of the data have a large impact on the accuracy and dependability of any subsequent studies or machine learning models, therefore this phase is critical in the data analysis workflow. 

The following are the major steps in data wrangling: 

Load the necessary packages:

library(dplyr) # for data manipulation library(tidyr) # for data tidying library(stringr) # for string operations
Import the data into R
data = read.csv("data.csv") # replace "data.csv" with your file path
Explore the data
head(data) # view the first few rows of the data str(data) # check the structure of the data (data types, columns) summary(data) # get summary statistics of the data
Handle missing values:
data = na.omit(data) # remove rows with missing values # or data = data[complete.cases(data), ] # remove rows with missing values # or data$column[is.na(data$column)] = mean(data$column, na.rm = TRUE) # impute missing values with the mean
Remove duplicate rows:
data = distinct(data) # remove duplicate rows
Clean and transform:
data$column = as.numeric(data$column) # convert a column to numeric type data$column = as.Date(data$column, format = "%Y-%m-%d") # convert a column to date type data$column = tolower(data$column) # convert a column to lowercase data$column = str_trim(data$column) # remove leading and trailing whitespaces
Handle outliers:
data = data %>% filter(column >= lower_limit & column <= upper_limit) # remove outliers based on a specified range
Handle inconsistent values:
data$column[data$column == "incorrect_value"] = NA # replace incorrect values with NA
Save the cleaned data:
write.csv(data, "cleaned_data.csv", row.names = FALSE) # replace "cleaned_data.csv" with your desired file name
Handle consistent categorical variables:
data$column = recode(data$column, "Incorrect" = "Correct") # correct inconsistent categorical values
Correct data types:
data$column = as.factor(data$column) # convert a column to a factor
Handling inconsistent string formats:
# Calculate the z-scores for a column
data$z_score <- (data$column - mean(data$column)) / sd(data$column) # Remove outliers based on z-scores data <- data[data$z_score >= -3 & data$z_score <= 3, ]
Perform additional data validation checks:
# Check for logical inconsistencies in the data inconsistent_data = data[data$column1 > data$column2, ] # Check for missing values in specific columns missing_values = data[is.na(data$column), ]
Reorder and rename variable:
data = data %>% select(column3, column1, column2) # reorder columns data <- data %>% rename(NewColumnName = OldColumnName) # rename a column<- code="" data="">
Validate cleaned data:
summary(data) # verify the summary statistics of the cleaned data
Merge or join datasets.
merged_data = merge(data1, data2, by = "common_column") # merge two datasets based on a common column
Handle data truncation or rounding: 
data$column = ifelse(data$column %in% c("Val1", "Val2"), data$column, "Other") # replace inconsistent values with a default value
Aggregation or summarization:
aggregated_data = data %>% group_by(column1) %>% summarise(mean_column2 = mean(column2), max_column3 = max(column3)) # calculate mean and maximum values for specific groups
Handle categorical variables with many levels:
data = data %>% mutate(column = fct_lump(column, n = 5, other_level = "Other")) # lump infrequent levels into an "Other" category
Convert data between wide and long formats:
# Convert from wide to long format using tidyr data_long = data %>% pivot_longer(cols = starts_with("column"), names_to = "variable", values_to = "value") # Convert from long to wide format using tidyr data_wide = data_long %>% pivot_wider(names_from = "variable", values_from = "value")
Validate cleaned data integrity
stopifnot(all(is.finite(data))) # check if all values in the dataset are finite
Data wrangling is a repetitive as well as time-consuming procedure that involves an in-depth understanding of the analytic or specific requirements of modelling tasks. The objective is to create a clean, well-structured dataset that optimizes the insights that may be extracted from it.

Download the Data Wrangling cheatsheet - "Data Wrangling in R Cheatsheet"
Previous Post Next Post

Translate

AKSTATS

Learn it 🧾 --> Do it 🖋 --> Get it 🏹📉📊