Data wrangling is the process of cleaning, converting, and organizing raw
data into appropriate format for analysis. It is also known as data munging
or data preparation. The quality and structure of the data have a large
impact on the accuracy and dependability of any subsequent studies or
machine learning models, therefore this phase is critical in the data
analysis workflow.
The following are the major steps in data wrangling:
Load the necessary packages:
Import the data into Rlibrary(dplyr) # for data manipulation library(tidyr) # for data tidying library(stringr) # for string operations
Explore the datadata = read.csv("data.csv") # replace "data.csv" with your file path
Handle missing values:head(data) # view the first few rows of the data str(data) # check the structure of the data (data types, columns) summary(data) # get summary statistics of the data
Remove duplicate rows:data = na.omit(data) # remove rows with missing values # or data = data[complete.cases(data), ] # remove rows with missing values # or data$column[is.na(data$column)] = mean(data$column, na.rm = TRUE) # impute missing values with the mean
Clean and transform:data = distinct(data) # remove duplicate rows
Handle outliers:data$column = as.numeric(data$column) # convert a column to numeric type data$column = as.Date(data$column, format = "%Y-%m-%d") # convert a column to date type data$column = tolower(data$column) # convert a column to lowercase data$column = str_trim(data$column) # remove leading and trailing whitespaces
Handle inconsistent values:data = data %>% filter(column >= lower_limit & column <= upper_limit) # remove outliers based on a specified range
Save the cleaned data:data$column[data$column == "incorrect_value"] = NA # replace incorrect values with NA
Handle consistent categorical variables:write.csv(data, "cleaned_data.csv", row.names = FALSE) # replace "cleaned_data.csv" with your desired file name
Correct data types:data$column = recode(data$column, "Incorrect" = "Correct") # correct inconsistent categorical values
Handling inconsistent string formats:data$column = as.factor(data$column) # convert a column to a factor
Perform additional data validation checks:# Calculate the z-scores for a column
data$z_score <- (data$column - mean(data$column)) / sd(data$column) # Remove outliers based on z-scores data <- data[data$z_score >= -3 & data$z_score <= 3, ]
Reorder and rename variable:# Check for logical inconsistencies in the data inconsistent_data = data[data$column1 > data$column2, ] # Check for missing values in specific columns missing_values = data[is.na(data$column), ]
Validate cleaned data:data = data %>% select(column3, column1, column2) # reorder columns data <- data %>% rename(NewColumnName = OldColumnName) # rename a column<- code="" data="">->
Merge or join datasets.summary(data) # verify the summary statistics of the cleaned data
Handle data truncation or rounding:merged_data = merge(data1, data2, by = "common_column") # merge two datasets based on a common column
Aggregation or summarization:data$column = ifelse(data$column %in% c("Val1", "Val2"), data$column, "Other") # replace inconsistent values with a default value
Handle categorical variables with many levels:aggregated_data = data %>% group_by(column1) %>% summarise(mean_column2 = mean(column2), max_column3 = max(column3)) # calculate mean and maximum values for specific groups
Convert data between wide and long formats:data = data %>% mutate(column = fct_lump(column, n = 5, other_level = "Other")) # lump infrequent levels into an "Other" category
Validate cleaned data integrity# Convert from wide to long format using tidyr data_long = data %>% pivot_longer(cols = starts_with("column"), names_to = "variable", values_to = "value") # Convert from long to wide format using tidyr data_wide = data_long %>% pivot_wider(names_from = "variable", values_from = "value")
stopifnot(all(is.finite(data))) # check if all values in the dataset are finite
Data wrangling is a repetitive as well as time-consuming procedure that involves an in-depth understanding
of the analytic or specific requirements of modelling tasks. The objective is to create a clean, well-structured
dataset that optimizes the insights that may be extracted from it.
Download the Data Wrangling cheatsheet - "Data Wrangling in R Cheatsheet"