Data Wrangling procedure in R

Data wrangling is the process of cleaning, converting, and organizing raw data into appropriate format for analysis. It is also known as data munging or data preparation. The quality and structure of the data have a large impact on the accuracy and dependability of any subsequent studies or machine learning models, therefore this phase is critical in the data analysis workflow.

The following are the major steps in data wrangling:

Load the necessary packages:

library(dplyr)     # for data manipulation
library(tidyr)     # for data tidying
library(stringr)   # for string operations

Import the data into R

data = read.csv("data.csv")    # replace "data.csv" with your file path

Explore the data

head(data)        # view the first few rows of the data
str(data)         # check the structure of the data (data types, columns)
summary(data)     # get summary statistics of the data

Handle missing values:

data = na.omit(data)    # remove rows with missing values
# or
data = data[complete.cases(data), ]    # remove rows with missing values
# or 
data$column[is.na(data$column)] = mean(data$column, na.rm = TRUE)    # impute missing values with the mean

Remove duplicate rows:

data = distinct(data)    # remove duplicate rows

Clean and transform:

data$column = as.numeric(data$column)    # convert a column to numeric type
data$column = as.Date(data$column, format = "%Y-%m-%d")    # convert a column to date type
data$column = tolower(data$column)    # convert a column to lowercase
data$column = str_trim(data$column)    # remove leading and trailing whitespaces

Handle outliers:

data = data %>%
  filter(column >= lower_limit & column <= upper_limit)    # remove outliers based on a specified range

Handle inconsistent values:

data$column[data$column == "incorrect_value"] = NA    # replace incorrect values with NA

Save the cleaned data:

write.csv(data, "cleaned_data.csv", row.names = FALSE)    # replace "cleaned_data.csv" with your desired file name

Handle consistent categorical variables:

data$column = recode(data$column, "Incorrect" = "Correct")    # correct inconsistent categorical values

Correct data types:

data$column = as.factor(data$column)    # convert a column to a factor

Handling inconsistent string formats:

# Calculate the z-scores for a column
data$z_score <- (data$column - mean(data$column)) / sd(data$column)

# Remove outliers based on z-scores
data <- data[data$z_score >= -3 & data$z_score <= 3, ]

Perform additional data validation checks:

# Check for logical inconsistencies in the data
inconsistent_data = data[data$column1 > data$column2, ]

# Check for missing values in specific columns
missing_values = data[is.na(data$column), ]

Reorder and rename variable:

data = data %>%
  select(column3, column1, column2)    # reorder columns

data <- data %>%
  rename(NewColumnName = OldColumnName)    # rename a column<- code="" data="">

Validate cleaned data:

summary(data)    # verify the summary statistics of the cleaned data

Merge or join datasets.

merged_data = merge(data1, data2, by = "common_column")    # merge two datasets based on a common column

Handle data truncation or rounding:

data$column = ifelse(data$column %in% c("Val1", "Val2"), data$column, "Other")    # replace inconsistent values with a default value

Aggregation or summarization:

aggregated_data = data %>%
  group_by(column1) %>%
  summarise(mean_column2 = mean(column2), max_column3 = max(column3))    # calculate mean and maximum values for specific groups

Handle categorical variables with many levels:

data = data %>%
  mutate(column = fct_lump(column, n = 5, other_level = "Other"))    # lump infrequent levels into an "Other" category

Convert data between wide and long formats:

# Convert from wide to long format using tidyr
data_long = data %>%
  pivot_longer(cols = starts_with("column"), names_to = "variable", values_to = "value")

# Convert from long to wide format using tidyr
data_wide = data_long %>%
  pivot_wider(names_from = "variable", values_from = "value")

Validate cleaned data integrity

stopifnot(all(is.finite(data)))    # check if all values in the dataset are finite

Data wrangling is a repetitive as well as time-consuming procedure that involves an in-depth understanding of the analytic or specific requirements of modelling tasks. The objective is to create a clean, well-structured dataset that optimizes the insights that may be extracted from it.

Download the Data Wrangling cheatsheet - "Data Wrangling in R Cheatsheet"

Translate

AKSTATS

Contact Form