Data wrangling is the process of cleaning, converting, and organizing raw data into appropriate format for analysis. It is also known as data munging or data preparation. The quality and structure of the data have a large impact on the accuracy and dependability of any subsequent studies or machine learning models, therefore this phase is critical in the data analysis workflow. 
The following are the major steps in data wrangling: 

Data Collection: Gather raw data from various sources, including databases, APIs, files (e.g., CSV, Excel, JSON), and internet scraping. 

Be clear and cautious about the data format before you import it into the environment. Because most of the file format errors are runtime errors. Load the data into a data frame:

import pandas as pd          # Importing Pandas library as alias "pd".
data = pd.read_csv('data.csv')   # for .csv 
print(data)
  
df = pd.read_excel("sample.xlsx")  # for .xlsx
print(df)

import json as js
df = open('data.json')  # Opening JSON file
dataset = js.load(f) 
for j in dataset ['emps_details']: # Iterating through the json as list
   print(j)
# It will print the results.
df.close() # Closing file

# Don't forget replace 'data' with your excat file path and file format.

Data Inspection: Perform preliminary exploratory data analysis (EDA) to understand the raw data's structure, content, and quality. This aids in the identification of possible difficulties as well as the planning of data cleaning and transformation stages. Inspect or Explore the data by using the below code:

data.head()  # view the first few rows of the data
data.info()  # check the data types and column
data.describe() # Get summary statistics of the data 

Data cleaning: 

Handling Missing Data: Using suitable imputation techniques, identify and fill or delete missing or null values in the dataset. 
data.dropna() # remove rows with missing values # or data.dropna(subset=['column1', 'column2']) # remove rows with missing values in specific columns
Handle inconsistent or incorrect values:
data.loc[data['column'] == 'incorrect_value', 'column'] = np.nan  # replace incorrect values with NaN

data['column'].fillna(data['column'].mean(), inplace=True)    # impute missing values with the mean
Reformat and reshape the data if necessary:
data = data.melt(id_vars=['id'], value_vars=['column1', 'column2'], var_name='variable', value_name='value')    # convert from wide to long format
Handle inconsistent categorical variables:
data['column'] = data['column'].replace({'Incorrect': 'Correct'})    # correct inconsistent categorical values
Correct data types:
data['column'] = data['column'].astype('category')    # convert a column to a categorical type
Perform additional data validation checks:
inconsistent_data = data[data['column1'] > data['column2']]    # check for logical inconsistencies in the data 
missing_values = data[data['column'].isnull()]    # check for missing values in a specific column

Handling Duplicate values: To maintain data integrity and correctness,remove duplicate records.

data.drop_duplicates()  # remove duplicate rows
Outlier Detection and Handling: Identify and address outliers that may distort the outcomes of the
study or modelling.
lower_limit = data['column'].quantile(0.05)
upper_limit = data['column'].quantile(0.95)
data[(data['column'] = lower_limit) & (data['column'] <= upper_limit)] # remove outliers based on a specified range
Transformation of data: 

The conversion of data types (for example, numeric and categorical) to formats, conversion of uppercase to lowercase
and removal of space, all these conversions will make the proper dataset for further analysis.
Data Formatting: Maintain consistent data formatting and organization for better analysis and visualization.

data['column'] = pd.to_numeric(data['column'], errors='coerce') # convert a column to numeric type
data['column'] = pd.to_datetime(data['column'], format='%Y-%m-%d') # convert a column to date type
data['column'] = data['column'].str.lower() # convert a column to lowercase
data['column'] = data['column'].str.strip() # Remove leading and trailing whitespaces

Normalization and standardization: To enable fair comparisons and improved model performance,
scale numerical characteristics to a standard scale.

Encoding category Variables: For model training, convert category variables into numerical representations.

from sklearn.preprocessing import StandardScaler, MinMaxScaler scaler = StandardScaler() # create a standard scaler object data['column'] = scaler.fit_transform(data[['column']]) # standardize a column # or scaler = MinMaxScaler() # create a min-max scaler object data['column'] = scaler.fit_transform(data[['column']]) # normalize a column
Handle skewed or non-normal distributions:
import numpy as np data['column'] = np.log(data['column'] + 1) # apply log transformation to address skewness
Other important data wrangling operations are listed below:
Save the cleaned data:
data.to_csv('cleaned_data.csv', index=False) # replace 'cleaned_data.csv' with your desired file name
Handle inconsistent string formats:
data['column'] = data['column'].str.replace('\s+', ' ', regex=True) # remove multiple whitespaces within strings data['column'] = data['column'].str.replace('[^a-zA-Z0-9\s]', '', regex=True) # remove non-alphanumeric characters from strings
Handle imputation of missing values replaced by mean:
data['column'].fillna(data['column'].mean(), inplace=True) # impute missing values with the mean
Reorder and rename variables:
data = data[['column3', 'column1', 'column2']] # reorder columns data.rename(columns={'OldColumnName': 'NewColumnName'}, inplace=True) # rename a column
Remove unnecessary variables:
data = data.drop(['column3', 'column4'], axis=1) # remove specific columns from the dataset
Merge or join datasets:
merged_data = data1.merge(data2, on='common_column') # merge two datasets based on a common column
Handle inconsistent or misspelled values:
data.loc[data['column'].isin(['Incorrect', 'Misspelled']), 'column'] = 'Correct' # replace inconsistent values with a correct value
Handle data truncation or rounding::
data['column'] = data['column'].round(2) # round numerical values to a specific decimal place
Perform data aggregation or summarization:
aggregated_data = data.groupby('category')['column'].agg(['mean', 'max']) # calculate mean and maximum values for each category
Perform data type conversions and factor re-leveling:
data['column'] = data['column'].astype('int') # convert a column to integer type data['column'] = pd.Categorical(data['column'], categories=['Level 1', 'Level 2']) # redefine categorical levels

Data Validation: Ensure that the processed data fulfils the specified requirements and is ready for the analytical step.

data.info() # verify the data types and column information of the cleaned data
Data wrangling is a repetitive as well as time-consuming procedure that involves an in-depth understanding
of the analytic or specific requirements of modelling tasks. The objective is to create a clean, well-structured
dataset that optimizes the insights that may be extracted from it.
Previous Post Next Post

Translate

AKSTATS

Learn it 🧾 --> Do it 🖋 --> Get it 🏹📉📊