Data Collection: Gather raw data from various sources, including databases, APIs, files (e.g., CSV, Excel, JSON), and internet scraping.
Be clear and cautious about the data format before you import it into the environment. Because most of the file format errors are runtime errors. Load the data into a data frame:
import pandas as pd # Importing Pandas library as alias "pd".
data = pd.read_csv('data.csv') # for .csv
print(data)
df = pd.read_excel("sample.xlsx") # for .xlsx
print(df)
import json as js
df = open('data.json') # Opening JSON file
dataset = js.load(f)
for j in dataset ['emps_details']: # Iterating through the json as list
print(j)
# It will print the results.
df.close() # Closing file
# Don't forget replace 'data' with your excat file path and file format.
Data Inspection: Perform preliminary exploratory data analysis (EDA) to understand the raw data's structure, content, and quality. This aids in the identification of possible difficulties as well as the planning of data cleaning and transformation stages. Inspect or Explore the data by using the below code:
data.head() # view the first few rows of the data
data.info() # check the data types and column
data.describe() # Get summary statistics of the data
data.dropna() # remove rows with missing values # or data.dropna(subset=['column1', 'column2']) # remove rows with missing values in specific columns
Handle inconsistent or incorrect values:data.loc[data['column'] == 'incorrect_value', 'column'] = np.nan # replace incorrect values with NaN data['column'].fillna(data['column'].mean(), inplace=True) # impute missing values with the mean
Reformat and reshape the data if necessary:data = data.melt(id_vars=['id'], value_vars=['column1', 'column2'], var_name='variable', value_name='value') # convert from wide to long format
Handle inconsistent categorical variables:data['column'] = data['column'].replace({'Incorrect': 'Correct'}) # correct inconsistent categorical values
Correct data types:data['column'] = data['column'].astype('category') # convert a column to a categorical type
Perform additional data validation checks:inconsistent_data = data[data['column1'] > data['column2']] # check for logical inconsistencies in the data missing_values = data[data['column'].isnull()] # check for missing values in a specific column
Handling Duplicate values: To maintain data integrity and correctness,remove duplicate records.
data.drop_duplicates() # remove duplicate rows
Outlier Detection and Handling: Identify and address outliers that may distort the outcomes of thestudy or modelling.lower_limit = data['column'].quantile(0.05) upper_limit = data['column'].quantile(0.95) data[(data['column'] = lower_limit) & (data['column'] <= upper_limit)] # remove outliers based on a specified range
Transformation of data:The conversion of data types (for example, numeric and categorical) to formats, conversion of uppercase to lowercaseand removal of space, all these conversions will make the proper dataset for further analysis.Data Formatting: Maintain consistent data formatting and organization for better analysis and visualization.data['column'] = pd.to_numeric(data['column'], errors='coerce') # convert a column to numeric type data['column'] = pd.to_datetime(data['column'], format='%Y-%m-%d') # convert a column to date type data['column'] = data['column'].str.lower() # convert a column to lowercase data['column'] = data['column'].str.strip() # Remove leading and trailing whitespaces
Normalization and standardization: To enable fair comparisons and improved model performance,
scale numerical characteristics to a standard scale.Encoding category Variables: For model training, convert category variables into numerical representations.
from sklearn.preprocessing import StandardScaler, MinMaxScaler scaler = StandardScaler() # create a standard scaler object data['column'] = scaler.fit_transform(data[['column']]) # standardize a column # or scaler = MinMaxScaler() # create a min-max scaler object data['column'] = scaler.fit_transform(data[['column']]) # normalize a column
Handle skewed or non-normal distributions:import numpy as np data['column'] = np.log(data['column'] + 1) # apply log transformation to address skewness
Other important data wrangling operations are listed below:Save the cleaned data:data.to_csv('cleaned_data.csv', index=False) # replace 'cleaned_data.csv' with your desired file name
Handle inconsistent string formats:data['column'] = data['column'].str.replace('\s+', ' ', regex=True) # remove multiple whitespaces within strings data['column'] = data['column'].str.replace('[^a-zA-Z0-9\s]', '', regex=True) # remove non-alphanumeric characters from strings
Handle imputation of missing values replaced by mean:data['column'].fillna(data['column'].mean(), inplace=True) # impute missing values with the mean
Reorder and rename variables:data = data[['column3', 'column1', 'column2']] # reorder columns data.rename(columns={'OldColumnName': 'NewColumnName'}, inplace=True) # rename a column
Remove unnecessary variables:data = data.drop(['column3', 'column4'], axis=1) # remove specific columns from the dataset
Merge or join datasets:merged_data = data1.merge(data2, on='common_column') # merge two datasets based on a common column
Handle inconsistent or misspelled values:data.loc[data['column'].isin(['Incorrect', 'Misspelled']), 'column'] = 'Correct' # replace inconsistent values with a correct value
Handle data truncation or rounding::data['column'] = data['column'].round(2) # round numerical values to a specific decimal place
Perform data aggregation or summarization:aggregated_data = data.groupby('category')['column'].agg(['mean', 'max']) # calculate mean and maximum values for each category
Perform data type conversions and factor re-leveling:data['column'] = data['column'].astype('int') # convert a column to integer type data['column'] = pd.Categorical(data['column'], categories=['Level 1', 'Level 2']) # redefine categorical levels
Data Validation: Ensure that the processed data fulfils the specified requirements and is ready for the analytical step.
data.info() # verify the data types and column information of the cleaned data
Data wrangling is a repetitive as well as time-consuming procedure that involves an in-depth understandingof the analytic or specific requirements of modelling tasks. The objective is to create a clean, well-structureddataset that optimizes the insights that may be extracted from it.