The practice of repairing or deleting inaccurate, incorrect, improperly formatted, duplicate, or missing data from a dataset is known as data cleaning. There are several ways for data to be duplicated or mislabeled this may occur when merging different data sources of datasets. Without the proper data, the results and methods are untrustworthy, even if the programme runs the data. There is no specific method for all data to describe the exact phases in the data cleaning process.

Handling the missing value

The first and foremost thing in data cleaning is to handle the missing value. Handling the missing value in the dataset is more important than other things in the data cleaning process or else it may cause a fluctuation in the accuracy or statistical methods may not be applicable. The missing values can be handled using different methods based on the nature of the data. Mostly the missing values are replaced by the median or mean but in some other cases, the missing values are removed from the datasets for better accuracy (But this way is not recommended).

Removing  duplicate entries or unrelated observations

Remove any extraneous observations from your dataset, such as duplicates or irrelevant observations.
During data collecting, duplicate observations are normal. Duplicate data might occur when datasets are from different data sources. Removing duplicate entries or unrelated observations in the dataset is one of the most important aspects of data cleaning. Irrelevant observations are those that are not relevant to the specific topic you are attempting to evaluate. For example, if you want to evaluate data from thousands of consumers but your dataset contains observations from previous generations, you may eliminate those observations. This can improve analysis efficiency, eliminate distraction from your main purpose, and result in a more manageable and high-performance dataset.

Outlier and Inlier

Outliers can be defined as the data points within a dataset varies largely in comparison to other observations. Depending on its cause an outlier can decrease the accuracy as well as efficiency of the model. For example, we are collecting pencil price data (monthly wise), in that the average price of a pencil lies between 2 to 8 rupees but suddenly average the price of the pencil is raised to 9.5 rupees for one month. This may create an outlier in the data. 
 
Inlier is a data point within a dataset that is at the same level as others. It is also an error and is removed to improve the model accuracy but it is difficult to find out. This can be easily found out by using Standard Deviation if the sd value is high which indicates the inlier in the data. Based on the nature of the dataset, the data cleaning process may vary from one another.

The basic data cleaning process is explained in the R language - Data cleaning in R

Post a Comment

The more you ask questions, that will enrich the answer, so whats your question?

Previous Post Next Post

Translate

AKSTATS

Learn it 🧾 --> Do it 🖋 --> Get it 🏹📉📊