What exactly is statistical modelling?

In data science, the statistical modelling process is a method of applying statistical analysis to datasets. A mathematical link exists between random and non-random variables in the statistical model. By applying statistical models to raw data, a statistical model can generate comprehensible visualisations that enable data scientists in discovering correlations between variables and generate predictions. Census data, public health data, and social media data are examples of typical data sets for statistical analysis.

Steps involved in the model building:

Data collection

Data collection is the systematic process of acquiring and measuring information on variables of interest to answer specified research questions, test hypotheses, and assess outcomes. This is the most challenging phase, after finalizing the objective of the model and deciding on the dependent and independent variables. we need to collect all the relevant data from reliable sources and understand the relationship between the dependent variable and the independent variable.

Data cleaning

Data cleaning is the process of fixing or removing erroneous, incorrect, incorrectly formatted, duplicated, or missing data from a dataset. When combining diverse datasets from different data sources, data can get duplicated or mislabeled in several ways. Even if the software runs the data, the results and techniques are unreliable without the right data. The precise steps in the data cleaning process are not described by a single approach that applies to all data. 

Model development

Once we are ready with the input data then based on the nature of the dependent variable and the relationship between the dependent variable and the independent variable. we need to select methodologies. Let us suppose, the dependent variable is a categorical variable then we can use logistic regression, a decision tree, Artificial Neural Network(ANN), Gradient Boosting, etc.,

Model diagnostics

After developing the models we need to check the model for the below three characteristics:
  • Status of the underlying assumption.
  • Model accuracy.
  • The prediction power of the model.
Assuming that the model is multiple linear regression, 
  • Here we need to check the underlying assumption and check for the problems like Heteroscedasticity, Autocorrelational and multicollinearity.
  • If all the assumptions were satisfied we need to check model accuracy by using different methods. For regression, the coefficient of determination - R squared value is used to check the accuracy.
  • If the model performance is accurate on the data, then we need to check the prediction power of the model that is how accurately the test data has been predicted by the model.
Selecting the best model(Accuracy Measures) is in the next article click here to read it.!!!!!!

Post a Comment

The more you ask questions, that will enrich the answer, so whats your question?

Previous Post Next Post

Translate

AKSTATS

Learn it 🧾 --> Do it 🖋 --> Get it 🏹📉📊