To understand the role of VIF, let's first understand the process of making a linear regression model with two variables.
While making such model, we have to make sure that the variables are not correlated to each other.
What is correlation?
If the changes in variable v1 can be explained by the other variable v2, or if variable v1 is some function of variable v2, then it is said that the two variables are correlated. Since v1 can be represented as some function of v2, v1 is actually not adding any value to the model (all it explains can also be explained by other variable)
Correlation can be checked by observing the correlation charts between the two variables just as shown below:
Fig. 1 : Examples of Correlation graphs
What if there are more than two variables?
While making multivariate regression model, if a predictor/variable is a function of combination of multiple variables or combination of multiple variables explains the effect of a variable then that variable exhibits multicollinearity with others. Severe multicollinearity may result in unstable estimates (you don't want this to happen).
We can't see the effect of multiple variables on one variable so correlation plots are of no luck in this case.
So, how do we check with multicollinearity in the model?
Solution: VIF or Variance Inflation Factor is used to check multicollinearity in the model.
How do we use it?
Playing with VIF and making a model less prone to multicollinearity is an iterative process.
In each step of the process, we quantify the amount of effect of a variable explained by all the other variables in the model. We do this for all the variables and remove the variables with high collinearity iteratively until the value of the VIF is under the threshold for all the variables.
Let see how.
Let p, q, r, s, t,.... be the independent variables considered in the multivariate regression model and let Y be the dependent variable.
Create a model using all the independent variables.
Y = b0 + b1*p + b2*q + b3*r + .....
Note down the value of R^2 for this model.
Create a model with dependent variable as p and rest all as independent variables.
p = b0 + b1*q + b2*r + ...
Note down R^2 for this model.
The value of VIF is given by: 1/(1-R^2)
Since the value of R^2 can vary from 0 to 1, hence the value of VIF lies between 1 and infinity.
Point to remember: Higher the VIF, more the collinearity.
Similarly, repeat the step II and calculate the VIF for all the independent variables one by one and record the values of VIF.
Using the values from the above calculations:
Independent variable VIF
From the above table, remove the variable with the highest VIF value if it crosses the threshold. The appropriate threshold depends on the domain of the problem. Generally VIF value of 5 or 10 are considered as threshold.
Note: Do not remove all the variables with VIF value higher than the threshold, but only the one with the highest value.
What after removing the variable with highest VIF value?
Congrats. You have started the process of removing variables using VIF.
Let 'r' be the removed variable. Now, repeat the same process with the variables left after removing r, and see if the variable with highest VIF crosses the threshold or not. If yes, remove that variable too and start the process again until variables stop crossing the threshold.
When I said it's an iterative process, I really meant it!
Why are we not removing all the variables with high VIF in a go?
Now that we know what multicollinearity is, we also know that removing one variable affect the VIF score since we are not aware how much of the effect for a variable was being explained by the removed variable. So, if a variable is removed from the model, the corresponding VIF scores of variables become meaningless. And hence, we have to compute them all over again.
This is how we deal with the multicollinearity between the variables in a model.
Hope this helps!!