July 24, 2018

April 15, 2017

1/3

# Multicollinearity and VIF

July 30, 2017

To understand the role of VIF, let's first understand the process of making a linear regression model with two variables.

While making such model, we have to make sure that the variables are not correlated to each other.

What is correlation?

If the changes in variable v1 can be explained by the other variable v2, or if variable v1 is some function of variable v2, then it is said that the two variables are correlated. Since v1 can be represented as some function of v2, v1 is actually not adding any value to the model (all it explains can also be explained by other variable)

Correlation can be checked by observing the correlation charts between the two variables just as shown below:

Fig. 1 : Examples of Correlation graphs

What if there are more than two variables?

While making multivariate regression model, if a predictor/variable is a function of combination of multiple variables or combination of multiple variables explains the effect of a variable then that variable exhibits multicollinearity with others. Severe multicollinearity may result in unstable estimates (you don't want this to happen).

We can't see the effect of multiple variables on one variable so correlation plots are of no luck in this case.

So, how do we check with multicollinearity in the model?

Solution: VIF or Variance Inflation Factor is used to check multicollinearity in the model.

How do we use it?

Playing with VIF and making a model less prone to multicollinearity is an iterative process.

In each step of the process, we quantify the amount of effect of a variable explained by all the other variables in the model. We do this for all the variables and remove the variables with high collinearity iteratively until the value of the VIF is under the threshold for all the variables.

Let see how.

Let p, q, r, s, t,.... be the independent variables considered in the multivariate regression model and let Y be the dependent variable.

Step I:

Create a model using all the independent variables.

Y = b0  +  b1*p  +  b2*q  +  b3*r  +  .....

Note down the value of R^2 for this model.

Step II:

Create a model with dependent variable as p and rest all as independent variables.

p = b0  +  b1*q  +  b2*r  +  ...

Note down R^2 for this model.

The value of VIF is given by: 1/(1-R^2)

Since the value of R^2 can vary from 0 to 1, hence the value of VIF lies between 1 and infinity.

Point to remember: Higher the VIF, more the collinearity.

Step III:

Similarly, repeat the step II and calculate the VIF for all the independent variables one by one and record the valu