# Introduction: The problem with tutorials

This huge amount of data science tutorials on the internet has its pros and cons. The biggest problem, in my humble opinion, is that everyone wants to talk about data nowadays in order to get people — specially recruiters — attention. That being said, several mistakes are being made and no one is correcting them. This leads to mistakes being spread all over beginners.

If you take 90% of tutorials about Linear Regression in this platform, you'll see a common mistake being made: saying that multicollinearity is the same as correlation. Even worse, people think that multicollinearity invalidates their model and that you should use Pearson correlation coefficient to deal with it. All these statements are wrong. Multicollinearity is not correlation, you should not deal with it by using Pearson coefficient and it’s not a problem depending on your situation.

# What’s multicollinearity?

Multicollinearity appears when one of your predictors can be predicted by one or several others. Let’s say you want to predict the salary of a young student. In order to do that, you have some variables, like the high school price, parents' earnings, neighborhood, amount of time spent in school and amount of subjects studied in high school. If the high school price can be predicted by the parents' earnings, which is very likely, then you have multicollinearity.

You might be thinking that this looks exactly like linear correlation, but it’s not always like this. In the example above, correlation will indicate multicollinearity, but this doesn’t happen every time. Let’s say the parents' earnings is not a very good predictor of the school price by itself. This variable, however, can be part of a model that predicts the price of the school. We might have a great model if we take not only the earnings, but also the neighborhood and the amount of subjects. Now, we have 3 independent variables — parents' earnings, neighborhood and amount of subjects — that predicts another one independent variable, the school price. This is a situation in which there is multicollinearity, but you won’t be able to capture it only by using the Pearson coefficient.

# So, how can we detect multicollinearity?

In order to assess if your model has multicollinearity you should use a metric called Variance Inflation Factor, also known as VIF. Mathematically, this metric is equal to the ration of the model variance to the variance of a model that includes only that single independent variable.

As mentioned on Wikipedia:

The square root of the variance inflation factor indicates how much larger the standard error increases compared to if that variable had 0 correlation to other predictor variables in the model.

Usually, we use this rule of thumb:

- VIF equals to 1: no multicollinearity.
- VIF higher than 1 and lower (or equal) than 5: mild multicollinearity. It’s not a big problem to leave the variable.
- VIF higher than 5: there is multicollinearity and you should leave the variable out of your model.

If you want to see the step-by-step demonstration of the formula, I highly recommend you to start with Wikipedia.

# Wait, multicollinearity does affect your model!

Yes and no. There is definitely a problem when our model has independent variables which are predicted by other independent variables, but that doen’t mean your model doesn’t work anymore. Multicollinearity inflates standard errors, which means that our coefficients are not trustworthy anymore. However, our predictions are still safe. That’s why I mentioned before that **it’s not a problem depending on your situation.**

One last thought, this is only one of the assumptions people believe invalidate their model, when it doesn’t do that. If you are working with a inference problem and want to learn more about these assumptions, please read some econometrics book. My favorite one is Introductory Econometrics, by Wooldridge.

Hope you enjoyed this article.