A FAMOUS MISCONCEPTION
Do you understand what multicollinearity is? To clarify, do you truly grasp the concept of multicollinearity and how to detect it?
If your answer involves recognizing multicollinearity as a high correlation among independent variables in a regression model, detected by examining the correlation matrix, then your understanding is only partially correct.
This is a common misconception, so don’t be concerned. A quick internet search will reveal numerous tutorials echoing this oversimplified interpretation of multicollinearity. In fact, I would encourage you to take a brief pause from this article and explore the prevailing internet literature on the subject. You’ll find that a significant majority, perhaps 99% of the available texts, propagate this same misunderstanding.
FIRST, WHAT IS MULTICOLLINEARITY?
In Data Science, when we discuss multicollinearity, we often grapple with the assumptions of linear regression. Specifically, we’re examining an independent variable that is influenced by one or more other independent variables. As pointed out in chapter 3 of “Introductory Econometrics” by Jeffrey Wooldridge, these assumptions help us derive the estimators known as BLUE (Best Linear Unbiased Estimators). Remember, our focus here is not on the validity of the regression and its predictions, but on the estimators!
Let’s say you decide to investigate this relationship by examining the correlation matrix (more specifically, Pearson Coefficients), as many tutorials suggest. Consider a scenario where nine independent variables collectively “explain” the tenth variable. If these nine variables account for only 5–25% of the variance of the last variable, the correlation matrix will exhibit only low values, even though we face a case of perfect multicollinearity. You see the issue? We have the most severe case of multicollinearity, which the correlation matrix fails to detect. That’s a major problem.
HOW TO CAPTURE MULTICOLLINEARITY?
So, how do we check for multicollinearity then?
In this case, we use the metric called VIF, variance inflation factor. The formula for VIF is shown in the image below:
What VIF does is rather intuitive: for each independent variable, it runs a regression using that variable as the dependent variable and the remaining variables as predictors. The objective is to calculate the R-squared value of this regression, which signifies the proportion of the variance in the ‘dependent’ variable that can be predicted from the ‘independent’ variables. This R-squared value is then incorporated into the above formula, and a higher R-squared leads to a correspondingly higher VIF.
In other words, VIF gauges to what extent the variability of an independent variable can be explained by the other independent variables. A high VIF indicates that the associated independent variable is highly collinear with the other variables in the model.
For instance, if the R-squared is high, let’s say 0.9, you’ll have a high VIF, as well — for this example, VIF = 10. That’s why when we have a high VIF, we know there is a multicollinearity issue. Note that the regression doesn’t look at variables in pairs, but rather one variable in relation to the others. That’s why we can capture cases of multicollinearity that the correlation matrix can’t.
WRAPPING UP (AND ONE LAST ISSUE)
Furthermore, as already mentioned before, another common misconception is to say that your linear regression is not valid if it violates the assumption of perfect multicollinearity. You see, multicollinearity impacts the coefficients, making their statistical tests unreliable. However, it’s not ruining your prediction. In fact, it is part of a Gauss-Markov assumption when we seek estimators called BLUE (best linear unbiased estimators). Therefore, your regression can still work very well, depending on its use.
I trust you found this article both enjoyable and insightful. If I’ve addressed issues present in your own tutorials, please accept my apologies in advance. I assure you it’s not about placing blame. My intention is to help students navigate some common misconceptions that often circulate on the internet.
If you speak Portuguese, you can find more content on my Ig @universidadedosdados. For English speakers, you can reach me on my Linkedin (https://www.linkedin.com/in/andreyukio/) or Kaggle (https://www.kaggle.com/andreyukio).
PS.: My first encounter with the distinction between multicollinearity and linear correlation was on Quora, through an answer by Peter Flom. It was enlightening! Despite my master’s in Economics, I had issues distinguishing between prediction and inference, and these forums greatly assisted me. My sincere thanks to those who assist the community, whether by participating in posts and discussions, or merely engaging in them.