visit
Multicollinearity is a well-known challenge in multiple regression. The term refers to the high correlation between two or more explanatory variables, i.e. predictors. It can be an issue in machine learning, but what really matters is your specific use case.
According to
As you can see, all of these relate to variable importance. In many cases, multiple regression is used with the purpose of understanding something. For instance, an ecologist might want to know what kind of environmental and biological factors lead to changes in the population size of chimpanzees. If understanding is the goal, checking for correlation between predictors is standard practice. They say that a picture is better than a thousand words so let’s see some multicollinearity.
Imagine you want to understand what drives fuel consumption in cars. We can use a dataset called mtcars that has information on miles per gallon (mpg) and other details about different models of cars. When we plot correlations of all the continuous variables, we can see a lot of strong correlations, i.e. multicollinearity.
When I ran a simple linear regression on this data, the car’s weight (wt) showed up as a statistically significant predictor. We also see that other variables strongly correlate with our target (mpg). Still, the model didn’t recognize them as important predictors due to multicollinearity.
When we think about regression, we need to make an assumption about the distribution, the shape of the data. When we need to specify a distribution, this is a parametric method. For example, the
Machine learning (ML) algorithms usually aim to achieve the best accuracy or low prediction error, not to explain the true relationship between predictors and the target variable. This can let you believe that multicollinearity is not an issue in machine learning. In fact, when searching for “multicollinearity and machine learning”, one of the top search results was a
I wouldn’t be so blunt.
Multicollinearity can be an issue in machine learning.
For example,
Well, let’s all just use neural networks for everything, right?
The problem is that in practice, you need to explain your system’s behaviour, especially if it makes decisions.
But what if you really don’t care about explaining and understanding anything? You just want a nice black box that will have an outstanding performance. If that’s the case, you are off the hook; you can ignore correlated predictors but then don’t check variable importance when someone asks you about it.
For those interested in handling correlated features, here are some tips.
You can deal with multicollinearity by
Both of the solutions will limit you differently. As
Correlated predictive variables can be an issue in machine learning and non-parametric methods. Multicollinearity is not your friend, so you should always double-check whether your chosen method can handle it automatically. Still, the solution will depend on your use case; mainly, whether you need to explain what the model has learnt or not.
For a deep dive on the topic, I recommend reading