Jakob Poerschmann ’21 (Data Science) has written an article called “Stop Dropping Outliers! 3 Upgrades That Prepare Your Linear Regression For The Real World” that was recently posted on Towards Data Science.
The real world example he uses to set up the piece will resonate with every fan of FC Barcelona (and probably scare them, too):
You are working as a Data Scientist for the FC Barcelona and took on the task of building a model that predicts the value increase of young talent over the next 2, 5, and 10 years. You might want to regress the value over some meaningful metrics such as the assists or goals scored. Some might now apply this standard procedure and drop the most severe outliers from the dataset. While your model might predict decently on average, it will unfortunately never understand what makes a Messi (because you dropped Messi with all the other “outliers”).
The idea of dropping or replacing outliers in regression problems comes from the fact that simple linear regression is comparably prone to extremes in the data. However, this approach would not have helped you much in your role as Barcelona’s Data Scientist. The simple message: Outliers are not always bad!
Dig into the full article to find out how to prepare your linear regression for the real world and avoid a tragedy like this one!
Connect with the author
Jakob Poerschmann ’21 is student in the Barcelona GSE Master’s in Data Science.