What people misunderstand about the test set in machine learning

3 min readOct 27, 2024

Image source: https://datascience.stackexchange.com/questions/61467/clarification-on-train-test-and-val-and-how-to-use-implement-it

INTRODUCTION: WHAT IS THE TEST SET?

People think they understand the test set in machine learning, but the reality often proves otherwise. Many believe it’s just a dataset used to validate a model, but frequent mistakes reveal a deeper misunderstanding of its true purpose. From mixing data incorrectly to unintentionally sharing information between datasets, these missteps show that the test set’s role is often misunderstood.

In machine learning, the test set is a critical part of building a model. It’s used to simulate the real-world conditions that the model will face when it’s actually in use, showing us how well it handles new data. However, a lot of people misunderstand the purpose of the test set, leading to something called data leakage. This can make the model look great during testing but lead to poor results when it’s actually deployed.

PRE-PROCESSING AND LEAKAGE

Now, let’s think about it. If the test set is a simulation of real-world conditions, it should be kept separate from the training process and shouldn’t interfere with any part of building the model. A common mistake is applying data processing steps — such as handling missing values, scaling, and encoding — to the entire dataset before splitting it into training and test sets. This blurs the line between the datasets, which is a critical error.

For instance, say you’re filling missing values by using the average of a column. If you calculate this average from the entire dataset, information from the test set leaks into the training data. By calculating the average with data from both sets, you’re unintentionally using test information in training, violating the test set’s purpose. The test set should not impact the training process, yet this approach lets test data influence training, creating a hidden data leak.

DEALING WITH IMBALANCED DATA

Image source: https://medium.com/analytics-vidhya/undersampling-and-oversampling-an-old-and-a-new-approach-4f984a0e8392

Another common issue is when people do data augmentation before splitting their data. Data augmentation, like creating extra examples to balance the dataset, can help build better models. But it’s important to do this only with the training set. If you add augmented data to the test set, it loses its purpose as a true reflection of real-world data. It’s even worse if you apply augmentation before splitting because augmented copies of training data can end up in the test set. This can cause the model to learn specific details of these generated examples, rather than focusing on the original data. The test set should look as close to reality as possible for an honest check of the model’s performance.

SEQUENTIAL OVERFITTING

Image source: How to avoid machine learning pitfalls:
a guide for academic researchers

Another form of data leakage is something called sequential overfitting. This happens when you build and test several models, making changes based on how each one does on the test set. When you keep using the test set to improve the model, information from it sneaks into the training process over time. This repeated tweaking gradually overfits the model to the test data, meaning it performs well on the test set but poorly on new data. To avoid this, use a separate validation set to adjust the model, saving the test set for the very last step to get a true picture of how it will perform in the real world.

FINAL CONSIDERATIONS

Being careful with the test set is key to building machine learning models that actually work in practice. Following these guidelines helps avoid common mistakes and results in a model that’s ready to handle real-world data accurately.