Yukio
Dec 8, 2020

As far as I know, you can do it by using the train dataset. The test dataset is like a new dataset where you should evaluate your model. It's something you should pretend you have never seen (like any new dataset). If you do the imputation before, you are using the test dataset information (= data leakage). For instance, let's say you have a column with the values (1, 2, 1, 5) for the train and (3, 2) for the test. The mean without the test is 2.3, because you used 3 and 2. So you used an information from something you shouldn't. You may wanna take a look at this discussion: https://stats.stackexchange.com/questions/95083/imputation-before-or-after-splitting-into-train-and-test. Best regards

Yukio
Yukio

Written by Yukio

Mathematician with a master degree in Economics. Working as a Data Scientist for the last 10 years.

No responses yet