Machine Learning typically assumes that the training and test set are independently drawn from the same distribution, but this assumption is often violated in practice which creates a bias. In the case that the test set is representative for the ground-truth and the training set is a biased subset thereof, we speak of Selection Bias. Many attempts to identify and mitigate this bias have been proposed, but they usually rely on ground-truth information. But what if the researcher is not even aware of the bias?
In contrast to prior work, we aim to introduce a new method to identify and mitigate Selection Bias in the case that we may not know if (and where) a bias is present, and hence no ground-truth information is available. This method could then be employed as a universal preprocessing step in every Machine Learning pipeline that enhances the model quality by improving the data quality.
We proposed a new method, IMITATE, that investigates the dataset’s probability density, then adds generated points in order to smooth out the density and have it resemble a Gaussian, the most common density occurring in real-world applications. If the artificial points focus on certain areas and are not widespread, this could indicate a Selection Bias where these areas are underrepresented in the sample.
Whereas the proposed method is already suitable for many different datasets, it needs further improvements and extensions to work well in a more general setting. This PhD aims to extend IMITATE to a wide set of situations and clearly identify its limitations.