Fast and Accurate Importance Weighting for Correcting Sample Bias
- URL: http://arxiv.org/abs/2209.04215v1
- Date: Fri, 9 Sep 2022 10:01:46 GMT
- Title: Fast and Accurate Importance Weighting for Correcting Sample Bias
- Authors: Antoine de Mathelin, Francois Deheeger, Mathilde Mougeot, Nicolas
Vayatis
- Abstract summary: We propose a novel importance weighting algorithm which scales to large datasets by using a neural network to predict the instance weights.
We show, that our proposed approach drastically reduces the computational time on large dataset while maintaining similar sample bias correction performance compared to other importance weighting methods.
- Score: 4.750521042508541
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bias in datasets can be very detrimental for appropriate statistical
estimation. In response to this problem, importance weighting methods have been
developed to match any biased distribution to its corresponding target unbiased
distribution. The seminal Kernel Mean Matching (KMM) method is, nowadays, still
considered as state of the art in this research field. However, one of the main
drawbacks of this method is the computational burden for large datasets.
Building on previous works by Huang et al. (2007) and de Mathelin et al.
(2021), we derive a novel importance weighting algorithm which scales to large
datasets by using a neural network to predict the instance weights. We show, on
multiple public datasets, under various sample biases, that our proposed
approach drastically reduces the computational time on large dataset while
maintaining similar sample bias correction performance compared to other
importance weighting methods. The proposed approach appears to be the only one
able to give relevant reweighting in a reasonable time for large dataset with
up to two million data.
Related papers
- Data Pruning in Generative Diffusion Models [2.0111637969968]
Generative models aim to estimate the underlying distribution of the data, so presumably they should benefit from larger datasets.
We show that eliminating redundant or noisy data in large datasets is beneficial particularly when done strategically.
arXiv Detail & Related papers (2024-11-19T14:13:25Z) - Revisiting the Dataset Bias Problem from a Statistical Perspective [72.94990819287551]
We study the "dataset bias" problem from a statistical standpoint.
We identify the main cause of the problem as the strong correlation between a class attribute u and a non-class attribute b.
We propose to mitigate dataset bias via either weighting the objective of each sample n by frac1p(u_n|b_n) or sampling that sample with a weight proportional to frac1p(u_n|b_n).
arXiv Detail & Related papers (2024-02-05T22:58:06Z) - Efficient Hybrid Oversampling and Intelligent Undersampling for
Imbalanced Big Data Classification [1.03590082373586]
We present a novel resampling method called SMOTENN that combines intelligent undersampling and oversampling using a MapReduce framework.
Our experimental results show the virtues of this approach, outperforming alternative resampling techniques for small- and medium-sized datasets.
arXiv Detail & Related papers (2023-10-09T15:22:13Z) - Correcting sampling biases via importance reweighting for spatial
modeling [2.6862667248315386]
In machine learning models, the estimation of errors is often complex due to distribution bias.
We introduce an approach based on the ideas of importance sampling to obtain an unbiased estimate of the target error.
arXiv Detail & Related papers (2023-09-09T15:36:28Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Distributed Semi-Supervised Sparse Statistical Inference [6.685997976921953]
A debiased estimator is a crucial tool in statistical inference for high-dimensional model parameters.
Traditional methods require computing a debiased estimator on every machine.
An efficient multi-round distributed debiased estimator, which integrates both labeled and unlabelled data, is developed.
arXiv Detail & Related papers (2023-06-17T17:30:43Z) - Kernel-Whitening: Overcome Dataset Bias with Isotropic Sentence
Embedding [51.48582649050054]
We propose a representation normalization method which aims at disentangling the correlations between features of encoded sentences.
We also propose Kernel-Whitening, a Nystrom kernel approximation method to achieve more thorough debiasing on nonlinear spurious correlations.
Experiments show that Kernel-Whitening significantly improves the performance of BERT on out-of-distribution datasets while maintaining in-distribution accuracy.
arXiv Detail & Related papers (2022-10-14T05:56:38Z) - Time-Series Imputation with Wasserstein Interpolation for Optimal
Look-Ahead-Bias and Variance Tradeoff [66.59869239999459]
In finance, imputation of missing returns may be applied prior to training a portfolio optimization model.
There is an inherent trade-off between the look-ahead-bias of using the full data set for imputation and the larger variance in the imputation from using only the training data.
We propose a Bayesian posterior consensus distribution which optimally controls the variance and look-ahead-bias trade-off in the imputation.
arXiv Detail & Related papers (2021-02-25T09:05:35Z) - The Gap on GAP: Tackling the Problem of Differing Data Distributions in
Bias-Measuring Datasets [58.53269361115974]
Diagnostic datasets that can detect biased models are an important prerequisite for bias reduction within natural language processing.
undesired patterns in the collected data can make such tests incorrect.
We introduce a theoretically grounded method for weighting test samples to cope with such patterns in the test data.
arXiv Detail & Related papers (2020-11-03T16:50:13Z) - Evaluating Prediction-Time Batch Normalization for Robustness under
Covariate Shift [81.74795324629712]
We call prediction-time batch normalization, which significantly improves model accuracy and calibration under covariate shift.
We show that prediction-time batch normalization provides complementary benefits to existing state-of-the-art approaches for improving robustness.
The method has mixed results when used alongside pre-training, and does not seem to perform as well under more natural types of dataset shift.
arXiv Detail & Related papers (2020-06-19T05:08:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.