Sampling Bias Correction for Supervised Machine Learning: A Bayesian
Inference Approach with Practical Applications
- URL: http://arxiv.org/abs/2203.06239v2
- Date: Tue, 15 Mar 2022 02:37:37 GMT
- Title: Sampling Bias Correction for Supervised Machine Learning: A Bayesian
Inference Approach with Practical Applications
- Authors: Max Sklar
- Abstract summary: We discuss a problem where a dataset might be subject to intentional sample bias such as label imbalance.
We then apply this solution to binary logistic regression, and discuss scenarios where a dataset might be subject to intentional sample bias.
This technique is widely applicable for statistical inference on big data, from the medical sciences to image recognition to marketing.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given a supervised machine learning problem where the training set has been
subject to a known sampling bias, how can a model be trained to fit the
original dataset? We achieve this through the Bayesian inference framework by
altering the posterior distribution to account for the sampling function. We
then apply this solution to binary logistic regression, and discuss scenarios
where a dataset might be subject to intentional sample bias such as label
imbalance. This technique is widely applicable for statistical inference on big
data, from the medical sciences to image recognition to marketing. Familiarity
with it will give the practitioner tools to improve their inference pipeline
from data collection to model selection.
Related papers
- Learning Augmentation Policies from A Model Zoo for Time Series Forecasting [58.66211334969299]
We introduce AutoTSAug, a learnable data augmentation method based on reinforcement learning.
By augmenting the marginal samples with a learnable policy, AutoTSAug substantially improves forecasting performance.
arXiv Detail & Related papers (2024-09-10T07:34:19Z) - Towards Bayesian Data Selection [0.0]
Examples include semi-supervised learning, active learning, multi-armed bandits, and Bayesian optimization.
We embed this kind of data addition into decision theory by framing data selection as a decision problem.
For the illustrative case of self-training in semi-supervised learning, we derive the respective Bayes criterion.
arXiv Detail & Related papers (2024-06-18T12:40:15Z) - Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions.
We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance.
Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z) - MissDiff: Training Diffusion Models on Tabular Data with Missing Values [29.894691645801597]
This work presents a unified and principled diffusion-based framework for learning from data with missing values.
We first observe that the widely adopted "impute-then-generate" pipeline may lead to a biased learning objective.
We prove the proposed method is consistent in learning the score of data distributions, and the proposed training objective serves as an upper bound for the negative likelihood in certain cases.
arXiv Detail & Related papers (2023-07-02T03:49:47Z) - Robust Outlier Rejection for 3D Registration with Variational Bayes [70.98659381852787]
We develop a novel variational non-local network-based outlier rejection framework for robust alignment.
We propose a voting-based inlier searching strategy to cluster the high-quality hypothetical inliers for transformation estimation.
arXiv Detail & Related papers (2023-04-04T03:48:56Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Achieving Representative Data via Convex Hull Feasibility Sampling
Algorithms [35.29582673348303]
Sampling biases in training data are a major source of algorithmic biases in machine learning systems.
We present adaptive sampling methods to determine, with high confidence, whether it is possible to assemble a representative dataset from the given data sources.
arXiv Detail & Related papers (2022-04-13T23:14:05Z) - Conformal prediction for the design problem [72.14982816083297]
In many real-world deployments of machine learning, we use a prediction algorithm to choose what data to test next.
In such settings, there is a distinct type of distribution shift between the training and test data.
We introduce a method to quantify predictive uncertainty in such settings.
arXiv Detail & Related papers (2022-02-08T02:59:12Z) - Time-Series Imputation with Wasserstein Interpolation for Optimal
Look-Ahead-Bias and Variance Tradeoff [66.59869239999459]
In finance, imputation of missing returns may be applied prior to training a portfolio optimization model.
There is an inherent trade-off between the look-ahead-bias of using the full data set for imputation and the larger variance in the imputation from using only the training data.
We propose a Bayesian posterior consensus distribution which optimally controls the variance and look-ahead-bias trade-off in the imputation.
arXiv Detail & Related papers (2021-02-25T09:05:35Z) - Regularization Helps with Mitigating Poisoning Attacks:
Distributionally-Robust Machine Learning Using the Wasserstein Distance [14.095523601311374]
We use distributionally-robust optimization for machine learning to mitigate the effect of data poisoning attacks.
We relax the distributionally-robust machine learning problem by finding an upper bound for the worst-case fitness.
arXiv Detail & Related papers (2020-01-29T01:16:19Z) - Domain Adaptive Bootstrap Aggregating [5.444459446244819]
bootstrap aggregating, or bagging, is a popular method for improving stability of predictive algorithms.
This article proposes a domain adaptive bagging method coupled with a new iterative nearest neighbor sampler.
arXiv Detail & Related papers (2020-01-12T20:02:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.