Related papers: Sampling Bias Correction for Supervised Machine Learning: A Bayesian Inference Approach with Practical Applications

Sampling Bias Correction for Supervised Machine Learning: A Bayesian Inference Approach with Practical Applications

URL: http://arxiv.org/abs/2203.06239v2
Date: Tue, 15 Mar 2022 02:37:37 GMT
Title: Sampling Bias Correction for Supervised Machine Learning: A Bayesian Inference Approach with Practical Applications
Authors: Max Sklar
Abstract summary: We discuss a problem where a dataset might be subject to intentional sample bias such as label imbalance. We then apply this solution to binary logistic regression, and discuss scenarios where a dataset might be subject to intentional sample bias. This technique is widely applicable for statistical inference on big data, from the medical sciences to image recognition to marketing.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Given a supervised machine learning problem where the training set has been subject to a known sampling bias, how can a model be trained to fit the original dataset? We achieve this through the Bayesian inference framework by altering the posterior distribution to account for the sampling function. We then apply this solution to binary logistic regression, and discuss scenarios where a dataset might be subject to intentional sample bias such as label imbalance. This technique is widely applicable for statistical inference on big data, from the medical sciences to image recognition to marketing. Familiarity with it will give the practitioner tools to improve their inference pipeline from data collection to model selection.

Related papers

Active Data Sampling and Generation for Bias Remediation [0.0]
A mixed active sampling and data generation strategy -- called samplation -- is proposed to compensate during fine-tuning of a pre-trained classifer the unfair classifications it produces. Using as case study Deep Models for visual semantic role labeling, the proposed method has been able to fully cure a simulated gender bias starting from a 90/10 imbalance.
arXiv Detail & Related papers (2025-03-26T10:42:15Z)
Learning Augmentation Policies from A Model Zoo for Time Series Forecasting [58.66211334969299]
We introduce AutoTSAug, a learnable data augmentation method based on reinforcement learning. By augmenting the marginal samples with a learnable policy, AutoTSAug substantially improves forecasting performance.
arXiv Detail & Related papers (2024-09-10T07:34:19Z)
Towards Bayesian Data Selection [0.0]
Examples include semi-supervised learning, active learning, multi-armed bandits, and Bayesian optimization. We embed this kind of data addition into decision theory by framing data selection as a decision problem. For the illustrative case of self-training in semi-supervised learning, we derive the respective Bayes criterion.
arXiv Detail & Related papers (2024-06-18T12:40:15Z)
Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions. We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance. Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z)
Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data. It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z)
Self-Supervised Dataset Distillation for Transfer Learning [77.4714995131992]
We propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL) We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is textitbiased due to randomness originating from data augmentations or masking. We empirically validate the effectiveness of our method on various applications involving transfer learning.
arXiv Detail & Related papers (2023-10-10T10:48:52Z)
MissDiff: Training Diffusion Models on Tabular Data with Missing Values [29.894691645801597]
This work presents a unified and principled diffusion-based framework for learning from data with missing values. We first observe that the widely adopted "impute-then-generate" pipeline may lead to a biased learning objective. We prove the proposed method is consistent in learning the score of data distributions, and the proposed training objective serves as an upper bound for the negative likelihood in certain cases.
arXiv Detail & Related papers (2023-07-02T03:49:47Z)
Robust Outlier Rejection for 3D Registration with Variational Bayes [70.98659381852787]
We develop a novel variational non-local network-based outlier rejection framework for robust alignment. We propose a voting-based inlier searching strategy to cluster the high-quality hypothetical inliers for transformation estimation.
arXiv Detail & Related papers (2023-04-04T03:48:56Z)
Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis. We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z)
Achieving Representative Data via Convex Hull Feasibility Sampling Algorithms [35.29582673348303]
Sampling biases in training data are a major source of algorithmic biases in machine learning systems. We present adaptive sampling methods to determine, with high confidence, whether it is possible to assemble a representative dataset from the given data sources.
arXiv Detail & Related papers (2022-04-13T23:14:05Z)
Conformal prediction for the design problem [72.14982816083297]
In many real-world deployments of machine learning, we use a prediction algorithm to choose what data to test next. In such settings, there is a distinct type of distribution shift between the training and test data. We introduce a method to quantify predictive uncertainty in such settings.
arXiv Detail & Related papers (2022-02-08T02:59:12Z)
Time-Series Imputation with Wasserstein Interpolation for Optimal Look-Ahead-Bias and Variance Tradeoff [66.59869239999459]
In finance, imputation of missing returns may be applied prior to training a portfolio optimization model. There is an inherent trade-off between the look-ahead-bias of using the full data set for imputation and the larger variance in the imputation from using only the training data. We propose a Bayesian posterior consensus distribution which optimally controls the variance and look-ahead-bias trade-off in the imputation.
arXiv Detail & Related papers (2021-02-25T09:05:35Z)
Regularization Helps with Mitigating Poisoning Attacks: Distributionally-Robust Machine Learning Using the Wasserstein Distance [14.095523601311374]
We use distributionally-robust optimization for machine learning to mitigate the effect of data poisoning attacks. We relax the distributionally-robust machine learning problem by finding an upper bound for the worst-case fitness.
arXiv Detail & Related papers (2020-01-29T01:16:19Z)
Domain Adaptive Bootstrap Aggregating [5.444459446244819]
bootstrap aggregating, or bagging, is a popular method for improving stability of predictive algorithms. This article proposes a domain adaptive bagging method coupled with a new iterative nearest neighbor sampler.
arXiv Detail & Related papers (2020-01-12T20:02:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.