Towards Bayesian Data Selection
- URL: http://arxiv.org/abs/2406.12560v2
- Date: Mon, 24 Jun 2024 08:27:13 GMT
- Title: Towards Bayesian Data Selection
- Authors: Julian Rodemann,
- Abstract summary: Examples include semi-supervised learning, active learning, multi-armed bandits, and Bayesian optimization.
We embed this kind of data addition into decision theory by framing data selection as a decision problem.
For the illustrative case of self-training in semi-supervised learning, we derive the respective Bayes criterion.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A wide range of machine learning algorithms iteratively add data to the training sample. Examples include semi-supervised learning, active learning, multi-armed bandits, and Bayesian optimization. We embed this kind of data addition into decision theory by framing data selection as a decision problem. This paves the way for finding Bayes-optimal selections of data. For the illustrative case of self-training in semi-supervised learning, we derive the respective Bayes criterion. We further show that deploying this criterion mitigates the issue of confirmation bias by empirically assessing our method for generalized linear models, semi-parametric generalized additive models, and Bayesian neural networks on simulated and real-world data.
Related papers
- Exploring Beyond Logits: Hierarchical Dynamic Labeling Based on Embeddings for Semi-Supervised Classification [49.09505771145326]
We propose a Hierarchical Dynamic Labeling (HDL) algorithm that does not depend on model predictions and utilizes image embeddings to generate sample labels.
Our approach has the potential to change the paradigm of pseudo-label generation in semi-supervised learning.
arXiv Detail & Related papers (2024-04-26T06:00:27Z) - Multiply Robust Estimation for Local Distribution Shifts with Multiple Domains [9.429772474335122]
We focus on scenarios where data distributions vary across multiple segments of the entire population.
We propose a two-stage multiply robust estimation method to improve model performance on each individual segment.
Our method is designed to be implemented with commonly used off-the-shelf machine learning models.
arXiv Detail & Related papers (2024-02-21T22:01:10Z) - Towards Free Data Selection with General-Purpose Models [71.92151210413374]
A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets.
Current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly.
FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods.
arXiv Detail & Related papers (2023-09-29T15:50:14Z) - In all LikelihoodS: How to Reliably Select Pseudo-Labeled Data for
Self-Training in Semi-Supervised Learning [0.0]
Self-training is a simple yet effective method within semi-supervised learning.
In this paper, we aim at rendering PLS more robust towards the involved modeling assumptions.
Results suggest that in particular robustness w.r.t. model choice can lead to substantial accuracy gains.
arXiv Detail & Related papers (2023-03-02T10:00:37Z) - Achieving Representative Data via Convex Hull Feasibility Sampling
Algorithms [35.29582673348303]
Sampling biases in training data are a major source of algorithmic biases in machine learning systems.
We present adaptive sampling methods to determine, with high confidence, whether it is possible to assemble a representative dataset from the given data sources.
arXiv Detail & Related papers (2022-04-13T23:14:05Z) - Sampling Bias Correction for Supervised Machine Learning: A Bayesian
Inference Approach with Practical Applications [0.0]
We discuss a problem where a dataset might be subject to intentional sample bias such as label imbalance.
We then apply this solution to binary logistic regression, and discuss scenarios where a dataset might be subject to intentional sample bias.
This technique is widely applicable for statistical inference on big data, from the medical sciences to image recognition to marketing.
arXiv Detail & Related papers (2022-03-11T20:46:37Z) - Invariance Learning in Deep Neural Networks with Differentiable Laplace
Approximations [76.82124752950148]
We develop a convenient gradient-based method for selecting the data augmentation.
We use a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective.
arXiv Detail & Related papers (2022-02-22T02:51:11Z) - Evaluating State-of-the-Art Classification Models Against Bayes
Optimality [106.50867011164584]
We show that we can compute the exact Bayes error of generative models learned using normalizing flows.
We use our approach to conduct a thorough investigation of state-of-the-art classification models.
arXiv Detail & Related papers (2021-06-07T06:21:20Z) - Online Active Model Selection for Pre-trained Classifiers [72.84853880948894]
We design an online selective sampling approach that actively selects informative examples to label and outputs the best model with high probability at any round.
Our algorithm can be used for online prediction tasks for both adversarial and streams.
arXiv Detail & Related papers (2020-10-19T19:53:15Z) - Model Fusion with Kullback--Leibler Divergence [58.20269014662046]
We propose a method to fuse posterior distributions learned from heterogeneous datasets.
Our algorithm relies on a mean field assumption for both the fused model and the individual dataset posteriors.
arXiv Detail & Related papers (2020-07-13T03:27:45Z) - BayesFlow: Learning complex stochastic models with invertible neural
networks [3.1498833540989413]
We propose a novel method for globally amortized Bayesian inference based on invertible neural networks.
BayesFlow incorporates a summary network trained to embed the observed data into maximally informative summary statistics.
We demonstrate the utility of BayesFlow on challenging intractable models from population dynamics, epidemiology, cognitive science and ecology.
arXiv Detail & Related papers (2020-03-13T13:39:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.