Related papers: Achieving Representative Data via Convex Hull Feasibility Sampling Algorithms

Achieving Representative Data via Convex Hull Feasibility Sampling Algorithms

URL: http://arxiv.org/abs/2204.06664v1
Date: Wed, 13 Apr 2022 23:14:05 GMT
Title: Achieving Representative Data via Convex Hull Feasibility Sampling Algorithms
Authors: Laura Niss, Yuekai Sun, Ambuj Tewari
Abstract summary: Sampling biases in training data are a major source of algorithmic biases in machine learning systems. We present adaptive sampling methods to determine, with high confidence, whether it is possible to assemble a representative dataset from the given data sources.
Score: 35.29582673348303
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sampling biases in training data are a major source of algorithmic biases in machine learning systems. Although there are many methods that attempt to mitigate such algorithmic biases during training, the most direct and obvious way is simply collecting more representative training data. In this paper, we consider the task of assembling a training dataset in which minority groups are adequately represented from a given set of data sources. In essence, this is an adaptive sampling problem to determine if a given point lies in the convex hull of the means from a set of unknown distributions. We present adaptive sampling methods to determine, with high confidence, whether it is possible to assemble a representative dataset from the given data sources. We also demonstrate the efficacy of our policies in simulations in the Bernoulli and a multinomial setting.

Related papers

Adaptive teachers for amortized samplers [76.88721198565861]
Amortized inference is the task of training a parametric model, such as a neural network, to approximate a distribution with a given unnormalized density where exact sampling is intractable. Off-policy RL training facilitates the discovery of diverse, high-reward candidates, but existing methods still face challenges in efficient exploration. We propose an adaptive training distribution (the Teacher) to guide the training of the primary amortized sampler (the Student) by prioritizing high-loss regions.
arXiv Detail & Related papers (2024-10-02T11:33:13Z)
Personalized Federated Learning via Active Sampling [50.456464838807115]
This paper proposes a novel method for sequentially identifying similar (or relevant) data generators. Our method evaluates the relevance of a data generator by evaluating the effect of a gradient step using its local dataset. We extend this method to non-parametric models by a suitable generalization of the gradient step to update a hypothesis using the local dataset provided by a data generator.
arXiv Detail & Related papers (2024-09-03T17:12:21Z)
Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance. DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator. Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z)
Balanced Data Sampling for Language Model Training with Clustering [96.46042695333655]
We propose ClusterClip Sampling to balance the text distribution of training data for better model training. Extensive experiments validate the effectiveness of ClusterClip Sampling.
arXiv Detail & Related papers (2024-02-22T13:20:53Z)
How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs) We find that Ask-LLM and Density sampling are the best methods in their respective categories. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z)
Group Distributionally Robust Dataset Distillation with Risk Minimization [18.07189444450016]
We introduce an algorithm that combines clustering with the minimization of a risk measure on the loss to conduct DD. We demonstrate its effective generalization and robustness across subgroups through numerical experiments.
arXiv Detail & Related papers (2024-02-07T09:03:04Z)
Open-Sampling: Exploring Out-of-Distribution data for Re-balancing Long-tailed datasets [24.551465814633325]
Deep neural networks usually perform poorly when the training dataset suffers from extreme class imbalance. Recent studies found that directly training with out-of-distribution data in a semi-supervised manner would harm the generalization performance. We propose a novel method called Open-sampling, which utilizes open-set noisy labels to re-balance the class priors of the training dataset.
arXiv Detail & Related papers (2022-06-17T14:29:52Z)
Sampling Bias Correction for Supervised Machine Learning: A Bayesian Inference Approach with Practical Applications [0.0]
We discuss a problem where a dataset might be subject to intentional sample bias such as label imbalance. We then apply this solution to binary logistic regression, and discuss scenarios where a dataset might be subject to intentional sample bias. This technique is widely applicable for statistical inference on big data, from the medical sciences to image recognition to marketing.
arXiv Detail & Related papers (2022-03-11T20:46:37Z)
Sequential Targeting: an incremental learning approach for data imbalance in text classification [7.455546102930911]
Methods to handle imbalanced datasets are crucial for alleviating distributional skews. We propose a novel training method, Sequential Targeting(ST), independent of the effectiveness of the representation method. We demonstrate the effectiveness of our method through experiments on simulated benchmark datasets (IMDB) and data collected from NAVER.
arXiv Detail & Related papers (2020-11-20T04:54:00Z)
Optimal Importance Sampling for Federated Learning [57.14673504239551]
Federated learning involves a mixture of centralized and decentralized processing tasks. The sampling of both agents and data is generally uniform; however, in this work we consider non-uniform sampling. We derive optimal importance sampling strategies for both agent and data selection and show that non-uniform sampling without replacement improves the performance of the original FedAvg algorithm.
arXiv Detail & Related papers (2020-10-26T14:15:33Z)
Learning while Respecting Privacy and Robustness to Distributional Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model. The objective is to endow the trained model with robustness against adversarially manipulated input data. Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z)
Domain Adaptive Bootstrap Aggregating [5.444459446244819]
bootstrap aggregating, or bagging, is a popular method for improving stability of predictive algorithms. This article proposes a domain adaptive bagging method coupled with a new iterative nearest neighbor sampler.
arXiv Detail & Related papers (2020-01-12T20:02:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.