Related papers: Adaptive Sampling Strategies to Construct Equitable Training Datasets

Adaptive Sampling Strategies to Construct Equitable Training Datasets

URL: http://arxiv.org/abs/2202.01327v1
Date: Mon, 31 Jan 2022 19:19:30 GMT
Title: Adaptive Sampling Strategies to Construct Equitable Training Datasets
Authors: William Cai, Ro Encarnacion, Bobbie Chern, Sam Corbett-Davies, Miranda Bogen, Stevie Bergman, Sharad Goel
Abstract summary: In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities. One factor contributing to these performance gaps is a lack of representation in the data the models are trained on. We formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem.
Score: 0.7036032466145111
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups. One factor contributing to these performance gaps is a lack of representation in the data the models are trained on. It is often unclear, however, how to operationalize representativeness in specific applications. Here we formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem. We consider a setting where a model builder must decide how to allocate a fixed data collection budget to gather training data from different subgroups. We then frame dataset creation as a constrained optimization problem, in which one maximizes a function of group-specific performance metrics based on (estimated) group-specific learning rates and costs per sample. This flexible approach incorporates preferences of model-builders and other stakeholders, as well as the statistical properties of the learning task. When data collection decisions are made sequentially, we show that under certain conditions this optimization problem can be efficiently solved even without prior knowledge of the learning rates. To illustrate our approach, we conduct a simulation study of polygenic risk scores on synthetic genomic data -- an application domain that often suffers from non-representative data collection. We find that our adaptive sampling strategy outperforms several common data collection heuristics, including equal and proportional sampling, demonstrating the value of strategic dataset design for building equitable models.

Related papers

Optimize Cardinality Estimation Model Pretraining by Simplifying the Training Datasets [0.0]
We introduce a simplified training dataset, which has been reduced to a fraction of the size of existing pretraining datasets. Sufficient experimental results demonstrate that the pre-trained cardinality estimator based on this simplified dataset can still achieve comparable performance to existing models in zero-shot setups.
arXiv Detail & Related papers (2025-02-20T08:06:16Z)
Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks. We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z)
A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance. Data selection has shown promise in identifying the most representative samples from the entire dataset. We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z)
Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning. We construct pseudo-skill clusters by grouping gradient-based sample vectors. We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z)
The Data Addition Dilemma [4.869513274920574]
In many machine learning for healthcare tasks, standard datasets are constructed by amassing data across many, often fundamentally dissimilar, sources. But when does adding more data help, and when does it hinder progress on desired model outcomes in real-world settings? We identify this situation as the textitData Addition Dilemma, demonstrating that adding training data in this multi-source scaling context can at times result in reduced overall accuracy, uncertain fairness outcomes, and reduced worst-subgroup performance.
arXiv Detail & Related papers (2024-08-08T01:42:31Z)
Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z)
Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z)
Deep Learning with Multiple Data Set: A Weighted Goal Programming Approach [2.7393821783237184]
Large-scale data analysis is growing at an exponential rate as data proliferates in our societies. Deep Learning models require plenty of resources, and distributed training is needed. This paper presents a Multicriteria approach for distributed learning.
arXiv Detail & Related papers (2021-11-27T07:10:25Z)
A Topological-Framework to Improve Analysis of Machine Learning Model Performance [5.3893373617126565]
We propose a framework for evaluating machine learning models in which a dataset is treated as a "space" on which a model operates. We describe a topological data structure, presheaves, which offer a convenient way to store and analyze model performance between different subpopulations.
arXiv Detail & Related papers (2021-07-09T23:11:13Z)
Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data [85.43008636875345]
We show that diverse representation in training data is key to increasing subgroup performances and achieving population level objectives. Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.
arXiv Detail & Related papers (2021-03-05T00:27:08Z)
Model-specific Data Subsampling with Influence Functions [37.64859614131316]
We develop a model-specific data subsampling strategy that improves over random sampling whenever training points have varying influence. Specifically, we leverage influence functions to guide our selection strategy, proving theoretically, and demonstrating empirically that our approach quickly selects high-quality models.
arXiv Detail & Related papers (2020-10-20T12:10:28Z)
Dynamic Federated Learning [57.14673504239551]
Federated learning has emerged as an umbrella term for centralized coordination strategies in multi-agent environments. We consider a federated learning model where at every iteration, a random subset of available agents perform local updates based on their data. Under a non-stationary random walk model on the true minimizer for the aggregate optimization problem, we establish that the performance of the architecture is determined by three factors, namely, the data variability at each agent, the model variability across all agents, and a tracking term that is inversely proportional to the learning rate of the algorithm.
arXiv Detail & Related papers (2020-02-20T15:00:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.