Striving for data-model efficiency: Identifying data externalities on
group performance
- URL: http://arxiv.org/abs/2211.06348v1
- Date: Fri, 11 Nov 2022 16:48:27 GMT
- Title: Striving for data-model efficiency: Identifying data externalities on
group performance
- Authors: Esther Rolf, Ben Packer, Alex Beutel, Fernando Diaz
- Abstract summary: Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
- Score: 75.17591306911015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building trustworthy, effective, and responsible machine learning systems
hinges on understanding how differences in training data and modeling decisions
interact to impact predictive performance. In this work, we seek to better
understand how we might characterize, detect, and design for data-model
synergies. We focus on a particular type of data-model inefficiency, in which
adding training data from some sources can actually lower performance evaluated
on key sub-groups of the population, a phenomenon we refer to as negative data
externalities on group performance. Such externalities can arise in standard
learning settings and can manifest differently depending on conditions between
training set size and model size. Data externalities directly imply a lower
bound on feasible model improvements, yet improving models efficiently requires
understanding the underlying data-model tensions. From a broader perspective,
our results indicate that data-efficiency is a key component of both accurate
and trustworthy machine learning.
Related papers
- Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search [59.75749613951193]
We propose Data Influence-oriented Tree Search (DITS) to guide both tree search and data selection.
By leveraging influence scores, we effectively identify the most impactful data for system improvement.
We derive influence score estimation methods tailored for non-differentiable metrics.
arXiv Detail & Related papers (2025-02-02T23:20:16Z) - Exploring the Impact of Dataset Statistical Effect Size on Model Performance and Data Sample Size Sufficiency [2.444909460562512]
We report on two experiments undertaken in an attempt to better ascertain whether or not basic descriptive statistical measures can be indicative of how effective a dataset will be at training a resulting model.
Our results appear to indicate that this is not an effective for determining adequate sample size or projecting model performance, and therefore that additional work is still needed to better prospectively assess adequacy of data.
arXiv Detail & Related papers (2025-01-05T22:03:46Z) - Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective [4.548047308860141]
This study investigates the impact of different type of preference data on model performance.
It aims to reduce their dependency on extensive amounts of preference data, which is expensive to collect.
arXiv Detail & Related papers (2024-10-22T00:11:41Z) - Explanatory Model Monitoring to Understand the Effects of Feature Shifts on Performance [61.06245197347139]
We propose a novel approach to explain the behavior of a black-box model under feature shifts.
We refer to our method that combines concepts from Optimal Transport and Shapley Values as Explanatory Performance Estimation.
arXiv Detail & Related papers (2024-08-24T18:28:19Z) - Decentralized Learning with Multi-Headed Distillation [12.90857834791378]
Decentralized learning with private data is a central problem in machine learning.
We propose a novel distillation-based decentralized learning technique that allows multiple agents with private non-iid data to learn from each other.
arXiv Detail & Related papers (2022-11-28T21:01:43Z) - Adaptive Sampling Strategies to Construct Equitable Training Datasets [0.7036032466145111]
In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities.
One factor contributing to these performance gaps is a lack of representation in the data the models are trained on.
We formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem.
arXiv Detail & Related papers (2022-01-31T19:19:30Z) - Representation Matters: Assessing the Importance of Subgroup Allocations
in Training Data [85.43008636875345]
We show that diverse representation in training data is key to increasing subgroup performances and achieving population level objectives.
Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.
arXiv Detail & Related papers (2021-03-05T00:27:08Z) - Negative Data Augmentation [127.28042046152954]
We show that negative data augmentation samples provide information on the support of the data distribution.
We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator.
Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities.
arXiv Detail & Related papers (2021-02-09T20:28:35Z) - How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance.
We formulate a quality measure for the data set, which we refer to as $rho$-gap.
We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.