Generalized Group Data Attribution
- URL: http://arxiv.org/abs/2410.09940v2
- Date: Mon, 21 Oct 2024 14:36:35 GMT
- Title: Generalized Group Data Attribution
- Authors: Dan Ley, Suraj Srinivas, Shichang Zhang, Gili Rusak, Himabindu Lakkaraju,
- Abstract summary: Data Attribution methods quantify the influence of individual training data points on model outputs.
Existing DA methods are often computationally intensive, limiting their applicability to large-scale machine learning models.
We introduce the Generalized Group Data Attribution (GGDA) framework, which computationally simplifies DA by attributing to groups of training points instead of individual ones.
- Score: 28.056149996461286
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data Attribution (DA) methods quantify the influence of individual training data points on model outputs and have broad applications such as explainability, data selection, and noisy label identification. However, existing DA methods are often computationally intensive, limiting their applicability to large-scale machine learning models. To address this challenge, we introduce the Generalized Group Data Attribution (GGDA) framework, which computationally simplifies DA by attributing to groups of training points instead of individual ones. GGDA is a general framework that subsumes existing attribution methods and can be applied to new DA techniques as they emerge. It allows users to optimize the trade-off between efficiency and fidelity based on their needs. Our empirical results demonstrate that GGDA applied to popular DA methods such as Influence Functions, TracIn, and TRAK results in upto 10x-50x speedups over standard DA methods while gracefully trading off attribution fidelity. For downstream applications such as dataset pruning and noisy label identification, we demonstrate that GGDA significantly improves computational efficiency and maintains effectiveness, enabling practical applications in large-scale machine learning scenarios that were previously infeasible.
Related papers
- Wireless Channel Aware Data Augmentation Methods for Deep Learning-Based Indoor Localization [22.76179980847908]
We propose methods that utilize the domain knowledge about wireless propagation channels and devices.
We show that in the low-data regime, localization accuracy increases up to 50%, matching non-augmented results in the high-data regime.
The proposed methods may outperform the measurement-only high-data performance by up to 33% using only 1/4 of the amount of measured data.
arXiv Detail & Related papers (2024-08-12T19:01:49Z) - Efficient Ensembles Improve Training Data Attribution [12.180392191924758]
Training data attribution methods aim to quantify the influence of individual data points on model predictions, with broad applications in data-centric AI.
Existing methods in this field, which can be categorized as retraining-based and gradient-based methods, have struggled with naive trade-off attribution efficacy.
Recent research has shown that augmenting gradient-based methods with ensembles of multiple independently trained models can achieve significantly better attribution.
arXiv Detail & Related papers (2024-05-27T15:58:34Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Better Practices for Domain Adaptation [62.70267990659201]
Domain adaptation (DA) aims to provide frameworks for adapting models to deployment data without using labels.
Unclear validation protocol for DA has led to bad practices in the literature.
We show challenges across all three branches of domain adaptation methodology.
arXiv Detail & Related papers (2023-09-07T17:44:18Z) - Face Presentation Attack Detection by Excavating Causal Clues and
Adapting Embedding Statistics [9.612556145185431]
Face presentation attack detection (PAD) uses domain adaptation (DA) and domain generalization (DG) techniques to address performance degradation on unknown domains.
Most DG-based PAD solutions rely on a priori, i.e., known domain labels.
This paper proposes to model face PAD as a compound DG task from a causal perspective, linking it to model optimization.
arXiv Detail & Related papers (2023-08-28T13:11:05Z) - Learning Better with Less: Effective Augmentation for Sample-Efficient
Visual Reinforcement Learning [57.83232242068982]
Data augmentation (DA) is a crucial technique for enhancing the sample efficiency of visual reinforcement learning (RL) algorithms.
It remains unclear which attributes of DA account for its effectiveness in achieving sample-efficient visual RL.
This work conducts comprehensive experiments to assess the impact of DA's attributes on its efficacy.
arXiv Detail & Related papers (2023-05-25T15:46:20Z) - Cluster-level pseudo-labelling for source-free cross-domain facial
expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER)
Our method exploits self-supervised pretraining to learn good feature representations from the target data.
We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z) - EPiDA: An Easy Plug-in Data Augmentation Framework for High Performance
Text Classification [34.15923302216751]
We present an easy and plug-in data augmentation framework EPiDA to support effective text classification.
EPiDA employs two mechanisms: relative entropy (REM) and conditional minimization entropy (CEM) to control data generation.
EPiDA can support efficient and continuous data generation for effective classification training.
arXiv Detail & Related papers (2022-04-24T06:53:48Z) - DEALIO: Data-Efficient Adversarial Learning for Imitation from
Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator.
Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms.
This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk.
We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.