Finding High-Value Training Data Subset through Differentiable Convex
Programming
- URL: http://arxiv.org/abs/2104.13794v1
- Date: Wed, 28 Apr 2021 14:33:26 GMT
- Title: Finding High-Value Training Data Subset through Differentiable Convex
Programming
- Authors: Soumi Das, Arshdeep Singh, Saptarshi Chatterjee, Suparna Bhattacharya,
Sourangshu Bhattacharya
- Abstract summary: In this paper, we study the problem of selecting high-value subsets of training data.
The key idea is to design a learnable framework for online subset selection.
Using this framework, we design an online alternating minimization-based algorithm for jointly learning the parameters of the selection model and ML model.
- Score: 5.5180456567480896
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Finding valuable training data points for deep neural networks has been a
core research challenge with many applications. In recent years, various
techniques for calculating the "value" of individual training datapoints have
been proposed for explaining trained models. However, the value of a training
datapoint also depends on other selected training datapoints - a notion that is
not explicitly captured by existing methods. In this paper, we study the
problem of selecting high-value subsets of training data. The key idea is to
design a learnable framework for online subset selection, which can be learned
using mini-batches of training data, thus making our method scalable. This
results in a parameterized convex subset selection problem that is amenable to
a differentiable convex programming paradigm, thus allowing us to learn the
parameters of the selection model in end-to-end training. Using this framework,
we design an online alternating minimization-based algorithm for jointly
learning the parameters of the selection model and ML model. Extensive
evaluation on a synthetic dataset, and three standard datasets, show that our
algorithm finds consistently higher value subsets of training data, compared to
the recent state-of-the-art methods, sometimes ~20% higher value than existing
methods. The subsets are also useful in finding mislabelled training data. Our
algorithm takes running time comparable to the existing valuation functions.
Related papers
- Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest.
Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z) - Learning the Regularization Strength for Deep Fine-Tuning via a Data-Emphasized Variational Objective [4.453137996095194]
grid search is computationally expensive, requires carving out a validation set, and requires practitioners to specify candidate values.
Our proposed technique overcomes all three disadvantages of grid search.
We demonstrate effectiveness on image classification tasks on several datasets, yielding heldout accuracy comparable to existing approaches.
arXiv Detail & Related papers (2024-10-25T16:32:11Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning
Interference with Gradient Projection [56.292071534857946]
Recent data-privacy laws have sparked interest in machine unlearning.
Challenge is to discard information about the forget'' data without altering knowledge about remaining dataset.
We adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU)
We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible.
arXiv Detail & Related papers (2023-12-07T07:17:24Z) - DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection [72.25697820290502]
This work introduces a straightforward and efficient strategy to identify potential novel classes through zero-shot classification.
We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training.
Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance.
arXiv Detail & Related papers (2023-10-02T17:52:24Z) - Towards Free Data Selection with General-Purpose Models [71.92151210413374]
A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets.
Current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly.
FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods.
arXiv Detail & Related papers (2023-09-29T15:50:14Z) - Exploring Data Redundancy in Real-world Image Classification through
Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs.
We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data.
Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z) - MILO: Model-Agnostic Subset Selection Framework for Efficient Model
Training and Tuning [68.12870241637636]
We propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training.
Our empirical results indicate that MILO can train models $3times - 10 times$ faster and tune hyperparameters $20times - 75 times$ faster than full-dataset training or tuning without performance.
arXiv Detail & Related papers (2023-01-30T20:59:30Z) - Dominant Set-based Active Learning for Text Classification and its
Application to Online Social Media [0.0]
We present a novel pool-based active learning method for the training of large unlabeled corpus with minimum annotation cost.
Our proposed method does not have any parameters to be tuned, making it dataset-independent.
Our method achieves a higher performance in comparison to the state-of-the-art active learning strategies.
arXiv Detail & Related papers (2022-01-28T19:19:03Z) - Mixing Deep Learning and Multiple Criteria Optimization: An Application
to Distributed Learning with Multiple Datasets [0.0]
Training phase is the most important stage during the machine learning process.
We develop a multiple criteria optimization model in which each criterion measures the distance between the output associated with a specific input and its label.
We propose a scalarization approach to implement this model and numerical experiments in digit classification using MNIST data.
arXiv Detail & Related papers (2021-12-02T16:00:44Z) - Training Data Subset Selection for Regression with Controlled
Generalization Error [19.21682938684508]
We develop an efficient majorization-minimization algorithm for data subset selection.
SELCON trades off accuracy and efficiency more effectively than the current state-of-the-art.
arXiv Detail & Related papers (2021-06-23T16:03:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.