Balancing Constraints and Submodularity in Data Subset Selection
- URL: http://arxiv.org/abs/2104.12835v1
- Date: Mon, 26 Apr 2021 19:22:27 GMT
- Title: Balancing Constraints and Submodularity in Data Subset Selection
- Authors: Srikumar Ramalingam, Daniel Glasner, Kaushal Patel, Raviteja
Vemulapalli, Sadeep Jayasumana, Sanjiv Kumar
- Abstract summary: We show that one can achieve similar accuracy to traditional deep-learning models, while using less training data.
We propose a novel diversity driven objective function, and balancing constraints on class labels and decision boundaries using matroids.
- Score: 43.03720397062461
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning has yielded extraordinary results in vision and natural
language processing, but this achievement comes at a cost. Most deep learning
models require enormous resources during training, both in terms of computation
and in human labeling effort. In this paper, we show that one can achieve
similar accuracy to traditional deep-learning models, while using less training
data. Much of the previous work in this area relies on using uncertainty or
some form of diversity to select subsets of a larger training set.
Submodularity, a discrete analogue of convexity, has been exploited to model
diversity in various settings including data subset selection. In contrast to
prior methods, we propose a novel diversity driven objective function, and
balancing constraints on class labels and decision boundaries using matroids.
This allows us to use efficient greedy algorithms with approximation guarantees
for subset selection. We outperform baselines on standard image classification
datasets such as CIFAR-10, CIFAR-100, and ImageNet. In addition, we also show
that the proposed balancing constraints can play a key role in boosting the
performance in long-tailed datasets such as CIFAR-100-LT.
Related papers
- Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning.
We construct pseudo-skill clusters by grouping gradient-based sample vectors.
We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - Uncertainty-aware Sampling for Long-tailed Semi-supervised Learning [89.98353600316285]
We introduce uncertainty into the modeling process for pseudo-label sampling, taking into account that the model performance on the tailed classes varies over different training stages.
This approach allows the model to perceive the uncertainty of pseudo-labels at different training stages, thereby adaptively adjusting the selection thresholds for different classes.
Compared to other methods such as the baseline method FixMatch, UDTS achieves an increase in accuracy of at least approximately 5.26%, 1.75%, 9.96%, and 1.28% on the natural scene image datasets.
arXiv Detail & Related papers (2024-01-09T08:59:39Z) - Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective.
The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets.
Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z) - Frugal Reinforcement-based Active Learning [12.18340575383456]
We propose a novel active learning approach for label-efficient training.
The proposed method is iterative and aims at minimizing a constrained objective function that mixes diversity, representativity and uncertainty criteria.
We also introduce a novel weighting mechanism based on reinforcement learning, which adaptively balances these criteria at each training iteration.
arXiv Detail & Related papers (2022-12-09T14:17:45Z) - A Lagrangian Duality Approach to Active Learning [119.36233726867992]
We consider the batch active learning problem, where only a subset of the training data is labeled.
We formulate the learning problem using constrained optimization, where each constraint bounds the performance of the model on labeled samples.
We show, via numerical experiments, that our proposed approach performs similarly to or better than state-of-the-art active learning methods.
arXiv Detail & Related papers (2022-02-08T19:18:49Z) - Mixing Deep Learning and Multiple Criteria Optimization: An Application
to Distributed Learning with Multiple Datasets [0.0]
Training phase is the most important stage during the machine learning process.
We develop a multiple criteria optimization model in which each criterion measures the distance between the output associated with a specific input and its label.
We propose a scalarization approach to implement this model and numerical experiments in digit classification using MNIST data.
arXiv Detail & Related papers (2021-12-02T16:00:44Z) - Data Summarization via Bilevel Optimization [48.89977988203108]
A simple yet powerful approach is to operate on small subsets of data.
In this work, we propose a generic coreset framework that formulates the coreset selection as a cardinality-constrained bilevel optimization problem.
arXiv Detail & Related papers (2021-09-26T09:08:38Z) - Improving Calibration for Long-Tailed Recognition [68.32848696795519]
We propose two methods to improve calibration and performance in such scenarios.
For dataset bias due to different samplers, we propose shifted batch normalization.
Our proposed methods set new records on multiple popular long-tailed recognition benchmark datasets.
arXiv Detail & Related papers (2021-04-01T13:55:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.