GLISTER: Generalization based Data Subset Selection for Efficient and
Robust Learning
- URL: http://arxiv.org/abs/2012.10630v3
- Date: Fri, 15 Jan 2021 21:00:10 GMT
- Title: GLISTER: Generalization based Data Subset Selection for Efficient and
Robust Learning
- Authors: Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan,
Rishabh Iyer
- Abstract summary: We introduce Glister, a GeneraLIzation based data Subset selecTion for Efficient and Robust learning framework.
We propose an iterative online algorithm Glister-Online, which performs data selection iteratively along with the parameter updates.
We show that our framework improves upon state of the art both in efficiency and accuracy (in cases (a) and (c)) and is more efficient compared to other state-of-the-art robust learning algorithms.
- Score: 11.220278271829699
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large scale machine learning and deep models are extremely data-hungry.
Unfortunately, obtaining large amounts of labeled data is expensive, and
training state-of-the-art models (with hyperparameter tuning) requires
significant computing resources and time. Secondly, real-world data is noisy
and imbalanced. As a result, several recent papers try to make the training
process more efficient and robust. However, most existing work either focuses
on robustness or efficiency, but not both. In this work, we introduce Glister,
a GeneraLIzation based data Subset selecTion for Efficient and Robust learning
framework. We formulate Glister as a mixed discrete-continuous bi-level
optimization problem to select a subset of the training data, which maximizes
the log-likelihood on a held-out validation set. Next, we propose an iterative
online algorithm Glister-Online, which performs data selection iteratively
along with the parameter updates and can be applied to any loss-based learning
algorithm. We then show that for a rich class of loss functions including
cross-entropy, hinge-loss, squared-loss, and logistic-loss, the inner discrete
data selection is an instance of (weakly) submodular optimization, and we
analyze conditions for which Glister-Online reduces the validation loss and
converges. Finally, we propose Glister-Active, an extension to batch active
learning, and we empirically demonstrate the performance of Glister on a wide
range of tasks including, (a) data selection to reduce training time, (b)
robust learning under label noise and imbalance settings, and (c) batch-active
learning with several deep and shallow models. We show that our framework
improves upon state of the art both in efficiency and accuracy (in cases (a)
and (c)) and is more efficient compared to other state-of-the-art robust
learning algorithms in case (b).
Related papers
- Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest.
Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z) - Incremental Self-training for Semi-supervised Learning [56.57057576885672]
IST is simple yet effective and fits existing self-training-based semi-supervised learning methods.
We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed.
arXiv Detail & Related papers (2024-04-14T05:02:00Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective.
The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets.
Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z) - Robust Learning with Progressive Data Expansion Against Spurious
Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features.
Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process.
We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z) - Neural Active Learning on Heteroskedastic Distributions [29.01776999862397]
We demonstrate the catastrophic failure of active learning algorithms on heteroskedastic datasets.
We propose a new algorithm that incorporates a model difference scoring function for each data point to filter out the noisy examples and sample clean examples.
arXiv Detail & Related papers (2022-11-02T07:30:19Z) - A Lagrangian Duality Approach to Active Learning [119.36233726867992]
We consider the batch active learning problem, where only a subset of the training data is labeled.
We formulate the learning problem using constrained optimization, where each constraint bounds the performance of the model on labeled samples.
We show, via numerical experiments, that our proposed approach performs similarly to or better than state-of-the-art active learning methods.
arXiv Detail & Related papers (2022-02-08T19:18:49Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - Training Efficiency and Robustness in Deep Learning [2.6451769337566406]
We study approaches to improve the training efficiency and robustness of deep learning models.
We find that prioritizing learning on more informative training data increases convergence speed and improves generalization performance on test data.
We show that a redundancy-aware modification to the sampling of training data improves the training speed and develops an efficient method for detecting the diversity of training signal.
arXiv Detail & Related papers (2021-12-02T17:11:33Z) - Training Data Subset Selection for Regression with Controlled
Generalization Error [19.21682938684508]
We develop an efficient majorization-minimization algorithm for data subset selection.
SELCON trades off accuracy and efficiency more effectively than the current state-of-the-art.
arXiv Detail & Related papers (2021-06-23T16:03:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.