REDUCR: Robust Data Downsampling Using Class Priority Reweighting
- URL: http://arxiv.org/abs/2312.00486v2
- Date: Mon, 25 Nov 2024 10:50:55 GMT
- Title: REDUCR: Robust Data Downsampling Using Class Priority Reweighting
- Authors: William Bankes, George Hughes, Ilija Bogunovic, Zi Wang,
- Abstract summary: This work introduces REDUCR, a robust and efficient data downsampling method that uses class priority reweighting.
We demonstrate the data efficiency and robust performance of REDUCR on vision and text classification tasks.
- Score: 14.030860906244898
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern machine learning models are becoming increasingly expensive to train for real-world image and text classification tasks, where massive web-scale data is collected in a streaming fashion. To reduce the training cost, online batch selection techniques have been developed to choose the most informative datapoints. However, these techniques can suffer from poor worst-class generalization performance due to class imbalance and distributional shifts. This work introduces REDUCR, a robust and efficient data downsampling method that uses class priority reweighting. REDUCR reduces the training data while preserving worst-class generalization performance. REDUCR assigns priority weights to datapoints in a class-aware manner using an online learning algorithm. We demonstrate the data efficiency and robust performance of REDUCR on vision and text classification tasks. On web-scraped datasets with imbalanced class distributions, REDUCR significantly improves worst-class test accuracy (and average accuracy), surpassing state-of-the-art methods by around 15%.
Related papers
- Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information [2.133855532092057]
We propose an effective data reduction strategy based on Pointwise - Information (PVI)<n>Experiments show that the classifier performance is maintained with only a 0.0001% to 0.76% reduction in accuracy when 10%-30% of the data is removed.<n>We have adapted the PVI framework, which was previously limited to English datasets, to a variety of Chinese NLP tasks and base models.
arXiv Detail & Related papers (2025-06-19T06:59:19Z) - Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models [107.24906866038431]
We propose REWIRE, REcycling the Web with guIded REwrite, to enrich low-quality documents so that they could become useful for training.<n>We show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks.
arXiv Detail & Related papers (2025-06-05T07:12:12Z) - SCIZOR: A Self-Supervised Approach to Data Curation for Large-Scale Imitation Learning [30.34323856102674]
Imitation learning advances robot capabilities by enabling the acquisition of diverse behaviors from human demonstrations.<n>Existing robotic curation approaches rely on costly manual annotations and perform curation at a coarse granularity.<n>We introduce SCIZOR, a self-supervised data curation framework that filters out low-quality state-action pairs to improve the performance of imitation learning policies.
arXiv Detail & Related papers (2025-05-28T17:45:05Z) - Online-BLS: An Accurate and Efficient Online Broad Learning System for Data Stream Classification [52.251569042852815]
We introduce an online broad learning system framework with closed-form solutions for each online update.
We design an effective weight estimation algorithm and an efficient online updating strategy.
Our framework is naturally extended to data stream scenarios with concept drift and exceeds state-of-the-art baselines.
arXiv Detail & Related papers (2025-01-28T13:21:59Z) - Machine Unlearning on Pre-trained Models by Residual Feature Alignment Using LoRA [15.542668474378633]
We propose a novel and efficient machine unlearning method on pre-trained models.
We leverage LoRA to decompose the model's intermediate features into pre-trained features and residual features.
The method aims to learn the zero residuals on the retained set and shifted residuals on the unlearning set.
arXiv Detail & Related papers (2024-11-13T08:56:35Z) - Data-Free Generative Replay for Class-Incremental Learning on Imbalanced Data [0.7366405857677227]
Continual learning is a challenging problem in machine learning, especially for image classification tasks with imbalanced datasets.
This paper proposes Data-Free Generative Replay (DFGR) for class incremental learning, where the generator is trained without access to real data.
DFGR achieves up to 88.5% and 46.6% accuracy on MNIST and FashionMNIST datasets, respectively.
arXiv Detail & Related papers (2024-06-07T17:51:27Z) - DRoP: Distributionally Robust Pruning [11.930434318557156]
We conduct the first systematic study of the impact of data pruning on classification bias of trained models.
We propose DRoP, a distributionally robust approach to pruning and empirically demonstrate its performance on standard computer vision benchmarks.
arXiv Detail & Related papers (2024-04-08T14:55:35Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Fast Machine Unlearning Without Retraining Through Selective Synaptic
Dampening [51.34904967046097]
Selective Synaptic Dampening (SSD) is a fast, performant, and does not require long-term storage of the training data.
We present a novel two-step, post hoc, retrain-free approach to machine unlearning which is fast, performant, and does not require long-term storage of the training data.
arXiv Detail & Related papers (2023-08-15T11:30:45Z) - Efficient Augmentation for Imbalanced Deep Learning [8.38844520504124]
We study a convolutional neural network's internal representation of imbalanced image data.
We measure the generalization gap between a model's feature embeddings in the training and test sets, showing that the gap is wider for minority classes.
This insight enables us to design an efficient three-phase CNN training framework for imbalanced data.
arXiv Detail & Related papers (2022-07-13T09:43:17Z) - CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time.
We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z) - CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep
Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance.
Sample re-weighting methods are popularly used to alleviate this data bias issue.
We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z) - Learning Fast Sample Re-weighting Without Reward Data [41.92662851886547]
This paper presents a novel learning-based fast sample re-weighting (FSR) method that does not require additional reward data.
Our experiments show the proposed method achieves competitive results compared to state of the arts on label noise and long-tailed recognition.
arXiv Detail & Related papers (2021-09-07T17:30:56Z) - Online Coreset Selection for Rehearsal-based Continual Learning [65.85595842458882]
In continual learning, we store a subset of training examples (coreset) to be replayed later to alleviate catastrophic forgetting.
We propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration.
Our proposed method maximizes the model's adaptation to a target dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting.
arXiv Detail & Related papers (2021-06-02T11:39:25Z) - Efficient Conditional Pre-training for Transfer Learning [71.01129334495553]
We propose efficient filtering methods to select relevant subsets from the pre-training dataset.
We validate our techniques by pre-training on ImageNet in both the unsupervised and supervised settings.
We improve standard ImageNet pre-training by 1-3% by tuning available models on our subsets and pre-training on a dataset filtered from a larger scale dataset.
arXiv Detail & Related papers (2020-11-20T06:16:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.