Exploring Data Redundancy in Real-world Image Classification through
Data Selection
- URL: http://arxiv.org/abs/2306.14113v1
- Date: Sun, 25 Jun 2023 03:31:05 GMT
- Title: Exploring Data Redundancy in Real-world Image Classification through
Data Selection
- Authors: Zhenyu Tang, Shaoting Zhang, Xiaosong Wang
- Abstract summary: Deep learning models often require large amounts of data for training, leading to increased costs.
We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data.
Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
- Score: 20.389636181891515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning models often require large amounts of data for training,
leading to increased costs. It is particularly challenging in medical imaging,
i.e., gathering distributed data for centralized training, and meanwhile,
obtaining quality labels remains a tedious job. Many methods have been proposed
to address this issue in various training paradigms, e.g., continual learning,
active learning, and federated learning, which indeed demonstrate certain forms
of the data valuation process. However, existing methods are either overly
intuitive or limited to common clean/toy datasets in the experiments. In this
work, we present two data valuation metrics based on Synaptic Intelligence and
gradient norms, respectively, to study the redundancy in real-world image data.
Novel online and offline data selection algorithms are then proposed via
clustering and grouping based on the examined data values. Our online approach
effectively evaluates data utilizing layerwise model parameter updates and
gradients in each epoch and can accelerate model training with fewer epochs and
a subset (e.g., 19%-59%) of data while maintaining equivalent levels of
accuracy in a variety of datasets. It also extends to the offline coreset
construction, producing subsets of only 18%-30% of the original. The codes for
the proposed adaptive data selection and coreset computation are available
(https://github.com/ZhenyuTANG2023/data_selection).
Related papers
- Learning from Limited and Imperfect Data [6.30667368422346]
We develop practical algorithms for Deep Neural Networks that can learn from limited and imperfect data present in the real world.
These works are divided into four segments, each covering a scenario of learning from limited or imperfect data.
arXiv Detail & Related papers (2024-11-11T18:48:31Z) - Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach [36.47860223750303]
We consider the problem of automatic curation of high-quality datasets for self-supervised pre-training.
We propose a clustering-based approach for building ones satisfying all these criteria.
Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository.
arXiv Detail & Related papers (2024-05-24T14:58:51Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Group Distributionally Robust Dataset Distillation with Risk
Minimization [18.07189444450016]
We introduce an algorithm that combines clustering with the minimization of a risk measure on the loss to conduct DD.
We demonstrate its effective generalization and robustness across subgroups through numerical experiments.
arXiv Detail & Related papers (2024-02-07T09:03:04Z) - Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning
Interference with Gradient Projection [56.292071534857946]
Recent data-privacy laws have sparked interest in machine unlearning.
Challenge is to discard information about the forget'' data without altering knowledge about remaining dataset.
We adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU)
We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible.
arXiv Detail & Related papers (2023-12-07T07:17:24Z) - D2 Pruning: Message Passing for Balancing Diversity and Difficulty in
Data Pruning [70.98091101459421]
Coreset selection seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset.
We propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection.
Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates.
arXiv Detail & Related papers (2023-10-11T23:01:29Z) - Dominant Set-based Active Learning for Text Classification and its
Application to Online Social Media [0.0]
We present a novel pool-based active learning method for the training of large unlabeled corpus with minimum annotation cost.
Our proposed method does not have any parameters to be tuned, making it dataset-independent.
Our method achieves a higher performance in comparison to the state-of-the-art active learning strategies.
arXiv Detail & Related papers (2022-01-28T19:19:03Z) - Online Coreset Selection for Rehearsal-based Continual Learning [65.85595842458882]
In continual learning, we store a subset of training examples (coreset) to be replayed later to alleviate catastrophic forgetting.
We propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration.
Our proposed method maximizes the model's adaptation to a target dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting.
arXiv Detail & Related papers (2021-06-02T11:39:25Z) - Finding High-Value Training Data Subset through Differentiable Convex
Programming [5.5180456567480896]
In this paper, we study the problem of selecting high-value subsets of training data.
The key idea is to design a learnable framework for online subset selection.
Using this framework, we design an online alternating minimization-based algorithm for jointly learning the parameters of the selection model and ML model.
arXiv Detail & Related papers (2021-04-28T14:33:26Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.