Related papers: Exploring Data Redundancy in Real-world Image Classification through Data Selection

Exploring Data Redundancy in Real-world Image Classification through Data Selection

URL: http://arxiv.org/abs/2306.14113v1
Date: Sun, 25 Jun 2023 03:31:05 GMT
Title: Exploring Data Redundancy in Real-world Image Classification through Data Selection
Authors: Zhenyu Tang, Shaoting Zhang, Xiaosong Wang
Abstract summary: Deep learning models often require large amounts of data for training, leading to increased costs. We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data. Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
Score: 20.389636181891515
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning models often require large amounts of data for training, leading to increased costs. It is particularly challenging in medical imaging, i.e., gathering distributed data for centralized training, and meanwhile, obtaining quality labels remains a tedious job. Many methods have been proposed to address this issue in various training paradigms, e.g., continual learning, active learning, and federated learning, which indeed demonstrate certain forms of the data valuation process. However, existing methods are either overly intuitive or limited to common clean/toy datasets in the experiments. In this work, we present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study the redundancy in real-world image data. Novel online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values. Our online approach effectively evaluates data utilizing layerwise model parameter updates and gradients in each epoch and can accelerate model training with fewer epochs and a subset (e.g., 19%-59%) of data while maintaining equivalent levels of accuracy in a variety of datasets. It also extends to the offline coreset construction, producing subsets of only 18%-30% of the original. The codes for the proposed adaptive data selection and coreset computation are available (https://github.com/ZhenyuTANG2023/data_selection).

Related papers

Learning from Limited and Imperfect Data [6.30667368422346]
We develop algorithms for Deep Neural Networks which can learn from limited or imperfect data present in the real world.<n>This thesis is divided into four segments, each covering a scenario of learning from limited or imperfect data.
arXiv Detail & Related papers (2025-07-28T17:54:15Z)
Core-Set Selection for Data-efficient Land Cover Segmentation [16.89537279044251]
We propose six novel core-set selection methods for selecting important subsets of samples from remote sensing image segmentation datasets.<n>We benchmark these approaches against a random-selection baseline on three commonly used land cover classification datasets.<n>This result shows the importance and potential of data-centric learning for the remote sensing domain.
arXiv Detail & Related papers (2025-05-02T12:22:08Z)
Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training. We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO. As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z)
Learning from Limited and Imperfect Data [6.30667368422346]
We develop practical algorithms for Deep Neural Networks that can learn from limited and imperfect data present in the real world. These works are divided into four segments, each covering a scenario of learning from limited or imperfect data.
arXiv Detail & Related papers (2024-11-11T18:48:31Z)
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach [36.47860223750303]
We consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository.
arXiv Detail & Related papers (2024-05-24T14:58:51Z)
A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset. Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive. Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z)
Group Distributionally Robust Dataset Distillation with Risk Minimization [18.07189444450016]
We introduce an algorithm that combines clustering with the minimization of a risk measure on the loss to conduct DD. We demonstrate its effective generalization and robustness across subgroups through numerical experiments.
arXiv Detail & Related papers (2024-02-07T09:03:04Z)
Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning Interference with Gradient Projection [56.292071534857946]
Recent data-privacy laws have sparked interest in machine unlearning. Challenge is to discard information about the forget'' data without altering knowledge about remaining dataset. We adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU) We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible.
arXiv Detail & Related papers (2023-12-07T07:17:24Z)
D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning [70.98091101459421]
Coreset selection seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset. We propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection. Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates.
arXiv Detail & Related papers (2023-10-11T23:01:29Z)
Dominant Set-based Active Learning for Text Classification and its Application to Online Social Media [0.0]
We present a novel pool-based active learning method for the training of large unlabeled corpus with minimum annotation cost. Our proposed method does not have any parameters to be tuned, making it dataset-independent. Our method achieves a higher performance in comparison to the state-of-the-art active learning strategies.
arXiv Detail & Related papers (2022-01-28T19:19:03Z)
Online Coreset Selection for Rehearsal-based Continual Learning [65.85595842458882]
In continual learning, we store a subset of training examples (coreset) to be replayed later to alleviate catastrophic forgetting. We propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration. Our proposed method maximizes the model's adaptation to a target dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting.
arXiv Detail & Related papers (2021-06-02T11:39:25Z)
Finding High-Value Training Data Subset through Differentiable Convex Programming [5.5180456567480896]
In this paper, we study the problem of selecting high-value subsets of training data. The key idea is to design a learnable framework for online subset selection. Using this framework, we design an online alternating minimization-based algorithm for jointly learning the parameters of the selection model and ML model.
arXiv Detail & Related papers (2021-04-28T14:33:26Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network. We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples. We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.